From click to reasoning: WebChorearena benchmarking challenges agents with heavy and multi-page tasks

by admin · June 5, 2025

Web automation agents have become an increasingly important focus for AI, especially due to their ability to perform human-like actions in digital environments. These agents interact with the website through a graphical user interface (GUI), mimicking human behaviors such as clicking, typing, and cross-web navigation. This approach bypasses the need for dedicated application programming interfaces (APIs) that are often unavailable or limited in many web applications. Instead, these agents can operate universally across web domains, making them suitable for a variety of tasks. The evolution of large language models (LLM) has made these agents more complex than just explaining web content. As their capabilities grow, they need to be evaluated not only on simple browsing tasks, but also on them. The benchmarks that once satisfies early models no longer measure the full extent of the capabilities of modern agents.

As these network agents progress, a pressing problem arises: their ability to handle mundane, memory-intensive, and multi-step digital chores is still insufficient. Many tasks performed by humans on websites, such as retrieving data from different pages, calculating based on previous inputs, or applying complex rules, require a lot of cognitive work. These are not only navigational challenges; they test memory, logic, and long-term planning. However, most benchmarks focus on simplified scenarios and do not reflect the types of digital trivia people usually prefer to avoid. Furthermore, the limitations of these benchmarks become more obvious as the agent improves its performance. Ambiguity in task instructions or inconsistencies in expected outputs begin to bias evaluation. When agents produce reasonable but slightly different answers, they are incorrectly punished due to vague task definitions. This drawback makes it difficult to distinguish between true model limitations and benchmark drawbacks.

Previous efforts to evaluate network agents have focused on benchmarks such as Webarena. Webarena has gained widespread adoption due to its repeatability and ability to simulate real-world websites including Reddit, Gitlab and e-commerce platforms. It provides over 800 tasks designed to test the agent’s ability to accomplish web-based goals in these environments. However, these tasks are mainly focused on general browsing and do not adequately challenge more advanced agents. Other benchmarks (such as Mind2Web, Gaia, and Mmin) contribute to exploring real web tasks or platform-specific environments (such as ServiceNow), but each has tradeoffs. Some people lack interactivity, some do not support repeatability, and some are too narrow. These limitations create gaps in measuring areas where complex decisions are required, long-term memory, and precise data processing.

Researchers from the University of Tokyo introduced Webchorearena. The extended framework is built on the structure of Webarena, but greatly increases the task difficulty and complexity. WebChoreArena has a total of 532 newly planned tasks distributed across the same four simulated websites. The purpose of these tasks is more demanding, reflecting the scheme in which the agent must engage in tasks such as data aggregation, memory recall, and multi-step reasoning. Importantly, benchmarks are built to ensure full repeatability and standardization, so that fair comparisons between agents and avoid ambiguity found in early tools. Including a variety of task types and input methods helps simulate realistic web usage and evaluate agents at a more practical and challenging scale.

WebChorearena divides its tasks into four main types. Seventeen tasks belong to a large amount of memory, requiring the agent to extract and remember a large amount of information, such as compiling all customer names linked to high-value transactions. The calculation task that includes 132 entries involves arithmetic operations, such as determining the highest spending month based on multiple data points. Long-term memory task number 127 and tests the agent’s ability to connect information on various pages, such as retrieving pricing rules from one site and applying them to another site. Another 65 tasks are classified as “other”, including the operation of assigning tags in GitLab, which do not conform to the traditional task format. Each task specifies its input mode, with 451 tasks that solve any observation type, 69 only requires text input, while 12 depends only on image input.

When evaluating the benchmark, the researchers used three prominent large language models: GPT-4O, Claude 3.7 sonnet, and Gemini 2.5 Pro. These test AgensocCam and Browsergym together with two advanced web agents. Compared to previous benchmarks, the results highlight the increased difficulty of Webchorearena. GPT-4O obtains 42.8% accuracy on WebChorearena and only manages 6.8% on WebChorearena. The Claude 3.7 sonnet and Gemini 2.5 Pro performed better, with Gemini peak accuracy of 44.9%. Despite the best performance, this result reflects the ability gap in handling more complex WebChoreAreare tasks. The benchmarks also proved more sensitive in detecting performance differences between models, making it a valuable tool for the benchmarks to continue to advance in Web proxy technology.

Several key points of research include:

WebChoreArena includes 532 tasks: 117 large memory, 132 computes, 127 long-term memory, and 65 others.
Tasks are distributed across shopping (117), shopping admin (132), Reddit (91), Gitlab (127), and 65 cross-sites.
Input Type: 451 tasks can solve any input, 69 require text input, and 12 require image input.
GPT-4O scored only 6.8% on WebChorearena, while Webarena scored only 42.8%.
The Gemini 2.5 Pro scored the highest score with a score of 44.9%, indicating the current limitations of handling complex tasks.
WebChorearena provides a more obvious performance gradient between models than Webarina, thereby enhancing the benchmark.
A total of 117 task templates were used to ensure diversity and repeatability of approximately 4.5 instances per template.
The benchmark requires over 300 hours of annotation and improvement to reflect its strict structure.
Evaluation uses string matching, URL matching and HTML structure comparison to evaluate accuracy.

In summary, this study highlights the differences between general browsing ability and higher-order cognitive abilities required for network-based tasks. The newly introduced WebChoreAreAne is a powerful and detailed benchmark specifically designed to push network agents to territories that must rely on reasoning, memory, and logic. It replaces ambiguity with standardization, and its task mimics the digital price that agents have to learn to deal with in the real world of automation.

View paper, GitHub pages, and project pages. All credits for this study are to the researchers on the project.

🆕 did you know? Marktechpost is the fastest growing AI media platform with more than 1 million readers per month. Book a strategy call to discuss your campaign goals. Also, please stay tuned for us twitter And don’t forget to join us 95k+ ml reddit And subscribe Our newsletter.

Asif Razzaq is CEO of Marktechpost Media Inc. As a visionary entrepreneur and engineer, ASIF is committed to harnessing the potential of artificial intelligence to achieve social benefits. His recent effort is to launch Marktechpost, an artificial intelligence media platform that has an in-depth coverage of machine learning and deep learning news that can sound both technically, both through technical voices and be understood by a wide audience. The platform has over 2 million views per month, demonstrating its popularity among its audience.