This AI paper introduces Web-Shepherd: Process Reward Model for Web Agents with 40K datasets and 10x cost-efficiency

Web navigation focuses on teaching machines how to interact with websites to perform tasks such as searching for information, shopping, or booking services. Building a powerful web navigation agent is a complex task because it requires understanding the structure of a website, explaining user goals, and making a series of decisions across multiple steps. These tasks allow agents to adapt content to the content in a dynamic network environment can change frequently and must be understood together with multimodal information such as text and images, which make these tasks more complex.
A key issue in web navigation is that there is no reliable and detailed reward model that guides the agent in real time. Existing methods rely primarily on multi-modal large language models (MLLM), such as GPT-4O and GPT-4O-MINI as evaluators, which are expensive, slow and often inaccurate, especially when dealing with long operations in multi-step tasks. These models use promotion-based evaluation or binary success/failure feedback, but fail to provide ladder guidance, often resulting in errors such as repeating actions or missing critical steps such as clicking a specific button or filling a form field. This limitation reduces the practicality of deploying network agents in real-world situations, where efficiency, accuracy and cost-effectiveness are critical.
The research team at Yonsei University and Carnegie Mellon University has launched Web-Shepherd, a process reward model designed specifically for Web navigation tasks. Web-Shepherd is the first model to guide evaluation using a structured manifest, and can evaluate the Web navigation agent at the step level. The researchers also developed the WebPRM Collection, a dataset annotated by 40,000 step-level web navigation tasks and WebRewardBench benchmarks for evaluating PRMS. These resources are designed to enable cyber migrants to provide detailed feedback by breaking down complex tasks into smaller measurable sub-objectives.
Web-Shepherd can generate a list for each task based on the user’s instructions, such as “Search for Products” or “Click for Product Pages” and evaluate the agent’s progress towards these sub-targets. The model uses the next step of prediction to generate feedback and assign rewards based on the inventory completion. This process enables the network waiter to evaluate the correctness of each step through fine-grained judgment. The model estimates the rewards for each step by combining the probability of “yes”, “no” and “ongoing” tokens and averages these rewards throughout the list. This detailed scoring system enables agents to receive targeted feedback on their progress, thereby enhancing their capabilities for complex websites.
Researchers show that the performance of network dependence is significantly better than existing models. On the benchmark of WebRewardBench, the average reciprocity level (MRR) score of Web-Shepherd is 87.6% and 55% track accuracy compared to GPT-4O-Mini’s 47.5% MRR and 0% track accuracy, with no list. When tested in Webarena-Lite using GPT-4O-MINI as a strategy model, Web-Shepherd’s success rate reached a 34.55% success rate, 10.9 points higher than using GPT-4O-Mini as an appraiser, and ten times more cost-effective. In the ablation study, the researchers observed that when clearance lists or feedback were deleted, the performance of network residents was greatly reduced, demonstrating their importance for accurate reward tasks. They also show that surprisingly, multimodal inputs do not always improve performance and sometimes introduce noise.
This study highlights the key role of detailed process-level rewards in building reliable network agents. The team’s work addresses the core challenges of web navigation – evaluating complexes, multi-step operations, and providing a solution that is both scalable and cost-effective. With Web-Shepherd, agents can now get accurate feedback during navigation, allowing them to make better decisions and complete tasks more efficiently.
View paper and GitHub pages. All credits for this study are to the researchers on the project. Also, please stay tuned for us twitter And don’t forget to join us 95k+ ml reddit And subscribe Our newsletter.

Nikhil is an intern consultant at Marktechpost. He is studying for a comprehensive material degree in integrated materials at the Haragpur Indian Technical College. Nikhil is an AI/ML enthusiast and has been studying applications in fields such as biomaterials and biomedical sciences. He has a strong background in materials science, and he is exploring new advancements and creating opportunities for contribution.
