AgentRewardBench: A New Benchmark for Evaluating Web Agent Performance

Evaluating Web Agents: Challenges and New Approaches with AgentRewardBench

Web agents, controlled by natural language and performing tasks in web browsers, are becoming increasingly important. However, accurately evaluating their performance is complex and presents developers with challenges. Traditional rule-based methods often reach their limits, as they require laborious adaptation to new tasks and don't always correctly identify successful processes. Human evaluations offer high accuracy but are time-consuming and expensive. Automatic evaluations using large language models (LLMs) appear as a promising alternative, but their effectiveness has been unclear until now.

AgentRewardBench: A New Benchmark for Evaluating Web Agents

To investigate the performance of LLMs in evaluating web agents, AgentRewardBench, the first benchmark of its kind, was developed. It encompasses 1302 trajectories from five different benchmarks and four LLMs. Each trajectory was reviewed by an expert who answered questions about the success, side effects, and repeatability of the agent's actions. Using this benchmark, 12 LLM-based evaluation methods were evaluated. The results show that no single LLM performs equally well across all benchmarks. Furthermore, it was found that the rule-based evaluation used in common benchmarks tends to underestimate the success rate of web agents. This highlights a weakness of rule-based approaches and underscores the need for more flexible automatic evaluations.

The Importance of Accurate Evaluations

The correct evaluation of web agents is crucial for the further development and improvement of this technology. It allows developers to identify the strengths and weaknesses of their agents and make targeted optimizations. Accurate evaluation contributes to increasing the reliability and efficiency of web agents, thus fully exploiting their potential for a variety of applications. From automating everyday tasks to supporting complex research work – reliable web agents can fundamentally change the way we interact with the internet.

Future Research and Development

AgentRewardBench provides valuable insights for future research and development in the field of web agents. The benchmark's results can contribute to advancing the development of more robust and accurate automatic evaluation methods. The combination of LLMs with other techniques, such as reinforcement learning, could lead to further improvements. The development of specialized LLMs, specifically trained for evaluating web agents, is also a promising approach. Research in this area will help to further increase the performance and reliability of web agents and make them an indispensable tool for interacting with the digital space.

The Role of Mindverse

As a German company specializing in AI-powered content creation, research, and analysis, Mindverse is following the developments in the field of web agents with great interest. The development of customized AI solutions, such as chatbots, voicebots, AI search engines, and knowledge systems, is at the heart of Mindverse's work. The findings from AgentRewardBench are of great importance for the development and optimization of these solutions. The accurate evaluation of agent trajectories plays a crucial role in developing robust and reliable AI systems. Mindverse strives to integrate the latest research results into its products and offer its customers innovative and powerful AI solutions.

Bibliography: Xing Han Lù et al. "AgentRewardBench: Evaluating Automatic Evaluations of Web Agent Trajectories". arXiv preprint arXiv:2504.08942 (2025). Yao, Shunyu, et al. "Measuring mathematical problem solving with the MATH dataset." arXiv preprint arXiv:2404.06474 (2024). Cobbe, Karl, et al. "LLaMA: Open and efficient foundation language models." arXiv preprint arXiv:2302.13971 (2023). Min, Sewon, et al. "Reward-driven web agent with constrained LLM." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2024. Li, Jiale, et al. "AceMath-RewardBench: Evaluating Reward Models for Mathematical Reasoning." arXiv preprint arXiv:2403.13787 (2024). Su, Yuxiang, et al. "AgentBench: Evaluating LLMs as Agents." arXiv preprint arXiv:2409.00394 (2024). Fu, Justin, et al. "Reward-bench: Towards better reward models for language agents." arXiv preprint arXiv:2407.15138 (2024). Lai, Baolin, et al. "PAL: Program-aided language models." Proceedings of the 2025 Conference of the North American Chapter of the Association for Computational Linguistics. 2025. Zhong, Wen, et al. "Autonomous Agents." GitHub repository. Goyal, Anirudh, et al. "Evaluating Robustness of Reward Models for Mathematical Reasoning." arXiv preprint arXiv:2311.11463 (2023).