Traditional AI benchmarks for web agents only measure whether a task is completed successfully, discarding all process information and offering little guidance on how to improve performance. A new approach from researchers at an undisclosed institution aims to change that by introducing process-level evaluation with semantic state tracking.
The work, detailed in a preprint on arXiv, introduces WebStep, a benchmark comprising 1,800 task instances with controlled difficulty and automatic semantic state tracking. Each website in the benchmark exposes a deterministic semantic MDP (Markov Decision Process) alongside the graphical user interface: the agent operates on the interface, while the environment records high-level states and transitions in the background, enabling fine-grained analysis without manual annotation.
Process Metrics Reveal Hidden Differences
The researchers evaluated three web agents and found that while their success rates clustered within a narrow range of 31–33%, process-level analysis uncovered significant differences. One agent showed superior exploration reach while another excelled in execution accuracy. According to the paper, "process metrics reveal differences invisible to outcome evaluation."
Decomposing performance by skill further characterized these differences. On the Housing website, for example, OpenAI CUA outperformed Qwen3.5 by 23.7% on commit actions yet underperformed by 15.6% on filtering. This exposes opposite per-skill rankings hidden within the same website, pinpointing a concrete skill to improve even within a single domain.
| Agent | Skill | Performance Difference |
|---|---|---|
| OpenAI CUA vs Qwen3.5 | Commit actions | +23.7% (CUA better) |
| OpenAI CUA vs Qwen3.5 | Filtering | -15.6% (CUA worse) |
Error Localization and Task Difficulty
Bifurcation analysis further localizes the decisive error that causes the agent to lose the task. The researchers report that this error is agent-specific rather than shared, meaning different agents fail on different critical steps. These differences widen as tasks grow harder: success rate is similar on easy tasks but separates sharply as exploration becomes more demanding.
Implications for AI Development
The WebStep benchmark opens a new avenue in web agent evaluation, providing fine-grained and actionable insight into where and how each agent should be improved. The authors conclude that "process-level analysis opens a new avenue in web agent evaluation, providing fine-grained and actionable insight into where and how each agent should be improved."
For enterprise technology leaders evaluating AI assistants for tasks such as form filling, data entry, or workflow automation, this benchmark underscores the importance of looking beyond final success rates. Process-level diagnostics could help identify whether an agent struggles with navigation, data extraction, or decision-making—guiding targeted improvements and vendor selection.
The work was authored by Chung, Jiwan; Byun, JiHyuk; Vineet, Vibhav; and Kim, Seon Joo. Their benchmark and methodology are available on arXiv under a Creative Commons license.