Process-Level Evaluation of Web Agents Reveals Hidden Performance Differences in AI Systems

Researchers introduce WebStep, a benchmark of 1,800 task instances that evaluates web agents at the process level using semantic state tracking. Key findings show that agents with similar success rates have divergent process metrics, with OpenAI CUA outperforming Qwen3.5 on commit actions but underperforming on filtering on the Housing website.

iGEN Editorial

June 16, 2026

Process-Level Evaluation of Web Agents Reveals Hidden Performance Differences in AI Systems

Traditional AI benchmarks for web agents only measure whether a task is completed successfully, discarding all process information and offering little guidance on how to improve performance. A new approach from researchers at an undisclosed institution aims to change that by introducing process-level evaluation with semantic state tracking.

The work, detailed in a preprint on arXiv, introduces WebStep, a benchmark comprising 1,800 task instances with controlled difficulty and automatic semantic state tracking. Each website in the benchmark exposes a deterministic semantic MDP (Markov Decision Process) alongside the graphical user interface: the agent operates on the interface, while the environment records high-level states and transitions in the background, enabling fine-grained analysis without manual annotation.

Process Metrics Reveal Hidden Differences

The researchers evaluated three web agents and found that while their success rates clustered within a narrow range of 31–33%, process-level analysis uncovered significant differences. One agent showed superior exploration reach while another excelled in execution accuracy. According to the paper, "process metrics reveal differences invisible to outcome evaluation."

Decomposing performance by skill further characterized these differences. On the Housing website, for example, OpenAI CUA outperformed Qwen3.5 by 23.7% on commit actions yet underperformed by 15.6% on filtering. This exposes opposite per-skill rankings hidden within the same website, pinpointing a concrete skill to improve even within a single domain.

Agent	Skill	Performance Difference
OpenAI CUA vs Qwen3.5	Commit actions	+23.7% (CUA better)
OpenAI CUA vs Qwen3.5	Filtering	-15.6% (CUA worse)

Error Localization and Task Difficulty

Bifurcation analysis further localizes the decisive error that causes the agent to lose the task. The researchers report that this error is agent-specific rather than shared, meaning different agents fail on different critical steps. These differences widen as tasks grow harder: success rate is similar on easy tasks but separates sharply as exploration becomes more demanding.

Implications for AI Development

The WebStep benchmark opens a new avenue in web agent evaluation, providing fine-grained and actionable insight into where and how each agent should be improved. The authors conclude that "process-level analysis opens a new avenue in web agent evaluation, providing fine-grained and actionable insight into where and how each agent should be improved."

For enterprise technology leaders evaluating AI assistants for tasks such as form filling, data entry, or workflow automation, this benchmark underscores the importance of looking beyond final success rates. Process-level diagnostics could help identify whether an agent struggles with navigation, data extraction, or decision-making—guiding targeted improvements and vendor selection.

The work was authored by Chung, Jiwan; Byun, JiHyuk; Vineet, Vibhav; and Kim, Seon Joo. Their benchmark and methodology are available on arXiv under a Creative Commons license.

Sources:

Process-Level Evaluation of Web Agents Reveals Hidden Performance Differences in AI Systems

Process Metrics Reveal Hidden Differences

Error Localization and Task Difficulty

Implications for AI Development

Recommended Stories

Hugging Face CEO demands AI firms answer for rogue bot attacks

Chinese AI Researchers Are Finding Their Voice on X

AI Slop Melodramas on X Exploit Revenue Sharing, Creators Cash In

Orchid AI Agent's Cross-App Automation Pitch Draws Backlash and Data Doubt