iGEN
Visit IGEN World Explore IGEN Expo
EXPLORE UPGRADE PLANS
BREAKING
US military runs secret ship-to-ship oil transfer operation near Strait of Hormuz to keep Gulf energy exports flowing Wasserstein Equilibrium Decoding Boosts Reliability in Medical Visual Question Answering BRITE Benchmark Reveals Critical Gaps in Text-to-Video Models' Object-Action Binding and Audio-Visual Sync Vocabulary Dropout Technique Prevents Diversity Collapse in LLM Co-Evolution Training Bayesian Visualization Helps Humans Negotiate with AI Across Multiple Issues, Study Shows Multi-Sequence Verifiers Cut Inference Latency in Half for LLM Reasoning Language-Guided AI Framework CLARITY Boosts Road Scene Segmentation for Autonomous Logistics When RAG Hurts: Research Identifies Attention Distraction in Vision-Language AI Models and Proposes Mitigation Strait of Hormuz Reopening: Mine Clearance Delays Threaten Weeks-Long Recovery for Oil Shipping India’s REITs and InvITs May Attract Rs 11.6 Lakh Crore Investment by 2030, Avendus Report Says US military runs secret ship-to-ship oil transfer operation near Strait of Hormuz to keep Gulf energy exports flowing Wasserstein Equilibrium Decoding Boosts Reliability in Medical Visual Question Answering BRITE Benchmark Reveals Critical Gaps in Text-to-Video Models' Object-Action Binding and Audio-Visual Sync Vocabulary Dropout Technique Prevents Diversity Collapse in LLM Co-Evolution Training Bayesian Visualization Helps Humans Negotiate with AI Across Multiple Issues, Study Shows Multi-Sequence Verifiers Cut Inference Latency in Half for LLM Reasoning Language-Guided AI Framework CLARITY Boosts Road Scene Segmentation for Autonomous Logistics When RAG Hurts: Research Identifies Attention Distraction in Vision-Language AI Models and Proposes Mitigation Strait of Hormuz Reopening: Mine Clearance Delays Threaten Weeks-Long Recovery for Oil Shipping India’s REITs and InvITs May Attract Rs 11.6 Lakh Crore Investment by 2030, Avendus Report Says
Home ›› Technology ›› Ai ›› Process-Level Evaluation of Web Agents Reveals Hidden Performance Differences in AI Systems

Process-Level Evaluation of Web Agents Reveals Hidden Performance Differences in AI Systems

Researchers introduce WebStep, a benchmark of 1,800 task instances that evaluates web agents at the process level using semantic state tracking. Key findings show that agents with similar success rates have divergent process metrics, with OpenAI CUA outperforming Qwen3.5 on commit actions but underperforming on filtering on the Housing website.

iG
iGEN Editorial
June 16, 2026
Process-Level Evaluation of Web Agents Reveals Hidden Performance Differences in AI Systems

Traditional AI benchmarks for web agents only measure whether a task is completed successfully, discarding all process information and offering little guidance on how to improve performance. A new approach from researchers at an undisclosed institution aims to change that by introducing process-level evaluation with semantic state tracking.

The work, detailed in a preprint on arXiv, introduces WebStep, a benchmark comprising 1,800 task instances with controlled difficulty and automatic semantic state tracking. Each website in the benchmark exposes a deterministic semantic MDP (Markov Decision Process) alongside the graphical user interface: the agent operates on the interface, while the environment records high-level states and transitions in the background, enabling fine-grained analysis without manual annotation.

Process Metrics Reveal Hidden Differences

The researchers evaluated three web agents and found that while their success rates clustered within a narrow range of 31–33%, process-level analysis uncovered significant differences. One agent showed superior exploration reach while another excelled in execution accuracy. According to the paper, "process metrics reveal differences invisible to outcome evaluation."

Decomposing performance by skill further characterized these differences. On the Housing website, for example, OpenAI CUA outperformed Qwen3.5 by 23.7% on commit actions yet underperformed by 15.6% on filtering. This exposes opposite per-skill rankings hidden within the same website, pinpointing a concrete skill to improve even within a single domain.

Agent Skill Performance Difference
OpenAI CUA vs Qwen3.5 Commit actions +23.7% (CUA better)
OpenAI CUA vs Qwen3.5 Filtering -15.6% (CUA worse)

Error Localization and Task Difficulty

Bifurcation analysis further localizes the decisive error that causes the agent to lose the task. The researchers report that this error is agent-specific rather than shared, meaning different agents fail on different critical steps. These differences widen as tasks grow harder: success rate is similar on easy tasks but separates sharply as exploration becomes more demanding.

Implications for AI Development

The WebStep benchmark opens a new avenue in web agent evaluation, providing fine-grained and actionable insight into where and how each agent should be improved. The authors conclude that "process-level analysis opens a new avenue in web agent evaluation, providing fine-grained and actionable insight into where and how each agent should be improved."

For enterprise technology leaders evaluating AI assistants for tasks such as form filling, data entry, or workflow automation, this benchmark underscores the importance of looking beyond final success rates. Process-level diagnostics could help identify whether an agent struggles with navigation, data extraction, or decision-making—guiding targeted improvements and vendor selection.

The work was authored by Chung, Jiwan; Byun, JiHyuk; Vineet, Vibhav; and Kim, Seon Joo. Their benchmark and methodology are available on arXiv under a Creative Commons license.


Sources:

Keep Reading

Recommended Stories

Vocabulary Dropout Technique Prevents Diversity Collapse in LLM Co-Evolution Training Technology

Vocabulary Dropout Technique Prevents Diversity Collapse in LLM Co-Evolution Training

A new method called vocabulary dropout prevents diversity collapse in co-evolutionary LLM training. Applied to Qwen3 models on mathematical reasoning, it improved solver performance by an average of 4.4 points, with largest gains on competition-level benchmarks.

June 16, 2026
Multi-Sequence Verifiers Cut Inference Latency in Half for LLM Reasoning Technology

Multi-Sequence Verifiers Cut Inference Latency in Half for LLM Reasoning

A new paper by Kim et al. introduces the Multi-Sequence Verifier (MSV), a lightweight verifier that improves calibration for parallel test-time scaling in large language models. MSV enhances best-of-N selection accuracy by up to 6% and enables early-stopping strategies that achieve the same accuracy with less than half the inference latency.

June 16, 2026
When RAG Hurts: Research Identifies Attention Distraction in Vision-Language AI Models and Proposes Mitigation Technology

When RAG Hurts: Research Identifies Attention Distraction in Vision-Language AI Models and Proposes Mitigation

A new study from arXiv identifies a previously overlooked failure mode in Retrieval-Augmented Generation (RAG) for Large Vision-Language Models (LVLMs): Attention Distraction (AD). The researchers propose MAD-RAG, a training-free intervention that decouples visual grounding from context integration, achieving absolute accuracy gains of up to 9.20% on standard benchmarks and rectifying up to 74.68% of failures with negligible computational overhead.

June 16, 2026
DualGauge: Automated Joint Security-Functionality Benchmarking of Specification-Only Code Generation by LLMs and Coding Agents Technology

DualGauge: Automated Joint Security-Functionality Benchmarking of Specification-Only Code Generation by LLMs and Coding Agents

Researchers present DualGauge, an automated framework for jointly evaluating correctness and security of code generated by LLMs from natural-language specifications. A benchmark of 307 tasks across three languages shows that even the strongest models achieve under 15% joint security-functionality success, while factors like scale and instruction tuning do not reliably improve outcomes. Three leading agentic coding systems also show no advantage over direct generation.

June 16, 2026