Artificial Intelligence #web agents#process-level evaluation
Process-Level Evaluation of Web Agents Reveals Hidden Performance Differences in AI Systems
Researchers introduce WebStep, a benchmark of 1,800 task instances that evaluates web agents at the process level using semantic state tracking. Key findings show that agents with similar success rates have divergent process metrics, with OpenAI CUA outperforming Qwen3.5 on commit actions but underperforming on filtering on the Housing website.
Jun 16, 2026 1 source