Large language models (LLMs) are being deployed as UX judges that inspect interfaces, diagnose usability problems, and propose repairs, according to a new paper on arXiv (2606.16262). Yet until now, no controlled benchmark measured whether the resulting critiques are reliable and actionable across heterogeneous product surfaces. The paper introduces UXBench, a benchmark for evaluating LLMs as interaction-grounded UX judges.
UXBench comprises local-first runnable web fixtures spanning ten product-surface families, paired with coverage-gated browser exploration that forces models to collect interaction evidence before reporting. Each judge model produces a structured UX report over seven rubric dimensions; report quality is measured by whether a fixed downstream repair agent can improve the interface based on the critique.
The evaluation protocol includes both an automated repair-lift protocol and a blind human validation study. The researchers evaluated eight frontier models under these conditions. Results show that UX judging is neither saturated nor one dimensional: models differ meaningfully in report actionability, exhibit distinct rubric-level repair signatures, vary in fixture-level reliability, and trade leadership across surface categories.
For enterprise technology leaders evaluating AI for usability testing, UXBench provides a structured methodology to assess whether LLM-generated critiques are actionable enough to drive real improvements. The benchmark's design—requiring evidence collection before reporting and measuring downstream repair success—offers a template for validating AI-assisted quality assurance in software development pipelines.
The findings underscore that no single model excels across all surface types, suggesting that procurement decisions for UX evaluation tools should consider the specific product surfaces being tested. As LLMs become integrated into continuous integration and deployment workflows, benchmarks like UXBench help ensure that automated critiques translate into measurable interface improvements rather than generating plausible-sounding but ineffective feedback.