Enterprise finance teams routinely build spreadsheets for modeling, forecasting, and scenario analysis. Yet, as a new benchmark reveals, current LLM agents are not yet capable of reliably producing professional-quality spreadsheets from scratch.
According to a paper published on arXiv titled "MBABench: Evaluating LLM Agents on End-to-End Spreadsheet Tasks in Finance," researchers including Yen, Thomson, Poeltl, and others developed one of the first evaluations of agents on complete spreadsheet workflows. The benchmark addresses a gap where existing spreadsheet benchmarks focus only on question-answering or single-formula edits, not on end-to-end artifact creation.
The Benchmark Design
MBABench targets economically critical financial workflows such as financial modeling, forecasting, and scenario analysis. Recognising that spreadsheet deliverables are routinely reviewed by multiple stakeholders, the researchers designed an evaluation taxonomy with three dimensions: Accuracy, Formula, and Format. Each dimension comprises fine-grained criteria reflecting professional standards.
The tasks require agents to produce entire spreadsheets from high-level user instructions, mimicking real-world demands where a deliverable must be readable, accurate, and easy to modify.
Key Findings: Claude Leads, But All Fall Short
In the evaluation, the Claude family of models led the benchmark and produced the most professional-looking outputs in a qualitative review. However, even the strongest agents frequently fell short of professional finance standards. Performance degraded sharply as task difficulty increased beyond a few chained calculations.
The paper states that "current agents are not yet able to reliably produce professional-quality spreadsheets at the level of complexity real-world workflows demand." This suggests that while progress has been made, enterprise-grade automation of spreadsheet creation remains out of reach.
Implications for Enterprise Technology Leaders
For CTOs and digital transformation leaders in finance and adjacent sectors like supply chain, the findings highlight a maturity gap. While LLM agents can handle simple formula tasks, they struggle with the multidimensional requirements of professional financial modeling. The benchmark's emphasis on readability and ease of modification underscores that enterprise users expect not just correct outputs, but outputs that can be reviewed and iterated upon by teams.
As frontier AI labs continue to develop agents for end-to-end workflows, MBABench provides a structured way to measure progress. The paper is available on arXiv under a Creative Commons license.
In summary, the Claude family leads but no current agent meets professional standards, especially as complexity increases. This benchmark sets a new bar for evaluating whether AI can truly replace or augment human spreadsheet work in finance.