New MBABench Evaluates LLM Agents on End-to-End Finance Spreadsheet Tasks

MBABench, a new benchmark from researchers, evaluates LLM agents on end-to-end spreadsheet tasks in finance, focusing on modeling and scenario analysis. The benchmark assesses accuracy, formula use, and formatting. Claude family models lead but still fall short of professional standards.

iGEN Editorial

June 16, 2026

New MBABench Evaluates LLM Agents on End-to-End Finance Spreadsheet Tasks

Enterprise finance teams routinely build spreadsheets for modeling, forecasting, and scenario analysis. Yet, as a new benchmark reveals, current LLM agents are not yet capable of reliably producing professional-quality spreadsheets from scratch.

According to a paper published on arXiv titled "MBABench: Evaluating LLM Agents on End-to-End Spreadsheet Tasks in Finance," researchers including Yen, Thomson, Poeltl, and others developed one of the first evaluations of agents on complete spreadsheet workflows. The benchmark addresses a gap where existing spreadsheet benchmarks focus only on question-answering or single-formula edits, not on end-to-end artifact creation.

The Benchmark Design

MBABench targets economically critical financial workflows such as financial modeling, forecasting, and scenario analysis. Recognising that spreadsheet deliverables are routinely reviewed by multiple stakeholders, the researchers designed an evaluation taxonomy with three dimensions: Accuracy, Formula, and Format. Each dimension comprises fine-grained criteria reflecting professional standards.

The tasks require agents to produce entire spreadsheets from high-level user instructions, mimicking real-world demands where a deliverable must be readable, accurate, and easy to modify.

Key Findings: Claude Leads, But All Fall Short

In the evaluation, the Claude family of models led the benchmark and produced the most professional-looking outputs in a qualitative review. However, even the strongest agents frequently fell short of professional finance standards. Performance degraded sharply as task difficulty increased beyond a few chained calculations.

The paper states that "current agents are not yet able to reliably produce professional-quality spreadsheets at the level of complexity real-world workflows demand." This suggests that while progress has been made, enterprise-grade automation of spreadsheet creation remains out of reach.

Implications for Enterprise Technology Leaders

For CTOs and digital transformation leaders in finance and adjacent sectors like supply chain, the findings highlight a maturity gap. While LLM agents can handle simple formula tasks, they struggle with the multidimensional requirements of professional financial modeling. The benchmark's emphasis on readability and ease of modification underscores that enterprise users expect not just correct outputs, but outputs that can be reviewed and iterated upon by teams.

As frontier AI labs continue to develop agents for end-to-end workflows, MBABench provides a structured way to measure progress. The paper is available on arXiv under a Creative Commons license.

In summary, the Claude family leads but no current agent meets professional standards, especially as complexity increases. This benchmark sets a new bar for evaluating whether AI can truly replace or augment human spreadsheet work in finance.

Sources:

New MBABench Evaluates LLM Agents on End-to-End Finance Spreadsheet Tasks

The Benchmark Design

Key Findings: Claude Leads, But All Fall Short

Implications for Enterprise Technology Leaders

Recommended Stories

Is AI facing a big financial reckoning? Chip stocks tumble as investor euphoria fades

AI Is Coming for Accounts Receivable’s Busywork, Not Its Jobs, Says FreightTech CEO

Benchmarking Agentic Review Systems: AI Peer Review Achieves 83% Pairwise Accuracy but Falls Short on Error Detection

TERMS-Bench Diagnoses LLM Negotiation Agents Beyond Deal Rate for Enterprise Procurement