iGEN
Visit IGEN World Explore IGEN Expo
EXPLORE UPGRADE PLANS
BREAKING
2026 State of Logistics Report: Volatility Becomes Permanent as U.S. Logistics Costs Fall to $2.4 Trillion USDOT Awards Contract to FreightWaves SONAR for High-Frequency Freight Market Data AIRMap AI Framework Generates Radio Maps 100x Faster Than Ray Tracing for Wireless Digital Twins New Research Defends LLMs from Extraction Attacks Using 'Knowledge Trap' Honeypot Deterministic Integrity Gates Verify LLM-Assisted Clinical Manuscripts Without False Positives Why Low-Precision Transformer Training Fails: Research Explains Flash Attention Instability ActiveSAM Speeds Open-Vocabulary Segmentation 5.5x, Boosts Accuracy for Noisy-Input Domains Snyk VulnBench JS 1.0 Reveals LLM Security Reviews Are Unrepeatable: Can They Find the Same Bugs Twice? India and Sri Lanka Strengthen Trade Ties with Local Currency Settlement Initiative Dorper and Texel Genetics to Boost Local Mutton Output in Jammu and Kashmir 2026 State of Logistics Report: Volatility Becomes Permanent as U.S. Logistics Costs Fall to $2.4 Trillion USDOT Awards Contract to FreightWaves SONAR for High-Frequency Freight Market Data AIRMap AI Framework Generates Radio Maps 100x Faster Than Ray Tracing for Wireless Digital Twins New Research Defends LLMs from Extraction Attacks Using 'Knowledge Trap' Honeypot Deterministic Integrity Gates Verify LLM-Assisted Clinical Manuscripts Without False Positives Why Low-Precision Transformer Training Fails: Research Explains Flash Attention Instability ActiveSAM Speeds Open-Vocabulary Segmentation 5.5x, Boosts Accuracy for Noisy-Input Domains Snyk VulnBench JS 1.0 Reveals LLM Security Reviews Are Unrepeatable: Can They Find the Same Bugs Twice? India and Sri Lanka Strengthen Trade Ties with Local Currency Settlement Initiative Dorper and Texel Genetics to Boost Local Mutton Output in Jammu and Kashmir
Home ›› Technology ›› Ai ›› Llms ›› New MBABench Evaluates LLM Agents on End-to-End Finance Spreadsheet Tasks

New MBABench Evaluates LLM Agents on End-to-End Finance Spreadsheet Tasks

MBABench, a new benchmark from researchers, evaluates LLM agents on end-to-end spreadsheet tasks in finance, focusing on modeling and scenario analysis. The benchmark assesses accuracy, formula use, and formatting. Claude family models lead but still fall short of professional standards.

iG
iGEN Editorial
June 16, 2026
New MBABench Evaluates LLM Agents on End-to-End Finance Spreadsheet Tasks

Enterprise finance teams routinely build spreadsheets for modeling, forecasting, and scenario analysis. Yet, as a new benchmark reveals, current LLM agents are not yet capable of reliably producing professional-quality spreadsheets from scratch.

According to a paper published on arXiv titled "MBABench: Evaluating LLM Agents on End-to-End Spreadsheet Tasks in Finance," researchers including Yen, Thomson, Poeltl, and others developed one of the first evaluations of agents on complete spreadsheet workflows. The benchmark addresses a gap where existing spreadsheet benchmarks focus only on question-answering or single-formula edits, not on end-to-end artifact creation.

The Benchmark Design

MBABench targets economically critical financial workflows such as financial modeling, forecasting, and scenario analysis. Recognising that spreadsheet deliverables are routinely reviewed by multiple stakeholders, the researchers designed an evaluation taxonomy with three dimensions: Accuracy, Formula, and Format. Each dimension comprises fine-grained criteria reflecting professional standards.

The tasks require agents to produce entire spreadsheets from high-level user instructions, mimicking real-world demands where a deliverable must be readable, accurate, and easy to modify.

Key Findings: Claude Leads, But All Fall Short

In the evaluation, the Claude family of models led the benchmark and produced the most professional-looking outputs in a qualitative review. However, even the strongest agents frequently fell short of professional finance standards. Performance degraded sharply as task difficulty increased beyond a few chained calculations.

The paper states that "current agents are not yet able to reliably produce professional-quality spreadsheets at the level of complexity real-world workflows demand." This suggests that while progress has been made, enterprise-grade automation of spreadsheet creation remains out of reach.

Implications for Enterprise Technology Leaders

For CTOs and digital transformation leaders in finance and adjacent sectors like supply chain, the findings highlight a maturity gap. While LLM agents can handle simple formula tasks, they struggle with the multidimensional requirements of professional financial modeling. The benchmark's emphasis on readability and ease of modification underscores that enterprise users expect not just correct outputs, but outputs that can be reviewed and iterated upon by teams.

As frontier AI labs continue to develop agents for end-to-end workflows, MBABench provides a structured way to measure progress. The paper is available on arXiv under a Creative Commons license.

In summary, the Claude family leads but no current agent meets professional standards, especially as complexity increases. This benchmark sets a new bar for evaluating whether AI can truly replace or augment human spreadsheet work in finance.


Sources:

Keep Reading

Recommended Stories

SkillsBench Benchmark Measures How Agent Skills Boost LLM Performance Across Diverse Tasks Technology

SkillsBench Benchmark Measures How Agent Skills Boost LLM Performance Across Diverse Tasks

Researchers introduce SkillsBench, a benchmark with 87 tasks across 8 domains to measure whether agent skills improve LLM performance. Curated skills raised average pass rate from 33.9% to 50.5%, with focused skills of at most three modules outperforming larger bundles. Smaller models with skills can match larger models without.

June 16, 2026
Security Analysis of Long-Horizon Agentic AI Systems: Threats, Evaluation, and Framework Development Technology

Security Analysis of Long-Horizon Agentic AI Systems: Threats, Evaluation, and Framework Development

A recent arXiv paper by Almalki and Masud provides a structured analysis of security challenges in long-horizon agentic AI systems. It reviews existing threats, evaluation approaches, attack propagation mechanisms, and security frameworks, and proposes a taxonomy of threats and a framework for analyzing attack propagation to support future research.

June 16, 2026
LabOSBench: New Benchmark Tests AI Agents on Complex Scientific Instrument Control Technology

LabOSBench: New Benchmark Tests AI Agents on Complex Scientific Instrument Control

LabOSBench is a new benchmark designed to evaluate computer-use agents on scientific instrument control. It features 96 subtasks across eight simulated instruments, testing agents on sample loading, alignment, parameter tuning, data acquisition, and result inspection. Early results show that while agents handle structured GUI tasks well, they struggle with feedback-driven operations and long-horizon workflows.

June 16, 2026
How AI is reshaping the battle against invoice fraud in global trade Trade Finance

How AI is reshaping the battle against invoice fraud in global trade

AI is both a weapon for fraudsters and a shield for finance teams. With 88% of organizations using AI and 40% reporting invoice fraud, manual controls are failing. The article explores how generative AI enables sophisticated fraud and how modern systems can counter it.

June 15, 2026