MA-ProofBench: New Benchmark Tests LLMs on Formal Theorem Proving in Mathematical Analysis

Researchers introduce MA-ProofBench, the first formal theorem-proving benchmark dedicated to mathematical analysis. It contains 200 theorems across six topics at two difficulty levels. Evaluations show that even the best model, GPT-5.5, achieves only 16% Pass@8 on undergraduate-level problems and 5% on Ph.D.-level problems, highlighting significant limitations of current LLMs in formal mathematical reasoning.

iGEN Editorial

June 16, 2026

MA-ProofBench: New Benchmark Tests LLMs on Formal Theorem Proving in Mathematical Analysis

Large language models (LLMs) have made notable progress in automated theorem proving, but existing formal benchmarks are limited in mathematical coverage and difficulty, concentrating on algebra and elementary number theory while neglecting deeper fields like mathematical analysis. To address this gap, researchers have introduced MA-ProofBench, according to a paper posted on arXiv (2606.13782). The benchmark is, to the best of the authors' knowledge, the first formal theorem-proving benchmark dedicated to mathematical analysis.

Benchmark Design

MA-ProofBench contains 200 formalized theorems covering 6 core topics and 27 subcategories, including measure and integration theory, complex analysis, and functional analysis. The problems are divided into two difficulty levels: an undergraduate level (Level I, 100 problems) and a Ph.D. qualifying level (Level II, 100 problems). This two-tiered structure is intended to evaluate how well LLMs perform formal reasoning at different mathematical depths. Each problem was constructed through a human-led, LLM-assisted formalization pipeline followed by independent expert review to ensure faithfulness to the original mathematics.

Evaluation Results

The researchers evaluated a range of recent general-purpose reasoning models and formal theorem provers on MA-ProofBench. The results reveal poor performance overall. The best-performing model, GPT-5.5, achieved only 16% Pass@8 on Level I and 5% on Level II. Most models stayed close to 0% on Level II. The following table summarizes the findings:

Model	Level I Pass@8	Level II Pass@8
GPT-5.5	16%	5%
Other models	(not specified)	~0%

The paper reports that most models performed poorly, with the best model achieving only 16% on the easier level and 5% on the harder level.

Failure Modes and Implications

Further analysis identified two dominant failure modes: Mathlib hallucinations and incomplete proofs. Mathlib hallucinations refer to instances where the model generates references or statements that do not exist in the Mathlib library, a common formalization framework. Incomplete proofs indicate that the model stops before finishing the reasoning chain. Additionally, an evaluation on a natural-language version of the benchmark exposed a clear gap between informal and formal reasoning, suggesting that LLMs can understand problem statements in plain language but struggle to produce correct formal proofs.

MA-ProofBench is intended to serve as a reliable reference for tracking progress in formal mathematical reasoning in advanced domains. The paper's authors include Pu, Lushi, Zhang, Weiming, Xie, Xinheng, Fu, Zixuan, He, Bingxiang, Lyu, Hongya, Li, Zhou, Jie, and Wang, Yudong. The benchmark's code and data are associated with the arXiv paper.

For enterprise technology leaders evaluating AI capabilities, MA-ProofBench highlights the limitations of current LLMs in tasks requiring rigorous formal reasoning. While consumer applications may appear fluent, these results underscore the gap between informal language understanding and formal verification—a critical consideration for any high-stakes deployment in regulated industries.

Sources:

MA-ProofBench: New Benchmark Tests LLMs on Formal Theorem Proving in Mathematical Analysis

Benchmark Design

Evaluation Results

Failure Modes and Implications

Recommended Stories

DRFLOW Benchmark Targets Personalized Workflow Prediction for Enterprise AI Agents

MEAL Benchmark Enables Continuous Multi-Agent RL Training on 100 Tasks in Hours Using GPU Acceleration

Large Language Models Can Read Compressed Text That Humans Cannot, Researchers Find

From Texts to Scores: Tracing the Emergence of Essay Quality Representations in Large Language Models