Large language models (LLMs) have made notable progress in automated theorem proving, but existing formal benchmarks are limited in mathematical coverage and difficulty, concentrating on algebra and elementary number theory while neglecting deeper fields like mathematical analysis. To address this gap, researchers have introduced MA-ProofBench, according to a paper posted on arXiv (2606.13782). The benchmark is, to the best of the authors' knowledge, the first formal theorem-proving benchmark dedicated to mathematical analysis.
Benchmark Design
MA-ProofBench contains 200 formalized theorems covering 6 core topics and 27 subcategories, including measure and integration theory, complex analysis, and functional analysis. The problems are divided into two difficulty levels: an undergraduate level (Level I, 100 problems) and a Ph.D. qualifying level (Level II, 100 problems). This two-tiered structure is intended to evaluate how well LLMs perform formal reasoning at different mathematical depths. Each problem was constructed through a human-led, LLM-assisted formalization pipeline followed by independent expert review to ensure faithfulness to the original mathematics.
Evaluation Results
The researchers evaluated a range of recent general-purpose reasoning models and formal theorem provers on MA-ProofBench. The results reveal poor performance overall. The best-performing model, GPT-5.5, achieved only 16% Pass@8 on Level I and 5% on Level II. Most models stayed close to 0% on Level II. The following table summarizes the findings:
| Model | Level I Pass@8 | Level II Pass@8 |
|---|---|---|
| GPT-5.5 | 16% | 5% |
| Other models | (not specified) | ~0% |
The paper reports that most models performed poorly, with the best model achieving only 16% on the easier level and 5% on the harder level.
Failure Modes and Implications
Further analysis identified two dominant failure modes: Mathlib hallucinations and incomplete proofs. Mathlib hallucinations refer to instances where the model generates references or statements that do not exist in the Mathlib library, a common formalization framework. Incomplete proofs indicate that the model stops before finishing the reasoning chain. Additionally, an evaluation on a natural-language version of the benchmark exposed a clear gap between informal and formal reasoning, suggesting that LLMs can understand problem statements in plain language but struggle to produce correct formal proofs.
MA-ProofBench is intended to serve as a reliable reference for tracking progress in formal mathematical reasoning in advanced domains. The paper's authors include Pu, Lushi, Zhang, Weiming, Xie, Xinheng, Fu, Zixuan, He, Bingxiang, Lyu, Hongya, Li, Zhou, Jie, and Wang, Yudong. The benchmark's code and data are associated with the arXiv paper.
For enterprise technology leaders evaluating AI capabilities, MA-ProofBench highlights the limitations of current LLMs in tasks requiring rigorous formal reasoning. While consumer applications may appear fluent, these results underscore the gap between informal language understanding and formal verification—a critical consideration for any high-stakes deployment in regulated industries.