iGEN
Visit IGEN World Explore IGEN Expo
EXPLORE UPGRADE PLANS
BREAKING
Building Local: How Sourcing Materials from Surroundings Reduces Supply Chain Risk and Embodied Carbon DySink: Dynamic Frame Sinks Enable Adaptive Long Video Generation Without Context Collapse AL-GNN: New Privacy-Preserving Continual Graph Learning Eliminates Replay Buffers and Backpropagation Zepto IPO: Can 10-Minute Delivery Sustain Profitability Under Public-Market Scrutiny? CLoVE: New Federated Learning Algorithm Clusters Loss Vectors for Personalization SceneConductor Generates 3D Scenes from Single Images Using Multi-Agent Orchestration From Detection to Recovery: Operational Analysis of LLM Pre-training on 504 NVIDIA B200 GPUs Less is More: Improving LLM Reasoning with Minimal Test-Time Intervention New EEG Benchmark Promises Standardized Evaluation of Foundation Models DCP-Prune: New Token Pruning Method Preserves AI Model Performance at Ultra-Low Budgets Building Local: How Sourcing Materials from Surroundings Reduces Supply Chain Risk and Embodied Carbon DySink: Dynamic Frame Sinks Enable Adaptive Long Video Generation Without Context Collapse AL-GNN: New Privacy-Preserving Continual Graph Learning Eliminates Replay Buffers and Backpropagation Zepto IPO: Can 10-Minute Delivery Sustain Profitability Under Public-Market Scrutiny? CLoVE: New Federated Learning Algorithm Clusters Loss Vectors for Personalization SceneConductor Generates 3D Scenes from Single Images Using Multi-Agent Orchestration From Detection to Recovery: Operational Analysis of LLM Pre-training on 504 NVIDIA B200 GPUs Less is More: Improving LLM Reasoning with Minimal Test-Time Intervention New EEG Benchmark Promises Standardized Evaluation of Foundation Models DCP-Prune: New Token Pruning Method Preserves AI Model Performance at Ultra-Low Budgets
Home ›› Technology ›› Ai ›› Llms ›› MA-ProofBench: New Benchmark Tests LLMs on Formal Theorem Proving in Mathematical Analysis

MA-ProofBench: New Benchmark Tests LLMs on Formal Theorem Proving in Mathematical Analysis

Researchers introduce MA-ProofBench, the first formal theorem-proving benchmark dedicated to mathematical analysis. It contains 200 theorems across six topics at two difficulty levels. Evaluations show that even the best model, GPT-5.5, achieves only 16% Pass@8 on undergraduate-level problems and 5% on Ph.D.-level problems, highlighting significant limitations of current LLMs in formal mathematical reasoning.

iG
iGEN Editorial
June 16, 2026
MA-ProofBench: New Benchmark Tests LLMs on Formal Theorem Proving in Mathematical Analysis

Large language models (LLMs) have made notable progress in automated theorem proving, but existing formal benchmarks are limited in mathematical coverage and difficulty, concentrating on algebra and elementary number theory while neglecting deeper fields like mathematical analysis. To address this gap, researchers have introduced MA-ProofBench, according to a paper posted on arXiv (2606.13782). The benchmark is, to the best of the authors' knowledge, the first formal theorem-proving benchmark dedicated to mathematical analysis.

Benchmark Design

MA-ProofBench contains 200 formalized theorems covering 6 core topics and 27 subcategories, including measure and integration theory, complex analysis, and functional analysis. The problems are divided into two difficulty levels: an undergraduate level (Level I, 100 problems) and a Ph.D. qualifying level (Level II, 100 problems). This two-tiered structure is intended to evaluate how well LLMs perform formal reasoning at different mathematical depths. Each problem was constructed through a human-led, LLM-assisted formalization pipeline followed by independent expert review to ensure faithfulness to the original mathematics.

Evaluation Results

The researchers evaluated a range of recent general-purpose reasoning models and formal theorem provers on MA-ProofBench. The results reveal poor performance overall. The best-performing model, GPT-5.5, achieved only 16% Pass@8 on Level I and 5% on Level II. Most models stayed close to 0% on Level II. The following table summarizes the findings:

Model Level I Pass@8 Level II Pass@8
GPT-5.5 16% 5%
Other models (not specified) ~0%

The paper reports that most models performed poorly, with the best model achieving only 16% on the easier level and 5% on the harder level.

Failure Modes and Implications

Further analysis identified two dominant failure modes: Mathlib hallucinations and incomplete proofs. Mathlib hallucinations refer to instances where the model generates references or statements that do not exist in the Mathlib library, a common formalization framework. Incomplete proofs indicate that the model stops before finishing the reasoning chain. Additionally, an evaluation on a natural-language version of the benchmark exposed a clear gap between informal and formal reasoning, suggesting that LLMs can understand problem statements in plain language but struggle to produce correct formal proofs.

MA-ProofBench is intended to serve as a reliable reference for tracking progress in formal mathematical reasoning in advanced domains. The paper's authors include Pu, Lushi, Zhang, Weiming, Xie, Xinheng, Fu, Zixuan, He, Bingxiang, Lyu, Hongya, Li, Zhou, Jie, and Wang, Yudong. The benchmark's code and data are associated with the arXiv paper.

For enterprise technology leaders evaluating AI capabilities, MA-ProofBench highlights the limitations of current LLMs in tasks requiring rigorous formal reasoning. While consumer applications may appear fluent, these results underscore the gap between informal language understanding and formal verification—a critical consideration for any high-stakes deployment in regulated industries.


Sources:

Keep Reading

Recommended Stories

LLM-WikiRace Benchmark Reveals Frontier AI Models Still Struggle with Planning Over Knowledge Graphs Technology

LLM-WikiRace Benchmark Reveals Frontier AI Models Still Struggle with Planning Over Knowledge Graphs

Researchers introduced LLM-WikiRace, a benchmark to evaluate large language models on planning, reasoning, and world knowledge using Wikipedia hyperlinks. Top models like Gemini-3, GPT-5, and Claude Opus 4.5 achieve superhuman performance on easy tasks but drop sharply on hard difficulty, with Gemini-3 succeeding in only 23% of hard games. The study reveals that world knowledge helps only up to a point; beyond that, planning and long-horizon reasoning are the limiting factors.

June 16, 2026
AgentLeak Benchmark Reveals Internal Channel Privacy Leaks in Multi-Agent LLM Systems Technology

AgentLeak Benchmark Reveals Internal Channel Privacy Leaks in Multi-Agent LLM Systems

A new benchmark called AgentLeak evaluates privacy leakage in multi-agent large language model (LLM) systems, finding that inter-agent messages leak at 68.8% compared to 27.2% for final outputs. Across 1,000 scenarios and five models, total system exposure reaches 68.9%, highlighting risks invisible to standard output-only audits.

June 16, 2026
New Benchmark ARB4WM Evaluates Adversarial Robustness of World Models for Safety-Critical Control Technology

New Benchmark ARB4WM Evaluates Adversarial Robustness of World Models for Safety-Critical Control

Researchers have introduced ARB4WM, a unified benchmark for evaluating adversarial robustness of world models used in continuous control systems. The framework tests attacks across policy, value, and latent-dynamics levels, revealing that targeting value estimation and latent representations can be as harmful as direct policy disruption. Early and frequent perturbations are particularly damaging, and input-level defenses offer limited recovery.

June 16, 2026
AL-GNN: New Privacy-Preserving Continual Graph Learning Eliminates Replay Buffers and Backpropagation Technology

AL-GNN: New Privacy-Preserving Continual Graph Learning Eliminates Replay Buffers and Backpropagation

Researchers propose AL-GNN, a continual graph learning framework that uses analytic learning to avoid replay buffers and backpropagation. It achieves 10% higher average performance on CoraFull, reduces forgetting by over 30% on Reddit, and cuts training time by nearly 50% while preserving data privacy.

June 16, 2026