Artificial Intelligence #large language models#theorem proving
MA-ProofBench: New Benchmark Tests LLMs on Formal Theorem Proving in Mathematical Analysis
Researchers introduce MA-ProofBench, the first formal theorem-proving benchmark dedicated to mathematical analysis. It contains 200 theorems across six topics at two difficulty levels. Evaluations show that even the best model, GPT-5.5, achieves only 16% Pass@8 on undergraduate-level problems and 5% on Ph.D.-level problems, highlighting significant limitations of current LLMs in formal mathematical reasoning.
Jun 16, 2026 1 source