Mask-Proof: New LLM Pipeline Automates Data Curation for Mathematical Proofs with 96.8% Accuracy

Researchers introduce Mask-Proof, an LLM-based pipeline that turns real mathematical proofs into automatically checkable masked-step tasks. The resulting Mask-ProofBench contains 292 problems. Seventeen models tested show reasoning-enhanced models outperform standard ones by 12-27%, with the evaluator achieving 96.8% agreement with expert annotators.

iGEN Editorial

June 16, 2026

Mask-Proof: New LLM Pipeline Automates Data Curation for Mathematical Proofs with 96.8% Accuracy

Enterprise technology leaders evaluating AI for complex reasoning tasks face a persistent challenge: how to measure step-level logic in long, multi-step proofs without costly human grading. A team of researchers has introduced Mask-Proof, an automated data curation pipeline that addresses this gap by converting real mathematical proofs into scalable, machine-checkable exercises.

The Problem: Evaluating Step-Level Reasoning in Long Proofs

According to the arXiv paper by Zhang Jierui, Tan Siyuan, Xinhang, Lin, Longzhuangzhi, Dailin, Gu, Chengfeng, Xinping, Hao, Yaxian, Liang, Shengjia, Ren, Yuxiang, and Liu Wenhao (2026), large language models (LLMs) are increasingly capable of mathematical problem solving and can even assist with research-level proofs. Yet the research community lacks a scalable and reproducible way to measure step-level reasoning in long proofs across diverse sources. This evaluation gap limits trustworthy AI assistance in proof-certified scientific progress. Existing evaluations often emphasize final answers or rely on costly expert grading, while end-to-end proof generation remains open-ended and hard to verify automatically.

The Mask-Proof Pipeline: Turning Proofs into Masked-Step Tasks

Mask-Proof addresses this by turning real proofs into automatically checkable masked-step tasks. The pipeline masks key formula steps, provides the necessary surrounding context, and evaluates model reconstructions with an LLM-based equivalence judge using repeated votes for stability. The result is Mask-ProofBench, a benchmark containing 292 curated problems across diverse research areas. The researchers designed the pipeline to enable faithful, reproducible, and comparable measurement of step-level mathematical reasoning.

Benchmark Results and Model Performance

Experiments with 17 models show that reasoning-enhanced models outperform standard models by 12% to 27%. The evaluator achieves 96.8% agreement with expert annotators, according to the paper. The benchmark, annotations, and code are made available at the provided URL. The authors also note that their work is licensed under a Creative Commons Attribution 4.0 International License.

Implications for Enterprise AI Evaluation

While Mask-Proof is designed for mathematical proofs, its methodology—automated step-level verification using LLM-based judges—has broader implications for enterprise AI. CTOs and digital transformation leaders investing in AI for compliance, auditing, or document analysis may find similar approaches applicable to verifying model reasoning in enterprise contexts. The pipeline's emphasis on reproducibility and automation addresses a critical need for trustworthy AI evaluation without relying on expensive human annotators. As enterprises deploy LLMs for complex tasks, tools like Mask-Proof could help ensure that AI-assisted decision-making meets rigorous verification standards.

Sources:

Mask-Proof: New LLM Pipeline Automates Data Curation for Mathematical Proofs with 96.8% Accuracy

The Problem: Evaluating Step-Level Reasoning in Long Proofs

The Mask-Proof Pipeline: Turning Proofs into Masked-Step Tasks

Benchmark Results and Model Performance

Implications for Enterprise AI Evaluation

Recommended Stories

New Research Shows Pretraining Data Composition Can Engineer Neural Scaling Laws for Particle Physics

LLM-Driven Stepwise Refinement Framework Promises Verifiable Hardware Generation

Vocabulary Dropout Technique Prevents Diversity Collapse in LLM Co-Evolution Training

Haiku to Opus in Just 10 bits: LLMs Unlock Large Compression Gains