Study Reveals Serious Robustness Flaws in Proof Autoformalization for Lean 4

A new arXiv preprint presents the first systematic study on the robustness of proof autoformalization in Lean 4, introducing a benchmark with global and local perturbations. Evaluating seven recent LLM-based models on miniF2F and MATH-500, the study finds all are sensitive to global paraphrasing and mostly fail to faithfully reflect local changes, raising concerns for dependable formal verification.

iGEN Editorial

June 16, 2026

Study Reveals Serious Robustness Flaws in Proof Autoformalization for Lean 4

The promise of automatic formal verification — converting human-written mathematical proofs into machine-checkable formal code — has gained momentum with large language models (LLMs). However, a new study published on arXiv reveals that current proof autoformalization systems are far from robust, potentially undermining their reliability for critical software verification tasks.

The preprint, authored by Gui, Zhengtao, Yang, Sheng, Shi, and Zhouxing, presents the first evaluation focused specifically on the robustness of proof autoformalization in the Lean 4 theorem prover. The authors argue that a robust autoformalizer must remain faithful even for informal proofs that deviate from idealized, well-formed examples commonly used in existing benchmarks.

Two Categories of Perturbations

The researchers formulated two distinct types of perturbations to stress-test the models:

Perturbation Type	Description	Expected Behavior
Global perturbation	Paraphrases the entire informal proof in a different writing style	Formalization should remain consistent with the original mathematical content
Local perturbation	Alters a single value, symbol, or proof step, possibly in a counterfactual way	Formalization should faithfully reflect the perturbation, not revert to original or infer a different change

The benchmark was built on two established datasets: miniF2F and MATH-500. The team automatically measured two key metrics: stability of correctness under global perturbations and faithfulness of output under local perturbations.

All Models Fail Robustness Tests

"All of which are sensitive to global perturbations and mostly fail to remain faithful under local perturbations."

This sentence from the paper summarizes the results bluntly. The researchers evaluated seven recent models for proof autoformalization. Every model showed significant drops in correctness when the informal proof was paraphrased globally. More critically, when a local perturbation was introduced — for example, changing a + to a - or swapping a lemma — the models often either output the original unperturbed formal proof or invented a different alteration altogether, rather than reflecting the actual change.

The study measures this through automatic evaluation, providing a quantitative framework for future improvements. Code and data are publicly available via the project's GitHub repository (linked in the preprint).

Why This Matters for Enterprise Technology

While the paper targets mathematical proofs, the findings have direct implications for enterprise software that relies on formal verification. Companies in aerospace, autonomous systems, financial trading, and supply chain logistics increasingly depend on formally verified components to ensure safety and correctness. If the tools that automate the creation of those proofs are brittle, the entire verification pipeline becomes suspect.

For CTOs and technology leaders evaluating formal verification tools, this study signals that LLM-based autoformalization is not yet production-ready for mission-critical use. The failure to handle even simple local perturbations means that subtle errors in informal specifications could lead to formally verified but semantically incorrect proofs.

The research also highlights the need for robustness-aware evaluation beyond simple accuracy metrics. The arXiv team's perturbation methodology provides a template that can be adopted by formal tool vendors to stress-test their own systems.

Looking Ahead

As the field of autoformalization matures, addressing these robustness gaps will be essential. The study suggests that current LLM approaches, while promising, lack the deeper mathematical understanding needed to reliably mirror human reasoning under variation. Future work may combine LLMs with rule-based verification or incorporate adversarial training specific to formal proofs.

For now, the message is clear: enterprises should not rely solely on autoformalization from LLMs without human-in-the-loop verification, especially when the cost of a formalization error is high.

Sources:

Study Reveals Serious Robustness Flaws in Proof Autoformalization for Lean 4

Two Categories of Perturbations

All Models Fail Robustness Tests

Why This Matters for Enterprise Technology

Looking Ahead

Recommended Stories

SorryDB Benchmark Tests AI Provers on Real-World Lean Theorem Completion Tasks

Process-Verified Reinforcement Learning for Theorem Proving via Lean: A New Path to AI Reliability

New Framework Verifies Safety of Multi-Agent AI Communication for Autonomous Logistics

Unified Causal-Origin Taxonomy for Distributional Shifts in Reinforcement Learning Systems