The promise of automatic formal verification — converting human-written mathematical proofs into machine-checkable formal code — has gained momentum with large language models (LLMs). However, a new study published on arXiv reveals that current proof autoformalization systems are far from robust, potentially undermining their reliability for critical software verification tasks.
The preprint, authored by Gui, Zhengtao, Yang, Sheng, Shi, and Zhouxing, presents the first evaluation focused specifically on the robustness of proof autoformalization in the Lean 4 theorem prover. The authors argue that a robust autoformalizer must remain faithful even for informal proofs that deviate from idealized, well-formed examples commonly used in existing benchmarks.
Two Categories of Perturbations
The researchers formulated two distinct types of perturbations to stress-test the models:
| Perturbation Type | Description | Expected Behavior |
|---|---|---|
| Global perturbation | Paraphrases the entire informal proof in a different writing style | Formalization should remain consistent with the original mathematical content |
| Local perturbation | Alters a single value, symbol, or proof step, possibly in a counterfactual way | Formalization should faithfully reflect the perturbation, not revert to original or infer a different change |
The benchmark was built on two established datasets: miniF2F and MATH-500. The team automatically measured two key metrics: stability of correctness under global perturbations and faithfulness of output under local perturbations.
All Models Fail Robustness Tests
"All of which are sensitive to global perturbations and mostly fail to remain faithful under local perturbations."
This sentence from the paper summarizes the results bluntly. The researchers evaluated seven recent models for proof autoformalization. Every model showed significant drops in correctness when the informal proof was paraphrased globally. More critically, when a local perturbation was introduced — for example, changing a + to a - or swapping a lemma — the models often either output the original unperturbed formal proof or invented a different alteration altogether, rather than reflecting the actual change.
The study measures this through automatic evaluation, providing a quantitative framework for future improvements. Code and data are publicly available via the project's GitHub repository (linked in the preprint).
Why This Matters for Enterprise Technology
While the paper targets mathematical proofs, the findings have direct implications for enterprise software that relies on formal verification. Companies in aerospace, autonomous systems, financial trading, and supply chain logistics increasingly depend on formally verified components to ensure safety and correctness. If the tools that automate the creation of those proofs are brittle, the entire verification pipeline becomes suspect.
For CTOs and technology leaders evaluating formal verification tools, this study signals that LLM-based autoformalization is not yet production-ready for mission-critical use. The failure to handle even simple local perturbations means that subtle errors in informal specifications could lead to formally verified but semantically incorrect proofs.
The research also highlights the need for robustness-aware evaluation beyond simple accuracy metrics. The arXiv team's perturbation methodology provides a template that can be adopted by formal tool vendors to stress-test their own systems.
Looking Ahead
As the field of autoformalization matures, addressing these robustness gaps will be essential. The study suggests that current LLM approaches, while promising, lack the deeper mathematical understanding needed to reliably mirror human reasoning under variation. Future work may combine LLMs with rule-based verification or incorporate adversarial training specific to formal proofs.
For now, the message is clear: enterprises should not rely solely on autoformalization from LLMs without human-in-the-loop verification, especially when the cost of a formalization error is high.