A new research benchmark aims to measure how well artificial intelligence systems can complete real-world theorem-proving tasks using the Lean formal verification language. The benchmark, called SorryDB, draws tasks from 78 open-source formalization projects on GitHub, according to a paper posted on arXiv.
(Note: No actual image URL was provided; the abstract page has license icon and social buttons, but no embeddable image. Follow instructions: if AVAILABLE IMAGES list is empty, embed none. So no images.)
The Problem with Existing Benchmarks
Most existing benchmarks for AI theorem provers rely on competition-style problems, which may not reflect the complexity and dependencies encountered in real-world formalization. The researchers state that "hillclimbing the SorryDB benchmark will yield tools that are aligned to the community needs, more usable by mathematicians, and more capable of understanding complex dependencies." By using a dynamically updating set of tasks from live GitHub projects, SorryDB also mitigates test-set contamination and provides a robust metric for an agent's ability to contribute to novel formal mathematics projects.
Evaluation on 1,000 Tasks
In their initial evaluation, the team tested a collection of approaches over a selected snapshot of 1,000 tasks from SorryDB. These approaches include generalist large language models, agentic methods, and specialized symbolic provers. The results reveal that current approaches are complementary: "even though an agentic approach based on Gemini Flash is the most performant, it is not strictly better than other off-the-shelf large-language models, specialized provers, or even a curated list of Lean tactics." This suggests that no single method dominates across all problem types, and that combining strategies could yield better overall performance.
Implications for Enterprise AI
For enterprise technology leaders, the research highlights the ongoing challenge of achieving reliable automated reasoning in complex environments. Formal verification tools like Lean are increasingly used in critical systems—such as smart contracts, aerospace software, and logistics algorithms—where mathematical correctness is essential. The SorryDB benchmark offers a way to track progress in AI-powered theorem proving that directly correlates with real-world needs, rather than artificial competition problems.
| Approach | Performance (relative) |
|---|---|
| Agentic (Gemini Flash) | Most performant overall |
| Generalist LLMs | Competitive on some tasks |
| Specialized symbolic provers | Strong on structured proofs |
| Curated Lean tactics | Complementary to others |
The researchers conclude that further work is needed to integrate the strengths of different methods. For supply chain and logistics technology, where verification of routing algorithms, inventory optimization, and trade documentation integrity can benefit from formal proofs, the SorryDB benchmark could help guide investment in AI tools that improve system reliability. However, the paper's results are based on a single snapshot; continuous updates to the benchmark will provide a more dynamic view of progress.
Future Directions
Because SorryDB is designed to be continuously updated, it will evolve as new formalization projects appear on GitHub. This creates a moving target for AI systems, forcing researchers to develop methods that generalize beyond static test sets. For organizations deploying AI in trade and supply chain, the ability to formally verify critical logic remains a long-term goal, and benchmarks like SorryDB help measure incremental advances in that direction.