SorryDB Benchmark Tests AI Provers on Real-World Lean Theorem Completion Tasks

Researchers present SorryDB, a benchmark of open Lean tasks from 78 GitHub projects. Evaluating a snapshot of 1000 tasks, they show current approaches are complementary, with Gemini Flash-based agentic methods leading but not outperforming all others.

iGEN Editorial

June 17, 2026

SorryDB Benchmark Tests AI Provers on Real-World Lean Theorem Completion Tasks

A new research benchmark aims to measure how well artificial intelligence systems can complete real-world theorem-proving tasks using the Lean formal verification language. The benchmark, called SorryDB, draws tasks from 78 open-source formalization projects on GitHub, according to a paper posted on arXiv.

(Note: No actual image URL was provided; the abstract page has license icon and social buttons, but no embeddable image. Follow instructions: if AVAILABLE IMAGES list is empty, embed none. So no images.)

The Problem with Existing Benchmarks

Most existing benchmarks for AI theorem provers rely on competition-style problems, which may not reflect the complexity and dependencies encountered in real-world formalization. The researchers state that "hillclimbing the SorryDB benchmark will yield tools that are aligned to the community needs, more usable by mathematicians, and more capable of understanding complex dependencies." By using a dynamically updating set of tasks from live GitHub projects, SorryDB also mitigates test-set contamination and provides a robust metric for an agent's ability to contribute to novel formal mathematics projects.

Evaluation on 1,000 Tasks

In their initial evaluation, the team tested a collection of approaches over a selected snapshot of 1,000 tasks from SorryDB. These approaches include generalist large language models, agentic methods, and specialized symbolic provers. The results reveal that current approaches are complementary: "even though an agentic approach based on Gemini Flash is the most performant, it is not strictly better than other off-the-shelf large-language models, specialized provers, or even a curated list of Lean tactics." This suggests that no single method dominates across all problem types, and that combining strategies could yield better overall performance.

Implications for Enterprise AI

For enterprise technology leaders, the research highlights the ongoing challenge of achieving reliable automated reasoning in complex environments. Formal verification tools like Lean are increasingly used in critical systems—such as smart contracts, aerospace software, and logistics algorithms—where mathematical correctness is essential. The SorryDB benchmark offers a way to track progress in AI-powered theorem proving that directly correlates with real-world needs, rather than artificial competition problems.

Approach	Performance (relative)
Agentic (Gemini Flash)	Most performant overall
Generalist LLMs	Competitive on some tasks
Specialized symbolic provers	Strong on structured proofs
Curated Lean tactics	Complementary to others

The researchers conclude that further work is needed to integrate the strengths of different methods. For supply chain and logistics technology, where verification of routing algorithms, inventory optimization, and trade documentation integrity can benefit from formal proofs, the SorryDB benchmark could help guide investment in AI tools that improve system reliability. However, the paper's results are based on a single snapshot; continuous updates to the benchmark will provide a more dynamic view of progress.

Future Directions

Because SorryDB is designed to be continuously updated, it will evolve as new formalization projects appear on GitHub. This creates a moving target for AI systems, forcing researchers to develop methods that generalize beyond static test sets. For organizations deploying AI in trade and supply chain, the ability to formally verify critical logic remains a long-term goal, and benchmarks like SorryDB help measure incremental advances in that direction.

Sources:

SorryDB Benchmark Tests AI Provers on Real-World Lean Theorem Completion Tasks

The Problem with Existing Benchmarks

Evaluation on 1,000 Tasks

Implications for Enterprise AI

Future Directions

Recommended Stories

PACT: Privileged Trace Co-Training Boosts Multi-Turn Tool-Use Agents for Enterprise Automation

UniSinger: First End-to-End Framework Unifies Song Generation and Singing Voice Conversion

Haiku to Opus in Just 10 bits: LLMs Unlock Large Compression Gains

New Research Shows Chain-of-Thought Reasoning Should Be Selective, Not Default, for LLMs