Large language models (LLMs) have demonstrated an ability to handle first-order logic (FOL) reasoning in various domains, yet their effectiveness on complex, multi-step mathematical deductions remains limited. According to a recent paper on arXiv, the model Deepseek-Prover-V2-7B achieves only 4.2% accuracy on a newly proposed theorem proving dataset, highlighting a significant gap between current capabilities and the demands of advanced reasoning tasks.
The Challenge: Multi-Step FOL Reasoning
The researchers note that while LLMs perform competitively on established mathematical reasoning benchmarks, they consistently struggle with multi-step FOL tasks. The low accuracy of Deepseek-Prover-V2-7B—4.2%—on the curated dataset underscores the difficulty. The authors attribute this to two primary issues: limited exploration of diverse proof strategies and the tendency for early reasoning mistakes to cascade, undermining entire proofs.
Introducing DREAM: Self-Adaptive Reasoning
To address these shortcomings, the paper introduces DREAM, a self-adaptive solution designed to enhance the Diversity and REAsonability of LLMs' generation strategies. DREAM incorporates two key mechanisms:
- Axiom-Driven Strategy Diversification: Promotes varied strategic outcomes by leveraging axioms to guide exploration of different proof paths.
- Sub-Proposition Error Feedback: Enables LLMs to reflect on intermediate steps and correct errors before they propagate.
These mechanisms work together during the inference stage, requiring no additional training data or model modifications.
Performance Gains and Dataset
The proposed solution yields measurable improvements. According to the paper, DREAM boosts performance by 0.6% to 6.4% over baseline methods across different models and configurations. The evaluation was conducted on a curated dataset of 447 mathematical theorems formatted in Lean 4, a proof assistant language.
| Metric | Value |
|---|---|
| Deepseek-Prover-V2-7B accuracy on proposed dataset | 4.2% |
| DREAM performance improvement range | 0.6% to 6.4% |
| Dataset size | 447 theorems |
The researchers also emphasize that their contributions include pioneering advancements in LLMs' mathematical reasoning through FOL theorem proving, providing a novel inference-stage solution that requires no retraining.
The authors of the study are Cao, Chuxue, Mengze, Dai, Juntao, Yang, Jinluan, Zhao, Zijian, Zhang, Shengyu, Shi, Weijie, Liu, Chengzhong, Han, Sirui, Guo, Yike. Their work, released on arXiv, represents a focused effort to improve the logical reasoning capabilities of LLMs, a critical step for applications in automated theorem proving and beyond.