LLMs Struggle with Multi-Step Logic: New Framework DREAM Boosts Theorem Proving Performance

Large language models (LLMs) have shown promise in mathematical reasoning but struggle with multi-step first-order logic (FOL) tasks. A new paper introduces DREAM, a self-adaptive solution that enhances diversity and reasoning of generation strategies, improving performance by up to 6.4% on a dataset of 447 theorems.

iGEN Editorial

June 16, 2026

LLMs Struggle with Multi-Step Logic: New Framework DREAM Boosts Theorem Proving Performance

Large language models (LLMs) have demonstrated an ability to handle first-order logic (FOL) reasoning in various domains, yet their effectiveness on complex, multi-step mathematical deductions remains limited. According to a recent paper on arXiv, the model Deepseek-Prover-V2-7B achieves only 4.2% accuracy on a newly proposed theorem proving dataset, highlighting a significant gap between current capabilities and the demands of advanced reasoning tasks.

The Challenge: Multi-Step FOL Reasoning

The researchers note that while LLMs perform competitively on established mathematical reasoning benchmarks, they consistently struggle with multi-step FOL tasks. The low accuracy of Deepseek-Prover-V2-7B—4.2%—on the curated dataset underscores the difficulty. The authors attribute this to two primary issues: limited exploration of diverse proof strategies and the tendency for early reasoning mistakes to cascade, undermining entire proofs.

Introducing DREAM: Self-Adaptive Reasoning

To address these shortcomings, the paper introduces DREAM, a self-adaptive solution designed to enhance the Diversity and REAsonability of LLMs' generation strategies. DREAM incorporates two key mechanisms:

Axiom-Driven Strategy Diversification: Promotes varied strategic outcomes by leveraging axioms to guide exploration of different proof paths.
Sub-Proposition Error Feedback: Enables LLMs to reflect on intermediate steps and correct errors before they propagate.

These mechanisms work together during the inference stage, requiring no additional training data or model modifications.

Performance Gains and Dataset

The proposed solution yields measurable improvements. According to the paper, DREAM boosts performance by 0.6% to 6.4% over baseline methods across different models and configurations. The evaluation was conducted on a curated dataset of 447 mathematical theorems formatted in Lean 4, a proof assistant language.

Metric	Value
Deepseek-Prover-V2-7B accuracy on proposed dataset	4.2%
DREAM performance improvement range	0.6% to 6.4%
Dataset size	447 theorems

The researchers also emphasize that their contributions include pioneering advancements in LLMs' mathematical reasoning through FOL theorem proving, providing a novel inference-stage solution that requires no retraining.

The authors of the study are Cao, Chuxue, Mengze, Dai, Juntao, Yang, Jinluan, Zhao, Zijian, Zhang, Shengyu, Shi, Weijie, Liu, Chengzhong, Han, Sirui, Guo, Yike. Their work, released on arXiv, represents a focused effort to improve the logical reasoning capabilities of LLMs, a critical step for applications in automated theorem proving and beyond.

Sources:

LLMs Struggle with Multi-Step Logic: New Framework DREAM Boosts Theorem Proving Performance

The Challenge: Multi-Step FOL Reasoning

Introducing DREAM: Self-Adaptive Reasoning

Performance Gains and Dataset

Recommended Stories

Everyone Is Freaking Out About OpenAI and Anthropic’s Race for Dominance

Boomers Can't Stop Gifting Their Grandkids AI-Generated Slop Books, Exposing Quality and Privacy Risks

Chinese Open AI Models Rival Silicon Valley, Spark US Policy Backlash

China's Moonshot AI claims Kimi K3 can rival OpenAI and Anthropic