Large language models often appear strong on symbolic and algorithmic tasks, yet this apparent strength can hide brittle behaviour when problems become longer, harder, or slightly out of distribution, according to a new research paper published on arXiv. To systematically measure this brittleness, researchers introduced RecurrReason, a difficulty-controlled benchmark of four recurrent logic puzzles.
What RecurrReason Tests
RecurrReason consists of four classic puzzles: Tower of Hanoi, River Crossing, Block World, and Checkers Jumping. Each puzzle has a single interpretable difficulty parameter N ranging from 1 to 10, resulting in a total of 10,817 unique puzzles and 285,933 moves. The benchmark uses breadth-first search (BFS) to generate optimal trajectories for each puzzle instance, providing a clear ground truth for evaluating model output.
The researchers designed the benchmark to test whether models can produce solutions that are minimal, robust, and stable under controlled difficulty scaling, addressing what they identify as a major limitation of current reasoning benchmarks that primarily check for any valid answer rather than solution quality.
Model Performance
The paper benchmarks two Transformer families under consistent data splits and evaluation criteria: an encoder-decoder model (T5-style) and a decoder-only model (GPT-2-style). Models are trained on puzzles with difficulty N=1 to 7 and evaluated on both held-out in-distribution instances and harder out-of-distribution instances at N=8 to 10.
| Model | Task | In-Distribution Accuracy | Out-of-Distribution Accuracy |
|---|---|---|---|
| T5 (fine-tuned) | Block World | 97.27% | 81.00% |
| All models | River Crossing | 0.00% | 0.00% |
Fine-tuned pre-trained T5 achieved 97.27% validation accuracy and 81.00% out-of-distribution accuracy on Block World, the strongest result in the study. However, all models scored 0.00% on River Crossing under all conditions, indicating a complete failure to learn that task's logic.
Key Findings on Reasoning Robustness
Failure mode analysis revealed that architecture is a stronger determinant of success than model scale. The researchers also found that pre-training transfers only to puzzles with locally structured transition functions — meaning that the benefits of pre-training are highly specific and do not generalise to all types of reasoning problems.
The dataset and code will be open-sourced upon acceptance of the paper, the authors state. The study involved researchers including Mannem, Gowrav, Mahjabin, Chowdhury Marzia, Chen, Jason, Garg, Shivank, and Zhu, Kevin.
Implications for Enterprise AI
For enterprise technology leaders deploying AI in critical applications — such as supply chain planning, logistics optimisation, or trade document processing — the findings underscore the gap between AI's apparent competence on controlled tasks and its performance when facing novel or scaled-up problems. The results show that careful testing with benchmarks like RecurrReason can reveal hidden weaknesses, including complete failures on certain reasoning types regardless of model architecture. The researchers' emphasis on measuring minimal, robust, and stable solutions aligns with the requirements of production AI systems where errors carry real costs.