Large language models (LLMs) demonstrate strong performance on reasoning tasks, but a new study reveals that this may reflect heuristic approximation rather than faithful logical inference. Researchers from the field of computer science — including Olivia Peiyu Wang, Sanna Wong-Toropainen, Daneshvar Amrollahi, Ryan Bai, Tashvi Bansal, Arush Garg, and Leilani H. Gilpin — examined this question in the domain of legal entailment. Their findings, published on arXiv, highlight systematic gaps between benchmark accuracy and logical faithfulness that have direct implications for enterprise adoption of AI in high-stakes decision-making.
The Study: Comparing Three Reasoning Paradigms
The research team compared three approaches on a re-annotated subset of ContractNLI, a legal contract entailment dataset. The paradigms included:
- Pure LLM classification – the model directly predicts entailment from text.
- LLM-based Formal Reasoning – the LLM generates formal representations (Z3 SMT solver code) and uses that to reason.
- Solver-based Formal Reasoning – the Z3 SMT solver, a symbolic engine from Microsoft, performs the reasoning directly from those formal representations.
Five different LLMs were evaluated, though the study does not name specific models. The work introduced formal structure to improve accuracy, with LLM-based Formal Reasoning achieving the highest benchmark performance. However, the researchers caution that this gain does not imply faithful reasoning.
Key Failure Modes Identified
The study identifies three recurring failure modes that undermine trust in LLM-based formal reasoning:
| Failure Mode | Description |
|---|---|
| Scope laundering | The LLM reports solver-inconsistent classifications without actually executing the underlying formal reasoning, producing conclusions that appear logically grounded but are not. This persists across all models tested. |
| Implicit constraint blindness | The model overlooks logical constraints present in the formal representations. |
| Program synthesis failures | The LLM generates incorrect Z3 code despite structured prompting. |
According to the paper, scope laundering is especially concerning because it affects every model evaluated and "raises serious concerns about the faithfulness of LLM-based formal reasoning as a proxy for symbolic execution."
The Re-Annotation Reveals a Pragmatic Gap
The team re-annotated a subset of ContractNLI — a dataset originally designed for legal entailment — to account for pragmatic legal interpretation versus strict formal entailment. They found a systematic, measurable gap: a substantial proportion of legally sound inferences are not formally grounded without additional unstated assumptions. This means that even when an LLM appears to reason correctly, it may be relying on pragmatic shortcuts rather than logical deduction.
Implications for Enterprise AI
For enterprise technology leaders evaluating AI for legal, compliance, or regulatory applications, these findings underscore the need for verification mechanisms beyond benchmark accuracy. The study demonstrates that adding formal structure (e.g., Z3 code generation) can improve performance metrics without guaranteeing correct logical reasoning. Scope laundering, in particular, is a hidden risk — the model can produce confidently wrong conclusions that pass surface-level inspection.
Organizations deploying LLMs for contract analysis, compliance checking, or legal reasoning should consider integrating symbolic solvers as a separate verification step, and not rely on the LLM's own formal reasoning outputs as trustworthy. The researchers note that even with structured prompting, program synthesis failures occur, meaning the generated Z3 code itself may be incorrect.
As LLMs become integrated into supply chain contract management, trade documentation, and regulatory compliance workflows, the distinction between accuracy and faithfulness becomes critical. The study provides a framework for diagnosing these failure modes and calls for more rigorous evaluation before relying on LLMs for logically grounded decisions.