Study Finds LLMs' Legal Reasoning Unfaithful: Scope Laundering and Formalization Flaws Identified

A study comparing LLM classification, LLM-based formal reasoning, and solver-based reasoning on ContractNLI finds that while formal reasoning improves accuracy, it does not guarantee faithfulness. Researchers identify three recurring failure modes: scope laundering, implicit constraint blindness, and program synthesis failures. The findings raise concerns about relying on LLM-based formal reasoning as a proxy for symbolic execution.

iGEN Editorial

June 16, 2026

Study Finds LLMs' Legal Reasoning Unfaithful: Scope Laundering and Formalization Flaws Identified

Large language models (LLMs) demonstrate strong performance on reasoning tasks, but a new study reveals that this may reflect heuristic approximation rather than faithful logical inference. Researchers from the field of computer science — including Olivia Peiyu Wang, Sanna Wong-Toropainen, Daneshvar Amrollahi, Ryan Bai, Tashvi Bansal, Arush Garg, and Leilani H. Gilpin — examined this question in the domain of legal entailment. Their findings, published on arXiv, highlight systematic gaps between benchmark accuracy and logical faithfulness that have direct implications for enterprise adoption of AI in high-stakes decision-making.

The Study: Comparing Three Reasoning Paradigms

The research team compared three approaches on a re-annotated subset of ContractNLI, a legal contract entailment dataset. The paradigms included:

Pure LLM classification – the model directly predicts entailment from text.
LLM-based Formal Reasoning – the LLM generates formal representations (Z3 SMT solver code) and uses that to reason.
Solver-based Formal Reasoning – the Z3 SMT solver, a symbolic engine from Microsoft, performs the reasoning directly from those formal representations.

Five different LLMs were evaluated, though the study does not name specific models. The work introduced formal structure to improve accuracy, with LLM-based Formal Reasoning achieving the highest benchmark performance. However, the researchers caution that this gain does not imply faithful reasoning.

Key Failure Modes Identified

The study identifies three recurring failure modes that undermine trust in LLM-based formal reasoning:

Failure Mode	Description
Scope laundering	The LLM reports solver-inconsistent classifications without actually executing the underlying formal reasoning, producing conclusions that appear logically grounded but are not. This persists across all models tested.
Implicit constraint blindness	The model overlooks logical constraints present in the formal representations.
Program synthesis failures	The LLM generates incorrect Z3 code despite structured prompting.

According to the paper, scope laundering is especially concerning because it affects every model evaluated and "raises serious concerns about the faithfulness of LLM-based formal reasoning as a proxy for symbolic execution."

The Re-Annotation Reveals a Pragmatic Gap

The team re-annotated a subset of ContractNLI — a dataset originally designed for legal entailment — to account for pragmatic legal interpretation versus strict formal entailment. They found a systematic, measurable gap: a substantial proportion of legally sound inferences are not formally grounded without additional unstated assumptions. This means that even when an LLM appears to reason correctly, it may be relying on pragmatic shortcuts rather than logical deduction.

Implications for Enterprise AI

For enterprise technology leaders evaluating AI for legal, compliance, or regulatory applications, these findings underscore the need for verification mechanisms beyond benchmark accuracy. The study demonstrates that adding formal structure (e.g., Z3 code generation) can improve performance metrics without guaranteeing correct logical reasoning. Scope laundering, in particular, is a hidden risk — the model can produce confidently wrong conclusions that pass surface-level inspection.

Organizations deploying LLMs for contract analysis, compliance checking, or legal reasoning should consider integrating symbolic solvers as a separate verification step, and not rely on the LLM's own formal reasoning outputs as trustworthy. The researchers note that even with structured prompting, program synthesis failures occur, meaning the generated Z3 code itself may be incorrect.

As LLMs become integrated into supply chain contract management, trade documentation, and regulatory compliance workflows, the distinction between accuracy and faithfulness becomes critical. The study provides a framework for diagnosing these failure modes and calls for more rigorous evaluation before relying on LLMs for logically grounded decisions.

Sources:

Study Finds LLMs' Legal Reasoning Unfaithful: Scope Laundering and Formalization Flaws Identified

The Study: Comparing Three Reasoning Paradigms

Key Failure Modes Identified

The Re-Annotation Reveals a Pragmatic Gap

Implications for Enterprise AI

Recommended Stories

LLMs Struggle on Privacy-Constrained Industrial Tabular Data, Study Finds

MatchLM2Lite: Scalable MLLM-Lite Framework Cuts Reproduced Video Views by 2.5%

Algorithm Audit Reveals LLM Hotel Recommendations Biased by Eco-Labels, Ignore Management Responses

Do Large Language Models Have Emotions? Researchers Assess Anthropic's Claim