Large language models (LLMs) are being deployed in clinical settings to answer questions from electronic health records (EHRs), but their reliability on multi-step reasoning is coming into question. A new study on arXiv — "Compositional Reasoning Depth Predicts Clinical AI Failure: Empirical Evidence Consistent with Transformer Compositionality Limits in Electronic Health Record Question Answering" by Sanjay Basu — provides empirical evidence that accuracy declines systematically as the number of reasoning steps increases, and that this decline is predictable.
The researchers pre-specified a hop-count taxonomy classifying the number of distinct reasoning steps required to answer a clinical question from an EHR. They annotated 313 clinician-generated MedAlign EHR question-answer pairs across four hop levels and evaluated 301 questions in a within-model ablation (claude-sonnet-4-6, zero-shot vs. extended thinking) and cross-architecture replications (gpt-4o and gpt-5.4-2026-03-05, zero-shot).
Monotone Accuracy Decline Across Models
All three models, spanning two providers and two OpenAI generations (GPT-4 and GPT-5), showed monotone accuracy decline with hop count:
- Claude Sonnet zero-shot fell from 30.6% at hop=1 to 17.6% at hop=4 (Cochran-Armitage z=-2.30, p=0.011; odds ratio per hop 0.72, 95% CI [0.56,0.92], p=0.008).
- GPT-4o replicated this decline from 37.8% to 14.7% (OR 0.58 [0.45,0.75], p<0.001).
- gpt-5.4-2026-03-05 confirmed the pattern from 37.8% to 23.5% (OR 0.80 [0.66,0.98], p=0.027).
| Model | Hop=1 Accuracy | Hop=4 Accuracy | Odds Ratio per Hop | p-value |
|---|---|---|---|---|
| Claude Sonnet | 30.6% | 17.6% | 0.72 | 0.008 |
| GPT-4o | 37.8% | 14.7% | 0.58 | <0.001 |
| GPT-5.4 | 37.8% | 23.5% | 0.80 | 0.027 |
Reasoning Difficulty, Not Data Truncation
A pre-specified context-sufficiency audit showed that higher-hop questions were not differentially disadvantaged by EHR truncation: answerability ranged from 93-95% at hops 2-4 versus 79% at hop=1. This confirms the accuracy decline reflects compositional reasoning difficulty, not data issues.
Extended Thinking Does Not Flatten the Curve
Extended thinking — where the model is prompted to reason step-by-step — did not significantly flatten the accuracy-depth curve across three reasoning conditions. Moreover, thinking-token usage scaled with hop count (r=0.31, p<0.0001), consistent with the predicted O(k) computational requirement.
Implications for Enterprise AI Deployment
For enterprise technology decision-makers evaluating LLMs for complex document analysis, compliance checks, or multi-step workflows, the study offers a theory-motivated, cross-architecture predictor of error. Hop count can serve as a deployment risk stratification tool: questions requiring more inferential steps are disproportionately likely to produce errors, regardless of the model provider or generation. The finding holds across Claude and GPT architectures and suggests a fundamental limit of transformer compositionality that even extended thinking cannot overcome.
The study is available on arXiv under a CC BY 4.0 license.