Study: LLM Accuracy Declines Predictably as Reasoning Steps Increase in Clinical AI Tasks

A study on arXiv introduces a hop-count taxonomy to predict LLM failure on clinical question answering. Tests across Claude and GPT models show monotone accuracy decline with reasoning depth, with extended thinking failing to flatten the curve.

iGEN Editorial

June 16, 2026

Study: LLM Accuracy Declines Predictably as Reasoning Steps Increase in Clinical AI Tasks

Large language models (LLMs) are being deployed in clinical settings to answer questions from electronic health records (EHRs), but their reliability on multi-step reasoning is coming into question. A new study on arXiv — "Compositional Reasoning Depth Predicts Clinical AI Failure: Empirical Evidence Consistent with Transformer Compositionality Limits in Electronic Health Record Question Answering" by Sanjay Basu — provides empirical evidence that accuracy declines systematically as the number of reasoning steps increases, and that this decline is predictable.

The researchers pre-specified a hop-count taxonomy classifying the number of distinct reasoning steps required to answer a clinical question from an EHR. They annotated 313 clinician-generated MedAlign EHR question-answer pairs across four hop levels and evaluated 301 questions in a within-model ablation (claude-sonnet-4-6, zero-shot vs. extended thinking) and cross-architecture replications (gpt-4o and gpt-5.4-2026-03-05, zero-shot).

Monotone Accuracy Decline Across Models

All three models, spanning two providers and two OpenAI generations (GPT-4 and GPT-5), showed monotone accuracy decline with hop count:

Claude Sonnet zero-shot fell from 30.6% at hop=1 to 17.6% at hop=4 (Cochran-Armitage z=-2.30, p=0.011; odds ratio per hop 0.72, 95% CI [0.56,0.92], p=0.008).
GPT-4o replicated this decline from 37.8% to 14.7% (OR 0.58 [0.45,0.75], p<0.001).
gpt-5.4-2026-03-05 confirmed the pattern from 37.8% to 23.5% (OR 0.80 [0.66,0.98], p=0.027).

Model	Hop=1 Accuracy	Hop=4 Accuracy	Odds Ratio per Hop	p-value
Claude Sonnet	30.6%	17.6%	0.72	0.008
GPT-4o	37.8%	14.7%	0.58	<0.001
GPT-5.4	37.8%	23.5%	0.80	0.027

Reasoning Difficulty, Not Data Truncation

A pre-specified context-sufficiency audit showed that higher-hop questions were not differentially disadvantaged by EHR truncation: answerability ranged from 93-95% at hops 2-4 versus 79% at hop=1. This confirms the accuracy decline reflects compositional reasoning difficulty, not data issues.

Extended Thinking Does Not Flatten the Curve

Extended thinking — where the model is prompted to reason step-by-step — did not significantly flatten the accuracy-depth curve across three reasoning conditions. Moreover, thinking-token usage scaled with hop count (r=0.31, p<0.0001), consistent with the predicted O(k) computational requirement.

Implications for Enterprise AI Deployment

For enterprise technology decision-makers evaluating LLMs for complex document analysis, compliance checks, or multi-step workflows, the study offers a theory-motivated, cross-architecture predictor of error. Hop count can serve as a deployment risk stratification tool: questions requiring more inferential steps are disproportionately likely to produce errors, regardless of the model provider or generation. The finding holds across Claude and GPT architectures and suggests a fundamental limit of transformer compositionality that even extended thinking cannot overcome.

The study is available on arXiv under a CC BY 4.0 license.

Sources:

Study: LLM Accuracy Declines Predictably as Reasoning Steps Increase in Clinical AI Tasks

Monotone Accuracy Decline Across Models

Reasoning Difficulty, Not Data Truncation

Extended Thinking Does Not Flatten the Curve

Implications for Enterprise AI Deployment

Recommended Stories

New Method LUCID Detects Hallucinations in LLM-Based Knowledge Graph Reasoning

MedAI Study Evaluates TxAgent's Therapeutic Reasoning in NeurIPS CURE-Bench Competition

Research Shows Code Execution Outperforms Natural Language for AI Algorithmic Reasoning

Multi-Sequence Verifiers Cut Inference Latency in Half for LLM Reasoning