EHRNote-ChatQA: New Benchmark Tests LLMs on Multi-Turn Clinical Question Answering

Researchers introduce EHRNote-ChatQA, the first benchmark for evidence-grounded multi-turn clinical question answering over multiple discharge summaries. Built from MIMIC-IV data, it contains 967 patient-level samples and 16,072 QA pairs, revealing that LLMs struggle more with evidence grounding than content answering and that multi-turn errors compound.

iGEN Editorial

June 16, 2026

EHRNote-ChatQA: New Benchmark Tests LLMs on Multi-Turn Clinical Question Answering

Medical experts reviewing discharge summaries must iteratively synthesize information across multiple documents while verifying the evidence supporting each answer. Large language models (LLMs) are increasingly explored for clinical question answering, but existing benchmarks do not sufficiently reflect this setting—they often evaluate exam-style medical knowledge or focus on single-turn QA with limited evidence-grounding evaluation. According to a paper published on arXiv, researchers from multiple institutions have introduced EHRNote-ChatQA, the first benchmark for evidence-grounded multi-turn clinical question answering over patients' multiple discharge summaries.

Benchmark Construction

The benchmark was built from de-identified MIMIC-IV discharge summaries, containing 967 patient-level multi-turn samples spanning one to five notes. These samples include 16,072 medical-expert-verified QA pairs across eight clinical categories. Specifically, there are 8,036 content questions, each paired with an evidence-grounding question. The construction followed an expert-informed pipeline combining a discharge-summary structuring schema, expert-curated multi-turn QA templates, and LLM-based generation. Every single QA sample was reviewed and revised by 11 medical experts.

Key Findings from Benchmarking LLMs

The paper reports benchmarking 22 open- and closed-source LLMs, revealing several challenges:

LLMs struggle more with evidence grounding than with content answering.
Multi-turn errors compound across turns.
Single-turn clinical QA performance does not reliably transfer to this multi-turn, evidence-grounded setting.

The authors state that these findings establish EHRNote-ChatQA as a rigorous and practical benchmark for evaluating clinical QA systems. The dataset will be made publicly available through PhysioNet credentialed access.

Implications for Healthcare AI

For enterprise technology decision-makers in healthcare, EHRNote-ChatQA underscores critical gaps in current LLM capabilities. The benchmark's focus on longitudinal discharge summaries mirrors real-world clinical workflows, where accuracy and evidence provenance are paramount. The demonstrated difficulty with evidence grounding and error compounding suggests that healthcare organizations should carefully validate LLMs before deployment in clinical settings. The benchmark provides a standardized way to compare models and track improvements, aiding procurement decisions.

Component	Count
Patient-level multi-turn samples	967
Total QA pairs	16,072
Content questions	8,036
Evidence-grounding questions (paired)	8,036
Clinical categories	8
LLMs benchmarked	22
Medical expert reviewers	11

The researchers hope this benchmark will drive future work on evidence-grounded, multi-turn reasoning in clinical NLP.

Sources:

EHRNote-ChatQA: New Benchmark Tests LLMs on Multi-Turn Clinical Question Answering

Benchmark Construction

Key Findings from Benchmarking LLMs

Implications for Healthcare AI

Recommended Stories

New PhysAssistBench Tests Medical LLMs on Interactive Doctor-Patient-EHR Coordination

P3B3 Benchmark Reveals Strong Brazilian Portuguese Bias in Large Language Models

G2Rec Framework Structures and Tokenizes User Interests for Generative Recommendation

FAPO Framework Lets Claude Code Autonomously Optimize Multi-Step LLM Pipelines, Beats Baseline by 14.1 Points