Medical experts reviewing discharge summaries must iteratively synthesize information across multiple documents while verifying the evidence supporting each answer. Large language models (LLMs) are increasingly explored for clinical question answering, but existing benchmarks do not sufficiently reflect this setting—they often evaluate exam-style medical knowledge or focus on single-turn QA with limited evidence-grounding evaluation. According to a paper published on arXiv, researchers from multiple institutions have introduced EHRNote-ChatQA, the first benchmark for evidence-grounded multi-turn clinical question answering over patients' multiple discharge summaries.
Benchmark Construction
The benchmark was built from de-identified MIMIC-IV discharge summaries, containing 967 patient-level multi-turn samples spanning one to five notes. These samples include 16,072 medical-expert-verified QA pairs across eight clinical categories. Specifically, there are 8,036 content questions, each paired with an evidence-grounding question. The construction followed an expert-informed pipeline combining a discharge-summary structuring schema, expert-curated multi-turn QA templates, and LLM-based generation. Every single QA sample was reviewed and revised by 11 medical experts.
Key Findings from Benchmarking LLMs
The paper reports benchmarking 22 open- and closed-source LLMs, revealing several challenges:
- LLMs struggle more with evidence grounding than with content answering.
- Multi-turn errors compound across turns.
- Single-turn clinical QA performance does not reliably transfer to this multi-turn, evidence-grounded setting.
The authors state that these findings establish EHRNote-ChatQA as a rigorous and practical benchmark for evaluating clinical QA systems. The dataset will be made publicly available through PhysioNet credentialed access.
Implications for Healthcare AI
For enterprise technology decision-makers in healthcare, EHRNote-ChatQA underscores critical gaps in current LLM capabilities. The benchmark's focus on longitudinal discharge summaries mirrors real-world clinical workflows, where accuracy and evidence provenance are paramount. The demonstrated difficulty with evidence grounding and error compounding suggests that healthcare organizations should carefully validate LLMs before deployment in clinical settings. The benchmark provides a standardized way to compare models and track improvements, aiding procurement decisions.
| Component | Count |
|---|---|
| Patient-level multi-turn samples | 967 |
| Total QA pairs | 16,072 |
| Content questions | 8,036 |
| Evidence-grounding questions (paired) | 8,036 |
| Clinical categories | 8 |
| LLMs benchmarked | 22 |
| Medical expert reviewers | 11 |
The researchers hope this benchmark will drive future work on evidence-grounded, multi-turn reasoning in clinical NLP.