Do LLMs Reliably Identify Correct Information Units in Aphasic Discourse? A New Study Evaluates Four Models

A study examined whether instruction-tuned large language models (LLMs) can reliably perform token-level classification of Correct Information Units (CIUs) from aphasic discourse transcripts. Four models—Llama-3.1-8B, Qwen2.5-7B, Mistral-7B, and Phi-3-mini—were tested under zero-shot and few-shot prompting conditions. Results showed that few-shot prompting yielded competitive mean F1 scores between 0.776 and 0.817 for three models, but zero-shot was insufficient and Phi-3-mini was unstable. The authors recommend a human-in-the-loop approach for automated CIU scoring.

iGEN Editorial

June 16, 2026

Do LLMs Reliably Identify Correct Information Units in Aphasic Discourse? A New Study Evaluates Four Models

Correct Information Units (CIUs) are central to discourse assessment in aphasia because they quantify communicative informativeness rather than linguistic form alone, according to a study published on arXiv. However, CIU scoring is time intensive and requires trained raters. This study, authored by Pittman, Jason M; Medina-Santos, Yesenia; Phillips Jr, Anton; and Stark, Brielle C, examined whether instruction-tuned large language models (LLMs) can reliably perform token-level CIU classification from aphasic discourse transcripts.

Methodology

The researchers used sixteen picture-description transcripts elicited with the Cat Rescue stimulus, annotated for CIU status according to Nicholas and Brookshire (1993). The sample spanned four severity strata: control, mild, moderate, and severe aphasia. Four publicly available instruction-tuned LLMs were benchmarked: Llama-3.1-8B, Qwen2.5-7B, Mistral-7B, and Phi-3-mini. They were tested under zero-shot and two few-shot prompting conditions across five stratified random seeds. Performance was evaluated against consensus human labels using accuracy, precision, recall, F1, and Cohen's kappa.

Results

Zero-shot prompting was insufficient across all models, according to the study. In contrast, few-shot prompting yielded substantial gains and produced competitive performance for three viable models. The following table summarizes the mean few-shot F1 scores:

Model	Mean Few-Shot F1	Notes
Llama-3.1-8B	0.776–0.817	Viable, high recall, lower precision
Qwen2.5-7B	0.776–0.817	Viable, high recall, lower precision
Mistral-7B	0.776–0.817	Viable, high recall, lower precision
Phi-3-mini	Unstable	Did not yield reliable performance

No significant differences were found between fixed global and per-chunk local example selection. Viable models showed high recall but lower precision, suggesting systematic over-classification of tokens as CIUs. Performance also varied by discourse severity, with the weakest results in more severe aphasia.

Implications for Automated Discourse Assessment

The study found that few-shot LLM prompting can support automated CIU identification without gradient-based task training, but agreement with human annotation remains insufficient for fully autonomous use. The authors state: "These findings support LLM-based CIU scoring as a promising human-in-the-loop component of discourse assessment systems." For enterprise technology decision-makers in healthcare and clinical applications, this evaluation underscores the need for careful validation and integration strategies when deploying LLMs in diagnostic workflows.

The research highlights that while LLMs can reduce rater workload, they are not yet a replacement for human judgment, especially in cases of severe aphasia. The results also caution against relying on zero-shot performance and emphasize the importance of few-shot examples tailored to the target population. Future work may investigate larger models or fine-tuning approaches to improve reliability.

Sources:

Do LLMs Reliably Identify Correct Information Units in Aphasic Discourse? A New Study Evaluates Four Models

Methodology

Results

Implications for Automated Discourse Assessment

Recommended Stories

UXBench: Measuring the Actionability of LLM-Generated UX Critiques

New Framework MACR Resolves Knowledge Conflicts in LLMs Using Multi-Agent Reasoning

From Texts to Scores: Tracing the Emergence of Essay Quality Representations in Large Language Models

New Benchmark Reveals Remote Sensing AI Models Fail at Negation Comprehension