Correct Information Units (CIUs) are central to discourse assessment in aphasia because they quantify communicative informativeness rather than linguistic form alone, according to a study published on arXiv. However, CIU scoring is time intensive and requires trained raters. This study, authored by Pittman, Jason M; Medina-Santos, Yesenia; Phillips Jr, Anton; and Stark, Brielle C, examined whether instruction-tuned large language models (LLMs) can reliably perform token-level CIU classification from aphasic discourse transcripts.
Methodology
The researchers used sixteen picture-description transcripts elicited with the Cat Rescue stimulus, annotated for CIU status according to Nicholas and Brookshire (1993). The sample spanned four severity strata: control, mild, moderate, and severe aphasia. Four publicly available instruction-tuned LLMs were benchmarked: Llama-3.1-8B, Qwen2.5-7B, Mistral-7B, and Phi-3-mini. They were tested under zero-shot and two few-shot prompting conditions across five stratified random seeds. Performance was evaluated against consensus human labels using accuracy, precision, recall, F1, and Cohen's kappa.
Results
Zero-shot prompting was insufficient across all models, according to the study. In contrast, few-shot prompting yielded substantial gains and produced competitive performance for three viable models. The following table summarizes the mean few-shot F1 scores:
| Model | Mean Few-Shot F1 | Notes |
|---|---|---|
| Llama-3.1-8B | 0.776–0.817 | Viable, high recall, lower precision |
| Qwen2.5-7B | 0.776–0.817 | Viable, high recall, lower precision |
| Mistral-7B | 0.776–0.817 | Viable, high recall, lower precision |
| Phi-3-mini | Unstable | Did not yield reliable performance |
No significant differences were found between fixed global and per-chunk local example selection. Viable models showed high recall but lower precision, suggesting systematic over-classification of tokens as CIUs. Performance also varied by discourse severity, with the weakest results in more severe aphasia.
Implications for Automated Discourse Assessment
The study found that few-shot LLM prompting can support automated CIU identification without gradient-based task training, but agreement with human annotation remains insufficient for fully autonomous use. The authors state: "These findings support LLM-based CIU scoring as a promising human-in-the-loop component of discourse assessment systems." For enterprise technology decision-makers in healthcare and clinical applications, this evaluation underscores the need for careful validation and integration strategies when deploying LLMs in diagnostic workflows.
The research highlights that while LLMs can reduce rater workload, they are not yet a replacement for human judgment, especially in cases of severe aphasia. The results also caution against relying on zero-shot performance and emphasize the importance of few-shot examples tailored to the target population. Future work may investigate larger models or fine-tuning approaches to improve reliability.