Qualitative coding of academic texts is a cornerstone of social science, but expert annotation is notoriously difficult to scale. A team of researchers has demonstrated that frontier large language models (LLMs) can achieve high reliability on a challenging interpretive task: detecting whether authors of Bayesian cognitive science papers adopt a realist or instrumentalist stance toward Bayesian models. The study, published on arXiv by Kucuk, Eyup Engin, Kelestemur, Tarik, and Tanrikulu (Ömer Dağlar), presents an expert-led framework that combines theory-driven codebooks with diagnostic-gated prompt optimization.
The work addresses a nuanced construct. According to the study, realism treats Bayesian models as descriptions of mental and neural mechanisms, while instrumentalism views them as useful mathematical tools. The researchers built a codebook and secured expert-coded reference annotations, then ran a diagnostic-gated prompt-optimization search that yielded a single zero-shot prompt for three frontier LLMs: GPT-5.1 (OpenAI), Claude Sonnet 4.6 (Anthropic), and Gemini 3 Pro Preview (Google).
Methodology: Combining Expert Codes with LLM Prompts
The method involved a held-out test set to validate the prompt. The final prompt achieved a held-out combined reliability score of 0.76, computed as the harmonic mean of ICC (intraclass correlation coefficient) of 0.79 and Cronbach's alpha of 0.74. All diagnostic checks were satisfied, indicating the prompt produced consistent and interpretable outputs. The researchers then deployed the prompt on a corpus of 6,858 quotes drawn from 210 articles in Bayesian cognitive science.
Results: High Reliability and Domain Insights
The three LLMs reached substantial quote-level agreement: ICC = 0.80 and alpha = 0.76, yielding a combined quote-level score of 0.78. At the article level, rank stability was near-perfect, with Spearman correlations of r = 0.96 to 0.97 across rater pairs. The corpus was predominantly weakly realist, but article-level stances were rarely uniform: only 1.4% of articles used a single stance band, while 59.5% spanned four or more bands.
A striking domain-specific finding emerged: articles focused on low-level perception and motor processes scored 8.8 Realism points higher than those on high-level cognition (p < .001, Cohen's d = 0.60). The authors note that this quantifies a long-held qualitative intuition in the field.
The following table summarises key reliability metrics from the study:
| Metric | Value |
|---|---|
| Held-out combined reliability (harmonic mean of ICC and α) | 0.76 |
| Held-out ICC | 0.79 |
| Held-out Cronbach's α | 0.74 |
| Quote-level ICC (on 6,858 quotes) | 0.80 |
| Quote-level α | 0.76 |
| Article-level rank stability (Spearman r) | 0.96–0.97 |
Implications for AI-Assisted Content Analysis
The researchers present their work as an expert-led case study, stressing that the framework is intended to generalize to similar theoretically demanding tasks, not to all qualitative analysis. For enterprise technology decision-makers, the method offers a template for using LLMs to scale interpretive coding—for example, analyzing customer feedback, policy documents, or technical support tickets where nuanced stances must be extracted consistently. The use of diagnostic-gated prompt optimization and multi-rater reliability analysis ensures that outputs remain trustworthy even when the target construct is abstract.
While the study focuses on scientific discourse, its reliance on a zero-shot prompt means any organization with a well-defined codebook could adapt the approach. The three LLMs tested are all available via commercial APIs, making deployment feasible. The researchers did not disclose the exact prompt, but the diagnostic framework is described in sufficient detail for replication.
For now, the case study stands as evidence that LLMs can augment human experts in classifying subtle theoretical positions, a capability that extends beyond academia into any domain where scale and consistency are required.