Enterprises deploying large language model (LLM) products rely on continuous evaluation pipelines where a strong LLM judge scores every interaction. When scores drift downward, teams are paged. But the judge itself is a model behind an API — a silent version bump or scoring-prompt update can change how it scores, creating an ambiguity: is the product worse or is the judge just stricter?
A new research paper from Li Yitao presents a method to resolve this ambiguity. The approach, detailed in "Who Drifted: the System or the Judge? Anytime-Valid Attribution in LLM Evaluation Pipelines" on arXiv, introduces a fixed, human-labeled anchor set that the current judge re-scores at a steady interleave. A second betting e-process monitors the judge-versus-human gap, and a guard-window rule returns a verdict in {none, system, judge}.
Key Technical Components
- Anchor set: A fixed set of items labeled by humans. The current judge re-scores them on a schedule.
- E-process: A betting e-process is used on the judge-versus-human gap to detect when the judge's scoring behavior changes.
- Guard-window rule: A mechanism that returns a verdict: none (no drift), system (product drift), or judge (judge drift).
The method proves anytime-validity, one-way identification (only the judge can move the anchors), and process orthogonality.
Experimental Results
On two real judge changes, the method showed high accuracy. A silent version bump was detected as judge drift in 60 out of 60 runs with zero judge-to-system misattribution. A contaminating strict-prompt change was correctly attributed on 110 of 120 runs at a guard width of 300. In contrast, the industry-default rolling z-test false-alarmed on 75% of drift-free streams.
Every experiment replicated on a second domain (TL;DR summarization) with nothing re-tuned. The strict-prompt change shifted scores harder on that domain, causing the anchors to fire faster and attribution became perfect: 240 of 240 correct.
| Scenario | Method | Correct Attribution | Misattribution | False Alarm Rate |
|---|---|---|---|---|
| Silent version bump | New method | 60/60 (100%) | 0 | - |
| Strict-prompt change (width 300) | New method | 110/120 (91.7%) | 10 | - |
| Strict-prompt change (TL;DR) | New method | 240/240 (100%) | 0 | - |
| Drift-free streams | Rolling z-test | - | - | 75% |
Cost Efficiency
The monitor runs at approximately 0.64 of the cost of strong-judging every item, or 0.21 in a cheaper-but-deafer regime, according to the paper.
Implications for Enterprise AI Pipelines
For technology leaders deploying LLM-based systems in critical domains like supply chain analytics or customer-facing chatbots, this research provides a principled way to maintain evaluation integrity. The ability to attribute drift correctly ensures that engineering teams respond to genuine product degradation rather than noisy judge changes, reducing wasted paging and false alarms.