Enterprises increasingly rely on large language models (LLMs) to evaluate other AI systems — a practice known as LLM-as-a-judge. However, according to a new paper by Usami, Hiroyasu, Hara, Keisuke, Tsuboi, Ayato, and Matsuda, Naohiko on arXiv (arXiv:2606.15610), these judicial LLMs often carry hidden biases that distort assessments. The researchers argue that a judge should be reported as a measurement instrument, not just a scalar accuracy or win-rate device.
The team introduces a Judge Datasheet protocol that measures several key psychometric properties:
- Dark current: response under true-vacuum inputs (e.g., empty prompts)
- Stable cross-sensitivity: variation due to same-quality surface changes
- Positional false preference: bias toward answers in a certain position
- Target sensitivity: response to controlled quality differences (a "ladder" of quality)
- Criterion or operating point: induced by tie-breaking instructions
The protocol also performs a direction-stability decomposition to distinguish whether an apparent preference (Delta0) comes from stable surface response or disguised positional bias.
Case Study: Three Open-Weight Models
In a case study of three open-weight LLMs, the authors found stark differences:
- Llama-3.1-8B shows high dark current and presentation-conflicted Delta0 behavior — meaning it responds even to null inputs and its preferences shift with surface formatting.
- Qwen2.5-14B is described as "vacuum-clean" (low dark current) and target-sensitive, but it mixes stable and positional over-discrimination.
- Qwen2.5-32B is also vacuum-clean with low stable cross-sensitivity and low positional false preference.
| Model | Dark Current | Stable Cross-Sensitivity | Positional False Preference | Target Sensitivity |
|---|---|---|---|---|
| Llama-3.1-8B | High | High (conflicted) | High | Low |
| Qwen2.5-14B | Low (vacuum-clean) | Mixed | Over-discrimination | High |
| Qwen2.5-32B | Low (vacuum-clean) | Low | Low | Moderate |
Tie Instructions and Criterion Shift
The study also examined how tie-breaking instructions affect results. A strict tie criterion eliminates Qwen2.5-32B's Delta0 false preference but absorbs marginal Delta1 target signals into ties, while preserving sensitivity to larger quality gaps (Delta5). The authors conclude: "prompting moves the criterion, not the resolution" — meaning instructions change the threshold for judging tie vs. preference, but do not sharpen the model's ability to discriminate fine quality differences.
The researchers explicitly state they do not claim confirmation of the downstream mechanism hypothesis that motivated the work. Instead, the contribution is a metrological protocol for measuring the measuring device before downstream claims are made.
Implications for Enterprise AI Procurement
For CTOs and technology leaders deploying LLM-based evaluation in areas like supply chain AI, customer service bots, or document processing, this protocol offers a way to vet evaluator models before trusting their outputs. High dark current or positional bias could lead to incorrect vendor selection or flawed model performance benchmarks. The Judge Datasheet provides a standardised, reproducible method to assess bias—critical as AI evaluation becomes a core enterprise function.