Psychometric Datasheet Reveals 'Dark Current' Bias in LLM-as-a-Judge Evaluation Systems

Researchers introduce a Judge Datasheet protocol to measure biases in LLM-as-a-judge systems, including dark current under vacuum inputs and positional false preference. A case study of three open-weight models reveals stark differences in measurement reliability, with implications for enterprise AI evaluation.

iGEN Editorial

June 16, 2026

Psychometric Datasheet Reveals 'Dark Current' Bias in LLM-as-a-Judge Evaluation Systems

Enterprises increasingly rely on large language models (LLMs) to evaluate other AI systems — a practice known as LLM-as-a-judge. However, according to a new paper by Usami, Hiroyasu, Hara, Keisuke, Tsuboi, Ayato, and Matsuda, Naohiko on arXiv (arXiv:2606.15610), these judicial LLMs often carry hidden biases that distort assessments. The researchers argue that a judge should be reported as a measurement instrument, not just a scalar accuracy or win-rate device.

The team introduces a Judge Datasheet protocol that measures several key psychometric properties:

Dark current: response under true-vacuum inputs (e.g., empty prompts)
Stable cross-sensitivity: variation due to same-quality surface changes
Positional false preference: bias toward answers in a certain position
Target sensitivity: response to controlled quality differences (a "ladder" of quality)
Criterion or operating point: induced by tie-breaking instructions

The protocol also performs a direction-stability decomposition to distinguish whether an apparent preference (Delta0) comes from stable surface response or disguised positional bias.

Case Study: Three Open-Weight Models

In a case study of three open-weight LLMs, the authors found stark differences:

Llama-3.1-8B shows high dark current and presentation-conflicted Delta0 behavior — meaning it responds even to null inputs and its preferences shift with surface formatting.
Qwen2.5-14B is described as "vacuum-clean" (low dark current) and target-sensitive, but it mixes stable and positional over-discrimination.
Qwen2.5-32B is also vacuum-clean with low stable cross-sensitivity and low positional false preference.

Model	Dark Current	Stable Cross-Sensitivity	Positional False Preference	Target Sensitivity
Llama-3.1-8B	High	High (conflicted)	High	Low
Qwen2.5-14B	Low (vacuum-clean)	Mixed	Over-discrimination	High
Qwen2.5-32B	Low (vacuum-clean)	Low	Low	Moderate

Tie Instructions and Criterion Shift

The study also examined how tie-breaking instructions affect results. A strict tie criterion eliminates Qwen2.5-32B's Delta0 false preference but absorbs marginal Delta1 target signals into ties, while preserving sensitivity to larger quality gaps (Delta5). The authors conclude: "prompting moves the criterion, not the resolution" — meaning instructions change the threshold for judging tie vs. preference, but do not sharpen the model's ability to discriminate fine quality differences.

The researchers explicitly state they do not claim confirmation of the downstream mechanism hypothesis that motivated the work. Instead, the contribution is a metrological protocol for measuring the measuring device before downstream claims are made.

Implications for Enterprise AI Procurement

For CTOs and technology leaders deploying LLM-based evaluation in areas like supply chain AI, customer service bots, or document processing, this protocol offers a way to vet evaluator models before trusting their outputs. High dark current or positional bias could lead to incorrect vendor selection or flawed model performance benchmarks. The Judge Datasheet provides a standardised, reproducible method to assess bias—critical as AI evaluation becomes a core enterprise function.

Sources:

Psychometric Datasheet Reveals 'Dark Current' Bias in LLM-as-a-Judge Evaluation Systems

Case Study: Three Open-Weight Models

Tie Instructions and Criterion Shift

Implications for Enterprise AI Procurement

Recommended Stories

TreeTracer Visualizes Hidden LLM Bias Through Stochastic Path Aggregation for Enterprise AI Auditing

New Study Challenges Prior Claims on Scaling Context Length in Imitation Learning

New JE-IRT Framework Reveals Multidimensional Abilities of Large Language Models

Researchers Tackle Annotator Disagreement to Improve Hate Speech Classification Accuracy