judge

3 stories

Artificial Intelligence #llm#artificial intelligence

Psychometric Datasheet Reveals 'Dark Current' Bias in LLM-as-a-Judge Evaluation Systems

Researchers introduce a Judge Datasheet protocol to measure biases in LLM-as-a-judge systems, including dark current under vacuum inputs and positional false preference. A case study of three open-weight models reveals stark differences in measurement reliability, with implications for enterprise AI evaluation.

Jun 16, 2026 1 source

Metric Match: New Subset Selection Method Improves LLM Judge Reliability Evaluation, Cuts Annotation Costs by 32.5%

Technology

Artificial Intelligence #llm#judge

Metric Match: New Subset Selection Method Improves LLM Judge Reliability Evaluation, Cuts Annotation Costs by 32.5%

Researchers developed Metric Match, a subset selection method that reduces costly human annotations needed to evaluate LLM judge reliability. The approach achieves a 0.838 win-rate over random selection, cuts estimation error by 18.7%, and reduces annotation needs by 32.5%. A medical case study showed $1,041.67 in savings.

Jun 16, 2026 1 source

Judge Kicks Lawyers Off Case After Both Sides Used AI to Generate Hallucinated Legal Citations

Technology

Artificial Intelligence #judge#lawyers

Judge Kicks Lawyers Off Case After Both Sides Used AI to Generate Hallucinated Legal Citations

Senior US District Judge Sharion Aycock sanctioned four lawyers after discovering they used AI to produce legal citations that did not exist. The judge disqualified all lawyers from the case, barred two from the district for two years, and imposed a total fine of $8,000, setting a precedent that ignorance of AI hallucinations is not a viable defense.

Jun 16, 2026 1 source