Topic
judge
Psychometric Datasheet Reveals 'Dark Current' Bias in LLM-as-a-Judge Evaluation Systems
Researchers introduce a Judge Datasheet protocol to measure biases in LLM-as-a-judge systems, including dark current under vacuum inputs and positional false preference. A case study of three open-weight models reveals stark differences in measurement reliability, with implications for enterprise AI evaluation.
Metric Match: New Subset Selection Method Improves LLM Judge Reliability Evaluation, Cuts Annotation Costs by 32.5%
Researchers developed Metric Match, a subset selection method that reduces costly human annotations needed to evaluate LLM judge reliability. The approach achieves a 0.838 win-rate over random selection, cuts estimation error by 18.7%, and reduces annotation needs by 32.5%. A medical case study showed $1,041.67 in savings.
Technology Judge Kicks Lawyers Off Case After Both Sides Used AI to Generate Hallucinated Legal Citations
Senior US District Judge Sharion Aycock sanctioned four lawyers after discovering they used AI to produce legal citations that did not exist. The judge disqualified all lawyers from the case, barred two from the district for two years, and imposed a total fine of $8,000, setting a precedent that ignorance of AI hallucinations is not a viable defense.