Home ›› Topics ›› evaluation

Topic

evaluation

32 stories

Artificial Intelligence #eeg#foundation models

EEG Foundation Models Show Promise for Burst-Suppression Detection in ICU Without Patient-Specific Calibration

A new study on arXiv evaluates three EEG foundation models—REVE-base, LUNA-large, and LuMamba-Tiny—for automatic burst-suppression detection in ICU patients, finding REVE-base achieves the highest event-based F1-score (0.868) and reduces burst-per-minute error by 52.1% compared to a task-specific EEGNet baseline.

Jul 8, 2026 1 source

Benchmarking Agentic Review Systems: AI Peer Review Achieves 83% Pairwise Accuracy but Falls Short on Error Detection

Technology

Artificial Intelligence #benchmarking#agentic

Benchmarking Agentic Review Systems: AI Peer Review Achieves 83% Pairwise Accuracy but Falls Short on Error Detection

A study by Nguyen et al. benchmarks two open-source and one proprietary AI review system on peer review tasks. The best configuration (OpenAIReview + GPT-5.5) achieves 83.0% pairwise accuracy in tracking paper quality but only 71.6% recall in detecting injected errors. User feedback shows a positive-to-negative vote ratio of 1.44:1, with common complaints about false positives. The research highlights both the potential and limitations of current AI agents in evaluation tasks.

Jul 8, 2026 1 source

New StaminaBench Benchmark Reveals Coding Agents Fail After 5-6 Turns

Technology

Artificial Intelligence #staminabench#coding agents

New StaminaBench Benchmark Reveals Coding Agents Fail After 5-6 Turns

Researchers introduce StaminaBench, a benchmark that measures how many consecutive interaction turns coding agents can handle. Testing six harnesses and seven open-source LLMs over 100-turn scenarios, they found all models fail within 5-6 turns. Providing test feedback improved passed turn count by up to 12x, highlighting the importance of iterative testing.

Jun 22, 2026 2 sources

New Benchmark Reveals AI Agents Leak Private Data Even When Focused on Tasks

Technology

Artificial Intelligence #benchmark#privacy

New Benchmark Reveals AI Agents Leak Private Data Even When Focused on Tasks

A new benchmark called TRAP evaluates the trade-off between task accuracy and privacy leakage in AI agents handling sensitive documents. Testing 22 models, the study finds non-trivial privacy leakage across all model families, with instruction-following ability correlating with leakage rate. The authors propose structural private field isolation using hash keys to prevent leakage without sacrificing task performance.

Jun 21, 2026 1 source

From Texts to Scores: Tracing the Emergence of Essay Quality Representations in Large Language Models

Technology

Artificial Intelligence #large language models#essay scoring

From Texts to Scores: Tracing the Emergence of Essay Quality Representations in Large Language Models

A study by Zuo et al. systematically analyzes hidden representations of eight LLMs across three essay datasets, finding that essay quality information is linearly decodable, emerges progressively across layers, and is robust to prompting strategies. The research identifies individual 'essay scoring neurons' and shows that their distribution shifts with essay length, offering insights into interpretability of LLM-based automated essay scoring systems.

Jun 20, 2026 1 source

How Learner-Based Concept Drift Detection Maintains Accuracy in Evolving Data Streams

Technology

Artificial Intelligence #concept drift#machine learning

How Learner-Based Concept Drift Detection Maintains Accuracy in Evolving Data Streams

A new study from arXiv reviews concept drift detection algorithms for machine learning in evolving streaming environments. The authors examine theoretical characteristics and evaluate detectors on synthetic and real-world datasets, focusing on abrupt and gradual drift types.

Jun 20, 2026 1 source

New Benchmark Reveals Remote Sensing AI Models Fail at Negation Comprehension

Technology

Artificial Intelligence #ai#multimodal

New Benchmark Reveals Remote Sensing AI Models Fail at Negation Comprehension

A new study introduces RS-Neg, the first benchmark to evaluate negation comprehension in remote sensing multimodal large language models. The evaluation reveals that advanced models exhibit hallucinations and performance degradation when handling negation. The proposed NeFo method, using about 5% unlabeled test samples, significantly improves negation understanding, with implications for critical applications like emergency response and logistics.

Jun 20, 2026 1 source

Beyond Static Leaderboards: Predictive Validity for Evaluating LLM Agents in Enterprise AI

Technology

Artificial Intelligence #llm#agents

Beyond Static Leaderboards: Predictive Validity for Evaluating LLM Agents in Enterprise AI

A new paper on arXiv proposes replacing static aggregate-score leaderboards with predictive validity—correlation between in-sample and out-of-sample rank—for evaluating LLM agents. The authors argue that current benchmarks underspecify deployed-agent evaluation, based on fourteen parallel implementation studies and seven prior agent benchmarks. They introduce a twelve-tier measurement apparatus and falsifiable out-of-distribution criteria.

Jun 20, 2026 1 source

New Study Challenges Prior Claims on Scaling Context Length in Imitation Learning

Technology

Artificial Intelligence #training#evaluation

New Study Challenges Prior Claims on Scaling Context Length in Imitation Learning

Researchers evaluated diffusion policies for robotic imitation learning across varying context lengths, challenging prior claims that long-context scaling is fragile. They propose a training algorithm that jointly trains policies at multiple context lengths, reducing sample complexity.

Jun 17, 2026 1 source

WorkflowPerturb Benchmark Offers Calibrated Stress Tests for Multi-Agent Workflow Metrics

Technology

Artificial Intelligence #stress tests#multi-agent

WorkflowPerturb Benchmark Offers Calibrated Stress Tests for Multi-Agent Workflow Metrics

WorkflowPerturb introduces 4,973 golden workflows and 44,757 perturbed variants with three perturbation types at severity levels 10%, 30%, and 50%, enabling calibrated interpretation of workflow evaluation metrics for change management in multi-agent systems.

Jun 17, 2026 1 source

MedAI Study Evaluates TxAgent's Therapeutic Reasoning in NeurIPS CURE-Bench Competition

Technology

Artificial Intelligence #medai#txagent

MedAI Study Evaluates TxAgent's Therapeutic Reasoning in NeurIPS CURE-Bench Competition

A MedAI study evaluated TxAgent, an agentic AI system for therapeutic reasoning, in the NeurIPS CURE-Bench 2025 Challenge. TxAgent uses a fine-tuned Llama-3.1-8B model with iterative retrieval-augmented generation and a unified biomedical tool suite. The work was awarded the Excellence Award in Open Science.

Jun 17, 2026 2 sources

New JE-IRT Framework Reveals Multidimensional Abilities of Large Language Models

Technology

Artificial Intelligence #joint embedding#item response theory

New JE-IRT Framework Reveals Multidimensional Abilities of Large Language Models

Standard LLM evaluation compresses diverse abilities into single scores. JE-IRT, a geometric item-response framework, embeds both LLMs and questions in a shared space, where direction encodes semantics and norm encodes difficulty. The approach reveals topical specialization, explains out-of-distribution behavior, and uncovers cross-subject ability directions like an arithmetic axis, offering a more interpretable lens for model evaluation.

Jun 17, 2026 1 source

TERMS-Bench Diagnoses LLM Negotiation Agents Beyond Deal Rate for Enterprise Procurement

Technology

Artificial Intelligence #llm#negotiation

TERMS-Bench Diagnoses LLM Negotiation Agents Beyond Deal Rate for Enterprise Procurement

A new benchmark called TERMS-Bench goes beyond deal rate to diagnose why LLM negotiation agents fail, evaluating 13 frontier models on surplus extraction, cue use, belief calibration, and compliance. For enterprise procurement and trade, this offers actionable insights into AI agent weaknesses.

Jun 17, 2026 1 source

Study Reveals Binary Classifiers That Excel Under Extreme Imbalance Without Rebalancing

Technology

Artificial Intelligence #binary classifiers#class imbalance

Study Reveals Binary Classifiers That Excel Under Extreme Imbalance Without Rebalancing

A new study from arXiv systematically evaluates binary classifiers under class imbalance without rebalancing techniques. Results show that advanced models such as TabPFN and boosting-based ensembles maintain high performance even as minority class size shrinks, while traditional classifiers deteriorate. The research offers guidance for model selection in imbalanced learning tasks.

Jun 17, 2026 1 source

New Benchmark and Method Address Occlusion in Vision-Language-Action Models for Robotics

Technology

Artificial Intelligence #vision-language-action#occlusion

New Benchmark and Method Address Occlusion in Vision-Language-Action Models for Robotics

Researchers introduced LIBERO-Occ, an occlusion-oriented benchmark for Vision-Language-Action (VLA) models, and proposed Viewpoint Imagination (VIM), a method that generates a complementary view from an occluded primary observation to condition action prediction. Experiments show that state-of-the-art VLAs suffer substantial performance degradation under occlusion, and VIM improves robustness across task suites, occlusion types, and severity levels without requiring additional cameras at deployment.

Jun 16, 2026 1 source

BRITE Benchmark Reveals Critical Gaps in Text-to-Video Models' Object-Action Binding and Audio-Visual Sync

Technology

Artificial Intelligence #benchmark#text-to-video

BRITE Benchmark Reveals Critical Gaps in Text-to-Video Models' Object-Action Binding and Audio-Visual Sync

A new benchmark called BRITE provides the first unified framework for evaluating text-to-video (T2V) models on implausible prompts, audio-visual consistency, and interpretable QA-based assessment. Testing five state-of-the-art models including Sora 2 and Veo 3.1, BRITE reveals that while models excel at static object composition, they show significant degradation in object-action binding and audio-visual synchronization.

Jun 16, 2026 1 source

New Framework TRACED Evaluates LLM Reasoning Using Geometric Stability and Progress

Technology

Artificial Intelligence #llms#reasoning

New Framework TRACED Evaluates LLM Reasoning Using Geometric Stability and Progress

A new research framework called TRACED evaluates LLM reasoning quality by analyzing geometric progress and stability of reasoning traces. It distinguishes correct reasoning from hallucinations based on trajectory patterns, offering a more robust evaluation method than scalar probabilities.

Jun 16, 2026 1 source

New EEG Benchmark Promises Standardized Evaluation of Foundation Models

Technology

Artificial Intelligence #eeg#foundation models

New EEG Benchmark Promises Standardized Evaluation of Foundation Models

A new benchmark called EEG-FM-Bench aims to standardize evaluation of electroencephalography foundation models (EEG-FMs). It integrates 14 datasets across 10 paradigms and provides tools for gradient and representation analysis. Early experiments reveal critical insights about multi-task learning, pre-training efficiency, and model scaling.

Jun 16, 2026 1 source

TuneJury: Open Metric Improves Music Generation Preference Alignment

Technology

Artificial Intelligence #music generation#preference alignment

TuneJury: Open Metric Improves Music Generation Preference Alignment

Researchers introduce TuneJury, an open metric for improving music generation preference alignment. The model predicts preference scores from text prompts and audio clips, trained on diverse human-preference labels, and supports data filtering and post-hoc calibration.

Jun 16, 2026 1 source

SkillsBench Benchmark Measures How Agent Skills Boost LLM Performance Across Diverse Tasks

Technology

Artificial Intelligence #skillsbench#benchmarking

SkillsBench Benchmark Measures How Agent Skills Boost LLM Performance Across Diverse Tasks

Researchers introduce SkillsBench, a benchmark with 87 tasks across 8 domains to measure whether agent skills improve LLM performance. Curated skills raised average pass rate from 33.9% to 50.5%, with focused skills of at most three modules outperforming larger bundles. Smaller models with skills can match larger models without.

Jun 16, 2026 1 source

New MBABench Evaluates LLM Agents on End-to-End Finance Spreadsheet Tasks

Technology

Artificial Intelligence #llm agents#artificial intelligence

New MBABench Evaluates LLM Agents on End-to-End Finance Spreadsheet Tasks

MBABench, a new benchmark from researchers, evaluates LLM agents on end-to-end spreadsheet tasks in finance, focusing on modeling and scenario analysis. The benchmark assesses accuracy, formula use, and formatting. Claude family models lead but still fall short of professional standards.

Jun 16, 2026 1 source

P3B3 Benchmark Reveals Strong Brazilian Portuguese Bias in Large Language Models

Technology

Artificial Intelligence #llm#benchmark

P3B3 Benchmark Reveals Strong Brazilian Portuguese Bias in Large Language Models

According to a new research paper, a team introduced P3B3, an expert-curated benchmark for measuring bias between European and Brazilian Portuguese in large language models. Experiments show most LLMs strongly prefer Brazilian Portuguese, underscoring the need for more balanced variety representation in conversational AI.

Jun 16, 2026 1 source

Security Analysis of Long-Horizon Agentic AI Systems: Threats, Evaluation, and Framework Development

Technology

Artificial Intelligence #security#ai

Security Analysis of Long-Horizon Agentic AI Systems: Threats, Evaluation, and Framework Development

A recent arXiv paper by Almalki and Masud provides a structured analysis of security challenges in long-horizon agentic AI systems. It reviews existing threats, evaluation approaches, attack propagation mechanisms, and security frameworks, and proposes a taxonomy of threats and a framework for analyzing attack propagation to support future research.

Jun 16, 2026 1 source

UXBench: Measuring the Actionability of LLM-Generated UX Critiques

Technology

Artificial Intelligence #llms#ux

UXBench: Measuring the Actionability of LLM-Generated UX Critiques

UXBench evaluates LLM-generated UX critiques for actionability. It uses web fixtures over ten product-surface families and measures whether repair agents can improve interfaces. Results show models vary significantly in reliability.

Jun 16, 2026 1 source

New LLM-Based Simulator Evaluates Deliberative Polling Information Systems Against Strategic Attacks

Technology

Artificial Intelligence #agentic simulator#deliberative polling

New LLM-Based Simulator Evaluates Deliberative Polling Information Systems Against Strategic Attacks

Researchers introduce the LLM-based Agentic Bipolar Argumentation Simulator (ABAS) to evaluate information systems for deliberative polling. ABAS simulates autonomous agents voting and submitting justifications, measuring coverage of the reason space. Experiments show that a tag-flood attack collapses coverage, while a reversed-PageRank weighting resists it markedly better than uniform weights.

Jun 16, 2026 1 source

Psychometric Datasheet Reveals 'Dark Current' Bias in LLM-as-a-Judge Evaluation Systems

Technology

Artificial Intelligence #llm#artificial intelligence

Psychometric Datasheet Reveals 'Dark Current' Bias in LLM-as-a-Judge Evaluation Systems

Researchers introduce a Judge Datasheet protocol to measure biases in LLM-as-a-judge systems, including dark current under vacuum inputs and positional false preference. A case study of three open-weight models reveals stark differences in measurement reliability, with implications for enterprise AI evaluation.

Jun 16, 2026 1 source

Do LLMs Reliably Identify Correct Information Units in Aphasic Discourse? A New Study Evaluates Four Models

Technology

Artificial Intelligence #llms#aphasia

Do LLMs Reliably Identify Correct Information Units in Aphasic Discourse? A New Study Evaluates Four Models

A study examined whether instruction-tuned large language models (LLMs) can reliably perform token-level classification of Correct Information Units (CIUs) from aphasic discourse transcripts. Four models—Llama-3.1-8B, Qwen2.5-7B, Mistral-7B, and Phi-3-mini—were tested under zero-shot and few-shot prompting conditions. Results showed that few-shot prompting yielded competitive mean F1 scores between 0.776 and 0.817 for three models, but zero-shot was insufficient and Phi-3-mini was unstable. The authors recommend a human-in-the-loop approach for automated CIU scoring.

Jun 16, 2026 1 source

MMLongEmbed Benchmark Reveals Limitations in Long-Context Multimodal Embedding Models

Technology

Artificial Intelligence #multimodal#embedding

MMLongEmbed Benchmark Reveals Limitations in Long-Context Multimodal Embedding Models

MMLongEmbed is the first comprehensive benchmark for evaluating multimodal embedding models (MEMs) in long-context scenarios. It comprises four retrieval tasks covering text, document, and video modalities. The evaluation reveals that current MEMs rely heavily on superficial feature matching and struggle with deep semantic and structural dependencies, with performance degrading systematically based on context length and key information placement.

Jun 16, 2026 1 source

Risk-Aware LLM Agents for Geospatial Data Retrieval: New Framework Passes Adversarial Tests

Technology

Artificial Intelligence #llm#geospatial

Risk-Aware LLM Agents for Geospatial Data Retrieval: New Framework Passes Adversarial Tests

Researchers present a risk-aware LLM agent framework for retrieving remote sensing data from cloud-based geospatial catalogues using natural language queries. The system integrates Guardrail, General-QA, and Recommender-Analyst agents to convert user intent into structured API calls. Preliminary adversarial evaluation shows prompt-level safety instructions improve robustness, though rare high-impact failures persist.

Jun 16, 2026 1 source

New OSGuard Benchmark Evaluates Safety of Computer-Use Agents for Enterprise AI Deployment

Technology

Artificial Intelligence #ai safety#benchmark

New OSGuard Benchmark Evaluates Safety of Computer-Use Agents for Enterprise AI Deployment

Researchers introduce OSGuard, a benchmark suite for evaluating safety in computer-use agents. It includes action-level guardrail decisions and a risk-augmented execution suite to detect unsafe completions that satisfy nominal task objectives. Early tests show current multimodal guardrails perform well on isolated action judgments but reveal gaps in end-to-end safety.

Jun 16, 2026 1 source

RecourseBench: Modular Framework Promises Reproducible Evaluation of AI Recourse Methods

Technology

Artificial Intelligence #algorithmic recourse#machine learning

RecourseBench: Modular Framework Promises Reproducible Evaluation of AI Recourse Methods

A new framework called RecourseBench aims to standardize and validate algorithmic recourse methods—counterfactual explanations that show individuals how to reverse an AI's decision. It decomposes the evaluation pipeline into five decoupled layers and integrates 28 state-of-the-art methods, with automated tests to verify reproducibility.

Jun 16, 2026 1 source

Metric Match: New Subset Selection Method Improves LLM Judge Reliability Evaluation, Cuts Annotation Costs by 32.5%

Technology

Artificial Intelligence #llm#judge

Metric Match: New Subset Selection Method Improves LLM Judge Reliability Evaluation, Cuts Annotation Costs by 32.5%

Researchers developed Metric Match, a subset selection method that reduces costly human annotations needed to evaluate LLM judge reliability. The approach achieves a 0.838 win-rate over random selection, cuts estimation error by 18.7%, and reduces annotation needs by 32.5%. A medical case study showed $1,041.67 in savings.

Jun 16, 2026 1 source