Visit IGEN World Explore IGEN Expo

EXPLORE UPGRADE PLANS

BREAKING

Moody's Assigns First-Time Baa2 Rating to RBL Bank, One Notch Above India's Sovereign Sebi Bars Zee's Subhash Chandra, Punit Goenka From Market for One Year Zepto Defers IPO by Two to Three Quarters After Tepid Investor Response Tim Cook: India Among Apple's Best Global Markets as June Quarter Records Revenue Domestic funds reach record 21% stake in Indian companies as FPI ownership drops to 17% Cybercriminals widen net as assessees rush to meet I-T return filing deadline Bloomberg Delays India's Sovereign Bond Index Inclusion as Market Reforms Need Further Testing Gold loans jump 93.8% y-o-y, fuel bank credit growth in Q1FY27 Snapchat joins YouTube, LinkedIn and Substack in fight against 'AI slop' Amazon speeds last-mile delivery, expands robotics fleet past 1 million Moody's Assigns First-Time Baa2 Rating to RBL Bank, One Notch Above India's Sovereign Sebi Bars Zee's Subhash Chandra, Punit Goenka From Market for One Year Zepto Defers IPO by Two to Three Quarters After Tepid Investor Response Tim Cook: India Among Apple's Best Global Markets as June Quarter Records Revenue Domestic funds reach record 21% stake in Indian companies as FPI ownership drops to 17% Cybercriminals widen net as assessees rush to meet I-T return filing deadline Bloomberg Delays India's Sovereign Bond Index Inclusion as Market Reforms Need Further Testing Gold loans jump 93.8% y-o-y, fuel bank credit growth in Q1FY27 Snapchat joins YouTube, LinkedIn and Substack in fight against 'AI slop' Amazon speeds last-mile delivery, expands robotics fleet past 1 million

Home ›› Topics ›› benchmarking

Topic

benchmarking

18 stories

Benchmarking Agentic Review Systems: AI Peer Review Achieves 83% Pairwise Accuracy but Falls Short on Error Detection

Artificial Intelligence #benchmarking#agentic

Benchmarking Agentic Review Systems: AI Peer Review Achieves 83% Pairwise Accuracy but Falls Short on Error Detection

A study by Nguyen et al. benchmarks two open-source and one proprietary AI review system on peer review tasks. The best configuration (OpenAIReview + GPT-5.5) achieves 83.0% pairwise accuracy in tracking paper quality but only 71.6% recall in detecting injected errors. User feedback shows a positive-to-negative vote ratio of 1.44:1, with common complaints about false positives. The research highlights both the potential and limitations of current AI agents in evaluation tasks.

Jul 8, 2026 1 source

ROSE Benchmark Reveals Perception-to-Action Gap in Multimodal AI Models

Artificial Intelligence #ai#multimodal

ROSE Benchmark Reveals Perception-to-Action Gap in Multimodal AI Models

The ROSE benchmark measures how reliably multimodal large language models (MLLMs) convert visual evidence into context-appropriate actions. Testing nine recent models, researchers found performance drops of up to 44.5 percentage points from counting to region-conditioned action, while humans achieve 98.8% accuracy.

Jun 22, 2026 3 sources

CRAX Benchmark Delivers 100x Speedup for Safe Reinforcement Learning Research

Artificial Intelligence #reinforcement learning#safe rl

CRAX Benchmark Delivers 100x Speedup for Safe Reinforcement Learning Research

Researchers have introduced CRAX (Constrained RL Accelerated with JAX), a fast safe reinforcement learning benchmark that leverages hardware acceleration to achieve up to 100x speedups over CPU-based alternatives. Built on MuJoCo XLA, it includes six environment suites and three agent-specific tasks across three difficulty levels. Evaluation of six popular safe RL methods reveals trade-offs between performance and safety, with curriculum learning improving results.

Jun 20, 2026 1 source

New Benchmark BIM-Edit Reveals Large Language Models Struggle with IFC-Based Building Information Model Editing

Artificial Intelligence #bim#llm

New Benchmark BIM-Edit Reveals Large Language Models Struggle with IFC-Based Building Information Model Editing

Researchers introduced BIM-Edit, a benchmark for evaluating large language models (LLMs) on natural-language editing of Building Information Models (BIM) in IFC format. The best-performing LLM achieved only a 49.5% average score across geometric, semantic, and topological metrics, and no model fully solved more than 3.4% of tasks, highlighting a substantial gap between current LLM capabilities and structured engineering design needs.

Jun 20, 2026 1 source

QMFOL Benchmark Reveals LLM Reasoning Degrades with Logical Complexity, New Framework Enables Precise Evaluation

Artificial Intelligence #llm#reasoning

QMFOL Benchmark Reveals LLM Reasoning Degrades with Logical Complexity, New Framework Enables Precise Evaluation

A new automated framework called QMFOL generates deductive reasoning tasks with quantifiable logical complexity, enabling precise evaluation of LLM reasoning. The associated benchmark, QMFOLBench, comprises 2,880 instances across 960 configurations. Evaluations on six large reasoning models (LRMs) and two LLMs show performance degrades and computational overhead increases with rising logical complexity, with models performing better on True-labeled tasks than False or Unknown ones.

Jun 20, 2026 1 source

SLUM-i: AI Semi-Supervised Learning Maps Informal Settlements with Benchmark Dataset

Artificial Intelligence #semi-supervised learning#urban mapping

SLUM-i: AI Semi-Supervised Learning Maps Informal Settlements with Benchmark Dataset

A new AI framework called SLUM-i uses semi-supervised learning to map informal settlements in cities like Lahore, Karachi, and Mumbai. It introduces a benchmark dataset and achieves up to +5.9 pp mIoU improvement over existing methods.

Jun 17, 2026 1 source

Study Reveals Binary Classifiers That Excel Under Extreme Imbalance Without Rebalancing

Artificial Intelligence #binary classifiers#class imbalance

Study Reveals Binary Classifiers That Excel Under Extreme Imbalance Without Rebalancing

A new study from arXiv systematically evaluates binary classifiers under class imbalance without rebalancing techniques. Results show that advanced models such as TabPFN and boosting-based ensembles maintain high performance even as minority class size shrinks, while traditional classifiers deteriorate. The research offers guidance for model selection in imbalanced learning tasks.

Jun 17, 2026 1 source

DualGauge: Automated Joint Security-Functionality Benchmarking of Specification-Only Code Generation by LLMs and Coding Agents

Artificial Intelligence #technology#ai

DualGauge: Automated Joint Security-Functionality Benchmarking of Specification-Only Code Generation by LLMs and Coding Agents

Researchers present DualGauge, an automated framework for jointly evaluating correctness and security of code generated by LLMs from natural-language specifications. A benchmark of 307 tasks across three languages shows that even the strongest models achieve under 15% joint security-functionality success, while factors like scale and instruction tuning do not reliably improve outcomes. Three leading agentic coding systems also show no advantage over direct generation.

Jun 16, 2026 1 source

SpatialWorld Benchmark Reveals Multimodal Agents Struggle with Interactive Spatial Reasoning

Artificial Intelligence #spatial reasoning#multimodal agents

SpatialWorld Benchmark Reveals Multimodal Agents Struggle with Interactive Spatial Reasoning

Researchers introduced SpatialWorld, a benchmark for evaluating interactive spatial understanding of multimodal agents in real-world tasks. Testing 15 advanced agents, the strongest model (GPT-5) achieved only 17.4% task success rate, highlighting challenges in active exploration and long-horizon planning.

Jun 16, 2026 1 source

SkillsBench Benchmark Measures How Agent Skills Boost LLM Performance Across Diverse Tasks

Artificial Intelligence #skillsbench#benchmarking

SkillsBench Benchmark Measures How Agent Skills Boost LLM Performance Across Diverse Tasks

Researchers introduce SkillsBench, a benchmark with 87 tasks across 8 domains to measure whether agent skills improve LLM performance. Curated skills raised average pass rate from 33.9% to 50.5%, with focused skills of at most three modules outperforming larger bundles. Smaller models with skills can match larger models without.

Jun 16, 2026 1 source

New MBABench Evaluates LLM Agents on End-to-End Finance Spreadsheet Tasks

Artificial Intelligence #llm agents#artificial intelligence

New MBABench Evaluates LLM Agents on End-to-End Finance Spreadsheet Tasks

MBABench, a new benchmark from researchers, evaluates LLM agents on end-to-end spreadsheet tasks in finance, focusing on modeling and scenario analysis. The benchmark assesses accuracy, formula use, and formatting. Claude family models lead but still fall short of professional standards.

Jun 16, 2026 1 source

AI Safety Monitors May Fail After Model Updates, New Benchmarking Study Finds

Artificial Intelligence #ai safety#model monitoring

AI Safety Monitors May Fail After Model Updates, New Benchmarking Study Finds

A new research paper presents the first systematic test of whether activation monitors remain reliable after common model updates such as quantization and fine-tuning. The study finds that while quantization largely preserves performance, fine-tuning frequently makes monitors stale, with privacy monitors most affected. Degradation is predictable, enabling triaged revalidation.

Jun 16, 2026 1 source

New Attack FragFuse Exploits LLM Agent Memory to Bypass Access Controls

Artificial Intelligence #ai agents#large language models

New Attack FragFuse Exploits LLM Agent Memory to Bypass Access Controls

Researchers introduce FragFuse, a novel attack that bypasses access control in large language model agents by fragmenting prohibited queries across interactions and storing them in long-term memory, later reconstructing them without triggering defenses. The attack achieves an 86.3% average bypass success rate across multiple agent settings and exposes a critical vulnerability in memory-based AI systems.

Jun 16, 2026 4 sources

LabOSBench: New Benchmark Tests AI Agents on Complex Scientific Instrument Control

Artificial Intelligence #labosbench#benchmarking

LabOSBench: New Benchmark Tests AI Agents on Complex Scientific Instrument Control

LabOSBench is a new benchmark designed to evaluate computer-use agents on scientific instrument control. It features 96 subtasks across eight simulated instruments, testing agents on sample loading, alignment, parameter tuning, data acquisition, and result inspection. Early results show that while agents handle structured GUI tasks well, they struggle with feedback-driven operations and long-horizon workflows.

Jun 16, 2026 1 source

CoffeeBench: New Benchmark Evaluates LLM Agents in Multi-Agent Economic Simulations

Artificial Intelligence #llm#agents

CoffeeBench: New Benchmark Evaluates LLM Agents in Multi-Agent Economic Simulations

Researchers introduce CoffeeBench, a benchmark for evaluating LLM agents in a long-horizon multi-agent economy. The 90-day simulation features farmers, roasters, and retailers, with models controlling one roaster. All models outperformed a passive baseline, but Claude Haiku 4.5 showed an idle-drift failure mode.

Jun 16, 2026 1 source

UrbanWell Benchmark Puts Multimodal LLMs to Test on Spatio-Temporal Urban Wellbeing Analytics

Artificial Intelligence #multimodal#large language models

UrbanWell Benchmark Puts Multimodal LLMs to Test on Spatio-Temporal Urban Wellbeing Analytics

Researchers introduce UrbanWell, a large-scale benchmark for evaluating multimodal large language models on spatio-temporal urban wellbeing analytics. The benchmark covers 38 cities, multiple years, and diverse indicators including environment, accessibility, urban form, vitality, and subjective perception. Testing 15 state-of-the-art MLLMs in zero-shot settings reveals substantial performance variations across heterogeneous indicators.

Jun 16, 2026 1 source

RetailBench Benchmark Tests LLM Agents on Long-Horizon Retail Decisions

Artificial Intelligence #retailbench#llm

RetailBench Benchmark Tests LLM Agents on Long-Horizon Retail Decisions

Researchers introduced RetailBench, a simulation benchmark for evaluating LLM agents in single-store supermarket management over 180 days. Tests on seven models showed only a subset completed the full horizon, and even the best fell far behind an oracle policy due to incomplete evidence acquisition and lack of consistent strategy.

Jun 16, 2026 2 sources

ToolMenuBench: New Benchmark Evaluates Tool-Menu Filtering for Reliable and Efficient LLM Agents

Artificial Intelligence #toolmenubench#benchmarking

ToolMenuBench: New Benchmark Evaluates Tool-Menu Filtering for Reliable and Efficient LLM Agents

ToolMenuBench, a new benchmark from researchers, evaluates how tool-menu filtering strategies affect LLM agent reliability and efficiency. In tests across seven model backends, causal minimal tool filtering improved task success from 32.1% to 85.7% while reducing token usage by roughly 98%.

Jun 16, 2026 2 sources