iGEN
Visit IGEN World Explore IGEN Expo
EXPLORE UPGRADE PLANS
BREAKING
Neuro-Symbolic Framework Improves Motion Prediction for Autonomous Vehicles in Mixed Traffic AI Scientist Automates Entire Research Lifecycle, Passes First Peer Review AI-driven Landmark-free Assessment of Lower-limb Alignment with Implicit Neural Shape Functions from Knee Radiographs Quantum Machine Learning for Industrial Applications: New Research Tackles Trainability and Expressivity New Method Resolves Drift Attribution Ambiguity in LLM Evaluation Pipelines New Hardware-Aware Neural Architecture Search Runs on Embedded Devices with Under 512MB RAM Malaysia's AI Agent-Powered Messaging Platform Respond.io Raises $62.5M, Targets Acquisitions MimicIK Framework Achieves Real-Time Inverse Kinematics with 4.65 mm Accuracy for Robotic Teleoperation Reward Hacking Still Undefeated: AI Safety Gridworlds Test Shows Exploits Persist Across LLM Scales Hormuz Threat Level Stays Severe Despite Peace Breakthrough as Explosions and Uncertainty Persist Neuro-Symbolic Framework Improves Motion Prediction for Autonomous Vehicles in Mixed Traffic AI Scientist Automates Entire Research Lifecycle, Passes First Peer Review AI-driven Landmark-free Assessment of Lower-limb Alignment with Implicit Neural Shape Functions from Knee Radiographs Quantum Machine Learning for Industrial Applications: New Research Tackles Trainability and Expressivity New Method Resolves Drift Attribution Ambiguity in LLM Evaluation Pipelines New Hardware-Aware Neural Architecture Search Runs on Embedded Devices with Under 512MB RAM Malaysia's AI Agent-Powered Messaging Platform Respond.io Raises $62.5M, Targets Acquisitions MimicIK Framework Achieves Real-Time Inverse Kinematics with 4.65 mm Accuracy for Robotic Teleoperation Reward Hacking Still Undefeated: AI Safety Gridworlds Test Shows Exploits Persist Across LLM Scales Hormuz Threat Level Stays Severe Despite Peace Breakthrough as Explosions and Uncertainty Persist
Home ›› Topics ›› evaluation

Topic

evaluation

4 stories
Risk-Aware LLM Agents for Geospatial Data Retrieval: New Framework Passes Adversarial Tests Technology
Artificial Intelligence #llm#geospatial

Risk-Aware LLM Agents for Geospatial Data Retrieval: New Framework Passes Adversarial Tests

Researchers present a risk-aware LLM agent framework for retrieving remote sensing data from cloud-based geospatial catalogues using natural language queries. The system integrates Guardrail, General-QA, and Recommender-Analyst agents to convert user intent into structured API calls. Preliminary adversarial evaluation shows prompt-level safety instructions improve robustness, though rare high-impact failures persist.

Jun 16, 2026 1 source
New OSGuard Benchmark Evaluates Safety of Computer-Use Agents for Enterprise AI Deployment Technology
Artificial Intelligence #ai safety#benchmark

New OSGuard Benchmark Evaluates Safety of Computer-Use Agents for Enterprise AI Deployment

Researchers introduce OSGuard, a benchmark suite for evaluating safety in computer-use agents. It includes action-level guardrail decisions and a risk-augmented execution suite to detect unsafe completions that satisfy nominal task objectives. Early tests show current multimodal guardrails perform well on isolated action judgments but reveal gaps in end-to-end safety.

Jun 16, 2026 1 source
RecourseBench: Modular Framework Promises Reproducible Evaluation of AI Recourse Methods Technology
Artificial Intelligence #algorithmic recourse#machine learning

RecourseBench: Modular Framework Promises Reproducible Evaluation of AI Recourse Methods

A new framework called RecourseBench aims to standardize and validate algorithmic recourse methods—counterfactual explanations that show individuals how to reverse an AI's decision. It decomposes the evaluation pipeline into five decoupled layers and integrates 28 state-of-the-art methods, with automated tests to verify reproducibility.

Jun 16, 2026 1 source
Metric Match: New Subset Selection Method Improves LLM Judge Reliability Evaluation, Cuts Annotation Costs by 32.5% Technology
Artificial Intelligence #llm#judge

Metric Match: New Subset Selection Method Improves LLM Judge Reliability Evaluation, Cuts Annotation Costs by 32.5%

Researchers developed Metric Match, a subset selection method that reduces costly human annotations needed to evaluate LLM judge reliability. The approach achieves a 0.838 win-rate over random selection, cuts estimation error by 18.7%, and reduces annotation needs by 32.5%. A medical case study showed $1,041.67 in savings.

Jun 16, 2026 1 source