Topic
benchmarking
Artificial Intelligence #retailbench#llm
RetailBench Benchmark Tests LLM Agents on Long-Horizon Retail Decisions
Researchers introduced RetailBench, a simulation benchmark for evaluating LLM agents in single-store supermarket management over 180 days. Tests on seven models showed only a subset completed the full horizon, and even the best fell far behind an oracle policy due to incomplete evidence acquisition and lack of consistent strategy.
Jun 16, 2026 2 sources
Artificial Intelligence #toolmenubench#benchmarking
ToolMenuBench: New Benchmark Evaluates Tool-Menu Filtering for Reliable and Efficient LLM Agents
ToolMenuBench, a new benchmark from researchers, evaluates how tool-menu filtering strategies affect LLM agent reliability and efficiency. In tests across seven model backends, causal minimal tool filtering improved task success from 32.1% to 85.7% while reducing token usage by roughly 98%.
Jun 16, 2026 2 sources