iGEN
Visit IGEN World Explore IGEN Expo
EXPLORE UPGRADE PLANS
BREAKING
From Finance to Human Trafficking: How Banks Can Protect Customers During the 2026 World Cup Gen-VCoT: New Framework Generates RGB Images as Visual Chain-of-Thought Intermediates for Multimodal AI Reasoning MASCOT-Android: Automated Pipeline and Curated Dataset for Android Malware Source Code Discovery Human Genetic Evidence Found to Be Strongly Associated with Drug Approval in Observational Study of 26,278 Target-Disease Pairs UniBrain: A Unified Multimodal Model for Brain MRI Imputation and Understanding DeepRoot Multi-Agent System Enables Therapeutic Reasoning Over Historical Medical Texts with 47.6% Accuracy Primacy Bias in Multimodal RAG: First Retrieved Items Dominate, Study Finds N-Sea appoints Pim Nelemans as chief executive, succeeding Martin Adler ‘We’re not flipping a switch and pushing it to everyone at once’: Sonos is about to make its biggest changes yet to the controversial new app, designed to make it way more intuitive to use — and it seems to have learned from its past mistakes New Generalization Bounds for Deep Learning Models via Local Robustness and Stability From Finance to Human Trafficking: How Banks Can Protect Customers During the 2026 World Cup Gen-VCoT: New Framework Generates RGB Images as Visual Chain-of-Thought Intermediates for Multimodal AI Reasoning MASCOT-Android: Automated Pipeline and Curated Dataset for Android Malware Source Code Discovery Human Genetic Evidence Found to Be Strongly Associated with Drug Approval in Observational Study of 26,278 Target-Disease Pairs UniBrain: A Unified Multimodal Model for Brain MRI Imputation and Understanding DeepRoot Multi-Agent System Enables Therapeutic Reasoning Over Historical Medical Texts with 47.6% Accuracy Primacy Bias in Multimodal RAG: First Retrieved Items Dominate, Study Finds N-Sea appoints Pim Nelemans as chief executive, succeeding Martin Adler ‘We’re not flipping a switch and pushing it to everyone at once’: Sonos is about to make its biggest changes yet to the controversial new app, designed to make it way more intuitive to use — and it seems to have learned from its past mistakes New Generalization Bounds for Deep Learning Models via Local Robustness and Stability
Home ›› Technology ›› Ai ›› Llms ›› CoffeeBench: New Benchmark Evaluates LLM Agents in Multi-Agent Economic Simulations

CoffeeBench: New Benchmark Evaluates LLM Agents in Multi-Agent Economic Simulations

Researchers introduce CoffeeBench, a benchmark for evaluating LLM agents in a long-horizon multi-agent economy. The 90-day simulation features farmers, roasters, and retailers, with models controlling one roaster. All models outperformed a passive baseline, but Claude Haiku 4.5 showed an idle-drift failure mode.

iG
iGEN Editorial
June 16, 2026
CoffeeBench: New Benchmark Evaluates LLM Agents in Multi-Agent Economic Simulations

Evaluating large language model (LLM) agents in economic systems presents unique challenges not addressed by existing benchmarks. Most current evaluations test a single agent interacting with a passive environment, but real-world economies are multi-agent, requiring autonomous agents to communicate, negotiate, and transact over extended periods. To fill this gap, researchers from a team including Sugiura, Issa, Hattori, Daichi, Araragi, Kazuo, and others introduced CoffeeBench, a benchmark designed to assess LLM agents in a long-horizon multi-agent economy composed of heterogeneous firms, according to the paper published on arXiv.

How CoffeeBench Works

CoffeeBench simulates a coffee supply chain over 90 days. The simulation includes two farmers, two roasters, and two retailers, each operating autonomously. The objective for each firm is to maximize cumulative net income through communication and transactions while managing cash, inventory, and pricing. The evaluated LLM model controls one coffee roaster, while the remaining five firms are controlled by fixed reference agents. This setup tests an agent's ability to engage in sustained economic interaction, including negotiation and strategic planning.

Key Findings from the Evaluation

The researchers tested several recent open-weight and proprietary LLMs. According to the paper, all models outperformed a passive baseline that takes no actions, with most achieving positive net income. Analysis of agent behavior revealed substantial differences in long-horizon economic interaction. Higher-performing models communicated more actively with other firms. In contrast, Claude Haiku 4.5 exhibited an "idle-drift failure mode," repeatedly choosing inaction despite producing coherent assessments and plans. This finding highlights a critical gap between an agent's reasoning capabilities and its ability to execute economically productive actions.

Implications for Enterprise AI

While CoffeeBench is a research benchmark, its methodology has direct relevance for enterprise technology leaders evaluating AI for supply chain and logistics automation. The need for agents that can autonomously handle procurement, pricing, and inventory management over long horizons is growing. The benchmark provides a structured way to compare models on these capabilities, revealing that communication frequency and consistent execution are as important as raw reasoning power. The researchers have released the code and agent trajectories to support future research, enabling organizations to test their own models against the CoffeeBench environment.


Sources:

Keep Reading

Recommended Stories

AdaSTORM Breakthrough Scales LLM Reasoning to Thousand-Node Dynamic Graphs, Paves Way for Supply Chain AI Technology

AdaSTORM Breakthrough Scales LLM Reasoning to Thousand-Node Dynamic Graphs, Paves Way for Supply Chain AI

AdaSTORM, a new multi-agent AI framework, scales large language model reasoning to dynamic graphs of up to thousand nodes with over 90% accuracy. The approach uses adaptive partitioning and collaborative reasoning to overcome limitations of current LLMs, which can only handle tens of nodes. This breakthrough could enable AI-driven analysis of complex, evolving networks such as supply chains.

June 16, 2026
LectūraAgents Multi-Agent Framework Promises Adaptive Personalized AI-Assisted Learning Technology

LectūraAgents Multi-Agent Framework Promises Adaptive Personalized AI-Assisted Learning

Researchers propose LectūraAgents, a multi-agent framework for adaptive personalized AI-assisted learning. It uses a hierarchical architecture with a ProfessorAgent leading specialized agents to generate and deliver tailored lecture content with embodied teaching actions. The system was validated on diverse courses and showed gains in content quality and personalization.

June 16, 2026
New Hindsight Self-Distillation Method Improves LLM Reasoning by Localizing Credit at Divergence Points Technology

New Hindsight Self-Distillation Method Improves LLM Reasoning by Localizing Credit at Divergence Points

A new method called Hindsight Self-Distillation (HSD) improves large language model reasoning by conditioning the teacher on a successful peer rollout. This localizes the credit signal at the divergence point between failed and successful rollouts, leading to state-of-the-art results on math and code benchmarks with Qwen3-8B and Qwen3-32B models.

June 16, 2026
LabOSBench: New Benchmark Tests AI Agents on Complex Scientific Instrument Control Technology

LabOSBench: New Benchmark Tests AI Agents on Complex Scientific Instrument Control

LabOSBench is a new benchmark designed to evaluate computer-use agents on scientific instrument control. It features 96 subtasks across eight simulated instruments, testing agents on sample loading, alignment, parameter tuning, data acquisition, and result inspection. Early results show that while agents handle structured GUI tasks well, they struggle with feedback-driven operations and long-horizon workflows.

June 16, 2026