CoffeeBench: New Benchmark Evaluates LLM Agents in Multi-Agent Economic Simulations

Researchers introduce CoffeeBench, a benchmark for evaluating LLM agents in a long-horizon multi-agent economy. The 90-day simulation features farmers, roasters, and retailers, with models controlling one roaster. All models outperformed a passive baseline, but Claude Haiku 4.5 showed an idle-drift failure mode.

iGEN Editorial

June 16, 2026

CoffeeBench: New Benchmark Evaluates LLM Agents in Multi-Agent Economic Simulations

Evaluating large language model (LLM) agents in economic systems presents unique challenges not addressed by existing benchmarks. Most current evaluations test a single agent interacting with a passive environment, but real-world economies are multi-agent, requiring autonomous agents to communicate, negotiate, and transact over extended periods. To fill this gap, researchers from a team including Sugiura, Issa, Hattori, Daichi, Araragi, Kazuo, and others introduced CoffeeBench, a benchmark designed to assess LLM agents in a long-horizon multi-agent economy composed of heterogeneous firms, according to the paper published on arXiv.

How CoffeeBench Works

CoffeeBench simulates a coffee supply chain over 90 days. The simulation includes two farmers, two roasters, and two retailers, each operating autonomously. The objective for each firm is to maximize cumulative net income through communication and transactions while managing cash, inventory, and pricing. The evaluated LLM model controls one coffee roaster, while the remaining five firms are controlled by fixed reference agents. This setup tests an agent's ability to engage in sustained economic interaction, including negotiation and strategic planning.

Key Findings from the Evaluation

The researchers tested several recent open-weight and proprietary LLMs. According to the paper, all models outperformed a passive baseline that takes no actions, with most achieving positive net income. Analysis of agent behavior revealed substantial differences in long-horizon economic interaction. Higher-performing models communicated more actively with other firms. In contrast, Claude Haiku 4.5 exhibited an "idle-drift failure mode," repeatedly choosing inaction despite producing coherent assessments and plans. This finding highlights a critical gap between an agent's reasoning capabilities and its ability to execute economically productive actions.

Implications for Enterprise AI

While CoffeeBench is a research benchmark, its methodology has direct relevance for enterprise technology leaders evaluating AI for supply chain and logistics automation. The need for agents that can autonomously handle procurement, pricing, and inventory management over long horizons is growing. The benchmark provides a structured way to compare models on these capabilities, revealing that communication frequency and consistent execution are as important as raw reasoning power. The researchers have released the code and agent trajectories to support future research, enabling organizations to test their own models against the CoffeeBench environment.

Sources:

CoffeeBench: New Benchmark Evaluates LLM Agents in Multi-Agent Economic Simulations

How CoffeeBench Works

Key Findings from the Evaluation

Implications for Enterprise AI

Recommended Stories

LedgerAgent: A New Method for Policy-Adherent Tool-Calling AI Agents in Customer Service

Hidden Anchors Reveal Why Multi-Agent LLM Deliberation Escapes Groupthink

AdaSTORM Breakthrough Scales LLM Reasoning to Thousand-Node Dynamic Graphs, Paves Way for Supply Chain AI

The Chatbot That Foretold Why People Share Secrets With ChatGPT