Evaluating large language model (LLM) agents in economic systems presents unique challenges not addressed by existing benchmarks. Most current evaluations test a single agent interacting with a passive environment, but real-world economies are multi-agent, requiring autonomous agents to communicate, negotiate, and transact over extended periods. To fill this gap, researchers from a team including Sugiura, Issa, Hattori, Daichi, Araragi, Kazuo, and others introduced CoffeeBench, a benchmark designed to assess LLM agents in a long-horizon multi-agent economy composed of heterogeneous firms, according to the paper published on arXiv.
How CoffeeBench Works
CoffeeBench simulates a coffee supply chain over 90 days. The simulation includes two farmers, two roasters, and two retailers, each operating autonomously. The objective for each firm is to maximize cumulative net income through communication and transactions while managing cash, inventory, and pricing. The evaluated LLM model controls one coffee roaster, while the remaining five firms are controlled by fixed reference agents. This setup tests an agent's ability to engage in sustained economic interaction, including negotiation and strategic planning.
Key Findings from the Evaluation
The researchers tested several recent open-weight and proprietary LLMs. According to the paper, all models outperformed a passive baseline that takes no actions, with most achieving positive net income. Analysis of agent behavior revealed substantial differences in long-horizon economic interaction. Higher-performing models communicated more actively with other firms. In contrast, Claude Haiku 4.5 exhibited an "idle-drift failure mode," repeatedly choosing inaction despite producing coherent assessments and plans. This finding highlights a critical gap between an agent's reasoning capabilities and its ability to execute economically productive actions.
Implications for Enterprise AI
While CoffeeBench is a research benchmark, its methodology has direct relevance for enterprise technology leaders evaluating AI for supply chain and logistics automation. The need for agents that can autonomously handle procurement, pricing, and inventory management over long horizons is growing. The benchmark provides a structured way to compare models on these capabilities, revealing that communication frequency and consistent execution are as important as raw reasoning power. The researchers have released the code and agent trajectories to support future research, enabling organizations to test their own models against the CoffeeBench environment.