Enterprises deploying large language models as autonomous agents face a critical challenge: as tool libraries grow, presenting every available tool overwhelms both the model and the user. A new benchmark called ToolMenuBench, introduced in a preprint on arXiv, systematically evaluates how different strategies for filtering which tools are visible to an LLM agent impact reliability, efficiency, and safety.
According to the paper, causal minimal tool filtering achieves the strongest overall tradeoff, reducing visible tools, wrong-tool calls, premature actions, and risky-tool exposure relative to unfiltered exposure, lexical filtering, state-aware filtering, and broader causal-path baselines.
The Problem: Tool-Menu Overload in Multi-Step Agents
Tool-augmented LLM agents must decide which tool to call at each step of a multi-step task. Existing benchmarks focus on whether a model can call a tool correctly, but not on how the visible tool menu itself shapes agent behaviour. ToolMenuBench addresses this gap by providing a framework to study the agent-interface problem: which tools should be visible, when they should be visible, and under what cost or risk constraints.
Key Metrics Measured
The benchmark reports both filter-level and downstream agent metrics:
- Visible-tool count – number of tools presented to the agent
- Risky-tool exposure – whether unsafe or undesirable tools are visible
- Task success – whether the agent completes the assigned task
- Wrong-tool calls – number of times the agent selects an inappropriate tool
- Premature actions – actions taken before sufficient information is gathered
- Token usage – total tokens consumed during agent operation
Evaluation Setup
ToolMenuBench was evaluated across multiple dimensions:
| Parameter | Options |
|---|---|
| Model backends | 7 (including various LLMs) |
| Tool-menu sizes | 3 (small, medium, large) |
| Filtering methods | 6 (e.g., unfiltered, lexical, state-aware, causal minimal) |
| Evaluation settings | 7 (varying distractor type, state-dependent task structure, risk exposure) |
Dramatic Improvements with Causal Minimal Filtering
The paper reports that Causal Minimal Tool Filtering (CMTF) substantially outperformed other methods. Compared to exposing all tools, CMTF improved task success from 32.1% to 85.7% — a gain of over 53 percentage points. At the same time, average token usage was reduced by roughly 98%.
Beyond these headline numbers, CMTF reduced:
- Visible tools (fewer irrelevant options)
- Wrong-tool calls (more accurate selections)
- Premature actions (better sequencing of steps)
- Risky-tool exposure (improved safety)
The benchmark also tested alternative approaches such as lexical filtering, state-aware filtering, and broader causal-path baselines. None matched the overall tradeoff achieved by CMTF.
Implications for Enterprise AI
For enterprise technology leaders evaluating LLM agents for automation — whether in customer support, code generation, or supply chain logistics — ToolMenuBench highlights the importance of intelligent tool-menu design. Simply making all tools available degrades performance and inflates costs. A focused, causally informed filtering strategy can dramatically improve both success rates and operational efficiency.
The authors, Babu, Rahul Suresh, and Laxmipriya Ganesh Iyer, have released ToolMenuBench as a reusable evaluation framework. Enterprises building custom agent systems can adopt similar metrics to benchmark their own tool-menu strategies, especially in domains with large, heterogeneous tool libraries.
As agent-based systems become more common in production, benchmarks like ToolMenuBench provide a systematic way to balance capability with cost and safety — a pressing need for any organisation deploying AI at scale.