ToolMenuBench: New Benchmark Evaluates Tool-Menu Filtering for Reliable and Efficient LLM Agents

ToolMenuBench, a new benchmark from researchers, evaluates how tool-menu filtering strategies affect LLM agent reliability and efficiency. In tests across seven model backends, causal minimal tool filtering improved task success from 32.1% to 85.7% while reducing token usage by roughly 98%.

iGEN Editorial

June 16, 2026

ToolMenuBench: New Benchmark Evaluates Tool-Menu Filtering for Reliable and Efficient LLM Agents

Enterprises deploying large language models as autonomous agents face a critical challenge: as tool libraries grow, presenting every available tool overwhelms both the model and the user. A new benchmark called ToolMenuBench, introduced in a preprint on arXiv, systematically evaluates how different strategies for filtering which tools are visible to an LLM agent impact reliability, efficiency, and safety.

According to the paper, causal minimal tool filtering achieves the strongest overall tradeoff, reducing visible tools, wrong-tool calls, premature actions, and risky-tool exposure relative to unfiltered exposure, lexical filtering, state-aware filtering, and broader causal-path baselines.

The Problem: Tool-Menu Overload in Multi-Step Agents

Tool-augmented LLM agents must decide which tool to call at each step of a multi-step task. Existing benchmarks focus on whether a model can call a tool correctly, but not on how the visible tool menu itself shapes agent behaviour. ToolMenuBench addresses this gap by providing a framework to study the agent-interface problem: which tools should be visible, when they should be visible, and under what cost or risk constraints.

Key Metrics Measured

The benchmark reports both filter-level and downstream agent metrics:

Visible-tool count – number of tools presented to the agent
Risky-tool exposure – whether unsafe or undesirable tools are visible
Task success – whether the agent completes the assigned task
Wrong-tool calls – number of times the agent selects an inappropriate tool
Premature actions – actions taken before sufficient information is gathered
Token usage – total tokens consumed during agent operation

Evaluation Setup

ToolMenuBench was evaluated across multiple dimensions:

Parameter	Options
Model backends	7 (including various LLMs)
Tool-menu sizes	3 (small, medium, large)
Filtering methods	6 (e.g., unfiltered, lexical, state-aware, causal minimal)
Evaluation settings	7 (varying distractor type, state-dependent task structure, risk exposure)

Dramatic Improvements with Causal Minimal Filtering

The paper reports that Causal Minimal Tool Filtering (CMTF) substantially outperformed other methods. Compared to exposing all tools, CMTF improved task success from 32.1% to 85.7% — a gain of over 53 percentage points. At the same time, average token usage was reduced by roughly 98%.

Beyond these headline numbers, CMTF reduced:

Visible tools (fewer irrelevant options)
Wrong-tool calls (more accurate selections)
Premature actions (better sequencing of steps)
Risky-tool exposure (improved safety)

The benchmark also tested alternative approaches such as lexical filtering, state-aware filtering, and broader causal-path baselines. None matched the overall tradeoff achieved by CMTF.

Implications for Enterprise AI

For enterprise technology leaders evaluating LLM agents for automation — whether in customer support, code generation, or supply chain logistics — ToolMenuBench highlights the importance of intelligent tool-menu design. Simply making all tools available degrades performance and inflates costs. A focused, causally informed filtering strategy can dramatically improve both success rates and operational efficiency.

The authors, Babu, Rahul Suresh, and Laxmipriya Ganesh Iyer, have released ToolMenuBench as a reusable evaluation framework. Enterprises building custom agent systems can adopt similar metrics to benchmark their own tool-menu strategies, especially in domains with large, heterogeneous tool libraries.

As agent-based systems become more common in production, benchmarks like ToolMenuBench provide a systematic way to balance capability with cost and safety — a pressing need for any organisation deploying AI at scale.

Sources:

ToolMenuBench: New Benchmark Evaluates Tool-Menu Filtering for Reliable and Efficient LLM Agents

The Problem: Tool-Menu Overload in Multi-Step Agents

Key Metrics Measured

Evaluation Setup

Dramatic Improvements with Causal Minimal Filtering

Implications for Enterprise AI

Recommended Stories

CoffeeBench: New Benchmark Evaluates LLM Agents in Multi-Agent Economic Simulations

RetailBench Benchmark Tests LLM Agents on Long-Horizon Retail Decisions

The Autonomy Tax: Defense Training Breaks LLM Agents

LedgerAgent: A New Method for Policy-Adherent Tool-Calling AI Agents in Customer Service