LLM Agents Look at Correct Tools but Still Pick Wrong, Research Reveals Readout as Failure Point

Research by Shiyang Chen reveals that LLM agents mis-call tools not because they fail to see the right tool, but because the decision readout fails. The model attends to the correct tool 80% of the time, yet picks wrong. Readout-side interventions recover 59-91% of failures, while input-side fixes recover ≤23%.

iGEN Editorial

June 16, 2026

LLM Agents Look at Correct Tools but Still Pick Wrong, Research Reveals Readout as Failure Point

LLM agents frequently mis-call tools, a problem often attributed to the agent being overwhelmed by too many options—a "crowded harness." However, new research published on arXiv by Shiyang Chen turns this assumption on its head. The study, titled "Looking Is Not Picking: An Attention-Segment Account of Tool-Selection Failures in LLM Agents," demonstrates that the failure lies not in the agent's ability to see the right tool, but in its subsequent decision to pick it.

According to the paper, the model's attention to labeled tool-definition segments shows that on real BFCL (Berkeley Function Calling Leaderboard) failures, the model attends most to the correct tool 80% of the time (versus a 21% chance baseline). The correct tool is the under-attended segment on only 10% of failures. This directly refutes the intuitive "crowded-harness / lost-in-the-middle" explanation that has been a natural guess in the field.

The Attention Paradox

The core finding is that the model looks at the right tool but still picks wrong. The bottleneck is localized to the decision readout, not the input harness. The researchers pinned this down through three separate lines of evidence.

1. Input vs. Readout Interventions

Repairing the prompt—by reordering or duplicating the gold tool—recovers at most 23% of failures. In stark contrast, readout-side interventions recover 59-91% of failures, highlighting that the problem is not with how the tools are presented but with how the model decides among them.

2. Representation-Invariance

Two different gold-pointed interventions in different representations—an additive attention-logit bias and a residual-stream steering vector—recover largely the same failures. According to the paper, the per-task Jaccard similarity is 0.865 pooled, with a range of 0.79-0.91 per model. This invariance confirms that the bottleneck is localized to the readout, independent of which representation is manipulated.

3. A Training-Free, Gold-Free Selector

The researchers developed a training-free, gold-free selector based on per-segment attention. This selector closes most of the gap between a gold-free and an oracle model: +11.9 points pooled function-name selection on BFCL (compared to an oracle headroom of +17.9 points), and adds +14.9 points on the Seal-Tools benchmark. Every model tested showed positive results, with exact McNemar p-values ≤ 8e-4.

Intervention / Metric	Recovery / Improvement	Notes
Prompt reordering/duplication (input-side)	≤23%	Gold-prompt repair
Readout-side interventions	59-91%	Covers both attention-logit and steering vector
Attention-logit bias vs. steering vector overlap	Jaccard 0.865 pooled	0.79-0.91 per model
Gold-free selector on BFCL (function-name)	+11.9 pts vs. baseline	Oracle headroom: +17.9 pts
Gold-free selector on Seal-Tools	+14.9 pts	All models positive (p ≤ 8e-4)

Scope and Limitations

The paper notes important scoping differences. The causal attention-bias dose-response effect is bidirectional and monotonic on 10 mask-honoring models ranging from 3B to 32B parameters. The full 0.5-32B parameter span carries only the correlational diagnostic. The deployable selector is evaluated on 5 single-turn models and does not yet transfer to a multi-turn loop. This means that while the findings are robust for single-turn tool calls, further work is needed for conversational agents.

Implications for Enterprise AI

For enterprise developers building LLM agents that rely on tool calling—such as those in supply chain, logistics, or finance—this research provides a clear diagnosis of a common failure mode. Instead of trying to simplify the tool harness (e.g., by reducing the number of tools or reordering them), engineers should focus on improving the decision readout mechanism. The training-free selector offers a lightweight, immediate fix that can boost accuracy without retraining, applicable to any LLM that exposes token-level attention. However, practitioners should validate these findings on their specific multi-turn pipelines, as the current work is limited to single-turn settings.

As the paper concludes, the evidence "directly refutes the intuitive 'crowded-harness / lost-in-the-middle' explanation" and points the way to more effective interventions for tool-selection reliability in LLM agents.

Sources:

LLM Agents Look at Correct Tools but Still Pick Wrong, Research Reveals Readout as Failure Point

The Attention Paradox

1. Input vs. Readout Interventions

2. Representation-Invariance

3. A Training-Free, Gold-Free Selector

Scope and Limitations

Implications for Enterprise AI

Recommended Stories

LedgerAgent: A New Method for Policy-Adherent Tool-Calling AI Agents in Customer Service

New Research Shows Chain-of-Thought Reasoning Should Be Selective, Not Default, for LLMs

CoffeeBench: New Benchmark Evaluates LLM Agents in Multi-Agent Economic Simulations

A Theoretical Roadmap to Fuse Foundation Models and Knowledge Graphs