LLM agents frequently mis-call tools, a problem often attributed to the agent being overwhelmed by too many options—a "crowded harness." However, new research published on arXiv by Shiyang Chen turns this assumption on its head. The study, titled "Looking Is Not Picking: An Attention-Segment Account of Tool-Selection Failures in LLM Agents," demonstrates that the failure lies not in the agent's ability to see the right tool, but in its subsequent decision to pick it.
According to the paper, the model's attention to labeled tool-definition segments shows that on real BFCL (Berkeley Function Calling Leaderboard) failures, the model attends most to the correct tool 80% of the time (versus a 21% chance baseline). The correct tool is the under-attended segment on only 10% of failures. This directly refutes the intuitive "crowded-harness / lost-in-the-middle" explanation that has been a natural guess in the field.
The Attention Paradox
The core finding is that the model looks at the right tool but still picks wrong. The bottleneck is localized to the decision readout, not the input harness. The researchers pinned this down through three separate lines of evidence.
1. Input vs. Readout Interventions
Repairing the prompt—by reordering or duplicating the gold tool—recovers at most 23% of failures. In stark contrast, readout-side interventions recover 59-91% of failures, highlighting that the problem is not with how the tools are presented but with how the model decides among them.
2. Representation-Invariance
Two different gold-pointed interventions in different representations—an additive attention-logit bias and a residual-stream steering vector—recover largely the same failures. According to the paper, the per-task Jaccard similarity is 0.865 pooled, with a range of 0.79-0.91 per model. This invariance confirms that the bottleneck is localized to the readout, independent of which representation is manipulated.
3. A Training-Free, Gold-Free Selector
The researchers developed a training-free, gold-free selector based on per-segment attention. This selector closes most of the gap between a gold-free and an oracle model: +11.9 points pooled function-name selection on BFCL (compared to an oracle headroom of +17.9 points), and adds +14.9 points on the Seal-Tools benchmark. Every model tested showed positive results, with exact McNemar p-values ≤ 8e-4.
| Intervention / Metric | Recovery / Improvement | Notes |
|---|---|---|
| Prompt reordering/duplication (input-side) | ≤23% | Gold-prompt repair |
| Readout-side interventions | 59-91% | Covers both attention-logit and steering vector |
| Attention-logit bias vs. steering vector overlap | Jaccard 0.865 pooled | 0.79-0.91 per model |
| Gold-free selector on BFCL (function-name) | +11.9 pts vs. baseline | Oracle headroom: +17.9 pts |
| Gold-free selector on Seal-Tools | +14.9 pts | All models positive (p ≤ 8e-4) |
Scope and Limitations
The paper notes important scoping differences. The causal attention-bias dose-response effect is bidirectional and monotonic on 10 mask-honoring models ranging from 3B to 32B parameters. The full 0.5-32B parameter span carries only the correlational diagnostic. The deployable selector is evaluated on 5 single-turn models and does not yet transfer to a multi-turn loop. This means that while the findings are robust for single-turn tool calls, further work is needed for conversational agents.
Implications for Enterprise AI
For enterprise developers building LLM agents that rely on tool calling—such as those in supply chain, logistics, or finance—this research provides a clear diagnosis of a common failure mode. Instead of trying to simplify the tool harness (e.g., by reducing the number of tools or reordering them), engineers should focus on improving the decision readout mechanism. The training-free selector offers a lightweight, immediate fix that can boost accuracy without retraining, applicable to any LLM that exposes token-level attention. However, practitioners should validate these findings on their specific multi-turn pipelines, as the current work is limited to single-turn settings.
As the paper concludes, the evidence "directly refutes the intuitive 'crowded-harness / lost-in-the-middle' explanation" and points the way to more effective interventions for tool-selection reliability in LLM agents.