Indirect prompt injection (IPI) is a major security threat to LLM-powered agents. A growing body of work has proposed defensive approaches, but their evaluation typically relies on static benchmarks that generate a fixed distribution of IPI attacks. According to the AutoDojo paper, such static benchmarks "do not usefully evaluate defense robustness to adaptive threats." To address this, the researchers developed AutoDojo, an adaptive extension of the AgentDojo benchmark that optimizes IPI attacks against a given defense.
The AutoDojo Framework
AutoDojo uses a cheap, black-box adaptive attack that calls a frontier LLM to iteratively optimize the injection. The framework operates across three task suites and five target models, enabling systematic evaluation of defenses against adaptive threats. The researchers categorize existing defenses into three groups:
- Prompt-based: using prompting to prevent agents from following malicious instructions
- Detection-based: identifying and filtering malicious instructions
- System-level: using systems insights such as control and data isolation for defense
Key Findings: Adaptive Attacks Recover High Success Rates
Applying AutoDojo against state-of-the-art IPI defenses, the researchers made two key observations. First, many defenses offer only limited protection. A cheap, black-box adaptive attack raises attack success rate (ASR) well above the level achieved by static injections against nearly all evaluated defenses. The following table illustrates this for a filter-based defense:
| Metric | Static Attack | Adaptive Attack (AutoDojo) |
|---|---|---|
| ASR overall | 0% | 28% |
| ASR on action-open tasks | 0% | 64% |
Structural Limits on Action-Open Tasks
Second, for prompt-level and filter-based defenses, ASR is substantially higher on action-open tasks — where the user's request delegates the action itself to attacker-controlled content — than on precisely specified tasks. According to the researchers, this is a structural limit:
This is a structural limit: on such tasks the injection can pose as ordinary data rather than an explicit instruction, bypassing defenses that rely on detecting instruction-like text.
Action-open tasks inherently allow the injection to blend in with ordinary data, making them harder to defend. The same vulnerability does not apply to system-level defenses to the same degree, but the paper notes that even those are not immune.
Implications for Enterprise Deployments
For CTOs and technology leaders deploying LLM agents in sensitive enterprise environments, these findings underscore the inadequacy of static security evaluations. Defenses that appear robust under fixed attack distributions can be undermined by adaptive adversaries. The ability of a relatively inexpensive black-box attack to recover significant ASR—28% overall and 64% on action-open tasks against a filter that previously blocked all static attacks—highlights the need for continuous, adversarial testing. Moreover, the structural limit on action-open tasks suggests that organizations should carefully scope the actions delegated to LLM agents, especially when those actions involve attacker-controlled data sources. The AutoDojo framework is publicly available, enabling defenders to assess their own systems against adaptive threats.
Source: Ma, Xinhang, et al. "AutoDojo: Adaptive Attacks Expose Superficial Defenses and User-Underspecification Limits in LLM Agents." arXiv preprint arXiv:2606.15057, 2026.