AutoDojo: Adaptive Attacks Expose Superficial Defenses and Structural Limits in LLM Agents

The AutoDojo framework adaptively optimizes indirect prompt injections against LLM agent defenses, revealing that many current defenses are superficial. Against a filter that reduces static attack success rate to 0%, AutoDojo recovers 28% overall and 64% on action-open tasks due to a structural limitation where injections can pose as ordinary data.

iGEN Editorial

June 16, 2026

AutoDojo: Adaptive Attacks Expose Superficial Defenses and Structural Limits in LLM Agents

Indirect prompt injection (IPI) is a major security threat to LLM-powered agents. A growing body of work has proposed defensive approaches, but their evaluation typically relies on static benchmarks that generate a fixed distribution of IPI attacks. According to the AutoDojo paper, such static benchmarks "do not usefully evaluate defense robustness to adaptive threats." To address this, the researchers developed AutoDojo, an adaptive extension of the AgentDojo benchmark that optimizes IPI attacks against a given defense.

The AutoDojo Framework

AutoDojo uses a cheap, black-box adaptive attack that calls a frontier LLM to iteratively optimize the injection. The framework operates across three task suites and five target models, enabling systematic evaluation of defenses against adaptive threats. The researchers categorize existing defenses into three groups:

Prompt-based: using prompting to prevent agents from following malicious instructions
Detection-based: identifying and filtering malicious instructions
System-level: using systems insights such as control and data isolation for defense

Key Findings: Adaptive Attacks Recover High Success Rates

Applying AutoDojo against state-of-the-art IPI defenses, the researchers made two key observations. First, many defenses offer only limited protection. A cheap, black-box adaptive attack raises attack success rate (ASR) well above the level achieved by static injections against nearly all evaluated defenses. The following table illustrates this for a filter-based defense:

Metric	Static Attack	Adaptive Attack (AutoDojo)
ASR overall	0%	28%
ASR on action-open tasks	0%	64%

Structural Limits on Action-Open Tasks

Second, for prompt-level and filter-based defenses, ASR is substantially higher on action-open tasks — where the user's request delegates the action itself to attacker-controlled content — than on precisely specified tasks. According to the researchers, this is a structural limit:

This is a structural limit: on such tasks the injection can pose as ordinary data rather than an explicit instruction, bypassing defenses that rely on detecting instruction-like text.

Action-open tasks inherently allow the injection to blend in with ordinary data, making them harder to defend. The same vulnerability does not apply to system-level defenses to the same degree, but the paper notes that even those are not immune.

Implications for Enterprise Deployments

For CTOs and technology leaders deploying LLM agents in sensitive enterprise environments, these findings underscore the inadequacy of static security evaluations. Defenses that appear robust under fixed attack distributions can be undermined by adaptive adversaries. The ability of a relatively inexpensive black-box attack to recover significant ASR—28% overall and 64% on action-open tasks against a filter that previously blocked all static attacks—highlights the need for continuous, adversarial testing. Moreover, the structural limit on action-open tasks suggests that organizations should carefully scope the actions delegated to LLM agents, especially when those actions involve attacker-controlled data sources. The AutoDojo framework is publicly available, enabling defenders to assess their own systems against adaptive threats.

Source: Ma, Xinhang, et al. "AutoDojo: Adaptive Attacks Expose Superficial Defenses and User-Underspecification Limits in LLM Agents." arXiv preprint arXiv:2606.15057, 2026.

Sources:

AutoDojo: Adaptive Attacks Expose Superficial Defenses and Structural Limits in LLM Agents

The AutoDojo Framework

Key Findings: Adaptive Attacks Recover High Success Rates

Structural Limits on Action-Open Tasks

Implications for Enterprise Deployments

Recommended Stories

Co-founder of Hugging Face says rogue OpenAI model hack is 'a wake up call' for industry

Researchers Identify 'Secure Coding Drift' Threat in LLM-Assisted Post-Quantum Cryptography Development

LedgerAgent: A New Method for Policy-Adherent Tool-Calling AI Agents in Customer Service

Beyond Static Leaderboards: Predictive Validity for Evaluating LLM Agents in Enterprise AI