iGEN
Visit IGEN World Explore IGEN Expo
EXPLORE UPGRADE PLANS
BREAKING
APEC Climate Center Upgrades El Niño to Strong; Indian Monsoon Faces Elevated Risk New Architecture GRIL Enables Gradient Descent-Like Learning in Linear Recurrent Networks ToolSelf AI Agents Achieve 28.8 Point Gain Through Runtime Self-Reconfiguration ArtNet: JEPA-Like Articulatory Framework Achieves 20.56% Error Reduction in Zero-Shot Phoneme Recognition LLM-Assisted Stance Detection in Scientific Discourse Reaches 0.76 Combined Reliability Score New Drift-RAE Method Distills Transformers Efficiently Using Representation Autoencoders Cough Regression Benchmark Reveals Trade-Offs in Respiratory Acoustic Foundation Models Spacex Acquires AI Coding Startup Cursor For $60bn Days After Bumper IPO Metacognitive Myopia in LLMs: New Framework Reveals Hidden Biases with High-Stakes Implications Lightweight Hardware-Aware Neural Architecture Search Enables CNNs on Ultra-Low-Power Microcontrollers APEC Climate Center Upgrades El Niño to Strong; Indian Monsoon Faces Elevated Risk New Architecture GRIL Enables Gradient Descent-Like Learning in Linear Recurrent Networks ToolSelf AI Agents Achieve 28.8 Point Gain Through Runtime Self-Reconfiguration ArtNet: JEPA-Like Articulatory Framework Achieves 20.56% Error Reduction in Zero-Shot Phoneme Recognition LLM-Assisted Stance Detection in Scientific Discourse Reaches 0.76 Combined Reliability Score New Drift-RAE Method Distills Transformers Efficiently Using Representation Autoencoders Cough Regression Benchmark Reveals Trade-Offs in Respiratory Acoustic Foundation Models Spacex Acquires AI Coding Startup Cursor For $60bn Days After Bumper IPO Metacognitive Myopia in LLMs: New Framework Reveals Hidden Biases with High-Stakes Implications Lightweight Hardware-Aware Neural Architecture Search Enables CNNs on Ultra-Low-Power Microcontrollers
Home ›› Technology ›› Ai ›› Llms ›› AutoDojo: Adaptive Attacks Expose Superficial Defenses and Structural Limits in LLM Agents

AutoDojo: Adaptive Attacks Expose Superficial Defenses and Structural Limits in LLM Agents

The AutoDojo framework adaptively optimizes indirect prompt injections against LLM agent defenses, revealing that many current defenses are superficial. Against a filter that reduces static attack success rate to 0%, AutoDojo recovers 28% overall and 64% on action-open tasks due to a structural limitation where injections can pose as ordinary data.

iG
iGEN Editorial
June 16, 2026
AutoDojo: Adaptive Attacks Expose Superficial Defenses and Structural Limits in LLM Agents

Indirect prompt injection (IPI) is a major security threat to LLM-powered agents. A growing body of work has proposed defensive approaches, but their evaluation typically relies on static benchmarks that generate a fixed distribution of IPI attacks. According to the AutoDojo paper, such static benchmarks "do not usefully evaluate defense robustness to adaptive threats." To address this, the researchers developed AutoDojo, an adaptive extension of the AgentDojo benchmark that optimizes IPI attacks against a given defense.

The AutoDojo Framework

AutoDojo uses a cheap, black-box adaptive attack that calls a frontier LLM to iteratively optimize the injection. The framework operates across three task suites and five target models, enabling systematic evaluation of defenses against adaptive threats. The researchers categorize existing defenses into three groups:

  • Prompt-based: using prompting to prevent agents from following malicious instructions
  • Detection-based: identifying and filtering malicious instructions
  • System-level: using systems insights such as control and data isolation for defense

Key Findings: Adaptive Attacks Recover High Success Rates

Applying AutoDojo against state-of-the-art IPI defenses, the researchers made two key observations. First, many defenses offer only limited protection. A cheap, black-box adaptive attack raises attack success rate (ASR) well above the level achieved by static injections against nearly all evaluated defenses. The following table illustrates this for a filter-based defense:

Metric Static Attack Adaptive Attack (AutoDojo)
ASR overall 0% 28%
ASR on action-open tasks 0% 64%

Structural Limits on Action-Open Tasks

Second, for prompt-level and filter-based defenses, ASR is substantially higher on action-open tasks — where the user's request delegates the action itself to attacker-controlled content — than on precisely specified tasks. According to the researchers, this is a structural limit:

This is a structural limit: on such tasks the injection can pose as ordinary data rather than an explicit instruction, bypassing defenses that rely on detecting instruction-like text.

Action-open tasks inherently allow the injection to blend in with ordinary data, making them harder to defend. The same vulnerability does not apply to system-level defenses to the same degree, but the paper notes that even those are not immune.

Implications for Enterprise Deployments

For CTOs and technology leaders deploying LLM agents in sensitive enterprise environments, these findings underscore the inadequacy of static security evaluations. Defenses that appear robust under fixed attack distributions can be undermined by adaptive adversaries. The ability of a relatively inexpensive black-box attack to recover significant ASR—28% overall and 64% on action-open tasks against a filter that previously blocked all static attacks—highlights the need for continuous, adversarial testing. Moreover, the structural limit on action-open tasks suggests that organizations should carefully scope the actions delegated to LLM agents, especially when those actions involve attacker-controlled data sources. The AutoDojo framework is publicly available, enabling defenders to assess their own systems against adaptive threats.

Source: Ma, Xinhang, et al. "AutoDojo: Adaptive Attacks Expose Superficial Defenses and User-Underspecification Limits in LLM Agents." arXiv preprint arXiv:2606.15057, 2026.


Sources:

Keep Reading

Recommended Stories

SkillVetBench Uses LLM-as-Judge to Evaluate Security Risks in Open-Source Agent Skills Technology

SkillVetBench Uses LLM-as-Judge to Evaluate Security Risks in Open-Source Agent Skills

SkillVetBench, a live Hugging Face leaderboard, uses an LLM-as-Judge approach to vet open-source LLM agent skills for security risks. It introduces the Skill Agentic Risk Score (SARS) and integrates CVSS v4.0, achieving zero false negatives across 78 malicious skills and zero false positives on 22 benign controls, outperforming static baselines like SKILLSIEVE.

June 16, 2026
CoffeeBench: New Benchmark Evaluates LLM Agents in Multi-Agent Economic Simulations Technology

CoffeeBench: New Benchmark Evaluates LLM Agents in Multi-Agent Economic Simulations

Researchers introduce CoffeeBench, a benchmark for evaluating LLM agents in a long-horizon multi-agent economy. The 90-day simulation features farmers, roasters, and retailers, with models controlling one roaster. All models outperformed a passive baseline, but Claude Haiku 4.5 showed an idle-drift failure mode.

June 16, 2026
New Attack Forces Costly Model Usage in Multimodal LLM Cascades Technology

New Attack Forces Costly Model Usage in Multimodal LLM Cascades

A research paper introduces the Forced Deferral Attack (FDA), which manipulates confidence thresholds in multimodal large language model cascades, causing queries to be routed to more expensive strong models. The attack raises security concerns for enterprises deploying cost-optimized AI systems.

June 16, 2026
New Framework Automates Skill Construction for Agentic Large Language Models Technology

New Framework Automates Skill Construction for Agentic Large Language Models

A new framework called Collective Skill Tree Search (CSTS) automatically constructs reusable skills for large language model (LLM) agents. It uses two iterative phases—collective generation and collective assessment—to build a diverse, generalizable tree of skills that enhances agentic capabilities in planning, tool use, and environment interaction.

June 16, 2026