New Defense Keeps Attack Success Rate Below 4% for Adaptive Prompt Injection on LLM Agents

Researchers propose RETA, a training-based defense that grounds LLM agent security on user tasks rather than attack patterns. Using chain-of-thought reasoning and red-teaming with diversity reward, RETA keeps average attack success rate below 4% across six adaptive attacks while preserving utility.

iGEN Editorial

June 16, 2026

New Defense Keeps Attack Success Rate Below 4% for Adaptive Prompt Injection on LLM Agents

Indirect prompt injection attacks pose a growing threat to enterprises deploying LLM-based agents in production workflows. These attacks hijack agents by embedding malicious instructions in third-party data retrieved during task execution — a common scenario in AI-powered supply chain systems, customer service bots, and document processing pipelines. Existing defenses report near-zero attack success rate on static benchmarks, but according to a new paper published on arXiv, these results collapse once the attacker is allowed to optimize against the deployed defense.

The researchers — He Lipeng, Wang Yihan, Zhang Jiawen, and N Asokan — identify two failure modes behind this collapse. First, current defense methods are confined to recognizing specific attack patterns, rather than assessing whether the intent of every embedded instruction is relevant to the user task. Second, training-based defenses, which otherwise offer the strongest safety-utility trade-off, assemble their adversarial examples from a handful of hand-crafted templates, causing the defender to fail when faced with novel attack strategies.

RETA: A Task-Centric Defense

To address these gaps, the team proposes RETA (Reasoning-enabled Task Alignment). RETA is a training-based method that grounds defense decisions on the user tasks rather than attacker-controlled data. At each tool-output step, the defender undertakes chain-of-thought reasoning to verify that its actions are consistent with the user task. This shifts the focus from recognizing attack patterns to assessing alignment with legitimate business objectives.

The system also leverages red-teaming: a simulated attacker synthesizes adversarial training data and receives a dictionary-learning diversity reward, achieving broad coverage of injection reformulation strategies. This prevents the narrow strategy distribution problem. The defender is optimized via multi-objective reinforcement learning, achieving a better safety-utility trade-off.

Quantified Results

RETA was evaluated across six black-box adaptive attacks. The results are summarized below:

Attack Scenario	Attack Success Rate (Model A)	Attack Success Rate (Model B)
Attack 1	<10%	<10%
Attack 2	<10%	<10%
Attack 3	<10%	<10%
Attack 4	<10%	<10%
Attack 5	<10%	<10%
Attack 6	<10%	<10%
Average	2.92%	3.75%

According to the paper, RETA keeps every per-attack ASR below 10%, with average success rates of 2.92% on the first target model and 3.75% on the second. Importantly, the system preserves most utility under attack and on clean inputs.

"Across six black-box adaptive attacks, RETA keeps every per-attack ASR below 10%, with average ASR of 2.92% and 3.75% on the two target models, while preserving most utility under attack and on clean inputs."

Implications for Enterprise AI Deployments

For CTOs and technology leaders deploying LLM agents in supply chain or logistics contexts, this research highlights a critical security evolution. Existing defenses that rely on pattern recognition are brittle against adaptive adversaries. RETA's task-alignment approach offers a more robust foundation, particularly for systems that retrieve and act on third-party data — such as supplier documents, shipping manifests, or trade compliance databases. The ability to maintain low attack success rates while preserving utility means enterprises can confidently integrate AI agents without compromising operational integrity or security.

Sources:

New Defense Keeps Attack Success Rate Below 4% for Adaptive Prompt Injection on LLM Agents

RETA: A Task-Centric Defense

Quantified Results

Implications for Enterprise AI Deployments

Recommended Stories

MUZZLE Framework Automates Red-Teaming of Web Agents Against Indirect Prompt Injection Attacks

New Research Defends LLMs from Extraction Attacks Using 'Knowledge Trap' Honeypot

Prompt Injection Attacks Are Thwarting AI Hacking Agents with Context Bombing

SPARK Method Activates Latent Security Knowledge in LLMs for Secure Code Generation