Indirect prompt injection attacks pose a growing threat to enterprises deploying LLM-based agents in production workflows. These attacks hijack agents by embedding malicious instructions in third-party data retrieved during task execution — a common scenario in AI-powered supply chain systems, customer service bots, and document processing pipelines. Existing defenses report near-zero attack success rate on static benchmarks, but according to a new paper published on arXiv, these results collapse once the attacker is allowed to optimize against the deployed defense.
The researchers — He Lipeng, Wang Yihan, Zhang Jiawen, and N Asokan — identify two failure modes behind this collapse. First, current defense methods are confined to recognizing specific attack patterns, rather than assessing whether the intent of every embedded instruction is relevant to the user task. Second, training-based defenses, which otherwise offer the strongest safety-utility trade-off, assemble their adversarial examples from a handful of hand-crafted templates, causing the defender to fail when faced with novel attack strategies.
RETA: A Task-Centric Defense
To address these gaps, the team proposes RETA (Reasoning-enabled Task Alignment). RETA is a training-based method that grounds defense decisions on the user tasks rather than attacker-controlled data. At each tool-output step, the defender undertakes chain-of-thought reasoning to verify that its actions are consistent with the user task. This shifts the focus from recognizing attack patterns to assessing alignment with legitimate business objectives.
The system also leverages red-teaming: a simulated attacker synthesizes adversarial training data and receives a dictionary-learning diversity reward, achieving broad coverage of injection reformulation strategies. This prevents the narrow strategy distribution problem. The defender is optimized via multi-objective reinforcement learning, achieving a better safety-utility trade-off.
Quantified Results
RETA was evaluated across six black-box adaptive attacks. The results are summarized below:
| Attack Scenario | Attack Success Rate (Model A) | Attack Success Rate (Model B) |
|---|---|---|
| Attack 1 | <10% | <10% |
| Attack 2 | <10% | <10% |
| Attack 3 | <10% | <10% |
| Attack 4 | <10% | <10% |
| Attack 5 | <10% | <10% |
| Attack 6 | <10% | <10% |
| Average | 2.92% | 3.75% |
According to the paper, RETA keeps every per-attack ASR below 10%, with average success rates of 2.92% on the first target model and 3.75% on the second. Importantly, the system preserves most utility under attack and on clean inputs.
"Across six black-box adaptive attacks, RETA keeps every per-attack ASR below 10%, with average ASR of 2.92% and 3.75% on the two target models, while preserving most utility under attack and on clean inputs."
Implications for Enterprise AI Deployments
For CTOs and technology leaders deploying LLM agents in supply chain or logistics contexts, this research highlights a critical security evolution. Existing defenses that rely on pattern recognition are brittle against adaptive adversaries. RETA's task-alignment approach offers a more robust foundation, particularly for systems that retrieve and act on third-party data — such as supplier documents, shipping manifests, or trade compliance databases. The ability to maintain low attack success rates while preserving utility means enterprises can confidently integrate AI agents without compromising operational integrity or security.