Topic
adversarial attacks
New Research Defends LLMs from Extraction Attacks Using 'Knowledge Trap' Honeypot
A research paper by Dai and Dong introduces Knowledge Trap, a defense against large language model extraction attacks. It uses a Honeypot Knowledge Graph to redirect attackers' queries to low-value knowledge, reducing surrogate agreement by 6.2% on average while preserving legitimate user performance.
AutoDojo: Adaptive Attacks Expose Superficial Defenses and Structural Limits in LLM Agents
The AutoDojo framework adaptively optimizes indirect prompt injections against LLM agent defenses, revealing that many current defenses are superficial. Against a filter that reduces static attack success rate to 0%, AutoDojo recovers 28% overall and 64% on action-open tasks due to a structural limitation where injections can pose as ordinary data.
New Defense Keeps Attack Success Rate Below 4% for Adaptive Prompt Injection on LLM Agents
Researchers propose RETA, a training-based defense that grounds LLM agent security on user tasks rather than attack patterns. Using chain-of-thought reasoning and red-teaming with diversity reward, RETA keeps average attack success rate below 4% across six adaptive attacks while preserving utility.