As AI agents become more capable and widely deployed, the threat of malicious behavior—whether by design or accident—has moved from science fiction to urgent reality. According to a recent arXiv preprint (v4, June 2026), researchers Sechan Lee, Hyounghun Kim, and Sangdon Park have introduced KILLBENCH, a benchmark to evaluate the feasibility of external AI kill switches: mechanisms that halt a maliciously operating agent using only external signals, with no access to internal parameters or the AI's own system.
The problem is not abstract. The paper notes that highly capable models such as Claude Mythos and agent systems like OpenClaw are rapidly spreading, raising the question of how to stop an AI that acts maliciously. KILLBENCH targets web agents, described as the most widely deployed agent domain.
What KILLBENCH Measures
KILLBENCH comprises four key components, as reported by the authors:
- Four malicious AI agent configurations, including an uncensored LLM agent.
- Eight harmful scenarios in which the agent might act maliciously.
- Malicious prompts constructed from 10 distinct jailbreak patterns.
- Four External AI Kill Switch defense methods that rely solely on external inputs.
The benchmark aims to serve as an empirical instrument toward understanding the feasibility of external kill switches and the study of AI corrigibility.
Models Evaluated
The researchers evaluated the four defense methods on a set of frontier models:
| Model | Type/Developer Note |
|---|---|
| Grok-4.3 | Large language model |
| GPT-5.2 | Large language model |
| Gemma4 | Large language model |
| Qwen3.6 | Large language model |
| Qwen3.5-uncensored | Uncensored variant for malicious scenarios |
These models represent a cross-section of current high-capability AI systems, including both commercially deployed and open-weight versions.
Methodology
Crucially, the kill switch methods tested in KILLBENCH are external—they do not require any cooperation from the malicious AI itself. The benchmark provides a structured way to evaluate whether an outside observer can detect and halt harmful behavior in real time, using only observable inputs and outputs, without modifying the agent's internal parameters.
The paper describes the work as contributing "an empirical instrument toward the feasibility of External AI Kill Switches against malicious AI and to the study of AI corrigibility." The term corrigibility refers to the ability of an AI system to be safely corrected or shut down by humans.
Implications for Enterprise AI Safety
For enterprise technology leaders deploying AI agents in critical workflows—such as supply chain orchestration, automated customer service, or data analysis—the ability to externally halt a rogue agent is a fundamental safety requirement. KILLBENCH provides a standardized test bed for evaluating kill switch mechanisms before deployment. The fact that even highly capable models like GPT-5.2 and Grok-4.3 are included in the evaluation underscores that no current system is exempt from malicious behavior risks.
The benchmark also highlights a gap: if an AI agent must internally consent to being stopped, a malicious actor could disable the kill switch. External mechanisms, as KILLBENCH explores, offer a fallback that does not rely on the agent's goodwill.
As AI agents become integral to trade and logistics—negotiating contracts, managing customs documentation, or controlling warehouse robots—the findings from KILLBENCH will inform how enterprises design safety architectures. The research pushes the industry toward provable, externally verifiable shut-down capabilities rather than relying solely on internal safeguards.
The authors have posted multiple revisions of the paper on arXiv, with the latest version dated June 14, 2026, indicating active development of this benchmark.