Terminal AI agents—agents that run in shell environments—increasingly rely on command denylists to block dangerous operations. But a new research paper reveals that these denylists are often alarmingly incomplete, leaving systems exposed. The study, posted on arXiv, presents CmdNeedle, an automated pipeline that uncovers bypass commands that circumvent such blocking mechanisms.
The Fragility Problem
According to the paper by Chen, Chuyang, Lin, and Zhiqiang, terminal AI agents use a three-list command-gating mechanism: allowlists, denylists, and a default policy. Denylists serve as the primary defense, listing dangerous commands that the agent must not execute. However, modern operating systems ship a huge and growing set of shell commands with overlapping functionalities. Even well-maintained built-in denylists—such as that of Claude Code, an AI agent by Anthropic—can overlook alternative commands that invalidate the denylist's effectiveness. The researchers term this "command denylist fragility."
The study formalizes the problem and proposes CmdNeedle, an LLM-driven pipeline that automatically discovers bypasses. CmdNeedle prompts a large language model to propose potential workaround commands, then executes them in a sandboxed validator, iteratively repairing failed attempts until a valid bypass is found.
Evaluation on Real-World Denylists
The team applied CmdNeedle to 1,709 real-world command denylists collected from GitHub, containing a total of 13,332 denylist rules. The results are stark:
| Metric | Value |
|---|---|
| Denylists found fragile | 69.0–98.6% |
| Total denylists tested | 1,709 |
| Denylist rules analyzed | 13,332 |
"69.0–98.6% of the denylists are fragile, that this fragility occurs consistently across projects and agents" — according to the arXiv paper.
The wide range (69.0% to 98.6%) depends on the strictness of evaluation criteria, but even the lower bound indicates a massive security gap. The fragility was consistent across different projects and AI agents, suggesting a systemic issue rather than isolated cases.
Root Causes and Implications
The researchers investigated possible root causes for the fragility. While the paper does not name specific causes in the provided text, it states that several validity checks support certain hypotheses. The work is intended to "facilitate future research and practice regarding the command denylists used by AI agents."
For enterprises deploying AI agents—especially in security-sensitive contexts like supply chain management or financial systems—the findings are a red flag. If a denylist can be bypassed by an attacker, the agent could be tricked into executing harmful commands, leading to data breaches or system compromise. The study underscores the need for more robust gating mechanisms, possibly combining denylists with allowlists and real-time anomaly detection.
The CmdNeedle pipeline itself could be used by security teams to audit their own denylists before deployment, turning the research into a practical tool. Given the rapid adoption of AI agents, addressing command denylist fragility is a pressing cybersecurity priority.