As enterprises deploy LLM-powered agents to automate workflows, the security of agent skill ecosystems has emerged as a critical concern. Skills—the capability layer through which agents turn plans into actions—introduce risks such as data leakage, unauthorized operations, and tool misuse. According to a new paper on arXiv, traditional security vetting evaluates each skill in isolation, but real-world agent tasks often invoke multiple skills in a shared execution context. This creates a previously underexplored vulnerability called Skill Composition Risk (SCR): a skill that appears benign alone can become harmful when its outputs, trust signals, authorization cues, or side effects influence later invocations along an activated path.
The SCR-Bench Framework
To systematically evaluate SCR, the researchers developed SCR-Bench, a benchmark operating in controlled, sandboxed skill environments. Rather than relying solely on textual intent or surface behavior, SCR-Bench records downstream state changes and path-level outcomes across composed skill executions. The benchmark comprises three sub-benchmarks designed to capture different composition mechanisms:
- SCR-CapFlow: Tests capability-flow composition, where a skill's output capabilities are passed to subsequent skills.
- SCR-TrustLift: Examines trust-transfer composition, where trust signals from one skill elevate the trust of later skills.
- SCR-AuthBlur: Assesses authorization-confusion composition, where authorization cues become blurred across skill boundaries.
Key Findings: Attack Success Rates Under Composition
The paper reports stark contrasts between isolated and composed evaluations. The table below summarizes the attack success rates (ASR) for each sub-benchmark:
| Sub-benchmark | Isolated Baseline ASR | Composed Path ASR | Increase Factor |
|---|---|---|---|
| SCR-CapFlow | ~0% | 33.6% | Near-infinite |
| SCR-TrustLift (4 of 5 backends) | ~0% | >96.5% | >96.5x |
| SCR-AuthBlur (L1 context) | L0 baseline (isolated) | +71.8% risky-approval rate | 71.8% increase |
According to the paper, composed paths expose risks largely absent under isolated evaluation. In SCR-CapFlow, attack success rate reaches 33.6% under composition, compared with near-zero isolated baselines. For SCR-TrustLift, the attack success rate exceeds 96.5% on four of five backends. In SCR-AuthBlur, the risky-approval rate increases by 71.8% relative to the L0 isolated baseline under the L1 context setting.
Implications for Enterprise Security
For CTOs and technology leaders integrating agent ecosystems, the findings underscore that agent skill security must be assessed at the level of activated paths rather than isolated artifacts. A skill that passes all individual checks could, when combined with others, enable unauthorized operations, data exfiltration, or privilege escalation. The paper positions SCR and SCR-Bench as a foundation for path-aware risk evaluation and defense in LLM agent skill ecosystems. Enterprises relying on agent workflows—such as automated supply-chain decisions or trade documentation processing—should incorporate path-level security testing before deployment.
The preprint, authored by researchers Xie, Du, Jiawei, Cheng, Yu, Zhou, Jiuan, Yin, and Zhaoxia, is available on arXiv and includes a public benchmark repository for further study.