The rapid growth of open-source LLM agent ecosystems has introduced new security challenges, particularly from community-contributed skills that extend agent capabilities. These modular tool definitions often go unvetted, leaving systems vulnerable to attacks at the instruction layer that traditional code scanners cannot detect. To address this gap, researchers have developed SkillVetBench, a live public leaderboard on Hugging Face that employs an LLM-as-Judge framework to evaluate agent skills across multiple security dimensions.
The Problem: Code-Layer Blindness
Existing security scanners operate at the code layer and are structurally blind to instruction-layer and multi-agent risks. These include natural-language directives that can hijack an agent, exfiltrate data through encoded side channels, or chain harm across processing pipelines. According to the SkillVetBench paper, conventional tools miss between 89% and 100% of instruction-layer threats such as Prompt Injection and Memory Poisoning. For example, the code analysis tool CODEBERT detected none of nine memory-poisoning skills.
SkillVetBench and the SARS Metric
SkillVetBench introduces the Skill Agentic Risk Score (SARS), a five-dimensional agentic-risk metric with a principled weighted formula designed for instruction-following systems. The platform integrates full CVSS v4.0 vector decomposition and features a ClawHub dual-view, which places the LLM-generated review alongside the official marketplace verdict. This allows users to compare automated assessments with human moderation directly.
Zero False Negatives, Zero False Positives
The LLM-as-Judge stage achieved zero false negatives across 78 confirmed-malicious skills and zero false positives across 22 benign controls in the companion benchmark study. In contrast, the best static baseline, SKILLSIEVE, still missed 15% of malicious skills. This demonstrates the effectiveness of semantic, LLM-based evaluation over traditional signature-based methods.
Instruction-Layer Threats: A Critical Blind Spot
| Threat Category | Conventional Tool Detection Rate | SkillVetBench Performance |
|---|---|---|
| Prompt Injection | 0–11% | Zero false negatives overall |
| Memory Poisoning | 0% (CODEBERT) | Zero false negatives overall |
Conventional code scanners fail to catch instruction-layer attacks because they lack semantic understanding. The SkillVetBench approach, by using an LLM as judge, can interpret natural-language commands and identify malicious intent that would otherwise slip through.
Variability Across LLM Evaluators
Detection rates varied from 35% to 95% across four LLM evaluators tested in the paper. This variability motivates the use of ensemble scoring in production deployments, where multiple judges vote on risk severity. The paper notes that no single LLM judge is sufficient for reliable security vetting.
The researchers—Hossain, Ismail, Puppala, Sai, Alam, Md Jahangir, Ahad, Tanzim, and Talukder, Sajedul—have made SkillVetBench publicly available on Hugging Face to help the open-source community vet agent skills before deployment. As LLM agents become more common in enterprise workflows, tools like SkillVetBench provide a critical layer of security that code-level scanners cannot offer. For technology procurement leaders and enterprise software buyers, this represents an important step toward safe adoption of open-source AI components.