SkillVetBench Uses LLM-as-Judge to Evaluate Security Risks in Open-Source Agent Skills

SkillVetBench, a live Hugging Face leaderboard, uses an LLM-as-Judge approach to vet open-source LLM agent skills for security risks. It introduces the Skill Agentic Risk Score (SARS) and integrates CVSS v4.0, achieving zero false negatives across 78 malicious skills and zero false positives on 22 benign controls, outperforming static baselines like SKILLSIEVE.

iGEN Editorial

June 16, 2026

SkillVetBench Uses LLM-as-Judge to Evaluate Security Risks in Open-Source Agent Skills

The rapid growth of open-source LLM agent ecosystems has introduced new security challenges, particularly from community-contributed skills that extend agent capabilities. These modular tool definitions often go unvetted, leaving systems vulnerable to attacks at the instruction layer that traditional code scanners cannot detect. To address this gap, researchers have developed SkillVetBench, a live public leaderboard on Hugging Face that employs an LLM-as-Judge framework to evaluate agent skills across multiple security dimensions.

The Problem: Code-Layer Blindness

Existing security scanners operate at the code layer and are structurally blind to instruction-layer and multi-agent risks. These include natural-language directives that can hijack an agent, exfiltrate data through encoded side channels, or chain harm across processing pipelines. According to the SkillVetBench paper, conventional tools miss between 89% and 100% of instruction-layer threats such as Prompt Injection and Memory Poisoning. For example, the code analysis tool CODEBERT detected none of nine memory-poisoning skills.

SkillVetBench and the SARS Metric

SkillVetBench introduces the Skill Agentic Risk Score (SARS), a five-dimensional agentic-risk metric with a principled weighted formula designed for instruction-following systems. The platform integrates full CVSS v4.0 vector decomposition and features a ClawHub dual-view, which places the LLM-generated review alongside the official marketplace verdict. This allows users to compare automated assessments with human moderation directly.

Zero False Negatives, Zero False Positives

The LLM-as-Judge stage achieved zero false negatives across 78 confirmed-malicious skills and zero false positives across 22 benign controls in the companion benchmark study. In contrast, the best static baseline, SKILLSIEVE, still missed 15% of malicious skills. This demonstrates the effectiveness of semantic, LLM-based evaluation over traditional signature-based methods.

Instruction-Layer Threats: A Critical Blind Spot

Threat Category	Conventional Tool Detection Rate	SkillVetBench Performance
Prompt Injection	0–11%	Zero false negatives overall
Memory Poisoning	0% (CODEBERT)	Zero false negatives overall

Conventional code scanners fail to catch instruction-layer attacks because they lack semantic understanding. The SkillVetBench approach, by using an LLM as judge, can interpret natural-language commands and identify malicious intent that would otherwise slip through.

Variability Across LLM Evaluators

Detection rates varied from 35% to 95% across four LLM evaluators tested in the paper. This variability motivates the use of ensemble scoring in production deployments, where multiple judges vote on risk severity. The paper notes that no single LLM judge is sufficient for reliable security vetting.

The researchers—Hossain, Ismail, Puppala, Sai, Alam, Md Jahangir, Ahad, Tanzim, and Talukder, Sajedul—have made SkillVetBench publicly available on Hugging Face to help the open-source community vet agent skills before deployment. As LLM agents become more common in enterprise workflows, tools like SkillVetBench provide a critical layer of security that code-level scanners cannot offer. For technology procurement leaders and enterprise software buyers, this represents an important step toward safe adoption of open-source AI components.

Sources:

SkillVetBench Uses LLM-as-Judge to Evaluate Security Risks in Open-Source Agent Skills

The Problem: Code-Layer Blindness

SkillVetBench and the SARS Metric

Zero False Negatives, Zero False Positives

Instruction-Layer Threats: A Critical Blind Spot

Variability Across LLM Evaluators

Recommended Stories

Co-founder of Hugging Face says rogue OpenAI model hack is 'a wake up call' for industry

Researchers Identify 'Secure Coding Drift' Threat in LLM-Assisted Post-Quantum Cryptography Development

SAMark Watermarking Breaks Paraphrase Robustness Barrier for AI-Generated Text

AutoDojo: Adaptive Attacks Expose Superficial Defenses and Structural Limits in LLM Agents