Organizations deploying large language model (LLM) agents face a persistent question: do the structured skill packages added to these agents actually improve performance? Until now, there was no standard way to measure that. A new benchmark called SkillsBench provides the first systematic answer, according to a paper from a team of researchers led by Xiangyi Li.
SkillsBench is a benchmark designed to evaluate how well agent skills work across diverse tasks. Agent skills are structured packages of procedural knowledge that augment LLM agents at inference time. The current inventory of SkillsBench contains 87 tasks across 8 domains, each paired with curated skills and deterministic verifiers. The researchers ran the full 87-task benchmark under matched no-skills and curated-skills conditions for 18 model-harness configurations.
Key Findings
The results show a clear benefit from curated skills. According to the paper, the average pass rate increased from 33.9% without skills to 50.5% with skills — a gain of +16.6 percentage points, or a 25.5% normalized gain. Configuration-level gains ranged from +4.1 to +25.7 percentage points. Importantly, focused skills with at most three modules outperformed larger or exhaustive bundles. The researchers also found that smaller models with Skills can match larger models without Skills, suggesting that skills can level the playing field between model sizes.
| Condition | Average Pass Rate | Gain (pp) |
|---|---|---|
| No Skills | 33.9% | — |
| Curated Skills | 50.5% | +16.6 |
Benchmark Design
SkillsBench establishes paired evaluation as the foundation for rigorous measurement of skill efficacy on agentic, expertise-heavy work. The benchmark includes tasks across 8 domains, though the paper does not specify which domains. Each task has a deterministic verifier to ensure objective scoring. The researchers tested 18 model-harness configurations, combining different LLMs and skill sets. The benchmark is available on arXiv under a Creative Commons license.
Implications for Enterprise AI
For enterprise technology decision-makers, SkillsBench offers a method to quantify the value of agent skills before deployment. The finding that focused skills outperform larger bundles suggests that organizations should prioritize targeted, concise skill packages over exhaustive ones. Additionally, the ability of smaller models augmented with skills to match larger models without skills could reduce computational costs while maintaining performance. This is particularly relevant for applications in supply chain, logistics, and other domains where AI agents handle complex, multi-step tasks. According to the researchers, SkillsBench provides the foundation for rigorous measurement of skill efficacy on agentic, expertise-heavy work. The benchmark code and data are available for download, enabling organizations to evaluate their own agent configurations.