Enterprise software supply chains rely on automated vulnerability scanning to catch flaws before they reach production. A new benchmark from Snyk, titled Snyk VulnBench JS 1.0, raises a critical question: Can large language models (LLMs) find the same bugs consistently when re-run on identical code?
According to the paper, the answer is largely no. The researchers ran 300 repeated vulnerability-finding scans to measure how repeatable agentic LLM security review is on the same JavaScript code, prompt, and benchmark harness.
The Repeatability Problem
The headline finding, as stated in the abstract, is that "LLM security findings were unevenly repeatable: reference-matched findings were stable, but extra model reports varied heavily from run to run." Across 250 model runs, 80 of 161 unique unmatched findings appeared in only one of five identical repetitions, while only 22 appeared in all five. By contrast, when the LLM Claude matched a Snyk Code reference finding, the behavior was much more stable: 134 of 158 unique reference-matched findings appeared in all five repetitions.
| Finding Category | Unique Findings | Appeared in All 5 Runs | Appeared in Only 1 Run |
|---|---|---|---|
| Unmatched (extra) | 161 | 22 | 80 |
| Reference-matched | 158 | 134 | — |
Benchmark Methodology
The benchmark, Snyk VulnBench JS 1.0, used agentic LLM scans on JavaScript code. Each scan was repeated exactly five times under identical conditions. The researchers compared the findings against a reference set from Snyk Code, Snyk's static application security testing (SAST) engine, which is deterministic and better at systematically enumerating repeated data-flow sinks.
Complementarity with SAST
Despite the unrepeatability of unmatched findings, the benchmark also shows complementarity. Models consistently found familiar, high-signal exploit shapes, and in one case surfaced a likely Snyk Code product gap — a vulnerability that the SAST engine missed. This suggests that LLMs can identify certain bug patterns that deterministic tools overlook.
However, the paper warns against treating either technique as a replacement for the other. "The results support combining agentic LLM review with deterministic SAST rather than treating either technique as a replacement for the other," the authors conclude.
Implications for Supply Chain Security
For enterprise technology decision-makers, these findings are directly relevant to cybersecurity in the software supply chain. Many vendors are incorporating LLM-based code review into their CI/CD pipelines, hoping to catch zero-days or complex logic flaws. The Snyk benchmark demonstrates that relying solely on LLMs for vulnerability detection introduces inconsistency: a bug may be flagged on one scan and missed on the next, even with identical inputs.
In contrast, deterministic SAST tools like Snyk Code provide stable, repeatable coverage, especially for data-flow vulnerabilities. The optimal strategy, per the paper, is a hybrid approach: use SAST for systematic coverage and LLMs for exploratory analysis that may uncover gaps in the SAST rule set.
The research was conducted by Tal, Liran, Kloos, Johannes, Rudich, Arsenii, Thoemmes, Stephen, and Nair at Snyk. The full paper is available on arXiv under a Creative Commons license.