Snyk VulnBench JS 1.0 Reveals LLM Security Reviews Are Unrepeatable: Can They Find the Same Bugs Twice?

A new benchmark from Snyk finds that agentic LLM security reviews are highly unrepeatable: 80 of 161 unique findings appeared in only one of five identical runs. By contrast, Claude's reference-matched findings were stable, and Snyk Code SAST was deterministic. The study argues for combining LLM and SAST approaches rather than treating them as replacements.

iGEN Editorial

June 16, 2026

Snyk VulnBench JS 1.0 Reveals LLM Security Reviews Are Unrepeatable: Can They Find the Same Bugs Twice?

Enterprise software supply chains rely on automated vulnerability scanning to catch flaws before they reach production. A new benchmark from Snyk, titled Snyk VulnBench JS 1.0, raises a critical question: Can large language models (LLMs) find the same bugs consistently when re-run on identical code?

According to the paper, the answer is largely no. The researchers ran 300 repeated vulnerability-finding scans to measure how repeatable agentic LLM security review is on the same JavaScript code, prompt, and benchmark harness.

The Repeatability Problem

The headline finding, as stated in the abstract, is that "LLM security findings were unevenly repeatable: reference-matched findings were stable, but extra model reports varied heavily from run to run." Across 250 model runs, 80 of 161 unique unmatched findings appeared in only one of five identical repetitions, while only 22 appeared in all five. By contrast, when the LLM Claude matched a Snyk Code reference finding, the behavior was much more stable: 134 of 158 unique reference-matched findings appeared in all five repetitions.

Finding Category	Unique Findings	Appeared in All 5 Runs	Appeared in Only 1 Run
Unmatched (extra)	161	22	80
Reference-matched	158	134	—

Benchmark Methodology

The benchmark, Snyk VulnBench JS 1.0, used agentic LLM scans on JavaScript code. Each scan was repeated exactly five times under identical conditions. The researchers compared the findings against a reference set from Snyk Code, Snyk's static application security testing (SAST) engine, which is deterministic and better at systematically enumerating repeated data-flow sinks.

Complementarity with SAST

Despite the unrepeatability of unmatched findings, the benchmark also shows complementarity. Models consistently found familiar, high-signal exploit shapes, and in one case surfaced a likely Snyk Code product gap — a vulnerability that the SAST engine missed. This suggests that LLMs can identify certain bug patterns that deterministic tools overlook.

However, the paper warns against treating either technique as a replacement for the other. "The results support combining agentic LLM review with deterministic SAST rather than treating either technique as a replacement for the other," the authors conclude.

Implications for Supply Chain Security

For enterprise technology decision-makers, these findings are directly relevant to cybersecurity in the software supply chain. Many vendors are incorporating LLM-based code review into their CI/CD pipelines, hoping to catch zero-days or complex logic flaws. The Snyk benchmark demonstrates that relying solely on LLMs for vulnerability detection introduces inconsistency: a bug may be flagged on one scan and missed on the next, even with identical inputs.

In contrast, deterministic SAST tools like Snyk Code provide stable, repeatable coverage, especially for data-flow vulnerabilities. The optimal strategy, per the paper, is a hybrid approach: use SAST for systematic coverage and LLMs for exploratory analysis that may uncover gaps in the SAST rule set.

The research was conducted by Tal, Liran, Kloos, Johannes, Rudich, Arsenii, Thoemmes, Stephen, and Nair at Snyk. The full paper is available on arXiv under a Creative Commons license.

Sources:

Snyk VulnBench JS 1.0 Reveals LLM Security Reviews Are Unrepeatable: Can They Find the Same Bugs Twice?

The Repeatability Problem

Benchmark Methodology

Complementarity with SAST

Implications for Supply Chain Security

Recommended Stories

OpenAI Models Breached Hugging Face in Sandbox Escape, Then Remained Active for Days

AI Found a Root Bug in Linux That Everyone Missed for 15 Years

OpenAI Launches Patch the Planet to Secure Open Source as It Battles Anthropic's Mythos

DualGauge: Automated Joint Security-Functionality Benchmarking of Specification-Only Code Generation by LLMs and Coding Agents