iGEN
Visit IGEN World Explore IGEN Expo
EXPLORE UPGRADE PLANS
BREAKING
G-Loss: New Graph-Guided Loss Function Boosts Language Model Fine-Tuning Accuracy FasterPy: New LLM Framework Optimizes Python Code Execution Efficiency Decision-Aware Memory Cards: Counterfactual-Inspired Context Selection for Tool-Using LLM Agents RoTRAG Framework Boosts Harm Detection Accuracy by 40% Using Retrieval-Augmented Generation KILLBENCH: New Benchmark Tests External Kill Switches to Stop Malicious AI Learned Image Compression Framework SPARC Boosts VLA Robot Control Performance in Bandwidth-Limited Settings K-Prism Model Unifies Medical Image Segmentation with Knowledge-Guided Prompt Integration Truckload Market Upswing Prompts Driver Pay Hikes as Regulatory Enforcement Tightens Capacity Study Reveals Patterns of Pre-Trained Deep Learning Model Reuse in Scientific Research LLM Jaggedness Unlocks Scientific Creativity: New Benchmark Reveals Uneven AI Capabilities Can Be Harnessed for Innovation G-Loss: New Graph-Guided Loss Function Boosts Language Model Fine-Tuning Accuracy FasterPy: New LLM Framework Optimizes Python Code Execution Efficiency Decision-Aware Memory Cards: Counterfactual-Inspired Context Selection for Tool-Using LLM Agents RoTRAG Framework Boosts Harm Detection Accuracy by 40% Using Retrieval-Augmented Generation KILLBENCH: New Benchmark Tests External Kill Switches to Stop Malicious AI Learned Image Compression Framework SPARC Boosts VLA Robot Control Performance in Bandwidth-Limited Settings K-Prism Model Unifies Medical Image Segmentation with Knowledge-Guided Prompt Integration Truckload Market Upswing Prompts Driver Pay Hikes as Regulatory Enforcement Tightens Capacity Study Reveals Patterns of Pre-Trained Deep Learning Model Reuse in Scientific Research LLM Jaggedness Unlocks Scientific Creativity: New Benchmark Reveals Uneven AI Capabilities Can Be Harnessed for Innovation
Home ›› Technology ›› Ai ›› Llms ›› Snyk VulnBench JS 1.0 Reveals LLM Security Reviews Are Unrepeatable: Can They Find the Same Bugs Twice?

Snyk VulnBench JS 1.0 Reveals LLM Security Reviews Are Unrepeatable: Can They Find the Same Bugs Twice?

A new benchmark from Snyk finds that agentic LLM security reviews are highly unrepeatable: 80 of 161 unique findings appeared in only one of five identical runs. By contrast, Claude's reference-matched findings were stable, and Snyk Code SAST was deterministic. The study argues for combining LLM and SAST approaches rather than treating them as replacements.

iG
iGEN Editorial
June 16, 2026
Snyk VulnBench JS 1.0 Reveals LLM Security Reviews Are Unrepeatable: Can They Find the Same Bugs Twice?

Enterprise software supply chains rely on automated vulnerability scanning to catch flaws before they reach production. A new benchmark from Snyk, titled Snyk VulnBench JS 1.0, raises a critical question: Can large language models (LLMs) find the same bugs consistently when re-run on identical code?

According to the paper, the answer is largely no. The researchers ran 300 repeated vulnerability-finding scans to measure how repeatable agentic LLM security review is on the same JavaScript code, prompt, and benchmark harness.

The Repeatability Problem

The headline finding, as stated in the abstract, is that "LLM security findings were unevenly repeatable: reference-matched findings were stable, but extra model reports varied heavily from run to run." Across 250 model runs, 80 of 161 unique unmatched findings appeared in only one of five identical repetitions, while only 22 appeared in all five. By contrast, when the LLM Claude matched a Snyk Code reference finding, the behavior was much more stable: 134 of 158 unique reference-matched findings appeared in all five repetitions.

Finding Category Unique Findings Appeared in All 5 Runs Appeared in Only 1 Run
Unmatched (extra) 161 22 80
Reference-matched 158 134

Benchmark Methodology

The benchmark, Snyk VulnBench JS 1.0, used agentic LLM scans on JavaScript code. Each scan was repeated exactly five times under identical conditions. The researchers compared the findings against a reference set from Snyk Code, Snyk's static application security testing (SAST) engine, which is deterministic and better at systematically enumerating repeated data-flow sinks.

Complementarity with SAST

Despite the unrepeatability of unmatched findings, the benchmark also shows complementarity. Models consistently found familiar, high-signal exploit shapes, and in one case surfaced a likely Snyk Code product gap — a vulnerability that the SAST engine missed. This suggests that LLMs can identify certain bug patterns that deterministic tools overlook.

However, the paper warns against treating either technique as a replacement for the other. "The results support combining agentic LLM review with deterministic SAST rather than treating either technique as a replacement for the other," the authors conclude.

Implications for Supply Chain Security

For enterprise technology decision-makers, these findings are directly relevant to cybersecurity in the software supply chain. Many vendors are incorporating LLM-based code review into their CI/CD pipelines, hoping to catch zero-days or complex logic flaws. The Snyk benchmark demonstrates that relying solely on LLMs for vulnerability detection introduces inconsistency: a bug may be flagged on one scan and missed on the next, even with identical inputs.

In contrast, deterministic SAST tools like Snyk Code provide stable, repeatable coverage, especially for data-flow vulnerabilities. The optimal strategy, per the paper, is a hybrid approach: use SAST for systematic coverage and LLMs for exploratory analysis that may uncover gaps in the SAST rule set.

The research was conducted by Tal, Liran, Kloos, Johannes, Rudich, Arsenii, Thoemmes, Stephen, and Nair at Snyk. The full paper is available on arXiv under a Creative Commons license.


Sources:

Keep Reading

Recommended Stories

AIChilles Automatically Unearths Hidden Weaknesses in AI-Evolved Programs Technology

AIChilles Automatically Unearths Hidden Weaknesses in AI-Evolved Programs

Researchers developed AIChilles, an automated tool that uncovers hidden weaknesses in AI-evolved programs. Testing 30 AI-generated programs across five system applications, it found 49 distinct failures in correctness, runtime, memory, and output quality. The tool combines workload extraction, constraint inference, and differential oracles to identify regressions that could undermine AI-generated code reliability.

June 16, 2026
CmdNeedle Reveals Widespread Fragility in AI Agent Command Denylists Technology

CmdNeedle Reveals Widespread Fragility in AI Agent Command Denylists

A research paper introduces CmdNeedle, an LLM-driven pipeline that systematically detects incompleteness in command denylists used by terminal AI agents. Evaluating 1,709 real-world denylists, the study finds that 69.0–98.6% are fragile, meaning they can be bypassed by alternative commands, undermining security.

June 16, 2026
AgentLeak Benchmark Reveals Internal Channel Privacy Leaks in Multi-Agent LLM Systems Technology

AgentLeak Benchmark Reveals Internal Channel Privacy Leaks in Multi-Agent LLM Systems

A new benchmark called AgentLeak evaluates privacy leakage in multi-agent large language model (LLM) systems, finding that inter-agent messages leak at 68.8% compared to 27.2% for final outputs. Across 1,000 scenarios and five models, total system exposure reaches 68.9%, highlighting risks invisible to standard output-only audits.

June 16, 2026
AI's Role in Accelerating Cyber Vulnerabilities Technology

AI's Role in Accelerating Cyber Vulnerabilities

AI is significantly reducing the time it takes for adversaries to exploit vulnerabilities, challenging traditional cybersecurity defenses. Organizations must shift focus from prevention to resilience to maintain operations.

June 10, 2026