iGEN
Visit IGEN World Explore IGEN Expo
EXPLORE UPGRADE PLANS
BREAKING
US military runs secret ship-to-ship oil transfer operation near Strait of Hormuz to keep Gulf energy exports flowing Wasserstein Equilibrium Decoding Boosts Reliability in Medical Visual Question Answering BRITE Benchmark Reveals Critical Gaps in Text-to-Video Models' Object-Action Binding and Audio-Visual Sync Vocabulary Dropout Technique Prevents Diversity Collapse in LLM Co-Evolution Training Bayesian Visualization Helps Humans Negotiate with AI Across Multiple Issues, Study Shows Multi-Sequence Verifiers Cut Inference Latency in Half for LLM Reasoning Language-Guided AI Framework CLARITY Boosts Road Scene Segmentation for Autonomous Logistics When RAG Hurts: Research Identifies Attention Distraction in Vision-Language AI Models and Proposes Mitigation Strait of Hormuz Reopening: Mine Clearance Delays Threaten Weeks-Long Recovery for Oil Shipping India’s REITs and InvITs May Attract Rs 11.6 Lakh Crore Investment by 2030, Avendus Report Says US military runs secret ship-to-ship oil transfer operation near Strait of Hormuz to keep Gulf energy exports flowing Wasserstein Equilibrium Decoding Boosts Reliability in Medical Visual Question Answering BRITE Benchmark Reveals Critical Gaps in Text-to-Video Models' Object-Action Binding and Audio-Visual Sync Vocabulary Dropout Technique Prevents Diversity Collapse in LLM Co-Evolution Training Bayesian Visualization Helps Humans Negotiate with AI Across Multiple Issues, Study Shows Multi-Sequence Verifiers Cut Inference Latency in Half for LLM Reasoning Language-Guided AI Framework CLARITY Boosts Road Scene Segmentation for Autonomous Logistics When RAG Hurts: Research Identifies Attention Distraction in Vision-Language AI Models and Proposes Mitigation Strait of Hormuz Reopening: Mine Clearance Delays Threaten Weeks-Long Recovery for Oil Shipping India’s REITs and InvITs May Attract Rs 11.6 Lakh Crore Investment by 2030, Avendus Report Says
Home ›› Technology ›› Ai ›› Llms ›› DualGauge: Automated Joint Security-Functionality Benchmarking of Specification-Only Code Generation by LLMs and Coding Agents

DualGauge: Automated Joint Security-Functionality Benchmarking of Specification-Only Code Generation by LLMs and Coding Agents

Researchers present DualGauge, an automated framework for jointly evaluating correctness and security of code generated by LLMs from natural-language specifications. A benchmark of 307 tasks across three languages shows that even the strongest models achieve under 15% joint security-functionality success, while factors like scale and instruction tuning do not reliably improve outcomes. Three leading agentic coding systems also show no advantage over direct generation.

iG
iGEN Editorial
June 16, 2026
DualGauge: Automated Joint Security-Functionality Benchmarking of Specification-Only Code Generation by LLMs and Coding Agents

Enterprises increasingly rely on large language models (LLMs) and LLM-based coding agents to generate code from natural-language specifications. However, ensuring that such code is both functionally correct and secure remains a critical challenge. A new research paper introduces DualGauge, the first fully automated framework for jointly evaluating the correctness and security of specification-only code generation, according to the study.

The DualGauge Framework

DualGauge is supported by DualGauge-Bench, a language-agnostic benchmark containing 307 coding tasks, each paired with functional and security tests derived from the same specification. The researchers evaluated 10 representative LLMs across Python, C++, and JavaScript, covering a range of model sizes and architectures. The framework automates the entire evaluation pipeline, from task generation to test execution, eliminating manual effort.

Key Findings from the Benchmark

The results reveal that functional correctness metrics substantially overestimate reliable code generation. Even the strongest model remains below 15% joint security-functionality success in every language tested. Common model-side factors—such as increased scale, extended thinking, quantization, instruction tuning, and code specialization—do not reliably improve joint performance, suggesting that secure-and-correct code generation does not automatically emerge from stronger coding capability.

Feature Detail
Framework DualGauge, first fully automated joint evaluation
Benchmark DualGauge-Bench, 307 coding tasks
Languages Python, C++, JavaScript
LLMs evaluated 10 representative models
Agentic systems Codex, OpenHands, Claude Code
Key result Top model <15% joint success across all languages
Model factors tested Scale, extended thinking, quantization, instruction tuning, code specialization
Impact of model factors No reliable improvement on joint performance

Agentic Coding Systems Under Scrutiny

The evaluation also included three leading agentic coding systems: Codex, OpenHands, and Claude Code. The researchers found that iterative scaffolding—where agents break tasks into subtasks and refine code—provides no advantage over direct (LLM-based) generation on specification-only tasks. This challenges the assumption that more complex agentic workflows inherently produce better code for simple specification-based tasks.

A qualitative audit of failures revealed two concentrated patterns: output contract boundary issues (where generated code fails to meet input/output specifications) and insufficient guards (where security checks exist but are inadequate). The researchers note that these patterns are only reliably exposed through joint benchmarking.

Implications for Enterprise Software Development

For CTOs and technology leaders evaluating LLM-based code generation, these findings indicate that functional testing alone is insufficient. Enterprises adopting AI-assisted coding must implement combined security-functionality benchmarks to avoid deploying vulnerable code. The fact that model improvements do not automatically translate to better joint performance suggests that specialized approaches—such as security-constrained training or verification layers—may be necessary. The DualGauge framework provides a template for enterprises to create their own joint benchmarks tailored to their specific coding tasks and security requirements.


Sources:

Keep Reading

Recommended Stories

AI's Role in Accelerating Cyber Vulnerabilities Technology

AI's Role in Accelerating Cyber Vulnerabilities

AI is significantly reducing the time it takes for adversaries to exploit vulnerabilities, challenging traditional cybersecurity defenses. Organizations must shift focus from prevention to resilience to maintain operations.

June 10, 2026
Snyk VulnBench JS 1.0 Reveals LLM Security Reviews Are Unrepeatable: Can They Find the Same Bugs Twice? Technology

Snyk VulnBench JS 1.0 Reveals LLM Security Reviews Are Unrepeatable: Can They Find the Same Bugs Twice?

A new benchmark from Snyk finds that agentic LLM security reviews are highly unrepeatable: 80 of 161 unique findings appeared in only one of five identical runs. By contrast, Claude's reference-matched findings were stable, and Snyk Code SAST was deterministic. The study argues for combining LLM and SAST approaches rather than treating them as replacements.

June 16, 2026
LabOSBench: New Benchmark Tests AI Agents on Complex Scientific Instrument Control Technology

LabOSBench: New Benchmark Tests AI Agents on Complex Scientific Instrument Control

LabOSBench is a new benchmark designed to evaluate computer-use agents on scientific instrument control. It features 96 subtasks across eight simulated instruments, testing agents on sample loading, alignment, parameter tuning, data acquisition, and result inspection. Early results show that while agents handle structured GUI tasks well, they struggle with feedback-driven operations and long-horizon workflows.

June 16, 2026
RAG and LLMs Combined to Generate Personalized Reading Content at Desired Complexity Technology

RAG and LLMs Combined to Generate Personalized Reading Content at Desired Complexity

A research paper proposes a four-module system that uses Retrieval-Augmented Generation (RAG) with Large Language Models (LLMs) to generate reading content tailored to user queries and complexity preferences. Experiments with Meta LLaMA 4 Scout, LLaMA 3.1 8B Instant, and Google Gemma2 9B show that RAG improves relevance and groundedness by 26–35 percentage points across all models and prompting strategies.

June 16, 2026