DualGauge: Automated Joint Security-Functionality Benchmarking of Specification-Only Code Generation by LLMs and Coding Agents

Researchers present DualGauge, an automated framework for jointly evaluating correctness and security of code generated by LLMs from natural-language specifications. A benchmark of 307 tasks across three languages shows that even the strongest models achieve under 15% joint security-functionality success, while factors like scale and instruction tuning do not reliably improve outcomes. Three leading agentic coding systems also show no advantage over direct generation.

iGEN Editorial

June 16, 2026

DualGauge: Automated Joint Security-Functionality Benchmarking of Specification-Only Code Generation by LLMs and Coding Agents

Enterprises increasingly rely on large language models (LLMs) and LLM-based coding agents to generate code from natural-language specifications. However, ensuring that such code is both functionally correct and secure remains a critical challenge. A new research paper introduces DualGauge, the first fully automated framework for jointly evaluating the correctness and security of specification-only code generation, according to the study.

The DualGauge Framework

DualGauge is supported by DualGauge-Bench, a language-agnostic benchmark containing 307 coding tasks, each paired with functional and security tests derived from the same specification. The researchers evaluated 10 representative LLMs across Python, C++, and JavaScript, covering a range of model sizes and architectures. The framework automates the entire evaluation pipeline, from task generation to test execution, eliminating manual effort.

Key Findings from the Benchmark

The results reveal that functional correctness metrics substantially overestimate reliable code generation. Even the strongest model remains below 15% joint security-functionality success in every language tested. Common model-side factors—such as increased scale, extended thinking, quantization, instruction tuning, and code specialization—do not reliably improve joint performance, suggesting that secure-and-correct code generation does not automatically emerge from stronger coding capability.

Feature	Detail
Framework	DualGauge, first fully automated joint evaluation
Benchmark	DualGauge-Bench, 307 coding tasks
Languages	Python, C++, JavaScript
LLMs evaluated	10 representative models
Agentic systems	Codex, OpenHands, Claude Code
Key result	Top model <15% joint success across all languages
Model factors tested	Scale, extended thinking, quantization, instruction tuning, code specialization
Impact of model factors	No reliable improvement on joint performance

Agentic Coding Systems Under Scrutiny

The evaluation also included three leading agentic coding systems: Codex, OpenHands, and Claude Code. The researchers found that iterative scaffolding—where agents break tasks into subtasks and refine code—provides no advantage over direct (LLM-based) generation on specification-only tasks. This challenges the assumption that more complex agentic workflows inherently produce better code for simple specification-based tasks.

A qualitative audit of failures revealed two concentrated patterns: output contract boundary issues (where generated code fails to meet input/output specifications) and insufficient guards (where security checks exist but are inadequate). The researchers note that these patterns are only reliably exposed through joint benchmarking.

Implications for Enterprise Software Development

For CTOs and technology leaders evaluating LLM-based code generation, these findings indicate that functional testing alone is insufficient. Enterprises adopting AI-assisted coding must implement combined security-functionality benchmarks to avoid deploying vulnerable code. The fact that model improvements do not automatically translate to better joint performance suggests that specialized approaches—such as security-constrained training or verification layers—may be necessary. The DualGauge framework provides a template for enterprises to create their own joint benchmarks tailored to their specific coding tasks and security requirements.

Sources:

DualGauge: Automated Joint Security-Functionality Benchmarking of Specification-Only Code Generation by LLMs and Coding Agents

The DualGauge Framework

Key Findings from the Benchmark

Agentic Coding Systems Under Scrutiny

Implications for Enterprise Software Development

Recommended Stories

AI's Role in Accelerating Cyber Vulnerabilities

OpenAI Hack of Hugging Face Sparks Debate: Warning Shot or Publicity Stunt?

Smart Home Gadgets That Boost Curb Appeal Without Sacrificing Style

Nobody Wants to Wait on Hold Anymore: Can AI Replace Customer Care in India's BPO Industry?