Enterprises increasingly rely on large language models (LLMs) and LLM-based coding agents to generate code from natural-language specifications. However, ensuring that such code is both functionally correct and secure remains a critical challenge. A new research paper introduces DualGauge, the first fully automated framework for jointly evaluating the correctness and security of specification-only code generation, according to the study.
The DualGauge Framework
DualGauge is supported by DualGauge-Bench, a language-agnostic benchmark containing 307 coding tasks, each paired with functional and security tests derived from the same specification. The researchers evaluated 10 representative LLMs across Python, C++, and JavaScript, covering a range of model sizes and architectures. The framework automates the entire evaluation pipeline, from task generation to test execution, eliminating manual effort.
Key Findings from the Benchmark
The results reveal that functional correctness metrics substantially overestimate reliable code generation. Even the strongest model remains below 15% joint security-functionality success in every language tested. Common model-side factors—such as increased scale, extended thinking, quantization, instruction tuning, and code specialization—do not reliably improve joint performance, suggesting that secure-and-correct code generation does not automatically emerge from stronger coding capability.
| Feature | Detail |
|---|---|
| Framework | DualGauge, first fully automated joint evaluation |
| Benchmark | DualGauge-Bench, 307 coding tasks |
| Languages | Python, C++, JavaScript |
| LLMs evaluated | 10 representative models |
| Agentic systems | Codex, OpenHands, Claude Code |
| Key result | Top model <15% joint success across all languages |
| Model factors tested | Scale, extended thinking, quantization, instruction tuning, code specialization |
| Impact of model factors | No reliable improvement on joint performance |
Agentic Coding Systems Under Scrutiny
The evaluation also included three leading agentic coding systems: Codex, OpenHands, and Claude Code. The researchers found that iterative scaffolding—where agents break tasks into subtasks and refine code—provides no advantage over direct (LLM-based) generation on specification-only tasks. This challenges the assumption that more complex agentic workflows inherently produce better code for simple specification-based tasks.
A qualitative audit of failures revealed two concentrated patterns: output contract boundary issues (where generated code fails to meet input/output specifications) and insufficient guards (where security checks exist but are inadequate). The researchers note that these patterns are only reliably exposed through joint benchmarking.
Implications for Enterprise Software Development
For CTOs and technology leaders evaluating LLM-based code generation, these findings indicate that functional testing alone is insufficient. Enterprises adopting AI-assisted coding must implement combined security-functionality benchmarks to avoid deploying vulnerable code. The fact that model improvements do not automatically translate to better joint performance suggests that specialized approaches—such as security-constrained training or verification layers—may be necessary. The DualGauge framework provides a template for enterprises to create their own joint benchmarks tailored to their specific coding tasks and security requirements.