Norm-Agnostic Residual Networks Offer Path to Scaling Adaptive Depth in Deep Learning

Researchers introduce NAG, a norm-agnostic residual architecture that prevents later layers from being suppressed by norm growth. This enables training of much deeper models and introduces an interpretable Mixture-of-Depths mechanism that can serve as a pretraining scaling strategy, with 20-25% sparsity matching full-depth baseline under equal compute.

iGEN Editorial

June 17, 2026

Norm-Agnostic Residual Networks Offer Path to Scaling Adaptive Depth in Deep Learning

Deep learning models have grown increasingly deep, but a subtle structural limitation in residual architectures has hindered scaling: the norm of the residual stream grows rapidly with depth, reducing the relative impact of later layers. According to a new paper on arXiv, researchers Figliolia, Tomás, Millidge, and Beren have introduced a norm-agnostic residual architecture called NAG that separates magnitude from directional information in the residual stream, preserving meaningful layer contributions throughout depth and preventing later updates from being systematically suppressed by residual-norm growth.

The Problem with Deeper Networks

Residual architectures are ubiquitous in deep learning, but they suffer from a fundamental issue: as depth increases, the norm of the residual stream grows, causing updates from later layers to become small relative to the accumulated state. This reduces their impact on the representation and limits the benefits of scaling models in depth. The new research addresses this by decoupling magnitude and direction, allowing each layer to contribute effectively regardless of depth.

Introducing NAG: Norm-Agnostic Residual Networks

NAG introduces only a negligible number of additional parameters and relies on simple operations that are easily kernel-fusible, preserving training efficiency in practice, the paper reports. The architecture outperforms baseline Transformers, with gains that increase substantially as depth grows. This enables effective training of much deeper models than previously possible with standard residual networks.

Mixture-of-Depths: A New Scaling Axis

The norm-agnostic formulation also leads to an interpretable Mixture-of-Depths (MoD) mechanism that adaptively skips both attention and MLP layers. Beyond serving as a post-training accuracy-compute tradeoff, this mechanism can be used as a pretraining-time scaling strategy. Under iso-FLOP training, compute saved by reducing per-token forward-pass cost can be reinvested into training on more tokens while keeping the total parameter count and KV-cache budget fixed.

Experimental Validation

In their experiments, moderate Mixture-of-Depths rates of approximately 20%-25% matched full-depth baseline performance under equal training compute while substantially reducing the number of executed layer parameters and forward-pass FLOPs, according to the paper. These results identify sparsity in depth as a new scaling axis for fixed-compute training, enabling very deep yet FLOP-efficient models.

Implications for AI Infrastructure

The ability to train deeper models more efficiently has direct implications for enterprise AI infrastructure. By reinvesting compute savings from sparsity into additional training tokens, organizations can potentially achieve better model performance without increasing total parameter count or inference memory budgets. The NAG architecture, with its simple operations and minimal overhead, represents a practical approach to scaling deep learning in resource-constrained environments.

Sources:

Norm-Agnostic Residual Networks Offer Path to Scaling Adaptive Depth in Deep Learning

The Problem with Deeper Networks

Introducing NAG: Norm-Agnostic Residual Networks

Mixture-of-Depths: A New Scaling Axis

Experimental Validation

Implications for AI Infrastructure

Recommended Stories

Lightweight Attention Mechanism Boosts Robust Multimodal Integration in Global Workspace Architecture

Gated QKAN-FWP: Quantum-Inspired Sequence Learning Achieves Parameter Efficiency on NISQ Devices

Cascaded Sparse Autoencoders Enable Hierarchical Visual Concept Learning in Multimodal LLMs

New Unified Definition of AI Hallucination Pins It on Inaccurate World Modeling