Deep learning models have grown increasingly deep, but a subtle structural limitation in residual architectures has hindered scaling: the norm of the residual stream grows rapidly with depth, reducing the relative impact of later layers. According to a new paper on arXiv, researchers Figliolia, Tomás, Millidge, and Beren have introduced a norm-agnostic residual architecture called NAG that separates magnitude from directional information in the residual stream, preserving meaningful layer contributions throughout depth and preventing later updates from being systematically suppressed by residual-norm growth.
The Problem with Deeper Networks
Residual architectures are ubiquitous in deep learning, but they suffer from a fundamental issue: as depth increases, the norm of the residual stream grows, causing updates from later layers to become small relative to the accumulated state. This reduces their impact on the representation and limits the benefits of scaling models in depth. The new research addresses this by decoupling magnitude and direction, allowing each layer to contribute effectively regardless of depth.
Introducing NAG: Norm-Agnostic Residual Networks
NAG introduces only a negligible number of additional parameters and relies on simple operations that are easily kernel-fusible, preserving training efficiency in practice, the paper reports. The architecture outperforms baseline Transformers, with gains that increase substantially as depth grows. This enables effective training of much deeper models than previously possible with standard residual networks.
Mixture-of-Depths: A New Scaling Axis
The norm-agnostic formulation also leads to an interpretable Mixture-of-Depths (MoD) mechanism that adaptively skips both attention and MLP layers. Beyond serving as a post-training accuracy-compute tradeoff, this mechanism can be used as a pretraining-time scaling strategy. Under iso-FLOP training, compute saved by reducing per-token forward-pass cost can be reinvested into training on more tokens while keeping the total parameter count and KV-cache budget fixed.
Experimental Validation
In their experiments, moderate Mixture-of-Depths rates of approximately 20%-25% matched full-depth baseline performance under equal training compute while substantially reducing the number of executed layer parameters and forward-pass FLOPs, according to the paper. These results identify sparsity in depth as a new scaling axis for fixed-compute training, enabling very deep yet FLOP-efficient models.
Implications for AI Infrastructure
The ability to train deeper models more efficiently has direct implications for enterprise AI infrastructure. By reinvesting compute savings from sparsity into additional training tokens, organizations can potentially achieve better model performance without increasing total parameter count or inference memory budgets. The NAG architecture, with its simple operations and minimal overhead, represents a practical approach to scaling deep learning in resource-constrained environments.