iGEN
Visit IGEN World Explore IGEN Expo
EXPLORE UPGRADE PLANS
BREAKING
Halo App Blocker Uses Geofencing to Curb Bedtime Scrolling for Better Sleep Ports Face Up to $30bn Annual Climate Disruption by 2050 Without Adaptation, WEF Warns Trump Lets Sanctions Waiver on Russian Crude Expire as US-Iran Peace Deal Progresses Iran-US Peace Deal Reopens Hormuz: 62 Million Barrels Set to Flood Market, Asia Braces for Oil Glut Vår Energi Approves Seven-Well North Sea Development with 2027 Start-Up Atom XVII Launches ₹75 Crore Consumer Fund to Back Early-Stage Indian Brands Rupee Tumbles 21 Paise to 94.66 Against US Dollar on Fed Hawkish Stance MOL and NYK Sign Long-Term Ammonia Carrier Charters with JERA for US-Japan Low-Carbon Fuel Supply Qatar LNG Tanker Sails for Hormuz as US-Iran Deal Reopens Critical Waterway UK to Scan Asylum-Seekers’ Faces with Flawed AI Age Estimation Despite Internal Warnings Halo App Blocker Uses Geofencing to Curb Bedtime Scrolling for Better Sleep Ports Face Up to $30bn Annual Climate Disruption by 2050 Without Adaptation, WEF Warns Trump Lets Sanctions Waiver on Russian Crude Expire as US-Iran Peace Deal Progresses Iran-US Peace Deal Reopens Hormuz: 62 Million Barrels Set to Flood Market, Asia Braces for Oil Glut Vår Energi Approves Seven-Well North Sea Development with 2027 Start-Up Atom XVII Launches ₹75 Crore Consumer Fund to Back Early-Stage Indian Brands Rupee Tumbles 21 Paise to 94.66 Against US Dollar on Fed Hawkish Stance MOL and NYK Sign Long-Term Ammonia Carrier Charters with JERA for US-Japan Low-Carbon Fuel Supply Qatar LNG Tanker Sails for Hormuz as US-Iran Deal Reopens Critical Waterway UK to Scan Asylum-Seekers’ Faces with Flawed AI Age Estimation Despite Internal Warnings
Home ›› Technology ›› Ai ›› Norm-Agnostic Residual Networks Offer Path to Scaling Adaptive Depth in Deep Learning

Norm-Agnostic Residual Networks Offer Path to Scaling Adaptive Depth in Deep Learning

Researchers introduce NAG, a norm-agnostic residual architecture that prevents later layers from being suppressed by norm growth. This enables training of much deeper models and introduces an interpretable Mixture-of-Depths mechanism that can serve as a pretraining scaling strategy, with 20-25% sparsity matching full-depth baseline under equal compute.

iG
iGEN Editorial
June 17, 2026
Norm-Agnostic Residual Networks Offer Path to Scaling Adaptive Depth in Deep Learning

Deep learning models have grown increasingly deep, but a subtle structural limitation in residual architectures has hindered scaling: the norm of the residual stream grows rapidly with depth, reducing the relative impact of later layers. According to a new paper on arXiv, researchers Figliolia, Tomás, Millidge, and Beren have introduced a norm-agnostic residual architecture called NAG that separates magnitude from directional information in the residual stream, preserving meaningful layer contributions throughout depth and preventing later updates from being systematically suppressed by residual-norm growth.

The Problem with Deeper Networks

Residual architectures are ubiquitous in deep learning, but they suffer from a fundamental issue: as depth increases, the norm of the residual stream grows, causing updates from later layers to become small relative to the accumulated state. This reduces their impact on the representation and limits the benefits of scaling models in depth. The new research addresses this by decoupling magnitude and direction, allowing each layer to contribute effectively regardless of depth.

Introducing NAG: Norm-Agnostic Residual Networks

NAG introduces only a negligible number of additional parameters and relies on simple operations that are easily kernel-fusible, preserving training efficiency in practice, the paper reports. The architecture outperforms baseline Transformers, with gains that increase substantially as depth grows. This enables effective training of much deeper models than previously possible with standard residual networks.

Mixture-of-Depths: A New Scaling Axis

The norm-agnostic formulation also leads to an interpretable Mixture-of-Depths (MoD) mechanism that adaptively skips both attention and MLP layers. Beyond serving as a post-training accuracy-compute tradeoff, this mechanism can be used as a pretraining-time scaling strategy. Under iso-FLOP training, compute saved by reducing per-token forward-pass cost can be reinvested into training on more tokens while keeping the total parameter count and KV-cache budget fixed.

Experimental Validation

In their experiments, moderate Mixture-of-Depths rates of approximately 20%-25% matched full-depth baseline performance under equal training compute while substantially reducing the number of executed layer parameters and forward-pass FLOPs, according to the paper. These results identify sparsity in depth as a new scaling axis for fixed-compute training, enabling very deep yet FLOP-efficient models.

Implications for AI Infrastructure

The ability to train deeper models more efficiently has direct implications for enterprise AI infrastructure. By reinvesting compute savings from sparsity into additional training tokens, organizations can potentially achieve better model performance without increasing total parameter count or inference memory budgets. The NAG architecture, with its simple operations and minimal overhead, represents a practical approach to scaling deep learning in resource-constrained environments.


Sources:

Keep Reading

Recommended Stories

Lightweight Attention Mechanism Boosts Robust Multimodal Integration in Global Workspace Architecture Technology

Lightweight Attention Mechanism Boosts Robust Multimodal Integration in Global Workspace Architecture

A new arXiv paper introduces a lightweight attention mechanism for multimodal integration in a global workspace architecture. The method improves robustness against corrupted modalities while using far fewer trainable parameters than end-to-end attention baselines. Tests on Simple Shapes and MM-IMDb 1.0 show transferable selection strategies across tasks and unseen modalities.

June 17, 2026
Gated QKAN-FWP: Quantum-Inspired Sequence Learning Achieves Parameter Efficiency on NISQ Devices Technology

Gated QKAN-FWP: Quantum-Inspired Sequence Learning Achieves Parameter Efficiency on NISQ Devices

A new quantum-inspired sequence learning model, Gated QKAN-FWP, uses single-qubit data re-uploading circuits to achieve high accuracy with only 12,500 parameters on long-horizon forecasting tasks. The model outperforms classical recurrent networks such as LSTM and WaveNet-LSTM while being deployable on current NISQ quantum hardware from IonQ and IBM.

June 16, 2026
Cascaded Sparse Autoencoders Enable Hierarchical Visual Concept Learning in Multimodal LLMs Technology

Cascaded Sparse Autoencoders Enable Hierarchical Visual Concept Learning in Multimodal LLMs

Researchers introduce cascaded sparse autoencoders (CSAEs) that learn hierarchical visual concepts in multimodal large language models. By training a second-level SAE on the decoder weights of the first, CSAEs achieve 'concepts of concepts' without nesting or stacking bottlenecks. Experiments on Qwen3-VL, Gemma-3, and LLaVA show improved interpretability and effective group-level steering.

June 16, 2026
New Unified Definition of AI Hallucination Pins It on Inaccurate World Modeling Technology

New Unified Definition of AI Hallucination Pins It on Inaccurate World Modeling

A new arXiv paper by Liu et al. proposes a unified definition of hallucination in large language models, defining it as inaccurate internal world modeling observable to the user. The framework subsumes prior definitions and distinguishes true hallucinations from planning or reward errors, and introduces the HalluWorld benchmark for stress-testing models.

June 16, 2026