iGEN
Visit IGEN World Explore IGEN Expo
EXPLORE UPGRADE PLANS
BREAKING
GAS-Leak-LLM: Genetic Algorithm Jailbreaks Black-Box LLMs, Exposing Safety Gaps New Generative Recommendation Model HoloRec Uses Hierarchical Encoding and Interleaved Reasoning to Boost Accuracy Tensor-Coord: Algebraic Decomposition Enables Conflict-Free Multi-Agent LLM Planning Led by US, exits from gold ETFs continue for the 5th week in a row Domain-Guided Prompting Boosts Segment Anything Model for Seismic Interpretation Spokes Optimizes Diverse Pretraining Data Selection for LLMs, Boosting Performance Medical Heuristic Learning: LLM-Driven Framework for Interpretable Clinical Decision Rules Commodore Callback 8020 Brings Digital Detox With Modern Apps and Retro Design PreLort: Prefix-Nested LoRA Enables Federated Fine-Tuning Across Heterogeneous Hardware Ranks Research Shows 'Retrieve, Don't Retrain' Approach Cuts AI Model Adaptation Costs GAS-Leak-LLM: Genetic Algorithm Jailbreaks Black-Box LLMs, Exposing Safety Gaps New Generative Recommendation Model HoloRec Uses Hierarchical Encoding and Interleaved Reasoning to Boost Accuracy Tensor-Coord: Algebraic Decomposition Enables Conflict-Free Multi-Agent LLM Planning Led by US, exits from gold ETFs continue for the 5th week in a row Domain-Guided Prompting Boosts Segment Anything Model for Seismic Interpretation Spokes Optimizes Diverse Pretraining Data Selection for LLMs, Boosting Performance Medical Heuristic Learning: LLM-Driven Framework for Interpretable Clinical Decision Rules Commodore Callback 8020 Brings Digital Detox With Modern Apps and Retro Design PreLort: Prefix-Nested LoRA Enables Federated Fine-Tuning Across Heterogeneous Hardware Ranks Research Shows 'Retrieve, Don't Retrain' Approach Cuts AI Model Adaptation Costs
Home ›› Technology ›› Ai ›› Llms ›› Expert Tying Reduces Memory Footprint of Mixture-of-Experts LLMs by Nearly Half

Expert Tying Reduces Memory Footprint of Mixture-of-Experts LLMs by Nearly Half

A new arXiv paper from Jaggi proposes Expert Tying, an architectural modification for Mixture-of-Experts LLMs that shares expert parameters across consecutive transformer layers. Pretraining experiments show memory footprint reduction by almost 2x with virtually no degradation in perplexity or downstream quality, evaluated on OLMoE, Qwen3, and DeepSeek-style architectures.

iG
iGEN Editorial
June 16, 2026
Expert Tying Reduces Memory Footprint of Mixture-of-Experts LLMs by Nearly Half

Large Language Models (LLMs) based on Mixture-of-Experts (MoE) architectures offer efficient scaling by activating only a small fraction of their experts per token. However, as a research paper published on arXiv by Jaggi explains, the full parameter count—dominated by the expert parameters—must still be held in memory during both training and inference. This memory bottleneck limits the practical deployment of ever-larger models.

The Memory Challenge in MoE

MoE models achieve compute efficiency by routing each input token to a subset of experts, reducing the computational cost per step. Yet the entire set of expert parameters resides in memory, creating a heavy footprint. The paper notes that this constraint applies across state-of-the-art architectures including OLMoE, Qwen3, and DeepSeek-style MoEs. To address this, the authors introduce Expert Tying, an architectural modification that shares expert parameters across consecutive transformer layers while preserving independent, layer-wise routing and attention mechanisms.

How Expert Tying Works

Expert Tying does not alter the routing logic—each layer still selects its own experts for each token—but reuses the same expert weights across a group of layers. According to the paper, this exploits the parameter redundancy inherent in MoE pathways. The approach is evaluated in pretraining experiments on the three mentioned architecture families.

Pretraining Results

The pretraining experiments demonstrate that tying experts can reduce memory footprint by almost 2x at virtually no degradation in perplexity or downstream quality. The key findings can be summarised:

Metric Without Expert Tying With Expert Tying
Memory footprint (relative) 1.0x ~0.5x (almost 2x reduction)
Perplexity (quality measure) Baseline Virtually no degradation
Downstream task performance Baseline No significant loss

The paper reports that this favorable compute-to-memory trade-off holds across all tested architectures, including OLMoE, Qwen3, and DeepSeek-style MoEs.

Implications for Scaling LLMs

The method, as described in the paper, provides a highly favorable compute-to-memory trade-off, advancing efficient training and scaling of next-generation LLMs. By reducing memory requirements without sacrificing quality, Expert Tying could enable larger models to run on existing hardware, lowering the barrier for deploying state-of-the-art language models in resource-constrained environments. The research underscores the value of architectural innovations that tackle the memory wall in deep learning.


Sources:

Keep Reading

Recommended Stories

Vernier Research Reveals Why Language Models Give Inconsistent Answers to Causal Questions After Variable Renaming Technology

Vernier Research Reveals Why Language Models Give Inconsistent Answers to Causal Questions After Variable Renaming

Researchers introduce Vernier, a probing technique that reveals representational misalignment in instruction-tuned language models when variable names are replaced with placeholders, causing inconsistent answers to causal reasoning questions. The study tests models including Qwen-7B, Qwen-14B, and Llama-3.1-8B, and finds that success is bounded by model family, scale, and task.

June 16, 2026
VibeThinker-3B: Small Language Model Matches Giants in Verifiable Reasoning, According to arXiv Paper Technology

VibeThinker-3B: Small Language Model Matches Giants in Verifiable Reasoning, According to arXiv Paper

A new technical report on arXiv introduces VibeThinker-3B, a compact 3B-parameter language model that achieves verifiable reasoning scores comparable to models orders of magnitude larger, including DeepSeek V3.2, GLM-5, and Gemini 3 Pro. The model uses a Spectrum-to-Signal post-training paradigm and achieves 94.3 on AIME26 and 80.2% Pass@1 on LiveCodeBench v6.

June 16, 2026
PACT Hybrid Architecture Combines Small Language Model Planning with Reinforcement Learning for Enhanced Decision-Making Technology

PACT Hybrid Architecture Combines Small Language Model Planning with Reinforcement Learning for Enhanced Decision-Making

Researchers propose Plan, Align, Commit, Think (PACT), a hybrid architecture that couples a fast reactive reinforcement learning policy with a slow deliberative small language model (SLM) planner. The SLM asynchronously generates and validates action plans, which are executed directly once verified as safe through simulation. Evaluated on three FrozenLake configurations, PACT outperformed all baselines using a 2B-parameter SLM backbone, demonstrating that deliberative planning and reactive execution complement each other.

June 16, 2026
You Don't Need Strong Assumptions: Visual Representation Learning via Temporal Differences Technology

You Don't Need Strong Assumptions: Visual Representation Learning via Temporal Differences

A new research paper introduces Temporal Difference in Vision (TDV), a self-supervised learning method that avoids strong inductive biases like augmentations or masking. TDV trains an image encoder and a motion encoder to predict the next frame, relying only on the causal assumption that the past causes the future. The method matches state-of-the-art on dense spatial tasks, suggesting a new paradigm for visual representation learning.

June 16, 2026