Expert Tying Reduces Memory Footprint of Mixture-of-Experts LLMs by Nearly Half

A new arXiv paper from Jaggi proposes Expert Tying, an architectural modification for Mixture-of-Experts LLMs that shares expert parameters across consecutive transformer layers. Pretraining experiments show memory footprint reduction by almost 2x with virtually no degradation in perplexity or downstream quality, evaluated on OLMoE, Qwen3, and DeepSeek-style architectures.

iGEN Editorial

June 16, 2026

Expert Tying Reduces Memory Footprint of Mixture-of-Experts LLMs by Nearly Half

Large Language Models (LLMs) based on Mixture-of-Experts (MoE) architectures offer efficient scaling by activating only a small fraction of their experts per token. However, as a research paper published on arXiv by Jaggi explains, the full parameter count—dominated by the expert parameters—must still be held in memory during both training and inference. This memory bottleneck limits the practical deployment of ever-larger models.

The Memory Challenge in MoE

MoE models achieve compute efficiency by routing each input token to a subset of experts, reducing the computational cost per step. Yet the entire set of expert parameters resides in memory, creating a heavy footprint. The paper notes that this constraint applies across state-of-the-art architectures including OLMoE, Qwen3, and DeepSeek-style MoEs. To address this, the authors introduce Expert Tying, an architectural modification that shares expert parameters across consecutive transformer layers while preserving independent, layer-wise routing and attention mechanisms.

How Expert Tying Works

Expert Tying does not alter the routing logic—each layer still selects its own experts for each token—but reuses the same expert weights across a group of layers. According to the paper, this exploits the parameter redundancy inherent in MoE pathways. The approach is evaluated in pretraining experiments on the three mentioned architecture families.

Pretraining Results

The pretraining experiments demonstrate that tying experts can reduce memory footprint by almost 2x at virtually no degradation in perplexity or downstream quality. The key findings can be summarised:

Metric	Without Expert Tying	With Expert Tying
Memory footprint (relative)	1.0x	~0.5x (almost 2x reduction)
Perplexity (quality measure)	Baseline	Virtually no degradation
Downstream task performance	Baseline	No significant loss

The paper reports that this favorable compute-to-memory trade-off holds across all tested architectures, including OLMoE, Qwen3, and DeepSeek-style MoEs.

Implications for Scaling LLMs

The method, as described in the paper, provides a highly favorable compute-to-memory trade-off, advancing efficient training and scaling of next-generation LLMs. By reducing memory requirements without sacrificing quality, Expert Tying could enable larger models to run on existing hardware, lowering the barrier for deploying state-of-the-art language models in resource-constrained environments. The research underscores the value of architectural innovations that tackle the memory wall in deep learning.

Sources:

Expert Tying Reduces Memory Footprint of Mixture-of-Experts LLMs by Nearly Half

The Memory Challenge in MoE

How Expert Tying Works

Pretraining Results

Implications for Scaling LLMs

Recommended Stories

Vernier Research Reveals Why Language Models Give Inconsistent Answers to Causal Questions After Variable Renaming

Yann LeCun's new AI startup AMI Labs raises $1bn to build flexible intelligence beyond LLMs

Transformer Feed-Forward Block Linearity: Learned, Not Architectural, According to New Research

Large Language Models Can Read Compressed Text That Humans Cannot, Researchers Find