Large Language Models (LLMs) based on Mixture-of-Experts (MoE) architectures offer efficient scaling by activating only a small fraction of their experts per token. However, as a research paper published on arXiv by Jaggi explains, the full parameter count—dominated by the expert parameters—must still be held in memory during both training and inference. This memory bottleneck limits the practical deployment of ever-larger models.
The Memory Challenge in MoE
MoE models achieve compute efficiency by routing each input token to a subset of experts, reducing the computational cost per step. Yet the entire set of expert parameters resides in memory, creating a heavy footprint. The paper notes that this constraint applies across state-of-the-art architectures including OLMoE, Qwen3, and DeepSeek-style MoEs. To address this, the authors introduce Expert Tying, an architectural modification that shares expert parameters across consecutive transformer layers while preserving independent, layer-wise routing and attention mechanisms.
How Expert Tying Works
Expert Tying does not alter the routing logic—each layer still selects its own experts for each token—but reuses the same expert weights across a group of layers. According to the paper, this exploits the parameter redundancy inherent in MoE pathways. The approach is evaluated in pretraining experiments on the three mentioned architecture families.
Pretraining Results
The pretraining experiments demonstrate that tying experts can reduce memory footprint by almost 2x at virtually no degradation in perplexity or downstream quality. The key findings can be summarised:
| Metric | Without Expert Tying | With Expert Tying |
|---|---|---|
| Memory footprint (relative) | 1.0x | ~0.5x (almost 2x reduction) |
| Perplexity (quality measure) | Baseline | Virtually no degradation |
| Downstream task performance | Baseline | No significant loss |
The paper reports that this favorable compute-to-memory trade-off holds across all tested architectures, including OLMoE, Qwen3, and DeepSeek-style MoEs.
Implications for Scaling LLMs
The method, as described in the paper, provides a highly favorable compute-to-memory trade-off, advancing efficient training and scaling of next-generation LLMs. By reducing memory requirements without sacrificing quality, Expert Tying could enable larger models to run on existing hardware, lowering the barrier for deploying state-of-the-art language models in resource-constrained environments. The research underscores the value of architectural innovations that tackle the memory wall in deep learning.