NVIDIA Open-Sources Nemotron 3 Ultra: 550B-Parameter Hybrid Mamba-Transformer Model for Agentic AI

NVIDIA introduced Nemotron 3 Ultra, a 550 billion total parameter Mixture-of-Experts language model with a hybrid Mamba-Attention architecture. Only 55 billion parameters are active per inference. Pre-trained on 20 trillion tokens and supporting a 1 million token context length, the model achieves up to 6x higher inference throughput versus state-of-the-art public LLMs while matching accuracy. All checkpoints, training data, and recipes are open-sourced on HuggingFace.

iGEN Editorial

June 16, 2026

NVIDIA Open-Sources Nemotron 3 Ultra: 550B-Parameter Hybrid Mamba-Transformer Model for Agentic AI

Enterprises deploying large language models for long-running autonomous agent tasks face a dual challenge: maintaining high accuracy while controlling inference cost and latency. NVIDIA's latest release aims to address this directly. According to a paper published on arXiv, the company has introduced Nemotron 3 Ultra, a 550 billion total parameter Mixture-of-Experts (MoE) model with a hybrid Mamba-Attention architecture. Only 55 billion parameters are active per inference, dramatically reducing computational overhead.

Model Architecture and Scale

Nemotron 3 Ultra combines two neural network paradigms: the Mamba state-space model, efficient for long sequences, and the Transformer attention mechanism, strong on recall and reasoning. This hybrid design is paired with a Mixture-of-Experts layer using LatentMoE, which routes each input token to a subset of expert sub-networks. The model's total parameter count is 550 billion, but only 55 billion are activated for any given token, enabling high capacity without proportional compute cost.

The model was pre-trained on 20 trillion text tokens and later had its context length extended to 1 million tokens, making it suitable for tasks that require processing very long documents, histories, or multi-turn agent conversations.

Training and Post-Training Innovations

NVIDIA applied a multi-stage post-training pipeline. After initial pre-training, the model underwent Supervised Fine Tuning (SFT), followed by Reinforcement Learning (RL), and a novel technique called Multi-teacher On-Policy Distillation (MOPD). Additional innovations include:

Multi Token Prediction (MTP): predicts multiple future tokens simultaneously to improve training efficiency.
NVFP4 pre-training: uses NVIDIA's 4-bit floating-point format for memory-efficient training.
Multi-environment RLVR: reinforcement learning with verifiable rewards across diverse environments.
Reasoning budget control: dynamically adjusts the compute allocated to reasoning steps.

Technology	Purpose
LatentMoE	Efficient parameter activation via mixture of experts
Multi Token Prediction	Improves training throughput and convergence
NVFP4 pre-training	Low-precision computation for memory savings
MOPD	Distills knowledge from multiple teachers during RL
Reasoning budget control	Allocates compute per reasoning step adaptively

Performance and Throughput

Nemotron 3 Ultra achieves up to 6x higher inference throughput compared to state-of-the-art publicly available LLMs while attaining on-par accuracy, according to the paper. This throughput advantage, combined with the 1 million token context window, positions the model as ideal for long-running autonomous agentic tasks, such as complex multi-step planning, document analysis, and code generation.

The results are attributed to the efficient MoE architecture and the hybrid Mamba-Transformer design, which reduces the quadratic complexity of attention for very long sequences.

Open-Source Availability for Enterprise Adoption

NVIDIA is releasing the model fully open-source. The base, post-trained, and quantized checkpoints are available on HuggingFace, alongside the training data and recipe. This transparency allows enterprise teams to inspect, fine-tune, and deploy the model on their own infrastructure. The open release lowers the barrier for organizations that need a high-capacity, efficient LLM for agentic workflows without vendor lock-in.

For CTOs and technology procurement leaders evaluating AI models for production, Nemotron 3 Ultra offers a rare combination of extreme scale, high throughput, and open access. The model's design choices—especially the hybrid architecture and aggressive post-training—are likely to inform future enterprise AI deployments.

Sources:

NVIDIA Open-Sources Nemotron 3 Ultra: 550B-Parameter Hybrid Mamba-Transformer Model for Agentic AI

Model Architecture and Scale

Training and Post-Training Innovations

Performance and Throughput

Open-Source Availability for Enterprise Adoption

Recommended Stories

Scientists Use AI and Quantum Computing to Generate New Peptides in Spare Time

SoftSkill: Compressing AI Agent Skills into Compact Latent Controls Boosts Accuracy Over Traditional Prompting

New Research Shows Pretraining Data Composition Can Engineer Neural Scaling Laws for Particle Physics

Researchers Propose Feature Selection to Improve Neural Additive Model Efficiency and Interpretability