iGEN
Visit IGEN World Explore IGEN Expo
EXPLORE UPGRADE PLANS
BREAKING
ATOM-Bench: New Benchmark Evaluates Atomic Skills and Compositional Generalization in Robotic Manipulation Policies FusionRS Dataset Advances Dual-Modal Vision-Language AI for Remote Sensing CAP Achieves 87.6% Improvement in Respiratory Rate Prediction via Patient-Level PPG Learning LLM-WikiRace Benchmark Reveals Frontier AI Models Still Struggle with Planning Over Knowledge Graphs New Research Demystifies Variance in Circuit Discovery of Large Language Models PISA Memory System Draws on Cognitive Psychology to Boost AI Agent Adaptability New Multi-Scale Two-Stream Framework Aims to Decouple Semantics from Distortions in AI-Generated Image Quality Assessment P3B3 Benchmark Reveals Strong Brazilian Portuguese Bias in Large Language Models Controlled Dynamics Attractor Transformer: New Model Targets Graph Anomaly Detection with Biologically Plausible Attention Tamil Nadu OE Spinning Mills Threaten 50% Production Cut Over High Cotton Waste Prices ATOM-Bench: New Benchmark Evaluates Atomic Skills and Compositional Generalization in Robotic Manipulation Policies FusionRS Dataset Advances Dual-Modal Vision-Language AI for Remote Sensing CAP Achieves 87.6% Improvement in Respiratory Rate Prediction via Patient-Level PPG Learning LLM-WikiRace Benchmark Reveals Frontier AI Models Still Struggle with Planning Over Knowledge Graphs New Research Demystifies Variance in Circuit Discovery of Large Language Models PISA Memory System Draws on Cognitive Psychology to Boost AI Agent Adaptability New Multi-Scale Two-Stream Framework Aims to Decouple Semantics from Distortions in AI-Generated Image Quality Assessment P3B3 Benchmark Reveals Strong Brazilian Portuguese Bias in Large Language Models Controlled Dynamics Attractor Transformer: New Model Targets Graph Anomaly Detection with Biologically Plausible Attention Tamil Nadu OE Spinning Mills Threaten 50% Production Cut Over High Cotton Waste Prices
Home ›› Technology ›› Ai ›› Llms ›› NVIDIA Open-Sources Nemotron 3 Ultra: 550B-Parameter Hybrid Mamba-Transformer Model for Agentic AI

NVIDIA Open-Sources Nemotron 3 Ultra: 550B-Parameter Hybrid Mamba-Transformer Model for Agentic AI

NVIDIA introduced Nemotron 3 Ultra, a 550 billion total parameter Mixture-of-Experts language model with a hybrid Mamba-Attention architecture. Only 55 billion parameters are active per inference. Pre-trained on 20 trillion tokens and supporting a 1 million token context length, the model achieves up to 6x higher inference throughput versus state-of-the-art public LLMs while matching accuracy. All checkpoints, training data, and recipes are open-sourced on HuggingFace.

iG
iGEN Editorial
June 16, 2026
NVIDIA Open-Sources Nemotron 3 Ultra: 550B-Parameter Hybrid Mamba-Transformer Model for Agentic AI

Enterprises deploying large language models for long-running autonomous agent tasks face a dual challenge: maintaining high accuracy while controlling inference cost and latency. NVIDIA's latest release aims to address this directly. According to a paper published on arXiv, the company has introduced Nemotron 3 Ultra, a 550 billion total parameter Mixture-of-Experts (MoE) model with a hybrid Mamba-Attention architecture. Only 55 billion parameters are active per inference, dramatically reducing computational overhead.

Model Architecture and Scale

Nemotron 3 Ultra combines two neural network paradigms: the Mamba state-space model, efficient for long sequences, and the Transformer attention mechanism, strong on recall and reasoning. This hybrid design is paired with a Mixture-of-Experts layer using LatentMoE, which routes each input token to a subset of expert sub-networks. The model's total parameter count is 550 billion, but only 55 billion are activated for any given token, enabling high capacity without proportional compute cost.

The model was pre-trained on 20 trillion text tokens and later had its context length extended to 1 million tokens, making it suitable for tasks that require processing very long documents, histories, or multi-turn agent conversations.

Training and Post-Training Innovations

NVIDIA applied a multi-stage post-training pipeline. After initial pre-training, the model underwent Supervised Fine Tuning (SFT), followed by Reinforcement Learning (RL), and a novel technique called Multi-teacher On-Policy Distillation (MOPD). Additional innovations include:

  • Multi Token Prediction (MTP): predicts multiple future tokens simultaneously to improve training efficiency.
  • NVFP4 pre-training: uses NVIDIA's 4-bit floating-point format for memory-efficient training.
  • Multi-environment RLVR: reinforcement learning with verifiable rewards across diverse environments.
  • Reasoning budget control: dynamically adjusts the compute allocated to reasoning steps.
Technology Purpose
LatentMoE Efficient parameter activation via mixture of experts
Multi Token Prediction Improves training throughput and convergence
NVFP4 pre-training Low-precision computation for memory savings
MOPD Distills knowledge from multiple teachers during RL
Reasoning budget control Allocates compute per reasoning step adaptively

Performance and Throughput

Nemotron 3 Ultra achieves up to 6x higher inference throughput compared to state-of-the-art publicly available LLMs while attaining on-par accuracy, according to the paper. This throughput advantage, combined with the 1 million token context window, positions the model as ideal for long-running autonomous agentic tasks, such as complex multi-step planning, document analysis, and code generation.

The results are attributed to the efficient MoE architecture and the hybrid Mamba-Transformer design, which reduces the quadratic complexity of attention for very long sequences.

Open-Source Availability for Enterprise Adoption

NVIDIA is releasing the model fully open-source. The base, post-trained, and quantized checkpoints are available on HuggingFace, alongside the training data and recipe. This transparency allows enterprise teams to inspect, fine-tune, and deploy the model on their own infrastructure. The open release lowers the barrier for organizations that need a high-capacity, efficient LLM for agentic workflows without vendor lock-in.

For CTOs and technology procurement leaders evaluating AI models for production, Nemotron 3 Ultra offers a rare combination of extreme scale, high throughput, and open access. The model's design choices—especially the hybrid architecture and aggressive post-training—are likely to inform future enterprise AI deployments.


Sources:

Keep Reading

Recommended Stories

Controlled Dynamics Attractor Transformer: New Model Targets Graph Anomaly Detection with Biologically Plausible Attention Technology

Controlled Dynamics Attractor Transformer: New Model Targets Graph Anomaly Detection with Biologically Plausible Attention

Researchers propose the Controlled Dynamics Attractor Transformer (CDAT), which integrates a mixture von Mises-Fisher attention energy with Hopfield refinement and excitation-inhibition modulation from neural attractor models. The model achieves state-of-the-art results on graph anomaly detection and classification benchmarks, offering potential for detecting fraud, cyber threats, and operational anomalies in supply chain networks.

June 16, 2026
Lossy Compression Slashes Storage 39x for Neural Surrogate Models, Study Finds Technology

Lossy Compression Slashes Storage 39x for Neural Surrogate Models, Study Finds

A new study quantifies the impact of lossy compression on neural generative surrogate models, finding that storage can be reduced by up to 39x and training time by up to 3x with negligible effect on model quality, offering a path to more efficient AI training in data-intensive domains.

June 16, 2026
New Research Reveals Truthfulness Preserved Across LLM Lineages, Enabling Better Hallucination Control Technology

New Research Reveals Truthfulness Preserved Across LLM Lineages, Enabling Better Hallucination Control

A new paper from researchers shows that truthfulness-related attention heads are preserved across generations of large language models, even after instruction tuning or multimodal adaptation. The authors propose TruthProbe, a soft-gating strategy that amplifies these heads to reduce hallucinations, with improvements on HaluEval, POPE, and CHAIR benchmarks.

June 16, 2026
LaWAM: Latent World Action Model Enables Efficient, Dynamics-Aware Robot Control with Low Latency Technology

LaWAM: Latent World Action Model Enables Efficient, Dynamics-Aware Robot Control with Low Latency

LaWAM (Latent World Action Model) is a new robotics AI that uses compact latent visual subgoals instead of full video generation to achieve fast, dynamics-aware robot control. It achieves state-of-the-art success rates on LIBERO (98.6%) and RoboTwin (91.22%) with 187ms per action-chunk and up to 24x lower latency than pixel-space World Action Models.

June 16, 2026