iGEN
Visit IGEN World Explore IGEN Expo
EXPLORE UPGRADE PLANS
BREAKING
Geneva Dry Returns for Fourth Edition with New Bauxite Blitz and Investment Masterclass Sessions Rupee snaps two-day rally, settles 2 paise lower at 94.60 against US dollar Spacex Shares Surge Past Amazon in Market Value After IPO Frenzy; Options Trading Begins Parametric Insurance Emerges as Alternative as Traditional Home Insurance Struggles with Disaster Payouts Travel Disruption Is a Productivity Nightmare – AI Provides the Scalable Solution Microsoft Teams finally rolls out Wi-Fi-based location tracking for workplace check-in Cost of ransomware recovery too high? Here’s how to stop footing the bill CMA CGM Moves to Acquire Aircraft Maintenance Specialist Crystal Aero Solutions Qobuz Gains Subscribers as Artists and Audiophiles Reject Spotify's Model M*: A Modular, Extensible Serving System for Efficient Multimodal AI Inference Geneva Dry Returns for Fourth Edition with New Bauxite Blitz and Investment Masterclass Sessions Rupee snaps two-day rally, settles 2 paise lower at 94.60 against US dollar Spacex Shares Surge Past Amazon in Market Value After IPO Frenzy; Options Trading Begins Parametric Insurance Emerges as Alternative as Traditional Home Insurance Struggles with Disaster Payouts Travel Disruption Is a Productivity Nightmare – AI Provides the Scalable Solution Microsoft Teams finally rolls out Wi-Fi-based location tracking for workplace check-in Cost of ransomware recovery too high? Here’s how to stop footing the bill CMA CGM Moves to Acquire Aircraft Maintenance Specialist Crystal Aero Solutions Qobuz Gains Subscribers as Artists and Audiophiles Reject Spotify's Model M*: A Modular, Extensible Serving System for Efficient Multimodal AI Inference
Home ›› Technology ›› Ai ›› Llms ›› M*: A Modular, Extensible Serving System for Efficient Multimodal AI Inference

M*: A Modular, Extensible Serving System for Efficient Multimodal AI Inference

Researchers have developed M*, a universal serving system for composite AI models that integrates diverse components like vision encoders and language backbones. Using a novel 'Walk Graph' abstraction, M* achieves significant performance improvements: 20% lower latency for text-to-image, up to 2.7x higher throughput for text-to-speech, and 12.5x faster robotic planning rollouts compared to existing baselines.

iG
iGEN Editorial
June 16, 2026
M*: A Modular, Extensible Serving System for Efficient Multimodal AI Inference

As enterprises increasingly deploy multimodal AI models that combine vision, language, speech, and action generation, the infrastructure needed to serve these composite architectures has become a critical bottleneck. Existing model serving frameworks were built on narrow assumptions about model structure, making them ill-suited for modern multimodal models. According to a new paper on arXiv, researchers from multiple institutions have introduced M*, a modular and extensible serving system designed specifically for efficient serving of composite AI models.

The Challenge of Multimodal Model Serving

Composite model architectures integrate diverse components such as vision encoders, language backbones, diffusion and flow heads, audio codecs, action generators, and world-model predictors. These architectures underpin a broad class of models including unified multimodal models, omni models, speech-language models, vision-language-action policies, and world models. However, existing serving frameworks cannot accommodate this architectural diversity.

M*'s Walk Graph Abstraction

M* represents models as dataflow graphs, processing requests spanning diverse modalities and tasks as traversals over these graphs. The core insight is a modular abstraction that supports arbitrary composition of model components, flexible placement onto a physical cluster, and model-agnostic optimizations within a distributed runtime. The authors call this abstraction the Walk Graph. It can concisely capture composite models from a broad range of families.

Performance Benchmarks

The researchers instantiated M* on representative models and measured its performance against baselines. The results are summarized in the table below:

Workload Model Metric Improvement over Baseline
Text-to-image BAGEL End-to-end latency 20% lower than vLLM-Omni
Text-to-speech Qwen3-Omni Real-time factor Up to 2.9x lower
Text-to-speech Qwen3-Omni Throughput Up to 2.7x higher
Robotic planning V-JEPA 2-AC rollout Planning speed Up to 12.5x faster

For text-to-image workloads on the BAGEL model, M* achieved 20% lower end-to-end latency than vLLM-Omni. On text-to-speech tasks with Qwen3-Omni, M* delivered up to 2.9x lower real-time factor and 2.7x higher throughput. In robotic planning, M* outperformed the V-JEPA 2-AC rollout baseline by up to 12.5x.

Developer Effort and Extensibility

The authors emphasize that M* paves the way toward more efficient serving of complex models with minimal developer effort. By providing a modular, extensible framework, M* allows AI teams to compose and deploy multimodal models without rearchitecting the serving stack for each new model family.

Implications for Enterprise AI Infrastructure

For enterprise technology leaders evaluating AI serving solutions, M* represents a departure from monolithic inference frameworks. Its ability to handle vision, language, speech, and action components within a single dataflow graph could simplify infrastructure for applications like multimodal customer service bots, autonomous systems, and content generation pipelines. The performance gains—especially the 20% latency reduction—directly translate to lower operational costs and faster response times in production environments.

As composite AI models become more prevalent, serving infrastructure must evolve. M* demonstrates a path forward that is both performant and flexible, reducing the friction of deploying next-generation multimodal systems. The research team includes Jha, Atindra, Sagan, Naomi, Kamahori, Keisuke, Sivgin, Irmak, Sanda, Rohan, Gao, Steven, Horowitz, Mark, Zettlemoyer, Luke, Hsu, Olivia, Leskovec, Jure, Kasikci, Baris, and Wang, Stephanie.


Sources:

Keep Reading

Recommended Stories

Scribby Multi-Level LLM Framework Promises Fine-Grained Semantic Analysis of Long-Form Video Technology

Scribby Multi-Level LLM Framework Promises Fine-Grained Semantic Analysis of Long-Form Video

Researchers propose Scribby, an LLM-based framework for semantic video analysis that balances macro-level comprehension with micro-level semantic indexing. The approach analyzes full transcripts, individual sentences, and groups sentences by semantic similarity using an LLM as a judge, enabling more detailed understanding of video structure and thematic progression.

June 16, 2026
Wasserstein Equilibrium Decoding Boosts Reliability in Medical Visual Question Answering Technology

Wasserstein Equilibrium Decoding Boosts Reliability in Medical Visual Question Answering

Researchers have extended game-theoretic decoding to vision-language models for medical visual question answering, introducing a Wasserstein stopping criterion that improves accuracy by up to 3.5 percentage points and reduces inference iterations by 20% while maintaining reliability.

June 16, 2026
Language-Guided AI Framework CLARITY Boosts Road Scene Segmentation for Autonomous Logistics Technology

Language-Guided AI Framework CLARITY Boosts Road Scene Segmentation for Autonomous Logistics

Researchers propose CLARITY, a language-guided framework for RGB-Thermal semantic segmentation that dynamically adapts fusion strategies based on scene illumination. On the MFNet dataset, it achieves 62.3% mIoU and 77.5% mAcc, setting a new state-of-the-art for robust road scene understanding in autonomous driving, critical for logistics automation.

June 16, 2026
Gen-VCoT: New Framework Generates RGB Images as Visual Chain-of-Thought Intermediates for Multimodal AI Reasoning Technology

Gen-VCoT: New Framework Generates RGB Images as Visual Chain-of-Thought Intermediates for Multimodal AI Reasoning

Researchers propose Gen-VCoT, a framework that generates RGB images as visual chain-of-thought intermediates, improving spatial reasoning by 25% and depth reasoning by 50% over baseline MLLMs, though text-based CoT remains superior for simple factual queries.

June 16, 2026