M*: A Modular, Extensible Serving System for Efficient Multimodal AI Inference

Researchers have developed M*, a universal serving system for composite AI models that integrates diverse components like vision encoders and language backbones. Using a novel 'Walk Graph' abstraction, M* achieves significant performance improvements: 20% lower latency for text-to-image, up to 2.7x higher throughput for text-to-speech, and 12.5x faster robotic planning rollouts compared to existing baselines.

iGEN Editorial

June 16, 2026

M*: A Modular, Extensible Serving System for Efficient Multimodal AI Inference

As enterprises increasingly deploy multimodal AI models that combine vision, language, speech, and action generation, the infrastructure needed to serve these composite architectures has become a critical bottleneck. Existing model serving frameworks were built on narrow assumptions about model structure, making them ill-suited for modern multimodal models. According to a new paper on arXiv, researchers from multiple institutions have introduced M*, a modular and extensible serving system designed specifically for efficient serving of composite AI models.

The Challenge of Multimodal Model Serving

Composite model architectures integrate diverse components such as vision encoders, language backbones, diffusion and flow heads, audio codecs, action generators, and world-model predictors. These architectures underpin a broad class of models including unified multimodal models, omni models, speech-language models, vision-language-action policies, and world models. However, existing serving frameworks cannot accommodate this architectural diversity.

M*'s Walk Graph Abstraction

M* represents models as dataflow graphs, processing requests spanning diverse modalities and tasks as traversals over these graphs. The core insight is a modular abstraction that supports arbitrary composition of model components, flexible placement onto a physical cluster, and model-agnostic optimizations within a distributed runtime. The authors call this abstraction the Walk Graph. It can concisely capture composite models from a broad range of families.

Performance Benchmarks

The researchers instantiated M* on representative models and measured its performance against baselines. The results are summarized in the table below:

Workload	Model	Metric	Improvement over Baseline
Text-to-image	BAGEL	End-to-end latency	20% lower than vLLM-Omni
Text-to-speech	Qwen3-Omni	Real-time factor	Up to 2.9x lower
Text-to-speech	Qwen3-Omni	Throughput	Up to 2.7x higher
Robotic planning	V-JEPA 2-AC rollout	Planning speed	Up to 12.5x faster

For text-to-image workloads on the BAGEL model, M* achieved 20% lower end-to-end latency than vLLM-Omni. On text-to-speech tasks with Qwen3-Omni, M* delivered up to 2.9x lower real-time factor and 2.7x higher throughput. In robotic planning, M* outperformed the V-JEPA 2-AC rollout baseline by up to 12.5x.

Developer Effort and Extensibility

The authors emphasize that M* paves the way toward more efficient serving of complex models with minimal developer effort. By providing a modular, extensible framework, M* allows AI teams to compose and deploy multimodal models without rearchitecting the serving stack for each new model family.

Implications for Enterprise AI Infrastructure

For enterprise technology leaders evaluating AI serving solutions, M* represents a departure from monolithic inference frameworks. Its ability to handle vision, language, speech, and action components within a single dataflow graph could simplify infrastructure for applications like multimodal customer service bots, autonomous systems, and content generation pipelines. The performance gains—especially the 20% latency reduction—directly translate to lower operational costs and faster response times in production environments.

As composite AI models become more prevalent, serving infrastructure must evolve. M* demonstrates a path forward that is both performant and flexible, reducing the friction of deploying next-generation multimodal systems. The research team includes Jha, Atindra, Sagan, Naomi, Kamahori, Keisuke, Sivgin, Irmak, Sanda, Rohan, Gao, Steven, Horowitz, Mark, Zettlemoyer, Luke, Hsu, Olivia, Leskovec, Jure, Kasikci, Baris, and Wang, Stephanie.

Sources:

M*: A Modular, Extensible Serving System for Efficient Multimodal AI Inference

The Challenge of Multimodal Model Serving

M*'s Walk Graph Abstraction

Performance Benchmarks

Developer Effort and Extensibility

Implications for Enterprise AI Infrastructure

Recommended Stories

Scribby Multi-Level LLM Framework Promises Fine-Grained Semantic Analysis of Long-Form Video

FlowMaps: Modeling Long-Term Multimodal Object Dynamics with Flow Matching

New Framework GeoVR Learns 3D Spatial Intelligence from 2D Videos for Multimodal LLMs

VCG: Multimodal Retrieval Framework Solves Extreme Cold-Start Problem for E-Commerce Video Feeds