As enterprises increasingly deploy multimodal AI models that combine vision, language, speech, and action generation, the infrastructure needed to serve these composite architectures has become a critical bottleneck. Existing model serving frameworks were built on narrow assumptions about model structure, making them ill-suited for modern multimodal models. According to a new paper on arXiv, researchers from multiple institutions have introduced M*, a modular and extensible serving system designed specifically for efficient serving of composite AI models.
The Challenge of Multimodal Model Serving
Composite model architectures integrate diverse components such as vision encoders, language backbones, diffusion and flow heads, audio codecs, action generators, and world-model predictors. These architectures underpin a broad class of models including unified multimodal models, omni models, speech-language models, vision-language-action policies, and world models. However, existing serving frameworks cannot accommodate this architectural diversity.
M*'s Walk Graph Abstraction
M* represents models as dataflow graphs, processing requests spanning diverse modalities and tasks as traversals over these graphs. The core insight is a modular abstraction that supports arbitrary composition of model components, flexible placement onto a physical cluster, and model-agnostic optimizations within a distributed runtime. The authors call this abstraction the Walk Graph. It can concisely capture composite models from a broad range of families.
Performance Benchmarks
The researchers instantiated M* on representative models and measured its performance against baselines. The results are summarized in the table below:
| Workload | Model | Metric | Improvement over Baseline |
|---|---|---|---|
| Text-to-image | BAGEL | End-to-end latency | 20% lower than vLLM-Omni |
| Text-to-speech | Qwen3-Omni | Real-time factor | Up to 2.9x lower |
| Text-to-speech | Qwen3-Omni | Throughput | Up to 2.7x higher |
| Robotic planning | V-JEPA 2-AC rollout | Planning speed | Up to 12.5x faster |
For text-to-image workloads on the BAGEL model, M* achieved 20% lower end-to-end latency than vLLM-Omni. On text-to-speech tasks with Qwen3-Omni, M* delivered up to 2.9x lower real-time factor and 2.7x higher throughput. In robotic planning, M* outperformed the V-JEPA 2-AC rollout baseline by up to 12.5x.
Developer Effort and Extensibility
The authors emphasize that M* paves the way toward more efficient serving of complex models with minimal developer effort. By providing a modular, extensible framework, M* allows AI teams to compose and deploy multimodal models without rearchitecting the serving stack for each new model family.
Implications for Enterprise AI Infrastructure
For enterprise technology leaders evaluating AI serving solutions, M* represents a departure from monolithic inference frameworks. Its ability to handle vision, language, speech, and action components within a single dataflow graph could simplify infrastructure for applications like multimodal customer service bots, autonomous systems, and content generation pipelines. The performance gains—especially the 20% latency reduction—directly translate to lower operational costs and faster response times in production environments.
As composite AI models become more prevalent, serving infrastructure must evolve. M* demonstrates a path forward that is both performant and flexible, reducing the friction of deploying next-generation multimodal systems. The research team includes Jha, Atindra, Sagan, Naomi, Kamahori, Keisuke, Sivgin, Irmak, Sanda, Rohan, Gao, Steven, Horowitz, Mark, Zettlemoyer, Luke, Hsu, Olivia, Leskovec, Jure, Kasikci, Baris, and Wang, Stephanie.