When combining multiple pre-trained diffusion models to generate images, one model may dominate or models may disagree, leading to poor results. Researchers from MIT, Aalto University, and other institutions have proposed Divide-and-Denoise, a method that coordinates multiple diffusion models during sampling using a game-theoretic approach. The technique creates a fair but efficient division of labor across models, analogous to managing a specialized workforce.
The Problem of Model Composition
Pre-trained diffusion models are abundant, and composing them offers the opportunity to leverage diverse expertise. However, without coordination, one model may overpower others or conflicting outputs may arise. According to the paper, 'Combining several models, however, runs the risk of one model dominating or models disagreeing with each other.' This can result in common failures such as missing objects or mismatched attributes in generated images.
How Divide-and-Denoise Works
Central to Divide-and-Denoise is the notion of an allocation, which defines the responsibility of each model to every region of the noisy sample. At each timestep, the method performs two steps:
- Update the allocation by solving a fair division game, where the sample is divided into regions that maximize total utility under fairness constraints.
- Align the models with this allocation, guiding each model to denoise within its assigned region.
This leads to a new composite denoising process that evolves in tandem with a division process. The method ensures that each model's expertise is utilized without neglecting any other model.
"Much like managing a specialized workforce, our method creates a fair but efficient division of labor across models." — from the paper
Evaluation and Results
The researchers evaluated Divide-and-Denoise on conditional image generation using several quality metrics, including the GenEval benchmark. The method outperformed baselines and resolved common failures:
| Common Failure | Resolution by Divide-and-Denoise |
|---|---|
| Missing objects | Models are allocated to specific regions, ensuring all objects appear |
| Mismatched attributes | Fair division prevents attribute conflicts between models |
Experiments show that Divide-and-Denoise utilizes each model's expertise without neglecting any other model, achieving superior composite generation quality.
Implications for Composable AI Systems
While the paper focuses on image generation, the principle of fair division of labor among specialized models has broader implications for composable AI systems. Enterprise applications that rely on multiple models—such as multi-modal generation or ensemble decision-making—could benefit from similar game-theoretic coordination to avoid dominance and disagreement.
The authors of the paper are Gupta, Abhi; Barabanshchikova, Polina; Garg, Vikas; Kaski, Samuel; and Jaakkola, Tommi. The full preprint is available on arXiv under a Creative Commons license. As AI models become more specialized and numerous, methods like Divide-and-Denoise will be critical for building fair and effective composite systems.