Distribution Matching Distillation (DMD)
distills score-based generative models
into efficient one-step generators, without requiring a one-to-one correspondence
with the sampling trajectories of their teachers. However, limited model capacity
causes one-step distilled models underperform on complex generative tasks, e.g.,
synthesizing intricate object motions in text-to-video generation.
Directly extending DMD to multi-step distillation increases memory usage and computational
depth, leading to instability and reduced efficiency.
While prior works
propose
stochastic gradient truncation (SGTS) as a potential solution,
we observe that it substantially reduces the generation diversity of multi-step distilled models,
bringing it down to the level of their one-step counterparts.
To address these limitations,
we propose Phased DMD, a multi-step distillation framework that bridges the
idea of phase-wise distillation with Mixture-of-Experts (MoE), reducing learning
difficulty while enhancing model capacity. Phased DMD is built upon two key
ideas:
progressive distribution matching and
score matching within subintervals.
First, our method divides the SNR range into subintervals,
progressively refining the model to higher SNR levels, to better capture complex distributions.
Next, to ensure the training objective within each subinterval is accurate,
we have conducted rigorous mathematical derivations. We validate Phased DMD by distilling
state-of-the-art image and video generation models, including Qwen-Image (20B
parameters) and Wan2.2 (28B parameters). Experimental results demonstrate that
Phased DMD preserves output diversity better than DMD while retaining key generative capabilities of the original models.