Phased DMD

Few-step Distribution Matching Distillation via Score Matching within Subintervals

Xiangyu Fan, Zesong Qiu, Zhuguanyu Wu, Fanzhou Wang, Zhiqian Lin, Tianxiang Ren, Dahua Lin, Ruihao Gong, Lei Yang^✉

SenseTime Research

Abstract

Distribution Matching Distillation (DMD) distills score-based generative models into efficient one-step generators, without requiring a one-to-one correspondence with the sampling trajectories of their teachers. However, limited model capacity causes one-step distilled models underperform on complex generative tasks, e.g., synthesizing intricate object motions in text-to-video generation. Directly extending DMD to multi-step distillation increases memory usage and computational depth, leading to instability and reduced efficiency. While prior works propose stochastic gradient truncation (SGTS) as a potential solution, we observe that it substantially reduces the generation diversity of multi-step distilled models, bringing it down to the level of their one-step counterparts. To address these limitations, we propose Phased DMD, a multi-step distillation framework that bridges the idea of phase-wise distillation with Mixture-of-Experts (MoE), reducing learning difficulty while enhancing model capacity. Phased DMD is built upon two key ideas: progressive distribution matching and score matching within subintervals. First, our method divides the SNR range into subintervals, progressively refining the model to higher SNR levels, to better capture complex distributions. Next, to ensure the training objective within each subinterval is accurate, we have conducted rigorous mathematical derivations. We validate Phased DMD by distilling state-of-the-art image and video generation models, including Qwen-Image (20B parameters) and Wan2.2 (28B parameters). Experimental results demonstrate that Phased DMD preserves output diversity better than DMD while retaining key generative capabilities of the original models.

Phased DMD for Video Generation

In contrast to DMD with SGTS, Phased-DMD achieves higher fidelity in replicating both the motion patterns and the camera movements present in the original teacher model's video outputs. The video below presents a comparative analysis of the two methods on the wan2.2-T2V-A14B model. The first column features outputs from the model distilled using DMD with SGTS, the second column displays results from Phased-DMD, and the third column presents generations from the base teacher model. This compilation comprises all 220 test examples from our evaluation set, with all videos generated using a fixed seed of 42. The presentation is not cherry-picked and is representative of typical performance. As the results demonstrate, the base model, utilizing 40 steps (80 NFE), represents the upper benchmark of performance. Phased-DMD, using only 4 steps (4 NFE), yields significantly superior results to standard DMD, particularly in the fidelity of motion dynamics and camera movement. Notably, the color rendition in Phased-DMD generations also more closely aligns with that of the base model.

We quantitatively evaluate motion intensity using the mean absolute optical flow, computed with Unimatch and the dynamic degree metric from VBench. The results, summarized in the table below, confirm that Phased-DMD more accurately captures motion dynamics than DMD with SGTS for both T2V and I2V tasks.

Method	T2V		I2V
Method	Optical Flow ↑	Dynamic Degree ↑	Optical Flow ↑	Dynamic Degree ↑
Base model	10.26	79.55%	9.32	82.27%
DMD with SGTS (lightning v1.x)	3.23	65.45%	7.87	80.00%
Phased DMD (lightning v2.0)	9.30	82.27%	9.84	83.64%

Phased DMD for Image Generation

We perform image generation distillation experiments on three base models, wan2.1-T2V-14B, wan2.2-T2V-A14B and Qwen-Image-20B. To evaluate generative diversity, we constructed a text-to-image test set comprising 21 prompts. Each prompt provides a short description of the image content without detailed specifications. For each prompt, we generated 8 images using seeds from 0 to 7. For the base model, images are sampled using 40 steps with a CFG scale of 4. All distilled models are sampled using 4 steps and a CFG scale of 1. Generative diversity is evaluated using two complementary metrics: (1) the mean pairwise cosine similarity of DINOv3 features, where lower values indicate higher diversity, and (2) the mean pairwise LPIPS distance, where higher values denote greater diversity. Both metrics are computed across images generated from the same prompt using different seeds. As expected, the base models achieve the highest diversity. Phased DMD outperforms both vanilla DMD and DMD with SGTS. The diversity improvement on Qwen-Image is marginal. We argue this stems from the base model’s own limited output diversity.

Method	Wan2.1-T2V-14B		Wan2.2-T2V-A14B		Qwen-Image
Method	DINOv3 ↓	LPIPS ↑	DINOv3 ↓	LPIPS ↑	DINOv3 ↓	LPIPS ↑
Base model	0.708	0.607	0.732	0.531	0.907	0.483
DMD	0.825	0.522	-	-	-	-
DMD with SGTS	0.826	0.521	0.828	0.447	0.941	0.309
Phased DMD (Ours)	0.782	0.544	0.768	0.481	0.958	0.322

The following four images were generated by the distilled Wan2.1-T2V-14B model using Phased-DMD, based on the prompt: "A mother braiding her daughter hair, sunlight warming the room." The outputs exhibit considerable diversity in color schemes, compositional structure, and lighting conditions.

Phased-DMD successfully maintains high image quality, particularly the precise text-rendering capability inherent to the base Qwen-Image model. Representative examples are presented below.

Method

The figure below illustrates the differences among vanilla DMD, DMD with SGTS, and our proposed Phased DMD. (a) Directly extending DMD to multi-step distillation paradigm increases memory usage and computational depth. (b) The SGTS approach addresses this by selecting a random step as the last generation diffusion step, recording gradients only to that step. While this strategy substantially reduces memory consumption and computational cost during training, it can degenerate into a one-step distillation in certain iterations, thereby limiting the model's generative capacity. (c) Phased DMD avoids this issue by partitioning the distillation process into distinct phases and applying supervision at intermediate timesteps. In each phase except the last, the generator is optimized to minimize the reverse KL divergence at an intermediate timestep, while the fake diffusion model is updated via score matching within a subinterval of the diffusion process. (d) Crucially, Phased DMD remains compatible with SGTS, enabling 4-step inference across 2 phases while simplifying the system complexity for both training and inference.

DMD and related methods within the Variational Score Distillation (VSD) framework comprise three core components: a teacher diffusion model \(T\), a "fake" diffusion model \(F\), and a generator \(G\) to be distilled. The theoretical validity of this framework is predicated on the relationship between the diffusion model's prediction and the score function \(\nabla_{x_t} \log(p(x_t)) \). Formally, assuming the teacher diffusion model's prediction follows \( T_(x_t) = a_t \nabla_{x_t} \log(p(x_t)) + b_t x_t \) and the fake diffusion model's prediction follows \(F_(x_t) = c_t \nabla_{x_t} \log(p(x_t)) + d_t x_t \), the theoretical justification for DMD and related VSD methods necessitates the coefficients to be identical—that is, \(a_t = c_t \) and \( b_t = d_t \). The satisfaction of this condition is straightforward in vanilla DMD and DMD with SGTS, given the direct alignment of their fake diffusion model's loss function with that of the teacher. In Phased DMD, however, its maintenance is non-trivial since \( x_0 \) is unobservable in all but the final phase. We demonstrate theoretically that a properly formulated loss function allows for unbiased score estimation via subinterval score matching, formally preserving he coefficients \(a_t = c_t \) and \( b_t = d_t \). Full derivations are provided in our paper.

Conclusion and Discussion

Phased DMD primarily enhances structural aspects of generation, such as image composition diversity, motion dynamics, and camera control. However, for base models like Qwen-Image, whose outputs are inherently less diverse, the improvement is less pronounced. While this work demonstrates phased distillation within the DMD framework, the approach is generalizable to other objectives like Fisher divergence in SiD, which we leave for future exploration. It is conceivable that other methods for enhancing diversity and dynamics, such as incorporating trajectory data pregenerated by the base model, could be integrated. However, this would compromise the data-free advantage central to DMD. While we may explore such directions in the future, this work prioritizes the data-free paradigm.

References

Tianwei Yin, Michael Gharbi, Taesung Park, Richard Zhang, Eli Shechtman, Fredo Durand, and Bill Freeman. Improved distribution matching distillation for fast image synthesis. Advances in neural information processing systems, 37:47455–47487, 2024a.
Xun Huang, Zhengqi Li, Guande He, Mingyuan Zhou, and Eli Shechtman. Self forcing: Bridging the train-test gap in autoregressive video diffusion. arXiv preprint arXiv:2506.08009, 2025.
Haofei Xu, Jing Zhang, Jianfei Cai, Hamid Rezatofighi, Fisher Yu, Dacheng Tao, and Andreas Geiger. Unifying flow, stereo and depth estimation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023.
Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, et al. Vbench: Comprehensive benchmark suite for video generative models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 21807–21818, 2024b.
Oriane Simeoni, Huy V. Vo, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Michael Ramamonjisoa, Francisco Massa, Daniel Haziza, Luca Wehrstedt, Jianyuan Wang, Timothee Darcet, Theo Moutakanni, Leonel Sentana, Claire Roberts, Andrea Vedaldi, Jamie Tolan, John Brandt, Camille Couprie, Julien Mairal, Herve Jegou, Patrick Labatut, and Piotr Bojanowski. DINOv3, 2025. URL https://arxiv.org/abs/2508.10104.
Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. In CVPR, 2018.
Mingyuan Zhou, Huangjie Zheng, Zhendong Wang, Mingzhang Yin, and Hai Huang. Score identity distillation: Exponentially fast distillation of pretrained diffusion models for one-step generation. In Forty-first International Conference on Machine Learning, 2024.