Motion Consistency Model: Accelerating Video Diffusion with Disentangled Motion-Appearance Distillation

1State University of New York at Buffalo,  2Microsoft
NeurIPS 2024

TL;DR: Our motion consistency model not only accelerates text2video diffusion model sampling process, but also can benefit from an additional high-quality image dataset to improve the frame quality of generated videos.

4-step generation results

High-resolution video

Pose-conditioned video

*epiCReealism is used as the AnimateDiff base model.

Frame quality improvement

Teacher (ModelScopeT2V)
50 steps
Ours+Webvid
4 steps
Ours+LAION-aesthetic
4 steps
Ours+Anime
4 steps
Ours+Realistic
4 steps
Ours+3D Cartoon
4 steps
Aerial uhd 4k view. mid-air flight over fresh and clean mountain river at sunny summer morning. Green trees and sun rays on horizon. Direct on sun.
Back of woman in shorts going near pure creek in beautiful mountains.
Misty mountain landscape
A rotating pandoro (a traditional italian sweet yeast bread, most popular around christmas and new year) being eaten in time-lapse.
Slow motion avocado with a stone falls and breaks into 2 parts with splashes

*For anime, realistic, and 3D cartoon styles, we leverage generated 500k image-caption datasets using fine-tuned stable diffusion models ToonYou beta 6, RealisticVision v6, and Disney pixar cartoon, respectively.

Abstract

Image diffusion distillation achieves high-fidelity generation with very few sampling steps. However, directly applying these techniques to video models results in unsatisfied frame quality. This issue arises from the limited frame appearance quality in public video datasets, affecting the performance of both teacher and student video diffusion models. Our study aims to improve video diffusion distillation and meanwhile enabling the student model to improve frame appearance using the abundant high-quality image data. To this end, we propose motion consistency models (MCM), a single-stage video diffusion distillation method that disentangles motion and appearance learning. Specifically, MCM involves a video consistency model that distills motion from the video teacher model, and an image discriminator that boosts frame appearance to match high-quality image data. However, directly combining these components leads to two significant challenges: a conflict in frame learning objectives, where video distillation learns from low-quality video frames while the image discriminator targets high-quality images, and training-inference discrepancies due to the differing quality of video samples used during training and inference. To address these challenges, we introduce disentangled motion distillation and mixed trajectory distillation. The former applies the distillation objective solely to the motion representation, while the latter mitigates training-inference discrepancies by mixing distillation trajectories from both the low- and high-quality video domains. Extensive experiments show that our MCM achieves state-of-the-art video diffusion distillation performance. Additionally, our method can enhance frame quality in video diffusion models, producing frames with high aesthetic value or specific styles.

Method

Motivation

Our motion consistency model not only distill the motion prior from the teacher to accelerate sampling, but also can benefit from an additional high-quality image dataset to improve the frame quality of generated videos.

Framework

Left: framework overview. Our motion consistency model features disentangled motion-appearance distillation, where motion is learned via the motion consistency distillation loss \(\mathcal{L}_{\text{MCD}}\), and the appearance is learned with the frame adversarial objective \(\mathcal{L}_{\text{adv}}^{\text{G}}\).

Right: mixed trajectory distillation. We simulate the inference-time ODE trajectory using student-generated video (bottom green line), which is mixed with the real video ODE trajectory (top green line) for consistency distillation training.

Qualitative Comparison on Diffusion Distillation (ModelScopeT2V teacher)

Teacher (ModelScopeT2V)
50 steps
Latent consistency model
1 step
Ours + Realistic
1 step
Slow motion avocado with a stone falls and breaks into 2 parts with splashes
Slow motion of delicious salmon sachimi set with green vegetables leaves served on wood plate. make homemade japanese food at home.-dan

Teacher (ModelScopeT2V)
50 steps
Latent consistency model
2 steps
Ours + Animate/3D Cartoon
2 steps
Blooming meadow panorama zoom-out shot heavenly clouds and upcoming thunderstorm in mountain range harz, germany.
Slow motion of delicious salmon sachimi set with green vegetables leaves served on wood plate. make homemade japanese food at home.-dan

Teacher (ModelScopeT2V)
50 steps
Latent consistency model
4 steps
Ours + LAION-aes/3D Cartoon
4 steps
Great blue heron clearing his feather on his face
Observatory near to church in a countryside. helicopter camera flaying above the observatory witch is standing next to the church in beautiful landscape.

Qualitative Comparison on Diffusion Distillation (AnimateDiff teacher)

Teacher (AnimateDiff)
50 steps
DPM++ AnimateLCM AnimateDiff Lightning Ours
A young woman in a yellow sweater uses vr glasses, sitting on the shore of a pond on a background of dark waves. a strong wind develops her hair, the sun's rays are reflected from the water.
2 steps
4 steps

Teacher (AnimateDiff)
50 steps
DPM++ AnimateLCM AnimateDiff Lightning Ours
Female running at sunset. healthy fitness concept
2 steps
4 steps

*For better visualization purpose, we use RealisticVision v6 as the AnimateDiff base model. For quantitative comparison, we follow AnimateLCM to use plain stable diffusion v1.5 as the base model.

Quantitative Comparison

Diffusion Distillation Comparison

Video diffusion distillation comparison on the WebVid mini validation set. We achieve the best FVD and CLIPSIM using 1, 2, and 4 sampling steps.


Zero-shot video diffusion distillation comparison on the MSRVTT validation set. We achieve the best FVD and CLIPSIM using 1, 2, and 4 sampling steps.


Frame Quality Improvement Comparison

Compared with two-stage method, our method can better align the frame quality with the image data, achieving lower FID.

BibTeX

@article{zhai2024motion,
  title={Motion Consistency Model: Accelerating Video Diffusion with Disentangled
  Motion-Appearance Distillation},
  author={Zhai, Yuanhao and Lin, Kevin and Yang, Zhengyuan and Li, Linjie and Wang, Jianfeng and Lin, Chung-Ching and Doermann, David and Yuan, Junsong and Wang, Lijuan},
  year={2024},
  journal={arXiv preprint arXiv:2406.06890},
  website={https://yhzhai.github.io/mcm/},
}