Video Diffusion Models

Jonathan Ho*
Tim Salimans*
Alexey Gritsenko
William Chan
Mohammad Norouzi
David Fleet

We present results on video generation using diffusion models. We propose an architecture for video diffusion models which is a natural extension of the standard image architecture. We show that this architecture is effective for jointly training from image and video data. To generate long and higher resolution videos we introduce a new conditioning technique that performs better than previously proposed methods. We present results on text-conditioned video generation and state-of-the-art results on an unconditional video generation benchmark.

Example results

Samples from a text-conditioned video diffusion model, conditioned on the string *fireworks*.

More samples from a text-conditioned video diffusion model. The conditioning string is displayed above each sample.

Summary

Diffusion models have recently been producing high quality results in domains such as image generation and audio generation, and there is significant interest in validating diffusion models in new data modalities. In this work, we present first results on video generation using diffusion models, for both unconditional and conditional settings. Prior work on video generation has usually employed other types of generative models, like GANs, VAEs, flow-based models, and autoregressive models.

We show that high quality videos can be generated by essentially the standard formulation of the Gaussian diffusion model, with little modification other than straightforward architectural changes to accommodate video data within memory constraints of deep learning accelerators. We train models that generate a block of a fixed number of frames of a video, and to generate videos longer than that number of frames, we additionally show how to repurpose a trained model to act as a model which is block-autoregressive over frames. We test our methods on an unconditional video generation benchmark, where we achieve state-of-the-art sample quality scores, and we also show promising results on text-conditioned video generation.

Gradient conditioning method

One of our main innovations is a new conditional generation method for unconditional diffusion models. Our new conditioning method, which refer to as the gradient method, modifies the sampling procedure of the model to improve a conditioning loss on denoised data using gradient-based optimization. We find that the gradient method is more capable than existing methods in ensuring consistency of the generated samples with the conditioning information. We use the gradient method to autoregressively extend our models to more timesteps and higher resolutions.

Frames from our gradient method (left) and a baseline "replacement" method (right) for autoregressive extension. Videos sampled using the gradient method attain superior temporal coherence compared to the baseline method.

Additional techniques

The basic techniques we employ are as follows (details can be found in our full paper):

Architecture: for video data we use a factorized space-time UNet, which is a straightforward extension of the standard 2D UNet used in image diffusion models.
Joint image-video training: our factorized UNets can be run on variable sequence lengths and therefore can be jointly trained on both video and image modeling objectives. We find that this joint training, which has the effect of a bias-variance tradeoff on the training objective, is important for video sample quality.
Classifier-free guidance: improves sample quality for text conditioned generation, similar to existing work on image modeling.

Citation

For more details and additional results, read the full paper.


  @article{ho2022video,
  
    title={Video diffusion models},

    author={Ho, Jonathan and Salimans, Tim and Gritsenko, Alexey and Chan, William and Norouzi, Mohammad and Fleet, David J},

    journal={arXiv:2204.03458},

    year={2022}}

  
  }