Video Diffusion Models

We present results on video generation using diffusion models. We propose an architecture for video diffusion models which is a natural extension of the standard image architecture. We show that this architecture is effective for jointly training from image and video data. To generate long and higher resolution videos we introduce a new conditioning technique that performs better than previously proposed methods. We present results on text-conditioned video generation and state-of-the-art results on an unconditional video generation benchmark.

Example results

Samples from a text-conditioned video diffusion model, conditioned on the string fireworks.
More samples from a text-conditioned video diffusion model. The conditioning string is displayed above each sample.

Summary

Diffusion models have recently been producing high quality results in domains such as image generation and audio generation, and there is significant interest in validating diffusion models in new data modalities. In this work, we present first results on video generation using diffusion models, for both unconditional and conditional settings. Prior work on video generation has usually employed other types of generative models, like GANs, VAEs, flow-based models, and autoregressive models.

We show that high quality videos can be generated by essentially the standard formulation of the Gaussian diffusion model, with little modification other than straightforward architectural changes to accommodate video data within memory constraints of deep learning accelerators. We train models that generate a block of a fixed number of frames of a video, and to generate videos longer than that number of frames, we additionally show how to repurpose a trained model to act as a model which is block-autoregressive over frames. We test our methods on an unconditional video generation benchmark, where we achieve state-of-the-art sample quality scores, and we also show promising results on text-conditioned video generation.

Gradient conditioning method

One of our main innovations is a new conditional generation method for unconditional diffusion models. Our new conditioning method, which refer to as the gradient method, modifies the sampling procedure of the model to improve a conditioning loss on denoised data using gradient-based optimization. We find that the gradient method is more capable than existing methods in ensuring consistency of the generated samples with the conditioning information. We use the gradient method to autoregressively extend our models to more timesteps and higher resolutions.

Frames from our gradient method (left) and a baseline "replacement" method (right) for autoregressive extension. Videos sampled using the gradient method attain superior temporal coherence compared to the baseline method.

Additional techniques

The basic techniques we employ are as follows (details can be found in our full paper):

Citation

For more details and additional results, read the full paper.

@article{ho2022video,
title={Video diffusion models},
author={Ho, Jonathan and Salimans, Tim and Gritsenko, Alexey and Chan, William and Norouzi, Mohammad and Fleet, David J},
journal={arXiv:2204.03458},
year={2022}}
}