Blob Pitt's next big blockbuster

Generative models have reached a remarkable capacity to synthesize original instances after learning a data distribution.

In the arena of image generation, the recent SOTA tracks alongside advances in a family of models called generative adversarial networks or GANs. This framework of jointly training two networks gives rise to a learnable loss.

Despite these successes, GANs are challenged by training instabilities. The latest StyleGAN2-ada mitigates mode collapse arising from overfit discriminators using test time data augmentation.

We’ve recently explored another exciting SOTA family of image synthesis techniques called score-based generative models. From this modeling perspective, training data undergoes a diffusion process to help us learn the gradient of the data distribution.

$$ \nabla_\mathbf{x} \log p(\mathbf{x}) $$

Armed with an estimate for this quantity, we can perturb any point in $R^D$ to one more likely given the training data distribution. This motivated researchers to consider the evolution of randomly initialized points under the flow prescribed by this vector field.

These diffusion processes can be modeled rather generally using the stochastic differential equation:

$$ \begin{align*} d \mathbf{x} = \mathbf{f}(\mathbf{x}, t) d t + g(t) d \mathbf{w}, \end{align*} $$

To generate samples, we are ultimately more interested in the reverse time dynamics of such diffusion processes. Framing the model this way, researchers can use Anderson’s result in stochastic calculus from the 80s to consider this reversal.

score-sde schematic

Then integral curves under the learned “probability flow” help generate realistic instances from cheaply sampled Gaussian initial conditions.

Despite apparent similarities to normalizing flows, score-based models avoid the normalization challenge of computing high-dimensional integrals.

In fact, highly-optimized ODE solvers utilize the learned score vector field to generate samples by solving an initial value problem. Researchers also explored various sampling methods to improve the result quality, offering a nice template for extensions.

Aside from generating high-quality samples, score-based models also support exact likelihood computations, class-conditioned sampling and inpainting/colorization applications.

These computations leverage an approximation of the probability flow ODE using the related ideas of neural ODEs. Many of the models made an original debut generating realistic images in Denoising Diffusion Probabilistic Models.

Generating Custom Movie Posters with Score-based Models

For rough comparison to previous experiments, we apply this generative model to a corpus of 40K unlabeled theatrical posters augmented by horizontal reflection. For training, we package the posters into tfrecords using the found in the StyleGAN2-ada repo.

The repo trains score models using Jax while making use of scipy ODE solvers to generate samples. The authors offer detailed collabs and pretrained models along with configurations referenced in their ICLR 2021 conference paper.

This makes it easy to generate realistic samples of CIFAR10 categories:

cifar-10 example

Next, we try applying the high resolution configuration configs/ve/ used to generate results from CelebA-HQ to our theatrical poster corpus. This entailed restricting batch sizes to fit the model and training samples into GPU memory. Unfortunately, without shrinking the learning rate, this seemed to destabilize training:


Finally, using the smaller model of configs/vp/ddpm/ and reducing image resolution, we found samples generated over the course of training like:


Training 100 steps takes approximately 1.5 mins saturating 2 Titan RTXs but less than 15 seconds on 8X larger batches using TPUs!


After scaling up training with an order of magnitude more images, including genre labels for class conditional training and generation, we find a qualitative improvement in the images synthesized. Compared to experiments with StyleGAN2, we find greater variety in qualities like hair, gender, and facial expressions.



Cascaded Diffusion Models extend this work using a sequence of score-based models to progressively sharpen and resolve details of images generated by earlier steps in a cascade.

Almost miraculously, by corrupting training samples through a diffusion process, we can learn to approximate the reverse time dynamics using ODE solvers to generate realistic samples from noise!