Diffusion Models (Part 1): Foundations and Principles

Diffusion models have become one of the most powerful tools in Artificial Intelligence (AI). They’re the engines behind some of today’s most advanced generative systems – from creating realistic images, audio, text, and videos to designing new molecules and medicines, and even modeling complex climate and environmental systems.

There are already plenty of great articles that dive into the details of diffusion models – and we’ll share some of our favorites along the way. In this series, we’ll keep things accessible: we focus on the core principles (in this post) and explore how diffusion models are being used in Earth and environmental sciences and why those applications are so promising (see Part 2).

Let’s get started!

• • •

What are generative models?

Generative models are a type of AI system that learn the underlying structure of existing data and use it to create new content that resembles the original.

What does this mean in practice? Suppose we have a dataset containing photos of dogs. A generative model can study all those images to learn what makes a picture look like a dog – the shapes, colors, textures, and relationships between pixels. Once trained, the model can then generate completely new, realistic images of dogs that did not exist in the original dataset.

Generative models are also probabilistic, i.e., they don’t always produce the same output. Instead, they can create many different versions of an image or dataset, all slightly varied, but still realistic. This makes them especially useful for creative tasks, predictive simulations, and risk-based scientific modeling.

“Creating noise from data is easy; creating data from noise is generative modeling.”
– Song Y. et al., (2020)

There are different types of generative models, such as Generative Adversarial Networks¹ (GANs), Variational Autoencoders² (VAEs), flow-based models³, and diffusion models⁴^,⁵. Each type has its strengths and weaknesses, but diffusion models have recently shown outstanding performance in producing high-quality and realistic results. Their success largely comes from the ability to progressively refine noise, allowing diffusion models to capture complex data distributions and produce stable, high-fidelity results without the training instability common in other generative modeling approaches.

We focus on diffusion models in this series.

• • •

What are diffusion models?

Diffusion models are inspired by non-equilibrium thermodynamics – specifically, how particles spread out or “diffuse” over time. The core idea behind them is simple: we gradually corrupt (i.e., add noise to) clean data until it becomes completely random, then train a deep learning model to reverse this process and recover the original data.

Diffusion models are a class of generative models that learn to reverse a gradual noising process applied to data, enabling them to generate realistic samples from the underlying data distributions by iteratively denoising random noise.

In other words, diffusion models learn how to “undo” noise. Imagine taking a blurry or noisy satellite image and carefully sharpening it, one small step at a time, until continents and clouds slowly come back into focus. Each step removes a bit of noise, turning random patterns into something meaningful.

In principle, if we start from pure random noise, we should be able to keep applying the trained model until we obtain a sample that looks as if it were drawn from the training set. That’s it – and yet this simple idea works incredibly well in practice.

For a more intuitive explanation, check out this article – it provides an interactive, step-by-step introduction that makes diffusion models much easier to grasp.

Diffusion models come in different forms, depending on whether the diffusion process is modeled in discrete or continuous time, and whether noise is removed through probabilistic or deterministic dynamics.

A breakthrough approach is the Denoising Diffusion Probabilistic Model (DDPM)⁵, which performs diffusion in discrete time. It models the generative process as a reverse Markov chain, gradually denoising the sample through a fixed sequence of probabilistic transitions.

A Markov chain is a discrete-time stochastic process where the next state depends only on the current state.

Other diffusion formulations include DDPM-inspired variants such as Denoising Diffusion Implicit Models (DDIMs)⁶, which introduce a non-Markovian formulation that enables deterministic and faster sampling, and continuous-time score-based models⁷, which replace the discrete Markov chain with stochastic and ordinary differential equation perspectives to model the diffusion and denoising processes. More recent approaches further optimize efficiency by performing diffusion in a compressed latent space (e.g., Latent Diffusion Models⁸ - LBMs), or by unifying diffusion with flow-based or implicit guidance techniques for improved controllability and speed.

We focus on the DDPM in this post since it provides the most basic foundation.

• • •

How do DDPMs work?

Now, let’s explore how DDPMs actually work. At their core, DDPMs involve two distinct stochastic processes in discrete time: a forward diffusion pass – where noise is gradually added to data until it becomes purely random, and a reverse denoising process – where the model learns to remove that noise step by step to reconstruct the original data.

The Forward Process: Adding Noise

Suppose we have a real data sample $\mathbf{x}_0 \sim q(\mathbf{x})$. In the forward process, we gradually corrupt the data by adding small amounts of Gaussian noise over $T$ steps, producing a sequence of increasingly noisy samples $(\mathbf{x}_1, \dots, \mathbf{x}_T)$. The amount of noise added at each step $t$ is controlled by a predefined variance schedule $\{\beta_t \in (0, 1)\}_{t=1}^T$. \begin{equation} q(\mathbf{x}_t \mid \mathbf{x}_{t-1}) = \mathcal{N}(\mathbf{x}_t; \sqrt{1 - \beta_t} \mathbf{x}_{t-1}, \beta_t\mathbf{I}) \end{equation} Because the process is Markovian, the full joint distribution factorizes as follows: \begin{equation} \begin{aligned} q(\mathbf{x}_{1:T} \mid \mathbf{x}_0) &= q(\mathbf{x}_1,…,\mathbf{x}_T \mid \mathbf{x}_0) \\ &= q(\mathbf{x}_1 \mid \mathbf{x}_0) \ q(\mathbf{x}_2 \mid \mathbf{x}_1, \mathbf{x}_0) \ … \ q(\mathbf{x}_T \mid \mathbf{x}_{T-1}, …, \mathbf{x}_0) \quad \color{green}\small{\text{(Bayes’ theorem)}} \\ &= q(\mathbf{x}_1 \mid \mathbf{x}_0) \ {\color{red}q(\mathbf{x}_2 \mid \mathbf{x}_1)} \ … \ {\color{red}q(\mathbf{x}_T \mid \mathbf{x}_{T-1})} \quad \quad \quad \quad \quad \ \ \ \color{green}\small{\text{(Markov property)}} \\ &= \prod^T_{t=1} q(\mathbf{x}_t \mid \mathbf{x}_{t-1}) \end{aligned} \end{equation}

Here, $\mathcal{N}(\cdot,\cdot)$ denotes a normal distribution. As $t$ increases, the sample $\mathbf{x}_t$ becomes progressively noisier. Eventually, when $T \rightarrow \infty$, $\mathbf{x}_T$ is indistinguishable from random noise. Mathematically, we can write each step of this process as follows: \begin{equation} \mathbf{x}_t = \sqrt{1-\beta_t}\mathbf{x}_{t-1} + \sqrt{\beta_t}\boldsymbol{\epsilon}_{t-1} \quad \quad \text{where } \boldsymbol{\epsilon}_{t-1} \sim \mathcal{N}(\mathbf{0}, \mathbf{I}) \end{equation}

Note that when two components are independent, the variance of their sum is simply the sum of their variances.

At each timestep $t$, we slightly perturb the sample by adding Gaussian noise with variance $\beta_t$, while also scaling the previous sample $\mathbf{x}_{t-1}$. This scaling is chosen so that the variance of the new sample $\mathbf{x}_{t}$ stays the same over time. In particular, since $\boldsymbol{\epsilon}_{t-1}$ is standard Gaussian, if $\mathbf{x}_{t-1}$ has zero mean and unit variance, then $\mathbf{x}_{t}$ will as well. Intuitively, this works because the contributions from the signal and noise terms balance each other: $\sqrt{1-\beta_t}^2 + \sqrt{\beta_t}^2=1$.

In theory, if we normalize the original sample $\mathbf{x}_{0}$ to have zero mean and unit variance, then the entire sequence $(\mathbf{x}_1, \dots, \mathbf{x}_T)$ will preserve these properties under the forward process. By the Central Limit Theorem, $\mathbf{x}_T$ will approximate a standard Gaussian distribution as $T$ becomes sufficiently large.

This scaling ensures that the variance remains stable throughout the diffusion process.

In practice, inputs are typically scaled to a bounded range (e.g., $[0,1]$ or $[-1,1]$), and this range must be known and consistent because the noise schedule is defined relative to the data’s scale.

Another nice property of the above process is that we can jump straight from the original sample $\mathbf{x}_0$ to any noised version of the forward diffusion process $\mathbf{x}_t$ using a reparameterization trick.

Let $\alpha_t = 1 - \beta_t$ and $\bar{\alpha}_t = \prod_{i=1}^t \alpha_i$, then we can write the following: \begin{equation} \begin{aligned} \mathbf{x}_t &= \sqrt{\alpha_t}\mathbf{x}_{t-1} + \sqrt{1 - \alpha_t}\boldsymbol{\epsilon}_{t-1} \\ &= \sqrt{\alpha_t} {(\underbrace{\color{red}\sqrt{\alpha_{t-1}}\mathbf{x}_{t-2} + \sqrt{1 - \alpha_{t-1}}\boldsymbol{\epsilon}_{t-2}}_{\mathbf{x}_{t-1}} )} + \sqrt{1 - \alpha_{t}}\boldsymbol{\epsilon}_{t-1} \\ &= {\color{red}\sqrt{\alpha_t \alpha_{t-1}} \mathbf{x}_{t-2} + \sqrt{\alpha_t (1-\alpha_{t-1})}\boldsymbol{\epsilon}_{t-2}} + \sqrt{1 - \alpha_{t}}\boldsymbol{\epsilon}_{t-1} \\ &= \sqrt{\alpha_t \alpha_{t-1}} \mathbf{x}_{t-2} + {\color{red}\sqrt{1 - \alpha_t \alpha_{t-1}} \bar{\boldsymbol{\epsilon}}_{t-2} } \\ &= \dots \\ &= \sqrt{\alpha_t \alpha_{t-1}\dots\alpha_1} \mathbf{x}_{0} + \sqrt{1 - \alpha_t \alpha_{t-1}\dots\alpha_1} \boldsymbol{\epsilon}_0 \\ &= \sqrt{\bar{\alpha}_t}\mathbf{x}_0 + \sqrt{1 - \bar{\alpha}_t}\boldsymbol{\epsilon}_0 \end{aligned} \end{equation}

Explanation in words: We unroll the update rule step by step, combining the noise terms along the way, so that $\mathbf{x}_t$ can be written directly in terms of $\mathbf{x}_0$. Note that since $\boldsymbol{\epsilon}_{t-2}, \boldsymbol{\epsilon}_{t-1} \sim \mathcal{N}(\mathbf{0}, \mathbf{I})$, their weighted sum is also Gaussian with standard deviation $\sqrt{\alpha_t (1-\alpha_{t-1})+(1-\alpha_t)} = \sqrt{1-\alpha_t\alpha_{t-1}}$, and $\bar{\boldsymbol{\epsilon}}_{t-2} \sim \mathcal{N}(\mathbf{0},\mathbf{I}).$

The forward diffusion process $q$ can therefore be written in closed form as: \begin{equation} q(\mathbf{x}_t \vert \mathbf{x}_0) = \mathcal{N}\big(\mathbf{x}_t; \sqrt{\bar{\alpha}_t} \mathbf{x}_0, (1 - \bar{\alpha}_t)\mathbf{I}\big) \end{equation}

What this really tells us is that $\mathbf{x}_t$ never loses the original signal – it’s just being covered by more and more Gaussian noise. The goal of diffusion models is figuring out how to peel that noise away again.

• • •

The Reverse Process: Learning to Denoise

The reverse process works in the opposite direction – and this is where the magic happens. Instead of adding noise, the reverse systematically removes it, step by step, gradually reconstructing the original data. Once trained, the model can start from pure Gaussian noise and iteratively apply this reverse procedure to generate new, realistic samples similar to $\mathbf{x}_0$.

In theory, the reverse diffusion process is defined as $q(\mathbf{x}_{t-1} \vert \mathbf{x}_t)$ – meaning that given a noisy sample $\mathbf{x}_t$, we would like to compute the distribution of the previous, slightly less noisy sample $\mathbf{x}_{t-1}$. However, this distribution is intractable in practice because it depends on the entire (unknown) data distribution.

Conditioning trick

Another useful trick in diffusion models is that the reverse transition becomes tractable if we condition on the original data $\mathbf{x}_{0}$. Since the forward process is fully known, we can apply Bayes’ rule to obtain a closed-form expression: \begin{equation} \begin{aligned} q(\mathbf{x}_{t-1} \vert \mathbf{x}_t, \mathbf{x}_0) &= q(\mathbf{x}_t \vert \mathbf{x}_{t-1}, \mathbf{x}_0) \frac{ q(\mathbf{x}_{t-1} \vert \mathbf{x}_0) }{ q(\mathbf{x}_t \vert \mathbf{x}_0) } \\ &= \mathcal{N}\big(\mathbf{x}_{t-1}; {\tilde{\boldsymbol{\mu}}}_t, {\tilde{\beta}_t} \mathbf{I}\big) \end{aligned} \end{equation} in which \begin{equation} \tilde{\boldsymbol{\mu}}_t = {\frac{1}{\sqrt{\alpha_t}} \Big( \mathbf{x}_t - \frac{1 - \alpha_t}{\sqrt{1 - \bar{\alpha}_t}} \boldsymbol{\epsilon}_t \Big)} \quad \text{and} \quad \tilde{\beta}_t = {\frac{1 - \bar{\alpha}_{t-1}}{1 - \bar{\alpha}_t} \beta_t} \label{eq:mubeta} \end{equation}

This derivation relies on the Markov property of the forward process – each state $\mathbf{x}_t$ depends on the previous $\mathbf{x}_{t-1}$, not on the original data $\mathbf{x}_0$. Formally, $q(\mathbf{x}_{t} \vert \mathbf{x}_{t-1}, \mathbf{x}_0) = q(\mathbf{x}_{t} \vert \mathbf{x}_{t-1})$. Since all the factors in the Bayes’ rule expression are Gaussian, multiplying them results in another Gaussian. Using $\mathcal{N}\big(x; \mu, \sigma\big) \propto \exp \left( \frac{-(x-\mu)^2}{2\sigma^2}\right),$ we can solve analytically for $\tilde{\boldsymbol{\mu}}_t$ and $\tilde{\beta}_t$ as shown above.

If you’d like a quick walkthrough of this derivation, check out this Lil’Log’s post⁹. For a full step-by-step version, see page 12 of this article¹⁰ and chapter 2 of this book¹¹.

What does this conditioning mean? It means that during training, since we know $\mathbf{x}_0$, we can compute the exact noise that was added to get $\mathbf{x}_t$. This allows us to create training pairs $(\mathbf{x}_t, \mathbf{\epsilon})$, where $\mathbf{\epsilon}$ is the exact noise, and train a model to predict this noise.

Why do we need deep learning?

However, at generation time, we start from pure Gaussian noise and do not know $\mathbf{x}_0$. So we can no longer use the closed-form $q(\mathbf{x}_{t-1} \vert \mathbf{x}_t, \mathbf{x}_0)$.

This is where deep learning comes into play. We instead train a neural network $\mathbf{\epsilon}_\theta(\mathbf{x}_{t},t)$ to predict the noise added at each step. Once we have this noise estimate, we can recover an estimate of the clean signal and approximate the true reverse process: \begin{equation} p_\theta(\mathbf{x}_{t-1} \vert \mathbf{x}_t) \approx q(\mathbf{x}_{t-1} \vert \mathbf{x}_t) \end{equation}

At each diffusion step, the neural network predicts the noise inside the current noisy sample and then subtracts it accordingly.

Since each step in the forward diffusion adds only a small amount of Gaussian noise, the reverse steps can also be modeled as Gaussian transitions:

\begin{equation} p_\theta(\mathbf{x}_{t-1} \vert \mathbf{x}_t) = \mathcal{N} \big( \mathbf{x}_{t-1}; \boldsymbol{\mu}_\theta(\mathbf{x}_t, t), \boldsymbol{\Sigma}_\theta(\mathbf{x}_t, t) \big) \end{equation}

By applying this reverse transition from $t=T \rightarrow 0$, we gradually transform pure noise $\mathbf{x}_T$ to a coherent, realistic sample that is similar to $\mathbf{x}_0$: \begin{equation} p_\theta(\mathbf{x}_{0:T}) = p_\theta(\mathbf{x}_T) \prod^T_{t=1} p_\theta(\mathbf{x}_{t-1} \vert \mathbf{x}_t) \end{equation}

Note that although the noise added during the forward diffusion is random, it is not arbitrary – its structure comes from the underlying data. As a result, by learning to predict and remove this noise accurately, the model implicitly learns the structure of the original image $\mathbf{x}_0$ and how to reconstruct it from noise.

• • •

Train diffusion models

The goal of training a diffusion model is to make it assign high probability to real data. Formally, we want to maximize the likelihood of samples from the true data distribution: \begin{equation} \max_{\theta} \mathbb{E}_{\mathbf{x}_0 \sim q(\mathbf{x}_0)} \Big[ \log p_{\theta}(\mathbf{x}_0) \Big] \end{equation} Here $q(\mathbf{x}_0)$ is the real data distribution, and $p_{\theta}(\mathbf{x}_0)$ is the distribution modeled by the neural network. However, the likelihood $p_{\theta}(\mathbf{x}_0)$ is intractable because the model generates data through a chain of latent noisy variables: \begin{equation} p_{\theta}(\mathbf{x}_0) = \int p_\theta(\mathbf{x}_{0:T}) d\mathbf{x}_{1:T} \end{equation}

To solve this, diffusion models use a classical idea from variational inference – the Evidence Lower Bound (ELBO).

We can’t compute the true likelihood, but we can compute a lower bound on it and train the model by maximizing that bound.

ELBO is a computable lower bound on the true log-likelihood of data. We maximize it because doing so also maximizes the likelihood of real data — but in a way we can actually calculate.

\begin{equation} \begin{aligned} \underbrace{\log p_\theta(\mathbf{x}_0)}_{\text{Evidence}} &= \log \int p_\theta(\mathbf{x}_{0:T}) d\mathbf{x}_{1:T} = \log \int {\color{red} q(\mathbf{x}_{1:T} \vert \mathbf{x}_{0})} \frac{p_\theta(\mathbf{x}_{0:T})}{\color{red}{q(\mathbf{x}_{1:T} \vert \mathbf{x}_{0})}} d\mathbf{x}_{1:T} \\ &= \log \mathbb{E}_{q(\mathbf{x}_{1:T} \vert \mathbf{x}_0)} \Bigg[\frac{p_\theta(\mathbf{x}_{0:T})}{q(\mathbf{x}_{1:T} \vert \mathbf{x}_{0})}\Bigg] \quad \quad \color{green}\small{\text{By definition: } \mathbb{E}_{p(x)}[f(x)] = \int p(x)f(x)dx} \\ &\ge \underbrace{\mathbb{E}_{q(\mathbf{x}_{1:T} \vert \mathbf{x}_0)} \Bigg[ \log \frac{p_\theta(\mathbf{x}_{0:T})}{q(\mathbf{x}_{1:T} \vert \mathbf{x}_{0})} \Bigg]}_{\text{Evidence Lower Bound} \ (ELBO_{\theta})} \quad \quad \color{green}\small{\text{Apply Jensen’s inequality (log is concave)}} \\ &\rule{0pt}{2em} \color{blue}\small{\text{. . . We skip the details here for simplicity. At the end, we obtain:}} \\ &\ge \underbrace{\mathbb{E}_{q(\mathbf{x}_{1} \vert \mathbf{x}_0)} \Big[ \log p_{\theta}(\mathbf{x}_{0} \vert \mathbf{x}_{1})\Big] - D_\text{KL}\big(q(\mathbf{x}_{T}\vert\mathbf{x}_0) || p_\theta(\mathbf{x}_{T}) \big) - \sum_{t=2}^T \mathbb{E}_{q(\mathbf{x}_{t} \vert \mathbf{x}_0)} \Big[ D_\text{KL}\big(q(\mathbf{x}_{t-1} \vert \mathbf{x}_t, \mathbf{x}_0) \parallel p_\theta(\mathbf{x}_{t-1} \vert\mathbf{x}_t) \big)\Big]}_{\text{Variational Lower Bound ($L_{VLB}$)}} \end{aligned} \end{equation}

For a complete derivation, check out this video¹² and this article¹⁰.

where $D_\text{KL}(q||p)$ is the Kullback–Leibler (KL) divergence. Basically, it measures the similarity between two probability distributions. KL divergence is always positive and can be non-symmetric under the interchange of $p$ and $q$.

To train the model, we instead minimize the negative log-likelihood bound (which is equivalent to maximizing the likelihood): \begin{equation} -\log p_\theta(\mathbf{x}_0) \le \underbrace{\mathbb{E}_{q(\mathbf{x}_1 \vert \mathbf{x}_0)} \big[- \log p_\theta(\mathbf{x}_0 \vert \mathbf{x}_1)\big]}_{L_0 \ (\text{reconstruction})} + \sum_{t=2}^T \underbrace{\mathbb{E}_{q(\mathbf{x}_t \vert \mathbf{x}_0)} \Big[ D_\text{KL}\big(q(\mathbf{x}_{t-1} \vert \mathbf{x}_t, \mathbf{x}_0) \parallel p_\theta(\mathbf{x}_{t-1} \vert\mathbf{x}_t)\big)\Big]}_{L_{t-1} \ (\text{consistency})} + \underbrace{D_\text{KL}\big(q(\mathbf{x}_T \vert \mathbf{x}_0) \parallel p_\theta(\mathbf{x}_T)\big)}_{L_T \ (\text{prior matching})} \end{equation}

Here, $L_T$ is constant with respect to $\theta$ and can be ignored during training. The consistency term is a summation of many KL divergence terms. Every KL divergence term in $L_{LVB}$ (except for $L_0$) compares two Gaussian distributions and therefore they can be computed in closed form: \begin{equation} \begin{aligned} D_\text{KL}\big(q(\mathbf{x}_{t-1} \vert \mathbf{x}_t, \mathbf{x}_0) \parallel p_\theta(\mathbf{x}_{t-1} \vert\mathbf{x}_t)\big) & = D_\text{KL}\big(\mathcal{N}(\mathbf{x}_{t-1}; \underbrace{\boldsymbol{\mu}_t(\mathbf{x}_t,\mathbf{x}_0)}_{\text{known}}, \underbrace{\sigma_t^2 \mathbf{I}}_{\text{known}}) \parallel \mathcal{N}(\mathbf{x}_{t-1}; \underbrace{\boldsymbol{\mu}_{\theta}(\mathbf{x}_t)}_{\text{neural net}}, \underbrace{\sigma_t^2 \mathbf{I}}_{\text{known}}) \big) \\ &=\frac{1}{2\sigma_t^2} \Vert \boldsymbol{\mu}_t(\mathbf{x}_t,\mathbf{x}_0) - \boldsymbol{\mu}_{\theta}(\mathbf{x}_t)\Vert^2 \end{aligned} \end{equation} The ELBO can be simplified to absorb the reconstruction $L_0$ into the summation (see Theorem 2.7 in this book¹¹ for details). Recall Eqn \eqref{eq:mubeta} that the mean can also be described as a function of $\mathbf{x}_{t}$ and $\boldsymbol{\epsilon}$. This ultimately reduces to a simple and intuitive loss: \begin{equation} \rm{ELBO}_{\theta}(\mathbf{x}_0,\boldsymbol{\epsilon}) = -\sum_{t=1}^T \mathbb{E}_{\mathbf{x}_0, \epsilon} \Bigg[\ \underbrace{\frac{(1-\alpha_t)^2}{2\sigma_t^2 (1-\bar{\alpha_t}) \alpha_t}}_{known} \times \Big\Vert \boldsymbol{\epsilon} - \boldsymbol{\epsilon}_\theta(\underbrace{\sqrt{\bar{\alpha}_t}\mathbf{x}_0 + \sqrt{1 - \bar{\alpha}_t}\boldsymbol{\epsilon}}_{\mathbf{x}_t}, t) \Big\Vert^2 \Bigg] \end{equation}

However, Ho et al., (2020)⁵ found it beneficial to sample quality to train on the following variant of the variational bound:

\begin{equation} \mathcal{L}_{\text{simple}}(\theta) := \mathbb{E}_{t,\mathbf{x}_0, \epsilon} \left[ \Big\Vert \boldsymbol{\epsilon} - \boldsymbol{\epsilon}_\theta(\sqrt{\bar{\alpha}_t}\mathbf{x}_0 + \sqrt{1 - \bar{\alpha}_t}\boldsymbol{\epsilon}, t) \Big\Vert ^2 \right] \end{equation} where $t$ is uniform between 1 and $T$. By minimizing this loss, the model learns to invert each step of the noising process. As training progresses, it becomes increasingly effective at removing noise from any noisy input $\mathbf{x}_T$, enabling it to generate realistic samples starting from pure random noise.

The training and sampling algorithms in DDPM can be summarized as below:

Training DDPM

1: repeat
2: $\mathbf{x}_0 \sim q(\mathbf{x}_0)$
3: $t\sim \text{Uniform}(\{1,...,T\})$
4: $\boldsymbol{\epsilon} \sim \mathcal{N}(0,\mathbf{I})$
5: take gradient descent step on:
$\nabla_{\theta} \| \epsilon - \epsilon_\theta(\sqrt{\bar{\alpha}_t}\mathbf{x}_0 + \sqrt{1 - \bar{\alpha}_t}\boldsymbol{\epsilon},t) \|^2$
6: until converged

Sampling DDPM

1: $\mathbf{x}_T \sim \mathcal{N}(0,\mathbf{I})$
2: for $t=T,...,1$ do
3: $\mathbf{z} \sim \mathcal{N}(0,\mathbf{I})$ if $t>1$, else $\mathbf{z}=0$
4: $\mathbf{x}_{t-1} = \tfrac{1}{\sqrt{\alpha_t}}\bigl(\mathbf{x}_t - \tfrac{1 - \alpha_t}{\sqrt{1-\bar{\alpha}_t}}\,\epsilon_\theta(\mathbf{x}_t, t)\bigr) + \sigma_t \mathbf{z}$
5: end for
6: return $\mathbf{x}_0$

If you’d like to explore the complete mathematical derivation, check out these excellent resources⁹^,¹⁰^,¹¹^,¹²^,¹³^,¹⁴. Each provides a detailed explanation of the theory and intuition behind diffusion models.

• • •

Summary

We gradually add noise to data (the forward process).
The model learns to remove the noise (the reverse process).
Training aims to maximize the likelihood of real data (or equivalently, minimize the negative log-likelihood).
The exact likelihood is intractable, we instead minimize a lower bound.

In Part 2, we’ll dive into how diffusion models are applied in Earth 🌎 sciences.

References

Goodfellow, I. et al., 2014. Generative Adversarial Networks. Advances in Neural Information Processing Systems (NeurIPS), 27, pp.2672–2680. ↩︎
Kingma, D.P. & Welling, M., 2014. Auto-Encoding Variational Bayes. Proceedings of the International Conference on Learning Representations (ICLR) 2014. ↩︎
Kingma, D.P. & Dhariwal, P., 2018. Glow: Generative Flow with Invertible 1×1 Convolutions. Advances in Neural Information Processing Systems 31 (NeurIPS 2018), Montréal, Canada. ↩︎
Sohl-Dickstein, J. et al., 2015. Deep unsupervised learning using nonequilibrium thermodynamics. Proceedings of the 32$^{nd}$ International Conference on Machine Learning (ICML), PMLR, 37, pp.2256–2265. ↩︎
Ho, J., Jain, A., & Abbeel, P. (2020). Denoising Diffusion Probabilistic Models. Advances in Neural Information Processing Systems 33 (NeurIPS 2020), pp. 6840-6851. ↩︎ ↩︎ ↩︎
Song, J., Meng, C., & Ermon, S., 2020. Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502. ↩︎
Song, Y., et al. 2020. Score-based generative modeling through stochastic differential equations, arXiv preprint arXiv:2011.13456 (2020). ↩︎
R. Rombach, et al., 2021. High-Resolution Image Synthesis with Latent Diffusion Models, in 2022 IEEE/CVF Conference on CVPR, New Orleans, LA, USA, 2022. ↩︎
https://lilianweng.github.io/posts/2021-07-11-diffusion-models ↩︎ ↩︎
Luo, C., 2022. Understanding Diffusion Models: A Unified Perspective. ↩︎ ↩︎ ↩︎
Chan, S., 2024. Tutorial on Diffusion Models for Imaging and Vision. Found. Trends. Comput. Graph. Vis. 16, 4, 322–471. ↩︎ ↩︎ ↩︎
Özdemir H., Diffusion Models Explained with Math From Scratch. ↩︎ ↩︎
https://theaisummer.com/diffusion-models ↩︎
Lai, C.-H. et al., 2025. The Principles of Diffusion Models. ↩︎

What are generative models?#

What are diffusion models?#

How do DDPMs work?#

The Forward Process: Adding Noise#

The Reverse Process: Learning to Denoise#

Train diffusion models#

Training DDPM

Sampling DDPM

Summary#

References#

What are generative models?

What are diffusion models?

How do DDPMs work?

The Forward Process: Adding Noise

The Reverse Process: Learning to Denoise

Train diffusion models

Summary

References