In my final a number of articles I talked about generative deep studying algorithms, which principally are associated to textual content era duties. So, I feel it might be attention-grabbing to change to generative algorithms for picture era now. We knew that these days there have been loads of deep studying fashions specialised for producing photos on the market, akin to Autoencoder, Variational Autoencoder (VAE), Generative Adversarial Community (GAN) and Neural Model Switch (NST). I really received a few of my writings about these matters posted on Medium as effectively. I present you the hyperlinks on the finish of this text if you wish to learn them.
In in the present day’s article, I wish to talk about the so-called diffusion mannequin — probably the most impactful fashions within the area of deep studying for picture era. The thought of this algorithm was first proposed within the paper titled Deep Unsupervised Studying utilizing Nonequilibrium Thermodynamics written by Sohl-Dickstein et al. again in 2015 [1]. Their framework was then developed additional by Ho et al. in 2020 of their paper titled Denoising Diffusion Probabilistic Fashions [2]. DDPM was later tailored by OpenAI and Google to develop DALLE-2 and Imagen, which we knew that these fashions have spectacular capabilities to generate high-quality photos.
How Diffusion Mannequin Works
Typically talking, diffusion mannequin works by producing picture from noise. We will consider it like an artist remodeling a splash of paint on a canvas into a wonderful paintings. So as to take action, the diffusion mannequin must be skilled first. There are two essential steps required to be adopted to coach the mannequin, particularly ahead diffusion and backward diffusion.
As you’ll be able to see within the above determine, ahead diffusion is a course of the place Gaussian noise is utilized to the unique picture iteratively. We maintain including the noise till the picture is totally unrecognizable, at which level we are able to say that the picture now lies within the latent area. Totally different from Autoencoders and GANs the place the latent area sometimes has a decrease dimension than the unique picture, the latent area in DDPM maintains the very same dimensionality as the unique one. This noising course of follows the precept of a Markov Chain, that means that the picture at timestep t is affected solely by timestep t-1. Ahead diffusion is taken into account simple since what we principally do is simply including some noise step-by-step.
The second coaching section is named backward diffusion, which our goal right here is to take away the noise little by little till we receive a transparent picture. This course of follows the precept of the reverse Markov Chain, the place the picture at timestep t-1 can solely be obtained based mostly on the picture at timestep t. Such a denoising course of is actually tough since we have to guess which pixels are noise and which of them belong to the precise picture content material. Thus, we have to make use of a neural community mannequin to take action.
DDPM makes use of U-Web as the idea of the deep studying structure for backward diffusion. Nonetheless, as a substitute of utilizing the unique U-Web mannequin [4], we have to make a number of modifications to it in order that it will likely be extra appropriate for our process. Afterward, I’m going to coach this mannequin on the MNIST Handwritten Digit dataset [5], and we’ll see whether or not it will possibly generate related photos.
Effectively, that was just about all the elemental ideas it’s essential to find out about diffusion fashions for now. Within the subsequent sections we’re going to get even deeper into the main points whereas implementing the algorithm from scratch.
PyTorch Implementation
We’re going to begin by importing the required modules. In case you’re not but acquainted with the imports beneath, each torch
and torchvision
are the libraries we’ll use for getting ready the mannequin and the dataset. In the meantime, matplotlib
and tqdm
will assist us show photos and progress bars.
# Codeblock 1
import matplotlib.pyplot as plt
import torch
import torch.nn as nn
from torch.optim import Adam
from torch.utils.knowledge import DataLoader
from torchvision import datasets, transforms
from tqdm import tqdm
Because the modules have been imported, the following factor to do is to initialize some config parameters. Take a look at the Codeblock 2 beneath for the main points.
# Codeblock 2
IMAGE_SIZE = 28 #(1)
NUM_CHANNELS = 1 #(2)
BATCH_SIZE = 2
NUM_EPOCHS = 10
LEARNING_RATE = 0.001
NUM_TIMESTEPS = 1000 #(3)
BETA_START = 0.0001 #(4)
BETA_END = 0.02 #(5)
TIME_EMBED_DIM = 32 #(6)
DEVICE = torch.system("cuda" if torch.cuda.is_available else "cpu") #(7)
DEVICE
# Codeblock 2 Output
system(sort='cuda')
On the traces marked with #(1)
and #(2)
I set IMAGE_SIZE
and NUM_CHANNELS
to twenty-eight and 1, which these numbers are obtained from the picture dimension within the MNIST dataset. The BATCH_SIZE
, NUM_EPOCHS
, and LEARNING_RATE
variables are fairly easy, so I don’t assume I would like to clarify them additional.
At line #(3)
, the variable NUM_TIMESTEPS
denotes the variety of iterations within the ahead and backward diffusion course of. Timestep 0 is the situation the place the picture is in its authentic state (the leftmost picture in Determine 1). On this case, since we set this parameter to 1000, timestep quantity 999 goes to be the situation the place the picture is totally unrecognizable (the rightmost picture in Determine 1). It is very important remember the fact that the selection of the variety of timesteps includes a tradeoff between mannequin accuracy and computational value. If we assign a small worth for NUM_TIMESTEPS
, the inference time goes to be shorter, but the ensuing picture may not be actually good for the reason that mannequin has fewer steps to refine the picture within the backward diffusion stage. However, growing NUM_TIMESTEPS
will decelerate the inference course of, however we are able to anticipate the output picture to have higher high quality because of the gradual denoising course of which leads to a extra exact reconstruction.
Subsequent, the BETA_START
(#(4)
) and BETA_END
(#(5)
) variables are used to regulate the quantity of Gaussian noise added at every timestep, whereas TIME_EMBED_DIM
(#(6)
) is employed to find out the function vector size for storing the timestep data. Lastly, at line #(7)
I assign “cuda”
to the DEVICE
variable if Pytorch detects GPU put in in our machine. I extremely suggest you run this challenge on GPU since coaching a diffusion mannequin is computationally costly. Along with the above parameters, the values set for NUM_TIMESTEPS
, BETA_START
and BETA_END
are all adopted instantly from the DDPM paper [2].
The whole implementation can be finished in a number of steps: developing the U-Web mannequin, getting ready the dataset, defining noise scheduler for the diffusion course of, coaching, and inference. We’re going to talk about every of these levels within the following sub-sections.
The U-Web Structure: Time Embedding
As I’ve talked about earlier, the idea of a diffusion mannequin is U-Web. This structure is used as a result of its output layer is appropriate to characterize a picture, which positively is sensible because it was initially launched for picture segmentation process on the first place. The next determine reveals what the unique U-Web structure seems like.

Nonetheless, it’s crucial to change this structure in order that it will possibly additionally take into consideration the timestep data. Not solely that, since we’ll solely use MNIST dataset, we additionally have to make the mannequin smaller. Simply keep in mind the conference in deep studying that less complicated fashions are sometimes more practical for easy duties.
Within the determine beneath I present you all the U-Web mannequin that has been modified. Right here you’ll be able to see that the time embedding tensor is injected to the mannequin at each stage, which is able to later be finished by element-wise summation, permitting the mannequin to seize the timestep data. Subsequent, as a substitute of repeating every of the downsampling and the upsampling levels 4 occasions like the unique U-Web, on this case we’ll solely repeat every of them twice. Moreover, it’s value noting that the stack of downsampling levels is also referred to as the encoder, whereas the stack of upsampling levels is usually referred to as the decoder.

Now let’s begin developing the structure by creating a category for producing the time embedding tensor, which the concept is much like the positional embedding in Transformer. See the Codeblock 3 beneath for the main points.
# Codeblock 3
class TimeEmbedding(nn.Module):
def ahead(self):
time = torch.arange(NUM_TIMESTEPS, system=DEVICE).reshape(NUM_TIMESTEPS, 1) #(1)
print(f"timett: {time.form}")
i = torch.arange(0, TIME_EMBED_DIM, 2, system=DEVICE)
denominator = torch.pow(10000, i/TIME_EMBED_DIM)
print(f"denominatort: {denominator.form}")
even_time_embed = torch.sin(time/denominator) #(1)
odd_time_embed = torch.cos(time/denominator) #(2)
print(f"even_time_embedt: {even_time_embed.form}")
print(f"odd_time_embedt: {odd_time_embed.form}")
stacked = torch.stack([even_time_embed, odd_time_embed], dim=2) #(3)
print(f"stackedtt: {stacked.form}")
time_embed = torch.flatten(stacked, start_dim=1, end_dim=2) #(4)
print(f"time_embedt: {time_embed.form}")
return time_embed
What we principally do within the above code is to create a tensor of measurement NUM_TIMESTEPS
× TIME_EMBED_DIM
(1000×32), the place each single row of this tensor will include the timestep data. Afterward, every of the 1000 timesteps can be represented by a function vector of size 32. The values within the tensor themselves are obtained based mostly on the 2 equations in Determine 4. Within the Codeblock 3 above, these two equations are carried out at line #(1)
and #(2)
, every forming a tensor having the dimensions of 1000×16. Subsequent, these tensors are mixed utilizing the code at line #(3)
and #(4)
.
Right here I additionally print out each single step finished within the above codeblock so to get a greater understanding of what’s really being finished within the TimeEmbedding class. In case you nonetheless need extra rationalization in regards to the above code, be at liberty to learn my earlier submit about Transformer which you’ll be able to entry by way of the hyperlink on the finish of this text. When you clicked the hyperlink, you’ll be able to simply scroll all the best way right down to the Positional Encoding part.

Now let’s test if the TimeEmbedding
class works correctly utilizing the next testing code. The ensuing output reveals that it efficiently produced a tensor of measurement 1000×32, which is strictly what we anticipated earlier.
# Codeblock 4
time_embed_test = TimeEmbedding()
out_test = time_embed_test()
# Codeblock 4 Output
time : torch.Dimension([1000, 1])
denominator : torch.Dimension([16])
even_time_embed : torch.Dimension([1000, 16])
odd_time_embed : torch.Dimension([1000, 16])
stacked : torch.Dimension([1000, 16, 2])
time_embed : torch.Dimension([1000, 32])
The U-Web Structure: DoubleConv
In case you take a more in-depth have a look at the modified structure, you will notice that we really received plenty of repeating patterns, akin to those highlighted in yellow bins within the following determine.

DoubleConv
class [3].These 5 yellow bins share the identical construction, the place they include two convolution layers with the time embedding tensor injected proper after the primary convolution operation is carried out. So, what we’re going to do now’s to create one other class named DoubleConv
to breed this construction. Take a look at the Codeblock 5a and 5b beneath to see how I do this.
# Codeblock 5a
class DoubleConv(nn.Module):
def __init__(self, in_channels, out_channels): #(1)
tremendous().__init__()
self.conv_0 = nn.Conv2d(in_channels=in_channels, #(2)
out_channels=out_channels,
kernel_size=3,
bias=False,
padding=1)
self.bn_0 = nn.BatchNorm2d(num_features=out_channels) #(3)
self.time_embedding = TimeEmbedding() #(4)
self.linear = nn.Linear(in_features=TIME_EMBED_DIM, #(5)
out_features=out_channels)
self.conv_1 = nn.Conv2d(in_channels=out_channels, #(6)
out_channels=out_channels,
kernel_size=3,
bias=False,
padding=1)
self.bn_1 = nn.BatchNorm2d(num_features=out_channels) #(7)
self.relu = nn.ReLU(inplace=True) #(8)
The 2 inputs of the __init__()
technique above offers us flexibility to configure the variety of enter and output channels (#(1)
) in order that the DoubleConv
class can be utilized to instantiate all of the 5 yellow bins just by adjusting its enter arguments. Because the title suggests, right here we initialize two convolution layers (line #(2)
and #(6)
), every adopted by a batch normalization layer and a ReLU activation operate. Remember that the 2 normalization layers have to be initialized individually (line #(3)
and #(7)
) since every of them has their very own trainable normalization parameters. In the meantime, the ReLU activation operate ought to solely be initialized as soon as (#(8)
) as a result of it accommodates no parameters, permitting it for use a number of occasions in several elements of the community. At line #(4)
, we initialize the TimeEmbedding
layer we created earlier, which is able to later be linked to an ordinary linear layer (#(5)
). This linear layer is accountable to regulate the dimension of the time embedding tensor in order that the ensuing output might be summed with the output from the primary convolution layer in an element-wise method.
Now let’s check out the Codeblock 5b beneath to higher perceive the move of the DoubleConv
block. Right here you’ll be able to see that the ahead()
technique accepts two inputs: the uncooked picture x
and the timestep data t
as proven at line #(1)
. We initially course of the picture with the primary Conv-BN-ReLU sequence (#(2–4)
). This Conv-BN-ReLU construction is usually used when working with CNN-based fashions, even when the illustration doesn’t explicitly present the batch normalization and the ReLU layers. Aside from the picture, we then take the t-th timestep data from our embedding tensor of the corresponding picture (#(5)
) and move it by way of the linear layer (#(6)
). We nonetheless have to broaden the dimension of the ensuing tensor utilizing the code at line #(7)
earlier than performing element-wise summation at line #(8)
. Lastly, we course of the ensuing tensor with the second Conv-BN-ReLU sequence (#(9–11)
).
# Codeblock 5b
def ahead(self, x, t): #(1)
print(f'imagesttt: {x.measurement()}')
print(f'timestepstt: {t.measurement()}, {t}')
x = self.conv_0(x) #(2)
x = self.bn_0(x) #(3)
x = self.relu(x) #(4)
print(f'nafter first convt: {x.measurement()}')
time_embed = self.time_embedding()[t] #(5)
print(f'ntime_embedtt: {time_embed.measurement()}')
time_embed = self.linear(time_embed) #(6)
print(f'time_embed after lineart: {time_embed.measurement()}')
time_embed = time_embed[:, :, None, None] #(7)
print(f'time_embed expandedt: {time_embed.measurement()}')
x = x + time_embed #(8)
print(f'nafter summationtt: {x.measurement()}')
x = self.conv_1(x) #(9)
x = self.bn_1(x) #(10)
x = self.relu(x) #(11)
print(f'after second convt: {x.measurement()}')
return x
To see if our DoubleConv
implementation works correctly, we’re going to check it with the Codeblock 6 beneath. Right here I need to simulate the very first occasion of this block, which corresponds to the leftmost yellow field in Determine 5. To take action, we have to we have to set the in_channels
and out_channels
parameters to 1 and 64, respectively (#(1)
). Subsequent, we initialize two enter tensors, particularly x_test
and t_test
. The x_test
tensor has the dimensions of two×1×28×28, representing a batch of two grayscale photos having the dimensions of 28×28 (#(2)
). Remember that that is only a dummy tensor of random values which can be changed with the precise photos from MNIST dataset later within the coaching section. In the meantime, t_test
is a tensor containing the timestep numbers of the corresponding photos (#(3)
). The values for this tensor are randomly chosen between 0 and NUM_TIMESTEPS
(1000). Word that the datatype of this tensor have to be an integer for the reason that numbers can be used for indexing, as proven at line #(5)
again in Codeblock 5b. Lastly, at line #(4)
we move each x_test
and t_test
tensors to the double_conv_test
layer.
By the best way, I re-run the earlier codeblocks with the print()
capabilities eliminated previous to working the next code in order that the outputs will look neater.
# Codeblock 6
double_conv_test = DoubleConv(in_channels=1, out_channels=64).to(DEVICE) #(1)
x_test = torch.randn((BATCH_SIZE, NUM_CHANNELS, IMAGE_SIZE, IMAGE_SIZE)).to(DEVICE) #(2)
t_test = torch.randint(0, NUM_TIMESTEPS, (BATCH_SIZE,)).to(DEVICE) #(3)
out_test = double_conv_test(x_test, t_test) #(4)
# Codeblock 6 Output
photos : torch.Dimension([2, 1, 28, 28]) #(1)
timesteps : torch.Dimension([2]), tensor([468, 304], system='cuda:0') #(2)
after first conv : torch.Dimension([2, 64, 28, 28]) #(3)
time_embed : torch.Dimension([2, 32]) #(4)
time_embed after linear : torch.Dimension([2, 64])
time_embed expanded : torch.Dimension([2, 64, 1, 1]) #(5)
after summation : torch.Dimension([2, 64, 28, 28]) #(6)
after second conv : torch.Dimension([2, 64, 28, 28]) #(7)
The form of our authentic enter tensors might be seen at traces #(1)
and #(2)
within the above output. Particularly at line #(2)
, I additionally print out the 2 timesteps that we chosen randomly. On this instance we assume that every of the 2 photos within the x tensor are already noised with the noise stage from 468-th and 304-th timesteps previous to being fed into the community. We will see that the form of the picture tensor x modifications to 2×64×28×28 after being handed by way of the primary convolution layer (#(3)
). In the meantime, the dimensions of our time embedding tensor turns into 2×32 (#(4)
), which is obtained by extracting rows 468 and 304 from the unique embedding of measurement 1000×32. With the intention to permit element-wise summation to be carried out (#(6)
), we have to map the 32-dimensional time embedding vectors into 64 and broaden their axes, leading to a tensor of measurement 2×64×1×1 (#(5)
) in order that it may be broadcast to the two×64×28×28 tensor. After the summation is finished, we then move the tensor by way of the second convolution layer, at which level the tensor dimension doesn’t change in any respect (#(7)
).
The U-Web Structure: Encoder
As we’ve efficiently carried out the DoubleConv
block, the following step to do is to implement the so-called DownSample
block. In Determine 6 beneath, this corresponds to the elements enclosed within the pink field.

DownSample
blocks [3].The aim of a DownSample
block is to cut back the spatial dimension of a picture, however it is very important notice that on the identical time it will increase the variety of channels. With the intention to obtain this, we are able to merely stack a DoubleConv
block and a maxpooling operation. On this case the pooling makes use of 2×2 kernel measurement with the stride of two, inflicting the spatial dimension of the picture to be twice as small because the enter. The implementation of this block might be seen in Codeblock 7 beneath.
# Codeblock 7
class DownSample(nn.Module):
def __init__(self, in_channels, out_channels): #(1)
tremendous().__init__()
self.double_conv = DoubleConv(in_channels=in_channels, #(2)
out_channels=out_channels)
self.maxpool = nn.MaxPool2d(kernel_size=2, stride=2) #(3)
def ahead(self, x, t): #(4)
print(f'originaltt: {x.measurement()}')
print(f'timestepstt: {t.measurement()}, {t}')
convolved = self.double_conv(x, t) #(5)
print(f'nafter double convt: {convolved.measurement()}')
maxpooled = self.maxpool(convolved) #(6)
print(f'after poolingtt: {maxpooled.measurement()}')
return convolved, maxpooled #(7)
Right here I set the __init__()
technique to take variety of enter and output channels in order that we are able to use it for creating the 2 DownSample
blocks highlighted in Determine 6 with no need to jot down them in separate lessons (#(1)
). Subsequent, the DoubleConv
and the maxpooling layers are initialized at line #(2)
and #(3)
, respectively. Do not forget that for the reason that DoubleConv
block accepts picture x
and the corresponding timestep t
because the inputs, we additionally have to set the ahead()
technique of this DownSample
block such that it accepts each of them as effectively (#(4)
). The knowledge contained in x and t are then mixed as the 2 tensors are processed by the double_conv
layer, which the output is saved within the variable named convolved
(#(5)
). Afterwards, we now really carry out the downsampling with the maxpooling operation at line #(6)
, producing a tensor named maxpooled
. It is very important notice that each the convolved
and maxpooled
tensors are going to be returned, which is actually finished as a result of we’ll later convey maxpooled
to the following downsampling stage, whereas the convolved
tensor can be transferred on to the upsampling stage within the decoder by way of skip-connections.
Now let’s check the DownSample
class utilizing the Codeblock 8 beneath. The enter tensors used listed below are precisely the identical as those in Codeblock 6. Primarily based on the ensuing output, we are able to see that the pooling operation efficiently transformed the output of the DoubleConv
block from 2×64×28×28 (#(1)
) to 2×64×14×14 (#(2)
), indicating that our DownSample class works correctly.
# Codeblock 8
down_sample_test = DownSample(in_channels=1, out_channels=64).to(DEVICE)
x_test = torch.randn((BATCH_SIZE, NUM_CHANNELS, IMAGE_SIZE, IMAGE_SIZE)).to(DEVICE)
t_test = torch.randint(0, NUM_TIMESTEPS, (BATCH_SIZE,)).to(DEVICE)
out_test = down_sample_test(x_test, t_test)
# Codeblock 8 Output
authentic : torch.Dimension([2, 1, 28, 28])
timesteps : torch.Dimension([2]), tensor([468, 304], system='cuda:0')
after double conv : torch.Dimension([2, 64, 28, 28]) #(1)
after pooling : torch.Dimension([2, 64, 14, 14]) #(2)
The U-Web Structure: Decoder
We have to introduce the so-called UpSample
block within the decoder, which is answerable for reverting the tensor within the intermediate layers to the unique picture dimension. With the intention to keep a symmetrical construction, the variety of UpSample
blocks should match that of the DownSample
blocks. Take a look at the Determine 7 beneath to see the place the 2 UpSample
blocks are positioned.

UpSample
blocks [3].Since each UpSample
blocks are structurally equivalent, we are able to simply initialize a single class for them, identical to the DownSample
class we created earlier. Take a look at the Codeblock 9 beneath to see how I implement it.
# Codeblock 9
class UpSample(nn.Module):
def __init__(self, in_channels, out_channels):
tremendous().__init__()
self.conv_transpose = nn.ConvTranspose2d(in_channels=in_channels, #(1)
out_channels=out_channels,
kernel_size=2, stride=2) #(2)
self.double_conv = DoubleConv(in_channels=in_channels, #(3)
out_channels=out_channels)
def ahead(self, x, t, connection): #(4)
print(f'originaltt: {x.measurement()}')
print(f'timestepstt: {t.measurement()}, {t}')
print(f'connectiontt: {connection.measurement()}')
x = self.conv_transpose(x) #(5)
print(f'nafter conv transposet: {x.measurement()}')
x = torch.cat([x, connection], dim=1) #(6)
print(f'after concattt: {x.measurement()}')
x = self.double_conv(x, t) #(7)
print(f'after double convt: {x.measurement()}')
return x
Within the __init__()
technique, we use nn.ConvTranspose2d
to upsample the spatial dimension (#(1)
). Each the kernel measurement and stride are set to 2 in order that the output can be twice as giant (#(2)
). Subsequent, the DoubleConv
block can be employed to cut back the variety of channels, whereas on the identical time combining the timestep data from the time embedding tensor (#(3)
).
The move of this UpSample
class is a little more difficult than the DownSample
class. If we take a more in-depth have a look at the structure, we’ll see that that we even have a skip-connection coming instantly from the encoder. Thus, we want the ahead()
technique to just accept one other argument along with the unique picture x
and the timestep t
, particularly the residual tensor connection
(#(4)
). The very first thing we do inside this technique is to course of the unique picture x
with the transpose convolution layer (#(5)
). The truth is, not solely upsampling the spatial measurement, however this layer additionally reduces the variety of channels on the identical time. Nonetheless, the ensuing tensor is then instantly concatenated with connection
in a channel-wise method (#(6)
), inflicting it to appear like no channel discount is carried out. It is very important know that at this level these two tensors are simply concatenated, that means that the data from the 2 will not be but mixed. We lastly feed these concatenated tensors to the double_conv
layer (#(7)
), permitting them to share data to one another by way of the learnable parameters contained in the convolution layers.
The Codeblock 10 beneath reveals how I check the UpSample
class. The dimensions of the tensors to be handed by way of are set in line with the second upsampling block, i.e., the rightmost blue field in Determine 7.
# Codeblock 10
up_sample_test = UpSample(in_channels=128, out_channels=64).to(DEVICE)
x_test = torch.randn((BATCH_SIZE, 128, 14, 14)).to(DEVICE)
t_test = torch.randint(0, NUM_TIMESTEPS, (BATCH_SIZE,)).to(DEVICE)
connection_test = torch.randn((BATCH_SIZE, 64, 28, 28)).to(DEVICE)
out_test = up_sample_test(x_test, t_test, connection_test)
Within the ensuing output beneath, if we evaluate the enter tensor (#(1)
) with the ultimate tensor form (#(2)
), we are able to clearly see that the variety of channels efficiently decreased from 128 to 64, whereas on the identical time the spatial dimension elevated from 14×14 to twenty-eight×28. This primarily signifies that our UpSample
class is now prepared for use in the primary U-Web structure.
# Codeblock 10 Output
authentic : torch.Dimension([2, 128, 14, 14]) #(1)
timesteps : torch.Dimension([2]), tensor([468, 304], system='cuda:0')
connection : torch.Dimension([2, 64, 28, 28])
after conv transpose : torch.Dimension([2, 64, 28, 28])
after concat : torch.Dimension([2, 128, 28, 28])
after double conv : torch.Dimension([2, 64, 28, 28]) #(2)
The U-Web Structure: Placing All Parts Collectively
As soon as all U-Web elements have been created, what we’re going to do subsequent is to wrap them collectively right into a single class. Take a look at the Codeblock 11a and 11b beneath for the main points.
# Codeblock 11a
class UNet(nn.Module):
def __init__(self):
tremendous().__init__()
self.downsample_0 = DownSample(in_channels=NUM_CHANNELS, #(1)
out_channels=64)
self.downsample_1 = DownSample(in_channels=64, #(2)
out_channels=128)
self.bottleneck = DoubleConv(in_channels=128, #(3)
out_channels=256)
self.upsample_0 = UpSample(in_channels=256, #(4)
out_channels=128)
self.upsample_1 = UpSample(in_channels=128, #(5)
out_channels=64)
self.output = nn.Conv2d(in_channels=64, #(6)
out_channels=NUM_CHANNELS,
kernel_size=1)
You possibly can see within the __init__()
technique above that we initialize two downsampling (#(1–2)
) and two upsampling (#(4–5)
) blocks, which the variety of enter and output channels are set in line with the structure proven within the illustration. There are literally two further elements I haven’t defined but, particularly the bottleneck (#(3)
) and the output layer (#(6)
). The previous is actually only a DoubleConv
block, which acts as the primary connection between the encoder and the decoder. Take a look at the Determine 8 beneath to see which elements of the community belong to the bottleneck layer. Subsequent, the output layer is an ordinary convolution layer which is accountable to show the 64-channel picture produced by the final UpSampling
stage into 1-channel solely. This operation is finished utilizing a kernel of measurement 1×1, that means that it combines data throughout all channels whereas working independently at every pixel place.

I assume the ahead()
technique of all the U-Web within the following codeblock is fairly easy, as what we primarily do right here is move the tensors from one layer to a different — simply don’t overlook to incorporate the skip connections between the downsampling and upsampling blocks.
# Codeblock 11b
def ahead(self, x, t): #(1)
print(f'originaltt: {x.measurement()}')
print(f'timestepstt: {t.measurement()}, {t}')
convolved_0, maxpooled_0 = self.downsample_0(x, t)
print(f'nmaxpooled_0tt: {maxpooled_0.measurement()}')
convolved_1, maxpooled_1 = self.downsample_1(maxpooled_0, t)
print(f'maxpooled_1tt: {maxpooled_1.measurement()}')
x = self.bottleneck(maxpooled_1, t)
print(f'after bottleneckt: {x.measurement()}')
upsampled_0 = self.upsample_0(x, t, convolved_1)
print(f'upsampled_0tt: {upsampled_0.measurement()}')
upsampled_1 = self.upsample_1(upsampled_0, t, convolved_0)
print(f'upsampled_1tt: {upsampled_1.measurement()}')
x = self.output(upsampled_1)
print(f'closing outputtt: {x.measurement()}')
return x
Now let’s see whether or not we’ve accurately constructed the U-Web class above by working the next testing code.
# Codeblock 12
unet_test = UNet().to(DEVICE)
x_test = torch.randn((BATCH_SIZE, NUM_CHANNELS, IMAGE_SIZE, IMAGE_SIZE)).to(DEVICE)
t_test = torch.randint(0, NUM_TIMESTEPS, (BATCH_SIZE,)).to(DEVICE)
out_test = unet_test(x_test, t_test)
# Codeblock 12 Output
authentic : torch.Dimension([2, 1, 28, 28]) #(1)
timesteps : torch.Dimension([2]), tensor([468, 304], system='cuda:0')
maxpooled_0 : torch.Dimension([2, 64, 14, 14]) #(2)
maxpooled_1 : torch.Dimension([2, 128, 7, 7]) #(3)
after bottleneck : torch.Dimension([2, 256, 7, 7]) #(4)
upsampled_0 : torch.Dimension([2, 128, 14, 14])
upsampled_1 : torch.Dimension([2, 64, 28, 28])
closing output : torch.Dimension([2, 1, 28, 28]) #(5)
We will see within the above output that the 2 downsampling levels efficiently transformed the unique tensor of measurement 1×28×28 (#(1)
) into 64×14×14 (#(2)
) and 128×7×7 (#(3)
), respectively. This tensor is then handed by way of the bottleneck layer, inflicting its variety of channels to broaden to 256 with out altering the spatial dimension (#(4)
). Lastly, we upsample the tensor twice earlier than finally shrinking the variety of channels to 1 (#(5)
). Primarily based on this output, it seems like our mannequin is working correctly. Thus, it’s now able to be skilled for our diffusion process.
Dataset Preparation
As we’ve efficiently created all the U-Web structure, the following factor to do is to organize the MNIST Handwritten Digit dataset. Earlier than really loading it, we have to outline the preprocessing steps first utilizing the transforms.Compose()
technique from Torchvision, as proven at line #(1)
in Codeblock 13. There are two issues we do right here: changing the pictures into PyTorch tensors which additionally scales the pixel values from 0–255 to 0–1 (#(2)
), and normalize them in order that the ultimate pixel values ranging between -1 and 1 (#(3)
). Subsequent, we obtain the dataset utilizing datasets.MNIST()
. On this case, we’re going to take the pictures from the coaching knowledge, therefore we use prepare=True
(#(5)
). Don’t overlook to move the remodel
variable we initialized earlier to the remodel
parameter (remodel=remodel
) so that it’s going to routinely preprocess the pictures as we load them (#(6)
). Lastly, we have to make use of DataLoader
to load the pictures from mnist_dataset
(#(7)
). The arguments I take advantage of for the enter parameters are meant to randomly decide BATCH_SIZE
(2) photos from the dataset in every iteration.
# Codeblock 13
remodel = transforms.Compose([ #(1)
transforms.ToTensor(), #(2)
transforms.Normalize((0.5,), (0.5,)) #(3)
])
mnist_dataset = datasets.MNIST( #(4)
root='./knowledge',
prepare=True, #(5)
obtain=True,
remodel=remodel #(6)
)
loader = DataLoader(mnist_dataset, #(7)
batch_size=BATCH_SIZE,
drop_last=True,
shuffle=True)
Within the following codeblock, I attempt to load a batch of photos from the dataset. In each iteration, loader
supplies each the pictures and the corresponding labels, therefore we have to retailer them in two separate variables: photos
and labels
.
# Codeblock 14
photos, labels = subsequent(iter(loader))
print('imagestt:', photos.form)
print('labelstt:', labels.form)
print('min valuet:', photos.min())
print('max valuet:', photos.max())
We will see within the ensuing output beneath that the photos
tensor has the dimensions of two×1×28×28 (#(1)
), indicating that two grayscale photos of measurement 28×28 have been efficiently loaded. Right here we are able to additionally see that the size of the labels
tensor is 2, which matches the variety of the loaded photos (#(2)
). Word that on this case the labels are going to be fully ignored. My plan right here is that I simply need the mannequin to generate any quantity it beforehand seen from all the coaching dataset with out even realizing what quantity it really is. Lastly, this output additionally reveals that the preprocessing works correctly, because the pixel values now vary between -1 and 1.
# Codeblock 14 Output
photos : torch.Dimension([2, 1, 28, 28]) #(1)
labels : torch.Dimension([2]) #(2)
min worth : tensor(-1.)
max worth : tensor(1.)
Run the next code if you wish to see what the picture we simply loaded seems like.
# Codeblock 15
plt.imshow(photos[0].squeeze(), cmap='grey')
plt.present()

Noise Scheduler
On this part we’re going to discuss how the ahead and backward diffusion are carried out, which the method primarily includes including or eradicating noise little by little at every timestep. It’s essential to know that we principally need a uniform quantity of noise throughout all timesteps, the place within the ahead diffusion the picture needs to be fully filled with noise precisely at timestep 1000, whereas within the backward diffusion, we’ve to get the fully clear picture at timestep 0. Therefore, we want one thing to regulate the noise quantity for every timestep. Later on this part, I’m going to implement a category named NoiseScheduler
to take action. — It will most likely be probably the most mathy part of this text, as I’ll show many equations right here. However don’t fear about that since we’ll deal with implementing these equations quite than discussing the mathematical derivations.
Now let’s check out the equations in Determine 10 which I’ll implement within the __init__()
technique of the NoiseScheduler
class beneath.

<sturdy>NoiseScheduler</sturdy>
class [3].# Codeblock 16a
class NoiseScheduler:
def __init__(self):
self.betas = torch.linspace(BETA_START, BETA_END, NUM_TIMESTEPS) #(1)
self.alphas = 1. - self.betas
self.alphas_cum_prod = torch.cumprod(self.alphas, dim=0)
self.sqrt_alphas_cum_prod = torch.sqrt(self.alphas_cum_prod)
self.sqrt_one_minus_alphas_cum_prod = torch.sqrt(1. - self.alphas_cum_prod)
The above code works by creating a number of sequences of numbers, all of them are principally managed by BETA_START
(0.0001), BETA_END
(0.02), and NUM_TIMESTEPS
(1000). The primary sequence we have to instantiate is the betas
itself, which is finished utilizing torch.linspace()
(#(1)
). What it primarily does is that it generates a 1-dimensional tensor of size 1000 ranging from 0.0001 to 0.02, the place each single ingredient on this tensor corresponds to a single timestep. The interval between every ingredient is uniform, permitting us to generate uniform quantity of noise all through all timesteps as effectively. With this betas
tensor, we then compute alphas
, alphas_cum_prod
, sqrt_alphas_cum_prod
and sqrt_one_minus_alphas_cum_prod
based mostly on the 4 equations in Determine 10. Afterward, these tensors will act as the idea of how the noise is generated or eliminated through the diffusion course of.
Diffusion is often finished in a sequential method. Nonetheless, the ahead diffusion course of is deterministic, therefore we are able to derive the unique equation right into a closed type in order that we are able to receive the noise at a selected timestep with out having to iteratively add noise from the very starting. The Determine 11 beneath reveals what the closed type of the ahead diffusion seems like, the place x₀ represents the unique picture whereas epsilon (ϵ) denotes a picture made up of random Gaussian noise. We will consider this equation as a weighted mixture, the place we mix the clear picture and the noise in line with weights decided by the timestep, leading to a picture with a certain quantity of noise.

The implementation of this equation might be seen in Codeblock 16b. On this forward_diffusion()
technique, x₀ and ϵ are denoted as authentic
and noise
. Right here it’s essential to remember the fact that these two enter variables are photos, whereas sqrt_alphas_cum_prod_t
and sqrt_one_minus_alphas_cum_prod_t
are scalars. Thus, we have to regulate the form of those two scalars (#(1)
and #(2)
) in order that the operation at line #(3)
might be carried out. The noisy_image
variable goes to be the output of this operate, which I assume the title is self-explanatory.
# Codeblock 16b
def forward_diffusion(self, authentic, noise, t):
sqrt_alphas_cum_prod_t = self.sqrt_alphas_cum_prod[t]
sqrt_alphas_cum_prod_t = sqrt_alphas_cum_prod_t.to(DEVICE).view(-1, 1, 1, 1) #(1)
sqrt_one_minus_alphas_cum_prod_t = self.sqrt_one_minus_alphas_cum_prod[t]
sqrt_one_minus_alphas_cum_prod_t = sqrt_one_minus_alphas_cum_prod_t.to(DEVICE).view(-1, 1, 1, 1) #(2)
noisy_image = sqrt_alphas_cum_prod_t * authentic + sqrt_one_minus_alphas_cum_prod_t * noise #(3)
return noisy_image
Now let’s discuss backward diffusion. The truth is, this one is a little more difficult than the ahead diffusion since we want three extra equations right here. Earlier than I offer you these equations, let me present you the implementation first. See the Codeblock 16c beneath.
# Codeblock 16c
def backward_diffusion(self, current_image, predicted_noise, t): #(1)
denoised_image = (current_image - (self.sqrt_one_minus_alphas_cum_prod[t] * predicted_noise)) / self.sqrt_alphas_cum_prod[t] #(2)
denoised_image = 2 * (denoised_image - denoised_image.min()) / (denoised_image.max() - denoised_image.min()) - 1 #(3)
current_prediction = current_image - ((self.betas[t] * predicted_noise) / (self.sqrt_one_minus_alphas_cum_prod[t])) #(4)
current_prediction = current_prediction / torch.sqrt(self.alphas[t]) #(5)
if t == 0: #(6)
return current_prediction, denoised_image
else:
variance = (1 - self.alphas_cum_prod[t-1]) / (1. - self.alphas_cum_prod[t]) #(7)
variance = variance * self.betas[t] #(8)
sigma = variance ** 0.5
z = torch.randn(current_image.form).to(DEVICE)
current_prediction = current_prediction + sigma*z
return current_prediction, denoised_image
Later within the inference section, the backward_diffusion()
technique can be referred to as inside a loop that iterates NUM_TIMESTEPS
(1000) occasions, ranging from t = 999, continued with t = 998, and so forth all the best way to t = 0. This operate is accountable to take away the noise from the picture iteratively based mostly on the current_image
(the picture produced by the earlier denoising step), the predicted_noise
(the noise predicted by U-Web within the earlier step), and the timestep data t
(#(1)
). In every iteration, noise removing is finished utilizing the equation proven in Determine 12, which in Codeblock 16c, this corresponds to traces #(4-5)
.

So long as we haven’t reached t = 0, we’ll compute the variance based mostly on the equation in Determine 13 (#(7–8)
). This variance will then be used to introduce one other managed noise to simulate the stochasticity within the backward diffusion course of for the reason that noise removing equation in Determine 12 is a deterministic approximation. That is primarily additionally the rationale that we don’t calculate the variance as soon as we reached t = 0 (#(6)
) since we now not want so as to add extra noise because the picture is totally clear already.

Totally different from current_prediction
which goals to estimate the picture of the earlier timestep (xₜ₋₁), the target of the denoised_image
tensor is to reconstruct the unique picture (x₀). Thanks to those totally different aims, we want a separate equation to compute denoised_image
, which might be seen in Determine 14 beneath. The implementation of the equation itself is written at line #(2–3)
.

Now let’s check the NoiseScheduler
class we created above. Within the following codeblock, I instantiate a NoiseScheduler
object and print out the attributes related to it, that are all computed utilizing the equation in Determine 10 based mostly on the values saved within the betas
attribute. Do not forget that the precise size of those tensors is NUM_TIMESTEPS
(1000), however right here I solely print out the primary 6 parts.
# Codeblock 17
noise_scheduler = NoiseScheduler()
print(f'betastttt: {noise_scheduler.betas[:6]}')
print(f'alphastttt: {noise_scheduler.alphas[:6]}')
print(f'alphas_cum_prodttt: {noise_scheduler.alphas_cum_prod[:6]}')
print(f'sqrt_alphas_cum_prodtt: {noise_scheduler.sqrt_alphas_cum_prod[:6]}')
print(f'sqrt_one_minus_alphas_cum_prodt: {noise_scheduler.sqrt_one_minus_alphas_cum_prod[:6]}')
# Codeblock 17 Output
betas : tensor([1.0000e-04, 1.1992e-04, 1.3984e-04, 1.5976e-04, 1.7968e-04, 1.9960e-04])
alphas : tensor([0.9999, 0.9999, 0.9999, 0.9998, 0.9998, 0.9998])
alphas_cum_prod : tensor([0.9999, 0.9998, 0.9996, 0.9995, 0.9993, 0.9991])
sqrt_alphas_cum_prod : tensor([0.9999, 0.9999, 0.9998, 0.9997, 0.9997, 0.9996])
sqrt_one_minus_alphas_cum_prod : tensor([0.0100, 0.0148, 0.0190, 0.0228, 0.0264, 0.0300])
The above output signifies that our __init__()
technique works as anticipated. Subsequent, we’re going to check the forward_diffusion()
technique. In case you return to Determine 16b, you will notice that forward_diffusion()
accepts three inputs: authentic picture, noise picture and the timestep quantity. Let’s simply use the picture from the MNIST dataset we loaded earlier for the primary enter (#(1)
) and a random Gaussian noise of the very same measurement for the second (#(2)
). Run the Codeblock 18 beneath to see what these two photos appear to be.
# Codeblock 18
picture = photos[0] #(1)
noise = torch.randn_like(picture) #(2)
plt.imshow(picture.squeeze(), cmap='grey')
plt.present()
plt.imshow(noise.squeeze(), cmap='grey')
plt.present()

As we already received the picture and the noise prepared, what we have to do afterwards is to move them to the forward_diffusion()
technique alongside the t. I really tried to run the Codeblock 19 beneath a number of occasions with t = 50, 100, 150, and so forth as much as t = 300. You possibly can see in Determine 16 that the picture turns into much less clear because the parameter will increase. On this case, the picture goes to be fully crammed by noise when the t is ready to 999.
# Codeblock 19
noisy_image_test = noise_scheduler.forward_diffusion(picture.to(DEVICE), noise.to(DEVICE), t=50)
plt.imshow(noisy_image_test[0].squeeze().cpu(), cmap='grey')
plt.present()

Sadly, we can’t check the backward_diffusion()
technique since this course of requires us to have our U-Web mannequin skilled. So, let’s simply skip this half for now. I’ll present you ways we are able to really use this operate later within the inference section.
Coaching
Because the U-Web mannequin, MNIST dataset, and the noise scheduler are prepared, we are able to now put together a operate for coaching. Earlier than we do this, I instantiate the mannequin and the noise scheduler in Codeblock 20 beneath.
# Codeblock 20
mannequin = UNet().to(DEVICE)
noise_scheduler = NoiseScheduler()
Your entire coaching process is carried out within the prepare()
operate proven in Codeblock 21. Earlier than doing something, we first initialize the optimizer and the loss operate, which on this case we use Adam and MSE, respectively (#(1–2)
). What we principally need to do right here is to coach the mannequin such that it will likely be in a position to predict the noise contained within the enter picture, which afterward, the anticipated noise can be used as the idea of the denoising course of within the backward diffusion stage. To truly prepare the mannequin, we first have to carry out ahead diffusion utilizing the code at line #(6)
. This noising course of can be finished on the photos
tensor (#(3)
) utilizing the random noise generated at line #(4)
. Subsequent, we take random quantity someplace between 0 and NUM_TIMESTEPS
(1000) for the t
(#(5)
), which is actually finished as a result of we wish our mannequin to see photos of various noise ranges as an strategy to enhance generalization. Because the noisy photos have been generated, we then move it by way of the U-Web mannequin alongside the chosen t
(#(7)
). The enter t
right here is beneficial for the mannequin because it signifies the present noise stage within the picture. Lastly, the loss operate we initialized earlier is accountable to compute the distinction between the precise noise and the anticipated noise from the unique picture (#(8)
). So, the target of this coaching is principally to make the anticipated noise as related as potential to the noise we generated at line #(4)
.
# Codeblock 21
def prepare():
optimizer = Adam(mannequin.parameters(), lr=LEARNING_RATE) #(1)
loss_function = nn.MSELoss() #(2)
losses = []
for epoch in vary(NUM_EPOCHS):
print(f'Epoch no {epoch}')
for photos, _ in tqdm(loader):
optimizer.zero_grad()
photos = photos.float().to(DEVICE) #(3)
noise = torch.randn_like(photos) #(4)
t = torch.randint(0, NUM_TIMESTEPS, (BATCH_SIZE,)) #(5)
noisy_images = noise_scheduler.forward_diffusion(photos, noise, t).to(DEVICE) #(6)
predicted_noise = mannequin(noisy_images, t) #(7)
loss = loss_function(predicted_noise, noise) #(8)
losses.append(loss.merchandise())
loss.backward()
optimizer.step()
return losses
Now let’s run the above coaching operate utilizing the codeblock beneath. Sit again and chill out whereas ready the coaching completes. In my case, I used Kaggle Pocket book with Nvidia GPU P100 turned on, and it took round 45 minutes to complete.
# Codeblock 22
losses = prepare()
If we check out the loss graph, it looks as if our mannequin realized fairly effectively as the worth is mostly lowering over time with a fast drop at early levels and a extra secure (but nonetheless lowering) development within the later levels. So, I feel we are able to anticipate good outcomes later within the inference section.
# Codeblock 23
plt.plot(losses)

Inference
At this level we already have our mannequin skilled, so we are able to now carry out inference on it. Take a look at the Codeblock 24 beneath to see how I implement the inference()
operate.
# Codeblock 24
def inference():
denoised_images = [] #(1)
with torch.no_grad(): #(2)
current_prediction = torch.randn((64, NUM_CHANNELS, IMAGE_SIZE, IMAGE_SIZE)).to(DEVICE) #(3)
for i in tqdm(reversed(vary(NUM_TIMESTEPS))): #(4)
predicted_noise = mannequin(current_prediction, torch.as_tensor(i).unsqueeze(0)) #(5)
current_prediction, denoised_image = noise_scheduler.backward_diffusion(current_prediction, predicted_noise, torch.as_tensor(i)) #(6)
if ipercent100 == 0: #(7)
denoised_images.append(denoised_image)
return denoised_images
On the line marked with #(1)
I initialize an empty listing which can be used to retailer the denoising consequence each 100 timesteps (#(7)
). It will later permit us to see how the backward diffusion goes. The precise inference course of is encapsulated inside torch.no_grad()
(#(2)
). Do not forget that in diffusion fashions we generate photos from a totally random noise, which we assume that these photos are initially at t = 999. To implement this, we are able to merely use torch.randn()
as proven at line #(3)
. Right here we initialize a tensor of measurement 64×1×28×28, indicating that we’re about to generate 64 photos concurrently. Subsequent, we write a for
loop that iterates backwards ranging from 999 to 0 (#(4)
). Inside this loop, we feed the present picture and the timestep because the enter for the skilled U-Web and let it predict the noise (#(5)
). The precise backward diffusion is then carried out at line #(6)
. On the finish of the iteration, we should always get new photos much like those we’ve in our dataset. Now let’s name the inference()
operate within the following codeblock.
# Codeblock 25
denoised_images = inference()
Because the inference accomplished, we are able to now see what the ensuing photos appear to be. The Codeblock 26 beneath is used to show the primary 42 photos we simply generated.
# Codeblock 26
fig, axes = plt.subplots(ncols=7, nrows=6, figsize=(10, 8))
counter = 0
for i in vary(6):
for j in vary(7):
axes[i,j].imshow(denoised_images[-1][counter].squeeze().detach().cpu().numpy(), cmap='grey') #(1)
axes[i,j].get_xaxis().set_visible(False)
axes[i,j].get_yaxis().set_visible(False)
counter += 1
plt.present()

If we check out the above codeblock, you’ll be able to see that the indexer of [-1]
at line #(1)
signifies that we solely show the pictures from the final iteration (which corresponds to timestep 0). That is the rationale that the pictures you see in Determine 18 are all free from noise. I do acknowledge that this may not be one of the best of a consequence since not all of the generated photos are legitimate digit numbers. — However hey, this as a substitute signifies that these photos will not be merely duplicates from the unique dataset.
Right here we are able to additionally visualize the backward diffusion course of utilizing the Codeblock 27 beneath. You possibly can see within the ensuing output in Determine 19 that we initially begin from a whole random noise, which regularly disappears as we transfer to the correct.
# Codeblock 27
fig, axes = plt.subplots(ncols=10, figsize=(24, 8))
sample_no = 0
timestep_no = 0
for i in vary(10):
axes[i].imshow(denoised_images[timestep_no][sample_no].squeeze().detach().cpu().numpy(), cmap='grey')
axes[i].get_xaxis().set_visible(False)
axes[i].get_yaxis().set_visible(False)
timestep_no += 1
plt.present()

Ending
There are many instructions you’ll be able to go from right here. First, you may most likely have to tweak the parameter configurations in Codeblock 2 if you need higher outcomes. Second, additionally it is potential to change the U-Web mannequin by implementing consideration layers along with the stack of convolution layers we used within the downsampling and the upsampling levels. This doesn’t assure you to acquire higher outcomes particularly for a easy dataset like this, nevertheless it’s positively value making an attempt. Third, you may also attempt to use a extra advanced dataset if you wish to problem your self.
In terms of sensible functions, there are literally plenty of issues you are able to do with diffusion fashions. The only one may be for knowledge augmentation. With diffusion mannequin, we are able to simply generate new photos from a selected knowledge distribution. For instance, suppose we’re engaged on a picture classification challenge, however the variety of photos within the lessons are imbalanced. To deal with this drawback, it’s potential for us to take the pictures from the minority class and feed them right into a diffusion mannequin. By doing so, we are able to ask the skilled diffusion mannequin to generate various samples from that class as many as we wish.
And effectively, that’s just about all the things in regards to the concept and the implementation of diffusion mannequin. Thanks for studying, I hope you be taught one thing new in the present day!
You possibly can entry the code used on this challenge by way of this link. Listed here are additionally the hyperlinks to my earlier articles about Autoencoder, Variational Autoencoder (VAE), Neural Style Transfer (NST), and Transformer.
References
[1] Jascha Sohl-Dickstein et al. Deep Unsupervised Studying utilizing Nonequilibrium Thermodynamics. Arxiv. https://arxiv.org/pdf/1503.03585 [Accessed December 27, 2024].
[2] Jonathan Ho et al. Denoising Diffusion Probabilistic Fashions. Arxiv. https://arxiv.org/pdf/2006.11239 [Accessed December 27, 2024].
[3] Picture created initially by writer.
[4] Olaf Ronneberger et al. U-Web: Convolutional Networks for Biomedical
Picture Segmentation. Arxiv. https://arxiv.org/pdf/1505.04597 [Accessed December 27, 2024].
[5] Yann LeCun et al. The MNIST Database of Handwritten Digits. https://yann.lecun.com/exdb/mnist/ [Accessed December 30, 2024] (Artistic Commons Attribution-Share Alike 3.0 license).
[6] Ashish Vaswani et al. Consideration Is All You Want. Arxiv. https://arxiv.org/pdf/1706.03762 [Accessed September 29, 2024].