If we discuss object detection, one mannequin that possible involves our thoughts first is YOLO — properly, no less than for me, due to its reputation within the area of laptop imaginative and prescient.
The very first model of this mannequin, known as YOLOv1, was launched again in 2015 within the analysis paper titled “You Solely Look As soon as: Unified, Actual-Time Object Detection” [1]. Earlier than YOLOv1 was invented, one of many state-of-the-art algorithms for performing object detection was R-CNN (Area-based Convolutional Neural Community), wherein it makes use of multi-stage mechanism to do the duty. It initially employs selective search algorithm to create area proposal, then makes use of CNN-based mannequin to extract the options inside all these areas, and at last classifies the detected objects utilizing SVM [2]. Right here you possibly can clearly think about how lengthy the method is simply to carry out object detection on a single picture.
The motivation of YOLO within the first place was to enhance velocity. In truth, not solely reaching low computational complexity, however the authors proved that their proposed deep studying mannequin was additionally in a position to obtain excessive accuracy. As this text is written, YOLOv13 has simply printed a number of days in the past [3]. However let’s simply discuss its very first ancestor for now so to see the fantastic thing about this mannequin ranging from the time it first got here out. This text goes to debate how YOLOv1 works and the right way to construct this neural community structure from scratch with PyTorch.
The Underlying Concept Behind YOLOv1
Earlier than we get into the structure, it could be higher if we perceive the thought behind YOLOv1 prematurely. Let’s begin with an instance. Suppose now we have an image of a cat, and we’re about to make use of it as a coaching pattern of a YOLOv1 mannequin. And so, we have to create a floor fact for that. It’s talked about within the unique paper that we have to outline the parameter S, which denotes the variety of grid cells we’re going to divide our picture into alongside every spatial dimension. By default, this parameter is ready to 7, so we may have 7×7=49 cells in whole. Check out Determine 1 beneath to higher perceive this concept.
Subsequent, we have to decide which cell corresponds to the midpoint of the article. Within the above case, the cat is situated virtually precisely on the middle of the picture, therefore the midpoint should lie at cell (3, 3). Later within the inference part, we will consider this cell because the one accountable to foretell the cat. Now taking a better take a look at the cell, we have to decide the precise place of the midpoint. Right here you possibly can see that alongside the vertical axis it’s situated precisely within the center, however within the horizontal axis it’s barely shifted to the left from the center. So, if I had been to approximate, the coordinate can be (0.4, 0.5). This coordinate worth is relative to the cell and is normalized to the vary of 0 to 1. It could be price noting that the (x, y) coordinate of the midpoint ought to neither be lower than 0 nor larger than 1, since a worth outdoors this vary would imply the midpoint lies in one other cell. In the meantime, the width w and the peak h of the bounding field are roughly 2.4 and three.2, respectively. These numbers are relative to the cell dimension, which means that if the article is larger than the cell, then the worth might be larger than 1. Afterward, if we had been to create a floor fact for a picture, we have to retailer all these x, y, w and h info within the so-called goal vector.
Goal Vector
The size of the goal vector itself is 25 for every cell, wherein the primary 20 parts (index 0 to 19) retailer the category of the article in type of one-hot encoding. That is primarily as a result of YOLOv1 was initially skilled on PASCAL VOC dataset which has that variety of courses. Subsequent, index 20 is used to retailer the arrogance of the bounding field prediction, which within the coaching part that is set to 1 each time there’s an object midpoint throughout the cell. Lastly, the (x, y) coordinate of the midpoint are positioned at indices 21 and 22, whereas w and h are saved at indices 23 and 24. The illustration in Determine 2 beneath shows what the goal vector for cell (3, 3) appears to be like like.

Once more, keep in mind that the above goal vector solely corresponds to a single cell. To create the bottom fact for the complete picture, we have to have a bunch of comparable vectors concatenated, forming the so-called goal tensor as proven in Determine 3. Be aware that the category chances in addition to the bounding field confidences, places, and sizes from all different cells are set to zero as a result of there isn’t a different object showing throughout the picture.

Prediction Vector
The prediction vector is sort of a bit completely different. If the goal vector consists of 25 parts, the prediction vector consists of 30. It is because by default YOLOv1 predicts two bounding containers for a similar object throughout inference. Thus, we want 5 extra parts to retailer the details about the second bounding field generated by the mannequin. Regardless of predicting two bounding containers, later we are going to solely take the one which has larger confidence.

This distinctive goal and prediction vector dimensions required the authors to rethink the loss perform. For regression issues, we usually use MAE, MSE or RMSE, whereas for classification duties we normally use cross entropy loss. However YOLOv1 is greater than only a regression and classification drawback, contemplating that now we have each steady (bounding field) and discrete (class) values within the vector illustration. Due to this motive, the authors created a brand new loss perform specialised for this mannequin as proven in Determine 5. This loss perform is sort of complicated (you see, proper?), so I made a decision to put in writing it in a separate article as a result of there are many issues to clarify about it — keep tuned, I’ll publish it very quickly.

The YOLOv1 Structure
Similar to typical earlier laptop imaginative and prescient fashions, YOLOv1 makes use of CNN-based structure because the spine of the mannequin. It includes 24 convolution layers stacked in keeping with the construction in Determine 6. Should you take a better take a look at the determine, you’ll discover that the output layer produces a tensor of form 30×7×7. This dimension signifies that each single cell has its corresponding prediction vector of size 30 containing the category and the bounding field info of the detected object, wherein this matches precisely with our earlier dialogue.

Properly, I feel I’ve lined all the basics of YOLOv1, so now let’s begin implementing the structure from scratch with PyTorch. Earlier than doing something, what we have to do first is to import the required modules and initialize the parameters S, B, and C. See Codeblock 1 beneath.
# Codeblock 1
import torch
import torch.nn as nn
S = 7
B = 2
C = 20
The three parameters I initialized above are the default values given within the paper, wherein S represents the variety of grid cells alongside the horizontal and vertical axes, B denotes the variety of bounding containers generated by every cell, and C is the variety of courses out there within the dataset. Since we use S=7 and B=2, our YOLOv1 will produce7×7×2=98 bounding containers in whole for every picture.
The Constructing Block
Subsequent, we’re going to create the ConvBlock class, wherein it incorporates a single convolution layer (line #(1)), a leaky ReLU activation perform (#(2)), and an non-compulsory maxpooling layer (#(3)) as proven in Codeblock 2.
# Codeblock 2
class ConvBlock(nn.Module):
def __init__(self,
in_channels,
out_channels,
kernel_size,
stride,
padding,
maxpool_flag=False):
tremendous().__init__()
self.maxpool_flag = maxpool_flag
self.conv = nn.Conv2d(in_channels=in_channels, #(1)
out_channels=out_channels,
kernel_size=kernel_size,
stride=stride,
padding=padding)
self.leaky_relu = nn.LeakyReLU(negative_slope=0.1) #(2)
if self.maxpool_flag:
self.maxpool = nn.MaxPool2d(kernel_size=2, #(3)
stride=2)
def ahead(self, x):
print(f'originalt: {x.dimension()}')
x = self.conv(x)
print(f'after convt: {x.dimension()}')
x = self.leaky_relu(x)
print(f'after leaky relu: {x.dimension()}')
if self.maxpool_flag:
x = self.maxpool(x)
print(f'after maxpoolt: {x.dimension()}')
return x
In fashionable architectures, we usually use the Conv-BN-ReLU construction, however on the time YOLOv1 was created, it looks as if batch normalization layer was not fairly common simply but, because it got here out solely a number of months earlier than YOLOv1. So, I assume that is in all probability the explanation that the authors didn’t make the most of this normalization layer. As an alternative, it solely makes use of a stack of convolutions and leaky ReLUs all through the complete community.
Only a fast refresher, leaky ReLU is an activation perform just like the usual ReLU, besides that the destructive values are multiplied with a small quantity as a substitute of being zeroed out. Within the case of YOLOv1, we set the multiplier to 0.1 (#(2)) in order that it could possibly nonetheless protect slightly bit quantity of data contained within the destructive enter numbers.

Because the ConvBlock class has been outlined, now I’m going to check it simply to verify if it really works correctly. In Codeblock 3 beneath I attempt to implement the very first layer within the community and move a dummy tensor via it. You possibly can see within the codeblock that in_channels is ready to three (#(1)) and out_channels is ready to 64 (#(2)) as a result of we would like this preliminary layer to simply accept an RGB picture because the enter and return a 64-channel picture. The dimensions of the kernel is 7×7 (#(3)), therefore we have to set the padding to three (#(5)). Usually, this configuration permits us to protect the spatial dimension of the picture, however since we use stride=2 (#(4)), this padding dimension ensures that the picture is strictly halved. Subsequent, for those who return to Determine 6, you’ll discover that some conv layers are adopted by a maxpooling layer and a few others should not. For the reason that first convolution makes use of a maxpooling layer, we have to set the maxpool_flag parameter to True (#(6)).
# Codeblock 3
convblock = ConvBlock(in_channels=3, #(1)
out_channels=64, #(2)
kernel_size=7, #(3)
stride=2, #(4)
padding=3, #(5)
maxpool_flag=True) #(6)
x = torch.randn(1, 3, 448, 448) #(7)
out = convblock(x)
Afterwards, we will merely generate a tensor of random values with the dimension of 1×3×448×448 (#(7)) which simulates a batch of a single RGB picture of dimension 448×448 after which move it via the community. You possibly can see within the ensuing output beneath that our convolution layer efficiently elevated the variety of channels to 64 and halved the spatial dimension to 224×224. The halving was achieved as soon as once more all the best way to 112×112 due to the maxpooling layer.
# Codeblock 3 Output
unique : torch.Dimension([1, 3, 448, 448])
after conv : torch.Dimension([1, 64, 224, 224])
after leaky relu : torch.Dimension([1, 64, 224, 224])
after maxpool : torch.Dimension([1, 64, 112, 112])
The Spine
The following factor we’re going to do is to create a sequence of ConvBlocks to construct the complete spine of the community. In case you’re nonetheless not acquainted with the time period spine, on this case it’s primarily the whole lot earlier than the 2 fully-connected layers (consult with Determine 6). Now take a look at the Codeblock 4a and 4b beneath to see how I outline the Spine class.
# Codeblock 4a
class Spine(nn.Module):
def __init__(self):
tremendous().__init__()
# in_channels, out_channels, kernel_size, stride, padding, maxpool_flag
self.stage0 = ConvBlock(3, 64, 7, 2, 3, maxpool_flag=True) #(1)
self.stage1 = ConvBlock(64, 192, 3, 1, 1, maxpool_flag=True) #(2)
self.stage2 = nn.ModuleList([
ConvBlock(192, 128, 1, 1, 0),
ConvBlock(128, 256, 3, 1, 1),
ConvBlock(256, 256, 1, 1, 0),
ConvBlock(256, 512, 3, 1, 1, maxpool_flag=True) #(3)
])
self.stage3 = nn.ModuleList([])
for _ in vary(4):
self.stage3.append(ConvBlock(512, 256, 1, 1, 0))
self.stage3.append(ConvBlock(256, 512, 3, 1, 1))
self.stage3.append(ConvBlock(512, 512, 1, 1, 0))
self.stage3.append(ConvBlock(512, 1024, 3, 1, 1, maxpool_flag=True)) #(4)
self.stage4 = nn.ModuleList([])
for _ in vary(2):
self.stage4.append(ConvBlock(1024, 512, 1, 1, 0))
self.stage4.append(ConvBlock(512, 1024, 3, 1, 1))
self.stage4.append(ConvBlock(1024, 1024, 3, 1, 1))
self.stage4.append(ConvBlock(1024, 1024, 3, 2, 1)) #(5)
self.stage5 = nn.ModuleList([])
self.stage5.append(ConvBlock(1024, 1024, 3, 1, 1))
self.stage5.append(ConvBlock(1024, 1024, 3, 1, 1))
What we do within the above codeblock is to instantiate ConvBlock cases in keeping with the structure given within the paper. There are a number of issues I need to emphasize right here. First, the time period stage I exploit within the code will not be explicitly talked about within the paper. Nonetheless, I made a decision to make use of that phrase to explain the six teams of convolutional layers in Determine 6. Second, discover that we have to set the maxpool_flag to True for the final ConvBlock within the first 4 teams to carry out spatial downsampling (#(1–4)). For the fifth group, the downsampling is finished by setting the stride of the final convolution layer to 2 (#(5)). Third, Determine 6 doesn’t point out the padding dimension of the convolution layers, so we have to calculate them manually. There’s certainly a selected method to seek out padding dimension based mostly on the given kernel dimension. Nonetheless, I really feel like it’s a lot simpler to memorize it. Simply needless to say if we use kernel of dimension 7×7, then we have to set the padding to three to protect the spatial dimension. In the meantime, for five×5, 3×3 and 1×1 kernels, the padding ought to be set to 2, 1, and 0, respectively.
As all layers within the spine have been instantiated, we will now join all of them utilizing the ahead() methodology beneath. I don’t suppose I want to clarify something right here because it mainly solely works by passing the enter tensor x via the layers sequentially.
# Codeblock 4b
def ahead(self, x):
print(f'originalt: {x.dimension()}n')
x = self.stage0(x)
print(f'after stage0t: {x.dimension()}n')
x = self.stage1(x)
print(f'after stage1t: {x.dimension()}n')
for i in vary(len(self.stage2)):
x = self.stage2[i](x)
print(f'after stage2 #{i}t: {x.dimension()}')
print()
for i in vary(len(self.stage3)):
x = self.stage3[i](x)
print(f'after stage3 #{i}t: {x.dimension()}')
print()
for i in vary(len(self.stage4)):
x = self.stage4[i](x)
print(f'after stage4 #{i}t: {x.dimension()}')
print()
for i in vary(len(self.stage5)):
x = self.stage5[i](x)
print(f'after stage5 #{i}t: {x.dimension()}')
return x
Now let’s confirm if our implementation is appropriate by operating the next testing code.
# Codeblock 5
spine = Spine()
x = torch.randn(1, 3, 448, 448)
out = spine(x)
Should you attempt to run the above codeblock, the next output ought to seem in your display. Right here you possibly can see that the spatial dimension of the picture appropriately bought diminished after the final ConvBlock of every stage. This course of continued all the best way to the final stage till ultimately we obtained a tensor of dimension 1024×7×7, wherein this matches precisely with the illustration in Determine 6.
# Codeblock 5 Output
unique : torch.Dimension([1, 3, 448, 448])
after stage0 : torch.Dimension([1, 64, 112, 112])
after stage1 : torch.Dimension([1, 192, 56, 56])
after stage2 #0 : torch.Dimension([1, 128, 56, 56])
after stage2 #1 : torch.Dimension([1, 256, 56, 56])
after stage2 #2 : torch.Dimension([1, 256, 56, 56])
after stage2 #3 : torch.Dimension([1, 512, 28, 28])
after stage3 #0 : torch.Dimension([1, 256, 28, 28])
after stage3 #1 : torch.Dimension([1, 512, 28, 28])
after stage3 #2 : torch.Dimension([1, 256, 28, 28])
after stage3 #3 : torch.Dimension([1, 512, 28, 28])
after stage3 #4 : torch.Dimension([1, 256, 28, 28])
after stage3 #5 : torch.Dimension([1, 512, 28, 28])
after stage3 #6 : torch.Dimension([1, 256, 28, 28])
after stage3 #7 : torch.Dimension([1, 512, 28, 28])
after stage3 #8 : torch.Dimension([1, 512, 28, 28])
after stage3 #9 : torch.Dimension([1, 1024, 14, 14])
after stage4 #0 : torch.Dimension([1, 512, 14, 14])
after stage4 #1 : torch.Dimension([1, 1024, 14, 14])
after stage4 #2 : torch.Dimension([1, 512, 14, 14])
after stage4 #3 : torch.Dimension([1, 1024, 14, 14])
after stage4 #4 : torch.Dimension([1, 1024, 14, 14])
after stage4 #5 : torch.Dimension([1, 1024, 7, 7])
after stage5 #0 : torch.Dimension([1, 1024, 7, 7])
after stage5 #1 : torch.Dimension([1, 1024, 7, 7])
The Absolutely-Related Layers
After the spine is finished, we will now transfer on to the fully-connected half, which I write in Codeblock 6 beneath. This a part of the community could be very easy because it primarily solely consists of two linear layers. Talking of the main points, it’s talked about within the paper that the authors apply a dropout layer with the speed of 0.5 (#(3)) between the primary (#(1)) and the second (#(4)) linear layers. It is very important be aware that the leaky ReLU activation perform remains to be used (#(2)) however solely after the primary linear layer. It is because the second acts because the output layer, therefore it doesn’t require any activation utilized to it.
# Codeblock 6
class FullyConnected(nn.Module):
def __init__(self):
tremendous().__init__()
self.linear0 = nn.Linear(in_features=1024*7*7, out_features=4096) #(1)
self.leaky_relu = nn.LeakyReLU(negative_slope=0.1) #(2)
self.dropout = nn.Dropout(p=0.5) #(3)
self.linear1 = nn.Linear(in_features=4096, out_features=(C+B*5)*S*S)#(4)
def ahead(self, x):
print(f'originalt: {x.dimension()}')
x = self.linear0(x)
print(f'after linear0t: {x.dimension()}')
x = self.leaky_relu(x)
x = self.dropout(x)
x = self.linear1(x)
print(f'after linear1t: {x.dimension()}')
return x
Run the Codeblock 7 beneath to see how the tensor transforms as it’s processed by the stack of linear layers.
# Codeblock 7
fc = FullyConnected()
x = torch.randn(1, 1024*7*7)
out = fc(x)
# Codeblock 7 Output
unique : torch.Dimension([1, 50176])
after linear0 : torch.Dimension([1, 4096])
after linear1 : torch.Dimension([1, 1470])
We are able to see within the above output that the fc block takes an enter of form 50176, which is basically the flattened 1024×7×7 tensor. The linear0 layer works by mapping this enter into 4096-dimensional vector, after which the linear1 layer ultimately maps it additional to 1470. Later within the post-processing stage we have to reshape it to 30×7×7 in order that we will take the bounding field and the article classification outcomes simply. Technically talking, this reshaping course of might be achieved both internally by the mannequin or outdoors the mannequin. For the sake of simplicity, I made a decision to depart the output flattened, which means the reshaping might be dealt with externally.
Connecting the FC Half to the Spine
At this level we have already got our spine and the fully-connected layers achieved. Thus, they’re now able to be assembled to assemble the complete YOLOv1 structure. There’s not a lot factor I can clarify relating to the next code, as what we do right here is just instantiating each components and join them within the ahead() methodology. Simply don’t neglect to flatten (#(1)) the output of spine to make it suitable with the enter of the fc block.
# Codeblock 8
class YOLOv1(nn.Module):
def __init__(self):
tremendous().__init__()
self.spine = Spine()
self.fc = FullyConnected()
def ahead(self, x):
x = self.spine(x)
x = torch.flatten(x, start_dim=1) #(1)
x = self.fc(x)
return x
So as to check our mannequin, we will merely instantiate the YOLOv1 mannequin and move a dummy tensor that simulates an RGB picture of dimension 448×448 (#(1)). After feeding the tensor into the community (#(2)), I additionally attempt to simulate the post-processing step by reshaping the output tensor to 30×7×7 as proven at line #(3).
# Codeblock 9
yolov1 = YOLOv1()
x = torch.randn(1, 3, 448, 448) #(1)
out = yolov1(x) #(2)
out = out.reshape(-1, C+B*5, S, S) #(3)
And beneath is what the output appears to be like like after the code above is run. Right here you possibly can see that our enter tensor efficiently flows via all layers throughout the complete community, indicating that our YOLOv1 mannequin works correctly and thus is able to practice.
# Codeblock 9 Output
unique : torch.Dimension([1, 3, 448, 448])
after stage0 : torch.Dimension([1, 64, 112, 112])
after stage1 : torch.Dimension([1, 192, 56, 56])
after stage2 #0 : torch.Dimension([1, 128, 56, 56])
after stage2 #1 : torch.Dimension([1, 256, 56, 56])
after stage2 #2 : torch.Dimension([1, 256, 56, 56])
after stage2 #3 : torch.Dimension([1, 512, 28, 28])
after stage3 #0 : torch.Dimension([1, 256, 28, 28])
after stage3 #1 : torch.Dimension([1, 512, 28, 28])
after stage3 #2 : torch.Dimension([1, 256, 28, 28])
after stage3 #3 : torch.Dimension([1, 512, 28, 28])
after stage3 #4 : torch.Dimension([1, 256, 28, 28])
after stage3 #5 : torch.Dimension([1, 512, 28, 28])
after stage3 #6 : torch.Dimension([1, 256, 28, 28])
after stage3 #7 : torch.Dimension([1, 512, 28, 28])
after stage3 #8 : torch.Dimension([1, 512, 28, 28])
after stage3 #9 : torch.Dimension([1, 1024, 14, 14])
after stage4 #0 : torch.Dimension([1, 512, 14, 14])
after stage4 #1 : torch.Dimension([1, 1024, 14, 14])
after stage4 #2 : torch.Dimension([1, 512, 14, 14])
after stage4 #3 : torch.Dimension([1, 1024, 14, 14])
after stage4 #4 : torch.Dimension([1, 1024, 14, 14])
after stage4 #5 : torch.Dimension([1, 1024, 7, 7])
after stage5 #0 : torch.Dimension([1, 1024, 7, 7])
after stage5 #1 : torch.Dimension([1, 1024, 7, 7])
unique : torch.Dimension([1, 50176])
after linear0 : torch.Dimension([1, 4096])
after linear1 : torch.Dimension([1, 1470])
torch.Dimension([1, 30, 7, 7])
Ending
It could be price noting that each one the codes I present you all through this whole article is for the bottom YOLOv1 structure. It’s talked about within the paper that the authors additionally proposed the lite model of this mannequin which they consult with as Quick YOLO. This smaller YOLOv1 model presents sooner computation time because it solely consists of 9 convolution layers as a substitute of 24. Sadly, the paper doesn’t present the implementation particulars, so I can’t reveal you the right way to implement that one.
Right here I encourage you to mess around with the above code. In idea, it’s potential to exchange the CNN-based spine with different deep studying fashions, akin to ResNet, ResNeXt, ViT, and so on. All it’s worthwhile to do is simply to match the output form of the spine with the enter form of the fully-connected half. Not solely that, I additionally need you to strive coaching this mannequin from scratch. However for those who determined to take action, you may in all probability need to make this mannequin smaller by lowering the depth (no of convolution layers) or the width (no of kernels) of the mannequin. That is primarily as a result of the authors talked about that they required round per week simply to do the pretraining on ImageNet dataset, to not point out the time for positive tuning on the article detection process.
And properly, I feel that’s just about the whole lot I can clarify you about how YOLOv1 works and its structure. Please let me know for those who spot any mistake on this article. Thanks!
By the best way, the code used on this article can also be out there on my GitHub repo [7].
References
[1] Joseph Redmon et al. You Solely Look As soon as: Unified, Actual-Time Object Detection. Arxiv. https://arxiv.org/pdf/1506.02640 [Accessed July 5, 2025].
[2] Ross Girshick et al. Wealthy function hierarchies for correct object detection and semantic segmentation. Arxiv. https://arxiv.org/pdf/1311.2524 [Accessed July 5, 2025].
[3] Mengqi Lei et al. YOLOv13: Actual-Time Object Detection with Hypergraph-Enhanced Adaptive Visible Notion. Arxiv. https://arxiv.org/abs/2506.17733 [Accessed July 5, 2025].
[4] Picture generated by writer with Gemini, edited by writer.
[5] Picture initially created by writer.
[6] Bing Xu et al. Empirical Analysis of Rectified Activations in Convolutional Community. Arxiv. https://arxiv.org/pdf/1505.00853 [Accessed July 5, 2025].
[7] MuhammadArdiPutra. The Day YOLO First Noticed the World — YOLOv1. GitHub. https://github.com/MuhammadArdiPutra/medium_articles/blob/main/The%20Day%20YOLO%20First%20Saw%20the%20World%20-%20YOLOv1.ipynb [Accessed July 7, 2025].
