— that’s the formidable title the authors selected for his or her paper introducing each YOLOv2 and YOLO9000. The title of the paper itself is “YOLO9000: Higher, Quicker, Stronger” [1], which was revealed again in December 2016. The principle focus of this paper is certainly to create YOLO9000. However let’s make issues clear. Regardless of the title of the paper, the mannequin proposed within the examine is known as YOLOv2. The title YOLO9000 is their proposed algorithm specialised to detect over 9000 object classes which is constructed on high of the YOLOv2 structure.
On this article I’m going to deal with how YOLOv2 works and learn how to implement the structure from scratch with PyTorch. I may even discuss a bit bit about how the authors ultimately ended up with YOLO9000.
From YOLOv1 to YOLOv2
Because the title suggests, YOLOv2 is the development of YOLOv1. Thus, in an effort to perceive YOLOv2, I like to recommend you learn my earlier article about YOLOv1 [2] and its loss perform [3] earlier than studying this one.
There have been two foremost issues raised by the authors on YOLOv1: first, the excessive localization error, or in different phrases the bounding field predictions made by the mannequin shouldn’t be fairly correct. Second, the low recall, which is a situation the place the mannequin is unable to detect all objects inside the picture. There have been plenty of modifications made by the authors on YOLOv1 to handle the above points, which normally the modifications they made are summarized in Determine 1. We’re going to talk about every of those modifications one after the other within the subsequent sub-sections.
Batch Normalization
The primary modification the authors did was making use of batch normalization layer. Keep in mind that YOLOv1 is sort of previous. It was first launched again when BN layer was not fairly fashionable simply but, which was the explanation why YOLOv1 don’t make the most of this normalization mechanism within the first place. It’s already confirmed that BN layer is ready to stabilize coaching, pace up convergence, dan regularize mannequin. Because of this cause, the dropout layer we beforehand have in YOLOv1 was omitted as we apply BN layers. It’s talked about within the paper that by attaching this sort layer after every convolution they obtained 2.4% enchancment in mAP from 63.4% to 65.8%.
Higher Superb-Tuning
Subsequent, the authors proposed a greater method to carry out fine-tuning. Beforehand in YOLOv1 the spine mannequin was pretrained on ImageNet classification dataset which the photographs had the scale of 224×224. Then, they changed the classification head with detection head and instantly fine-tune it on PASCAL VOC detection dataset which accommodates pictures of measurement 448×448. Right here we are able to clearly see that there was one thing like a “leap” due to the completely different picture resolutions in pretraining and fine-tuning. The pipeline used for coaching YOLOv2 is barely modified, the place the authors added an intermediate step, particularly fine-tuning the mannequin on 448×448 ImageNet earlier than fine-tuning it once more on PASCAL VOC of the identical picture decision. This extra step permits the mannequin to get tailored to the upper decision picture earlier than being fine-tuned for detection, not like in YOLOv1 which the mannequin is pressured to work on 448×448 pictures instantly after being pretrained on 224×224 pictures. This new fine-tuning pipeline allowed the mAP to extend by 3.7% from 65.8% to 69.5%.

Anchor Field and Totally Convolutional Community
The following modification was associated to using anchor field. Should you’re not but conversant in it, that is basically a template bounding field (a.ok.a. prior field) similar to a single grid cell, which is rescaled to match the precise object measurement. The mannequin is then skilled to foretell the offset of the anchor field reasonably than the bounding field coordinates like YOLOv1. We are able to consider an anchor field as the start line of the mannequin to make bounding field prediction. In response to the paper, predicting offset like that is simpler than predicting coordinates, therefore permitting the mannequin to carry out higher. Determine 3 under illustrates 5 anchor packing containers that correspond to the top-left grid cell. In a while, the identical anchor packing containers might be utilized to all grid cells inside the picture.

Using anchor packing containers additionally modified the best way we do the article classification. Beforehand in YOLOv1 we had every grid cell predicted two bounding packing containers, but it may solely predict a single object class. YOLOv2 addresses this concern by attaching the article classification mechanism with the anchor field reasonably than the grid cell, permitting every anchor field from the identical grid cell to foretell completely different object courses. Mathematically talking, the size of the prediction vector of YOLOv1 could be formulated as (B×5)+C for every grid cell, whereas in YOLOv2 this prediction vector size modified to B×(5+C), the place B is the variety of bounding field to be generated, C is the variety of courses within the dataset, and 5 is the variety of xywh and the bounding field confidence worth. With the mechanism launched in YOLOv2, the prediction vector certainly turns into longer, but it surely permits every anchor field predicts its personal class. The determine under illustrates the prediction vectors of YOLOv1 and YOLOv2, the place we set B to 2 and C to twenty. On this specific case, the size of the prediction vectors of each fashions are (2×5)+20=30 and a pair of×(5+20)=50, respectively.

At this level authors additionally changed the fully-connected layers in YOLOv1 with a stack of convolution layers, inflicting all the mannequin to be a totally convolutional community which has the downsampling issue of 32. This downsampling issue causes an enter tensor of measurement 448×448 to get decreased to 14×14. The authors argued that giant objects are often situated in the midst of a picture, in order that they made the output function map to have odd dimensions, guaranteeing that there’s a single middle cell to foretell such objects. With the intention to obtain this, the authors modified the enter form to 416×416 because the default configuration in order that the output dimension has the spatial decision of 13×13.
Apparently, using anchor field and absolutely convolutional community brought about the mAP to lower by 0.3% from 69.5% to 69.2% as an alternative, but on the identical time the recall elevated by 7% from 81% to 88%. This enchancment in recall was significantly brought on by the rise of the variety of predictions made by the mannequin. Within the case of YOLOv1, the mannequin may solely predict 7×7=49 objects in complete, and now YOLOv2 can predict as much as 13×13×5=845 objects, the place the quantity 5 comes from the default variety of anchor packing containers used. In the meantime, the lower in mAP indicated that there was a room for enchancment on the anchor packing containers.
Prior Field Clustering and Constrained Predictions
The authors certainly noticed an issue within the anchor packing containers, and so within the subsequent step they tried to change the best way it really works. Beforehand in Quicker R-CNN the anchor packing containers had been manually handpicked, which brought about them to not optimally symbolize all object shapes within the dataset. To handle this drawback, authors used Ok-means to cluster the distribution of the bounding field measurement. They did it by taking the w and h values of the bounding packing containers within the object detection dataset, placing them into two-dimensional area, and clustering the datapoints utilizing Ok-means as typical. The authors determined to make use of Ok=5, which basically means that we’ll later have that variety of clusters.
The illustration in Determine 5 under shows what the bounding field measurement distribution appears to be like like, the place every black datapoint represents a single bounding field within the dataset and the inexperienced circles are the centroids which can then act as the scale of our anchor packing containers. Word that this illustration is certainly created based mostly on dummy knowledge, however the concept right here is that the datapoint situated on the top-right represents a big sq. bounding field, the one within the top-left is a vertical rectangle field, and so forth.

Should you’re conversant in Ok-means, we usually use Euclidean distance to measure the space between datapoints and the centroids. However right here the authors created a brand new distance metric particularly for this case, by which they used the complement of the IOU between the bounding packing containers and the cluster centroids. See the equation under for the main points.

Utilizing the space metric above, we are able to see within the following desk that the prior packing containers generated utilizing Ok-means clustering (highlighted in blue) have a better common IOU in comparison with the prior packing containers utilized in Quicker R-CNN (highlighted in inexperienced) regardless of the smaller variety of prior packing containers (5 vs 9). This basically signifies that the proposed clustering mechanism permits the ensuing prior packing containers to symbolize the bounding field measurement distribution within the dataset higher as in comparison with the handpicked anchor packing containers.

Nonetheless associated to prior field, the authors discovered that predicting anchor field offset like Quicker R-CNN was really nonetheless not fairly optimum due to the unbounded equation. If we check out the Determine 8 under, there’s a chance that the field place may very well be shifted wildly all through all the picture, inflicting the coaching troublesome particularly in earlier phases.

As an alternative of being relative to the anchor field like Quicker R-CNN, the authors solved this concern by adopting the concept of predicting location coordinates relative to the grid cell from YOLOv1. Nevertheless, the authors additional modify this by introducing sigmoid perform to constrain the xy coordinate prediction of the community, successfully bounds the worth to the vary of 0 to 1 therefore inflicting the anticipated location won’t ever fall outdoors the corresponding grid cell, as proven within the first and the second rows in Determine 9. Subsequent, the w and h of the bounding field are processed with exponential perform (third and fourth row), which is helpful to stop destructive values as a result of it’s simply nonsense to have destructive width or peak. In the meantime, the best way to compute confidence rating on the fifth row is similar as YOLOv1, particularly by calculating the multiplication of objectness confidence and the IOU between the anticipated and the goal field.

So in easy phrases, we certainly undertake the idea of prior field launched by Quicker R-CNN, however as an alternative of handpicking the field, we use clustering to routinely discover probably the most optimum prior field sizes. The bounding field is created with extra sigmoid perform for the xy and exponential perform for the wh. It’s price noting that now the x and y are relative to the grid cell whereas the w and h are relative to the prior field. The authors discovered that this technique improved mAP from 69.6% to 74.4%.
Passthrough Layer
The ultimate output function map of YOLOv2 has the spatial dimension of 13×13, by which every aspect corresponds to a single grid cell. The data contained inside every grid cell is taken into account coarse, which completely is smart as a result of maxpooling layers inside the community certainly work by taking solely the very best values from the sooner function map. This won’t be an issue if the objects to be detected are significantly giant. But when the objects are small, our mannequin is perhaps having laborious time in performing the detection as a result of lack of info contained within the non-prominent pixels.
To handle this drawback, authors proposed to use the so-called passthrough layer. The target of this layer is to protect fine-grained info from earlier function map earlier than being downsampled by maxpooling layer. In Determine 12, the a part of the community known as passthrough layer is the connection that branches out from the community earlier than ultimately merging again to the primary circulation ultimately. The concept of this layer is sort of much like identification mapping launched in ResNet. Nevertheless, the method achieved in that mannequin is easier as a result of the tensor dimension from the unique circulation and the one within the skip-connection matches precisely, permitting them to be element-wise sumed. The case is completely different in passthrough layer, by which right here the latter function map has a smaller spatial dimension, therefore we have to suppose a method to mix info from the 2 tensors. The authors got here up with an concept the place they divide the picture within the passthrough layer after which stack the divided tensors in channel-wise method as proven in Determine 10 under. By doing so, we can have the spatial dimension of the ensuing tensor matches with the next function map, permitting them to be concatenated alongside the channel axis. The fine-grained info from the previous layer will then be mixed with the higher-level options from the latter layer utilizing a convolution layer.

Multi-Scale Coaching
Beforehand I’ve talked about that in YOLOv2 all FC layers have been changed with a stack of convolution layers. This basically permits us to feed pictures of various scales inside the identical coaching course of, contemplating that the weights of CNN-based mannequin correspond to the trainable parameters within the kernel, which is unbiased of the enter picture dimension. Actually, that is really the explanation why the authors determined to take away the FC layers within the first place. Through the coaching part, authors modified the enter decision each 10 batches randomly from 320×320, 352×352, 384×384, and so forth to 608×608, all with a number of of 32. This course of could be regarded as their method to enhance the information in order that the mannequin can detect objects throughout various enter dimensions, which I imagine it additionally permits the mannequin to foretell objects of various scales with higher efficiency. This course of boosted mAP to 76.8% on the default enter decision 416×416, and it acquired even increased to 78.6% once we they elevated the picture decision additional to 544×544.
Darknet-19
All modifications on YOLOv1 we mentioned within the earlier sub sections had been all associated to how the authors enhance detection high quality when it comes to the mAP and recall. Now the main focus of this sub part is to enhance mannequin efficiency when it comes to the pace. It’s talked about within the paper that the authors use a mannequin known as Darknet-19 because the spine, which has much less operations in comparison with the spine of YOLOv1 (5.58 billion vs 8.52 billion), permitting YOLOv2 to run sooner than its predecessor. The unique model of this mannequin consists of 19 convolution layers and 5 maxpooling layers, which the main points could be seen in Determine 11 under.

It is very important notice that the above structure is the vanilla Darknet-19 mannequin, which is barely appropriate for classification activity. To adapt it with the requirement of YOLOv2, we have to barely modify it by including passthrough layer and changing the classification head with detection head. You possibly can see the modified structure in Determine 12 under.

Right here you possibly can see that the passthrough layer is positioned after the final 26×26 function map. This passthrough layer will cut back the spatial dimension to 13×13, permitting it to be concatenated in channel-wise method with the 13×13 function map from the primary circulation. Later within the subsequent part I’m going to display learn how to implement this Darknet-19 structure from scratch together with the detection head in addition to the passthrough layer.
9000-Class Object Detection
YOLOv2 mannequin was initially skilled on PASCAL VOC and COCO datasets which have 20 and 80 variety of object courses, respectively. The authors noticed this as an issue as a result of they thought that this quantity may be very restricted for basic case, and therefore lack of versatility. As a consequence of this cause, it’s crucial to enhance the mannequin such that it might detect a greater diversity of object courses. Nevertheless, creating object detection dataset may be very costly and laborious, as a result of not solely the article courses however we’re additionally required to annotate the bounding field info.
The authors got here up with a really intelligent concept, the place they mixed ImageNet, which has over 22,000 courses, with COCO utilizing class hierarchy mechanism which they consult with as WordTree as proven in Determine 13. You possibly can see the illustration within the determine that blue nodes are the courses from COCO dataset, whereas the purple ones are from ImageNet dataset. The article classes accessible within the COCO dataset are comparatively basic, whereas those in ImageNet are much more fine-grained. As an example, if in COCO we acquired airplane, in ImageNet we acquired biplane, jet, airbus, and stealth fighter. So utilizing the concept of WordTree, the authors put these 4 airplane sorts because the subclass of airplane. You possibly can consider the inference like this: the mannequin works by predicting bounding field and the dad or mum class, then it can test if it acquired subclasses. In that case, the mannequin will proceed predicting from the smaller subset of courses.
By combining the 2 datasets like this, we ultimately ended up with a mannequin that’s able to predicting over 9000 object courses (9418 to be actual), therefore the title YOLO9000.

YOLOv2 Structure Implementation
As I promised earlier, on this part I’m going to display learn how to implement the YOLOv2 structure from scratch in an effort to get higher understanding about how an enter picture ultimately turns into a tensor containing bounding field and sophistication predictions.
Now what we have to do first is to import the required modules which is proven in Codeblock 1 under.
# Codeblock 1
import torch
import torch.nn as nn
Subsequent, we create the ConvBlock class, by which it’ll encapsulate the convolution layer itself, a batch normalization layer and a leaky ReLU activation perform. The negative_slope parameter itself is about to 0.1 as proven at line #(1), which is precisely the identical because the one utilized in YOLOv1.
# Codeblock 2
class ConvBlock(nn.Module):
def __init__(self,
in_channels,
out_channels,
kernel_size,
padding):
tremendous().__init__()
self.conv = nn.Conv2d(in_channels=in_channels,
out_channels=out_channels,
kernel_size=kernel_size,
padding=padding)
self.bn = nn.BatchNorm2d(num_features=out_channels)
self.leaky_relu = nn.LeakyReLU(negative_slope=0.1) #(1)
def ahead(self, x):
print(f'originalt: {x.measurement()}')
x = self.conv(x)
print(f'after convt: {x.measurement()}')
x = self.leaky_relu(x)
print(f'after leaky relu: {x.measurement()}')
return x
Simply to test if the above class works correctly, right here I take a look at it with a quite simple take a look at case, the place I initialize a ConvBlock occasion which accepts an RGB picture of measurement 416×416. You possibly can see within the ensuing output that the picture now has 64 channels, proving that our ConvBlock works correctly.
# Codeblock 3
convblock = ConvBlock(in_channels=3,
out_channels=64,
kernel_size=3,
padding=1)
x = torch.randn(1, 3, 416, 416)
out = convblock(x)
# Codeblock 3 Output
authentic : torch.Measurement([1, 3, 416, 416])
after conv : torch.Measurement([1, 64, 416, 416])
after leaky relu : torch.Measurement([1, 64, 416, 416])
Darknet-19 Implementation
Now let’s use this ConvBlock class to assemble the Darknet-19 structure. The way in which to take action is fairly easy, as what we have to do is simply to stack a number of ConvBlock situations adopted by a maxpooling layer in response to the structure in Determine 12. See the main points in Codeblock 4a under. Word that the maxpooling layer for stage4 is positioned at first of stage5 as proven on the line marked with #(1). That is basically achieved as a result of the output of stage4 will instantly be fed into the passthrough layer with out being downsampled. Along with this, you will need to notice that the time period “stage” shouldn’t be formally talked about within the paper. Slightly, that is only a time period I personally use for the sake of this implementation.
# Codeblock 4a
class Darknet(nn.Module):
def __init__(self):
tremendous(Darknet, self).__init__()
self.stage0 = nn.ModuleList([
ConvBlock(3, 32, 3, 1),
nn.MaxPool2d(kernel_size=2, stride=2)
])
self.stage1 = nn.ModuleList([
ConvBlock(32, 64, 3, 1),
nn.MaxPool2d(kernel_size=2, stride=2)
])
self.stage2 = nn.ModuleList([
ConvBlock(64, 128, 3, 1),
ConvBlock(128, 64, 1, 0),
ConvBlock(64, 128, 3, 1),
nn.MaxPool2d(kernel_size=2, stride=2)
])
self.stage3 = nn.ModuleList([
ConvBlock(128, 256, 3, 1),
ConvBlock(256, 128, 1, 0),
ConvBlock(128, 256, 3, 1),
nn.MaxPool2d(kernel_size=2, stride=2)
])
self.stage4 = nn.ModuleList([
ConvBlock(256, 512, 3, 1),
ConvBlock(512, 256, 1, 0),
ConvBlock(256, 512, 3, 1),
ConvBlock(512, 256, 1, 0),
ConvBlock(256, 512, 3, 1),
])
self.stage5 = nn.ModuleList([
nn.MaxPool2d(kernel_size=2, stride=2), #(1)
ConvBlock(512, 1024, 3, 1),
ConvBlock(1024, 512, 1, 0),
ConvBlock(512, 1024, 3, 1),
ConvBlock(1024, 512, 1, 0),
ConvBlock(512, 1024, 3, 1),
])
As all layers have been initialized, the subsequent factor we do is to attach all these layers utilizing the ahead() technique in Codeblock 4b under. Beforehand I mentioned that we’ll take the output of stage4 because the enter for the passthrough layer. To take action, I retailer the function map produced by the final layer of stage4 in a separate variable which I consult with as x_stage4 (#(1)). We then do the identical factor for the output of stage5 (#(2)) and return each x_stage4 and x_stage5 because the output of our Darknet (#(3)).
# Codeblock 4b
def ahead(self, x):
print(f'originalt: {x.measurement()}')
print()
for i in vary(len(self.stage0)):
x = self.stage0[i](x)
print(f'after stage0 #{i}t: {x.measurement()}')
print()
for i in vary(len(self.stage1)):
x = self.stage1[i](x)
print(f'after stage1 #{i}t: {x.measurement()}')
print()
for i in vary(len(self.stage2)):
x = self.stage2[i](x)
print(f'after stage2 #{i}t: {x.measurement()}')
print()
for i in vary(len(self.stage3)):
x = self.stage3[i](x)
print(f'after stage3 #{i}t: {x.measurement()}')
print()
for i in vary(len(self.stage4)):
x = self.stage4[i](x)
print(f'after stage4 #{i}t: {x.measurement()}')
x_stage4 = x.clone() #(1)
print()
for i in vary(len(self.stage5)):
x = self.stage5[i](x)
print(f'after stage5 #{i}t: {x.measurement()}')
x_stage5 = x.clone() #(2)
return x_stage4, x_stage5 #(3)
Subsequent, I take a look at the Darknet-19 mannequin above by passing the identical dummy picture because the one in our earlier take a look at case.
# Codeblock 5
darknet = Darknet()
x = torch.randn(1, 3, 416, 416)
out = darknet(x)
# Codeblock 5 Output
authentic : torch.Measurement([1, 3, 416, 416])
after stage0 #0 : torch.Measurement([1, 32, 416, 416])
after stage0 #1 : torch.Measurement([1, 32, 208, 208])
after stage1 #0 : torch.Measurement([1, 64, 208, 208])
after stage1 #1 : torch.Measurement([1, 64, 104, 104])
after stage2 #0 : torch.Measurement([1, 128, 104, 104])
after stage2 #1 : torch.Measurement([1, 64, 104, 104])
after stage2 #2 : torch.Measurement([1, 128, 104, 104])
after stage2 #3 : torch.Measurement([1, 128, 52, 52])
after stage3 #0 : torch.Measurement([1, 256, 52, 52])
after stage3 #1 : torch.Measurement([1, 128, 52, 52])
after stage3 #2 : torch.Measurement([1, 256, 52, 52])
after stage3 #3 : torch.Measurement([1, 256, 26, 26])
after stage4 #0 : torch.Measurement([1, 512, 26, 26])
after stage4 #1 : torch.Measurement([1, 256, 26, 26])
after stage4 #2 : torch.Measurement([1, 512, 26, 26])
after stage4 #3 : torch.Measurement([1, 256, 26, 26])
after stage4 #4 : torch.Measurement([1, 512, 26, 26])
after stage5 #0 : torch.Measurement([1, 512, 13, 13])
after stage5 #1 : torch.Measurement([1, 1024, 13, 13])
after stage5 #2 : torch.Measurement([1, 512, 13, 13])
after stage5 #3 : torch.Measurement([1, 1024, 13, 13])
after stage5 #4 : torch.Measurement([1, 512, 13, 13])
after stage5 #5 : torch.Measurement([1, 1024, 13, 13])
Right here we are able to see that our output matches precisely with the architectural particulars in Determine 12, indicating that our implementation of the Darknet-19 mannequin is appropriate.
The Total YOLOv2 Structure
Earlier than really developing all the YOLOv2 structure, we have to initialize the parameters for the mannequin first. Right here we wish each single cell to generate 5 anchor packing containers, therefore we have to set the NUM_ANCHORS variable to that quantity. Subsequent, I set NUM_CLASSES to twenty as a result of we assume that we wish to prepare the mannequin on PASCAL VOC dataset.
# Codeblock 6
NUM_ANCHORS = 5
NUM_CLASSES = 20
Now it’s time to outline the YOLOv2 class. Within the Codeblock 7a under, we initially outline the __init__() technique, the place we initialize the Darknet mannequin (#(1)), a single ConvBlock for the passthrough layer (#(2)), a stack of two convolution layers which I consult with as stage6 (#(3)), and one other stack of two convolution layers which the final one is used to map the tensor into prediction vector with B×(5+C) variety of channels (#(4)).
# Codeblock 7a
class YOLOv2(nn.Module):
def __init__(self):
tremendous().__init__()
self.darknet = Darknet() #(1)
self.passthrough = ConvBlock(512, 64, 1, 0) #(2)
self.stage6 = nn.ModuleList([ #(3)
ConvBlock(1024, 1024, 3, 1),
ConvBlock(1024, 1024, 3, 1),
])
self.stage7 = nn.ModuleList([
ConvBlock(1280, 1024, 3, 1),
ConvBlock(1024, NUM_ANCHORS*(5+NUM_CLASSES), 1, 0) #(4)
])
Afterwards, we outline the so-called reorder() technique, which we are going to use to course of the function map within the passthrough layer. The logic of the code under is sort of sophisticated although, however the primary concept is that it follows the precept given in Determine 10. Right here I present you the output of every line in an effort to get higher understanding of how the method goes contained in the perform given an enter tensor of form 1×64×26×26, which represents a single picture of measurement 26×26 with 64 channels. Within the final step we are able to see that the ultimate output tensor has the form of 1×256×13×13. This form matches precisely with our requirement, the place the channel dimension turns into 4 instances bigger than that of the enter whereas on the identical time the spatial dimension halves.
# Codeblock 7b
def reorder(self, x, scale=2): # ([1, 64, 26, 26])
B, C, H, W = x.form
h, w = H // scale, W // scale
x = x.reshape(B, C, h, scale, w, scale) # ([1, 64, 13, 2, 13, 2])
x = x.transpose(3, 4) # ([1, 64, 13, 13, 2, 2])
x = x.reshape(B, C, h * w, scale * scale) # ([1, 64, 169, 4])
x = x.transpose(2, 3) # ([1, 64, 4, 169])
x = x.reshape(B, C, scale * scale, h, w) # ([1, 64, 4, 13, 13])
x = x.transpose(1, 2) # ([1, 4, 64, 13, 13])
x = x.reshape(B, scale * scale * C, h, w) # ([1, 256, 13, 13])
return x
Subsequent, the Codeblock 7c under reveals how we create the circulation of the community. We initially begin from the darknet spine, by which it returns x_stage4 and x_stage5 (#(1)). The x_stage5 tensor will instantly be processed with the next convolution layers which I consult with as stage6 (#(2)) whereas the x_stage4 tensor might be handed to the passthrough layer (#(3)) and processed by the reorder() (#(4)) technique we outlined in Codeblock 7b above. Afterwards, we then concatenate each tensors in channel-wise method at line #(5). This concatenated tensor is then processed additional with one other stack of convolution layers referred to as stage7 (#(6)) which returns the prediction vector.
# Codeblock 7c
def ahead(self, x):
print(f'originalttt: {x.measurement()}')
x_stage4, x_stage5 = self.darknet(x) #(1)
print(f'nx_stage4ttt: {x_stage4.measurement()}')
print(f'x_stage5ttt: {x_stage5.measurement()}')
print()
x = x_stage5
for i in vary(len(self.stage6)):
x = self.stage6[i](x) #(2)
print(f'x_stage5 after stage6 #{i}t: {x.measurement()}')
x_stage4 = self.passthrough(x_stage4) #(3)
print(f'nx_stage4 after passthrought: {x_stage4.measurement()}')
x_stage4 = self.reorder(x_stage4) #(4)
print(f'x_stage4 after reordertt: {x_stage4.measurement()}')
x = torch.cat([x_stage4, x], dim=1) #(5)
print(f'nx after concatenatett: {x.measurement()}')
for i in vary(len(self.stage7)): #(6)
x = self.stage7[i](x)
print(f'x after stage7 #{i}t: {x.measurement()}')
return x
Once more, to check the above code we are going to go by a tensor of measurement 1×3×416×416.
# Codeblock 8
yolov2 = YOLOv2()
x = torch.randn(1, 3, 416, 416)
out = yolov2(x)
And under is what the output appears to be like like after the code is run. The outputs known as stage0 to stage5 are the processes inside the Darknet spine, by which that is precisely the identical because the one I confirmed you earlier in Codeblock 5 Output. Afterwards we are able to see in stage6 that the form of the x_stage5 tensor doesn’t change in any respect (#(1–3)). In the meantime, the channel dimension of x_stage4 elevated from 64 to 256 after being processed by the reorder() operation (#(4–5)). The tensor from the primary circulation is then concatenated with the one from passthrough layer, which brought about the variety of channels within the ensuing tensor turned 1024+256=1280 (#(6)). Lastly, we go the tensor to stage7 which returns a prediction tensor of measurement 125×13×13, denoting that we have now 13×13 grid cells the place each single of these cells accommodates a prediction vector of size 125 (#(7)), storing the bounding field and the article class predictions.
# Codeblock 8 Output
authentic : torch.Measurement([1, 3, 416, 416])
after stage0 #0 : torch.Measurement([1, 32, 416, 416])
after stage0 #1 : torch.Measurement([1, 32, 208, 208])
after stage1 #0 : torch.Measurement([1, 64, 208, 208])
after stage1 #1 : torch.Measurement([1, 64, 104, 104])
after stage2 #0 : torch.Measurement([1, 128, 104, 104])
after stage2 #1 : torch.Measurement([1, 64, 104, 104])
after stage2 #2 : torch.Measurement([1, 128, 104, 104])
after stage2 #3 : torch.Measurement([1, 128, 52, 52])
after stage3 #0 : torch.Measurement([1, 256, 52, 52])
after stage3 #1 : torch.Measurement([1, 128, 52, 52])
after stage3 #2 : torch.Measurement([1, 256, 52, 52])
after stage3 #3 : torch.Measurement([1, 256, 26, 26])
after stage4 #0 : torch.Measurement([1, 512, 26, 26])
after stage4 #1 : torch.Measurement([1, 256, 26, 26])
after stage4 #2 : torch.Measurement([1, 512, 26, 26])
after stage4 #3 : torch.Measurement([1, 256, 26, 26])
after stage4 #4 : torch.Measurement([1, 512, 26, 26])
after stage5 #0 : torch.Measurement([1, 512, 13, 13])
after stage5 #1 : torch.Measurement([1, 1024, 13, 13])
after stage5 #2 : torch.Measurement([1, 512, 13, 13])
after stage5 #3 : torch.Measurement([1, 1024, 13, 13])
after stage5 #4 : torch.Measurement([1, 512, 13, 13])
after stage5 #5 : torch.Measurement([1, 1024, 13, 13])
x_stage4 : torch.Measurement([1, 512, 26, 26])
x_stage5 : torch.Measurement([1, 1024, 13, 13]) #(1)
x_stage5 after stage6 #0 : torch.Measurement([1, 1024, 13, 13]) #(2)
x_stage5 after stage6 #1 : torch.Measurement([1, 1024, 13, 13]) #(3)
x_stage4 after passthrough : torch.Measurement([1, 64, 26, 26]) #(4)
x_stage4 after reorder : torch.Measurement([1, 256, 13, 13]) #(5)
x after concatenate : torch.Measurement([1, 1280, 13, 13]) #(6)
x after stage7 #0 : torch.Measurement([1, 1024, 13, 13])
x after stage7 #1 : torch.Measurement([1, 125, 13, 13])
Ending
I believe that’s just about every little thing about YOLOv2 and its mannequin structure implementation from scratch. The code used on this article can be accessible on my GitHub repository [6]. Please let me know in the event you spot any mistake in my rationalization or within the code. Thanks for studying, I hope you be taught one thing new from this text. See ya in my subsequent writing!
References
[1] Joseph Redmon and Ali Farhadi. YOLO9000: Higher, Quicker, Stronger. Arxiv. https://arxiv.org/abs/1612.08242 [Accessed August 9, 2025].
[2] Muhammad Ardi. YOLOv1 Paper Walkthrough: The Day YOLO First Noticed the World. Medium. https://medium.com/ai-advances/yolov1-paper-walkthrough-the-day-yolo-first-saw-the-world-ccff8b60d84b [Accessed January 24, 2026].
[3] Muhammad Ardi. YOLOv1 Loss Perform Walkthrough: Regression for All. In the direction of Information Science. https://towardsdatascience.com/yolov1-loss-function-walkthrough-regression-for-all/ [Accessed January 24, 2026].
[4] Picture initially created by writer, partially generated with Gemini
[5] Picture initially created by writer
[6] MuhammadArdiPutra. Higher, Quicker, Stronger — YOLOv2 and YOLO9000. GitHub. https://github.com/MuhammadArdiPutra/medium_articles/blob/main/Better%2C%20Faster%2C%20Stronger%20-%20YOLOv2%20and%20YOLO9000.ipynb [Accessed August 9, 2025].
