MobileNetV1 Paper Walkthrough: The Tiny Giant

Introduction

used to concentrate on enhancing accuracy. They saved pushing the restrict greater and better till they finally realized that the computational complexity of their fashions turned increasingly costly. This was positively an issue researchers wanted to deal with as a result of we wish deep studying fashions to have the ability to work not solely on high-end computer systems but in addition on small units. To beat this problem, Howard et al. again in 2017 proposed a particularly light-weight neural community mannequin known as MobileNet which they launched in a paper titled MobileNets: Environment friendly Convolutional Neural Networks for Cellular Imaginative and prescient Functions [1]. In truth, the mannequin proposed within the paper is the primary model of MobileNet, which is often often known as MobileNetV1. At present we have already got 4 MobileNet variations: MobileNetV1 all the way in which to MobileNetV4. Nevertheless, on this article we’re solely going to concentrate on MobileNetV1, masking the concept behind the structure and the way to implement it from scratch with PyTorch — I’ll save the later MobileNet variations for my upcoming articles.

Depthwise Separable Convolution

To be able to obtain a light-weight mannequin, MobileNet leverages the concept of depthwise separable convolution, which is used almost all through all the community. Determine 1 under shows the structural distinction between this layer (proper) and an ordinary convolution layer (left). You may see within the determine that depthwise separable convolution principally contains two kinds of convolution layers: depthwise convolution and pointwise convolution. Along with that, we sometimes comply with the conv-BN-ReLU construction in terms of developing CNN-based fashions. That is basically the rationale that within the illustration we’ve got batch normalization and ReLU proper after every conv layer. We’re going to talk about depthwise and pointwise convolutions extra deeply within the subsequent sections.

Determine 1. The construction of an ordinary convolution layer (left) and a depthwise separable convolution layer (proper) [1].

Depthwise Convolution

A normal convolution layer is principally a convolution with the group parameter set to 1. You will need to keep in mind that on this case utilizing a 3×3 kernel truly means making use of a kernel of form C×3×3 to the enter tensor, the place C is the variety of enter channels. The usage of this kernel form permits us to mixture data from all channels inside every 3×3 patch without delay. That is the rationale why the usual convolution operation is computationally costly, but in return the output tensor accommodates a number of data. If you happen to take a better take a look at Determine 2 under, an ordinary convolution layer corresponds to the one within the leftmost a part of the tradeoff line.

Determine 2. The tradeoff between fewer and extra convolution teams [2].

If you happen to’re already acquainted with group convolution, depthwise convolution must be simple so that you can perceive. Group convolution is a technique the place we divide the channels of the enter tensor based on the variety of teams used and apply convolution independently inside every group. As an example, suppose we’ve got an enter tensor of 64 channels and need to course of it with 128 kernels grouped into 2. In such a case, the primary 64 kernels are chargeable for processing the primary 32 channels of the enter tensor, whereas the remaining 64 kernels course of the final 32 channels of the enter tensor. This mechanism leads to 64 output channels for every group. The ultimate output tensor is obtained by concatenating the ensuing tensors from all teams alongside the channel dimension, leading to a complete of 128 channels on this instance.

As we proceed rising the variety of teams, we finally attain the intense case often known as depthwise convolution, which is a particular case of group convolution the place the variety of teams is about equal to the variety of enter channels. With this configuration, we principally have every channel processed independently of one another, inflicting each channel within the enter to supply solely a single output channel. By concatenating all of the ensuing 1-channel tensors, the ultimate variety of output channels stays precisely the identical as that of the enter. This mechanism requires us to make use of a kernel of dimension 1×3×3 as an alternative of C×3×3, stopping us to carry out data aggregation alongside channel axis. This permits us to have extraordinarily light-weight computation, but in return inflicting the output tensor to include much less data because of the absence of channel-wise data aggregation.

For the reason that goal of MobileNet is to make the computation as quick as doable, we have to place ourselves on the rightmost a part of the above tradeoff line regardless of capturing the least quantity of data. That is positively an issue that must be addressed, which is the rationale why we make use of pointwise convolution within the subsequent step.

Pointwise Convolution

Pointwise convolution is principally simply an ordinary convolution, besides that it makes use of kernels of dimension 1×1 — or to be extra exact, it’s truly C×1×1. This kernel form permits us to mixture data alongside the channel axis with out being influenced by spatial data, successfully compensating for the limitation of depthwise convolution. Moreover, keep in mind that depthwise convolution alone can solely output a tensor of the identical variety of channels as its enter, which limits our flexibility in designing the mannequin structure. By making use of pointwise convolution within the subsequent step, we are able to set it to return as many channels as we wish, permitting us to adapt the layer to the next one as wanted.

We will consider depthwise convolution and pointwise convolution as two complementary processes, the place the previous focuses on capturing spatial relationships whereas the latter captures channel relationships. These two processes might sound a bit inefficient at first look since we are able to principally do the 2 processes without delay utilizing an ordinary convolution layer. Nevertheless, if we take a better take a look at the computational complexity, depthwise separable convolution is much more light-weight in comparison with the normal convolution layer counterpart. Within the subsequent part I’ll talk about in additional element how we are able to calculate the variety of parameters in these two fashions which positively additionally impacts the computational complexity.

Parameters Rely Calculation

Suppose we’ve got a picture of dimension 3×H×W, the place H and W are the peak and width of a picture, respectively. For the sake of this instance, let’s assume that we’re about to course of the picture with 16 kernels of dimension 5×5, the place the stride is about to 1 and the padding is about to 2 (which on this case is equal to padding = similar). With this configuration, the scale of the output tensor goes to be 16×H×W. If we use an ordinary convolution layer, the variety of parameters might be 5×5×3×16 = 1200 (with out bias), wherein this quantity is obtained based mostly on the equation in Determine 3. The usage of bias time period is just not strictly mandatory on this case, but when we do the full variety of parameters goes to be (5×5×3+1) × 16 = 1216.

Determine 3. Equation to calculate the variety of parameters of a convolution layer [2].

Now let’s calculate the parameter depend of the depthwise separable convolution counterpart to supply the very same tensor dimension. Following the identical components, we could have 5×5×1×3 = 75 for the depthwise convolution half (with out bias). Or if we additionally account for the biases, then we could have (5×5×1+1) × 3 = 78 trainable params. Within the case of depthwise convolution like this, the variety of enter channels is taken into account 1 since every kernel is chargeable for processing a single channel solely. To the pointwise convolution half, the variety of parameters might be 1×1×3×16 = 48 (with out bias) or (1×1×3+1) × 16 = 64 (with bias). Now to acquire the full variety of parameters in all the depthwise separable convolution course of, we are able to merely calculate 75+48 = 123 (with out bias) or 78+64 = 142 (with bias) — that’s almost 90% discount in parameter depend if we examine it with the usual convolution! In idea, such an excessive drop in parameter depend causes the mannequin to have a lot decrease capability. However that’s simply the speculation. Later I’ll present you the way MobileNet manages to maintain up with different fashions by way of accuracy.

The Detailed MobileNetV1 Structure

Determine 4 under shows all the MobileNetV1 structure intimately. The depthwise convolution layers are the rows marked with dw, whereas the pointwise convolutions are those having 1×1 filter form. Discover that every dw layer is at all times adopted by a 1×1 convolution, indicating that all the structure primarily consists of depthwise separable convolutions. Moreover, when you take a better take a look at the structure, you will note that spatial downsampling is finished by the depthwise convolutions which have a stride of two (discover the rows with s2 within the desk). Right here you possibly can see that each time we cut back the spatial dimension by half, the variety of channels doubles to compensate for the lack of spatial data.

Determine 4. Your entire MobileNetV1 structure [1].

Width and Decision Multiplier

The authors of MobileNet proposed a brand new parameter tuning mechanism by introducing the so-called width and decision multiplier, that are formally denoted as α and ρ, respectively. The α parameter can technically be adjusted freely, however authors recommend utilizing both 1.0, 0.75, 0.5, or 0.25. This parameter works by lowering the variety of channels produced by all convolution layers. As an example, if we set α to 0.5, the primary convolution layer within the community will flip 3-channel enter into 16 as an alternative of 32. However, ρ is used to regulate the spatial dimension of the enter tensor. You will need to observe that despite the fact that ideally we should always assign a floating-point quantity to this parameter, but in observe it’s extra preferable to instantly decide the precise decision for the enter picture. On this case, authors suggest utilizing both 224, 192, 160 and 128, wherein the enter dimension of 224×224 corresponds to ρ = 1. The structure displayed in Determine 4 above follows the default configuration the place each α and ρ are set to 1.

Experimental Outcomes

Authors carried out loads of experiments to show the robustness of MobileNet. The primary consequence to debate is the one displayed in Determine 5 under, the place on this experiment they tried to learn how using depthwise separable convolution layers impacts efficiency. The second row of the desk reveals the consequence obtained by the structure I confirmed you earlier in Determine 4, whereas the primary row is the consequence when the layers are changed with conventional convolutions. Right here we are able to see that the accuracy of MobileNet with conventional CNN is certainly greater than that of the one utilizing depthwise separable convolutions. Nevertheless, if we have in mind the variety of multiplications and additions (mult-adds) in addition to the parameter depend, we are able to clearly see that the one with conventional convolution layers requires rather more computational value and reminiscence utilization simply to make a slight enchancment in accuracy. Thus, with depthwise separable convolutions, despite the fact that the mannequin complexity of MobileNet considerably will get diminished, authors proved that the mannequin capability stays excessive.

Determine 5. Efficiency comparability between MobileNet with depthwise separable convolution layers (second row) and its full-convolution counterpart (first row) [1].

The α and ρ parameters I defined earlier are primarily used to offer flexibility, contemplating that not all duties require the very best MobileNet functionality. Authors initially carried out experiments on 1000-class ImageNet dataset, however in observe, we’d most likely solely want the mannequin to carry out classification on a dataset with fewer variety of lessons. In such a case, deciding on decrease values for the 2 parameters is likely to be preferable as it may well pace up the inference course of whereas on the similar time the mannequin nonetheless has sufficient capability to accommodate the classification process. Speaking extra particularly about α, utilizing smaller worth for this parameter causes MobileNet to have decrease accuracy. However that’s the consequence on 1000-class dataset. If our dataset is easier and has fewer lessons, utilizing smaller α would possibly nonetheless be positive. In Determine 6 under the values 1.0, 0.75, 0.5, and 0.25 written subsequent to every mannequin correspond to the α used.

Determine 6. How width multiplier impacts mannequin accuracy, variety of operations, and parameter depend [1].

The identical factor additionally applies to the ρ parameter, which is chargeable for altering the decision of the enter picture. Determine 7 under shows what the experimental outcomes seem like once we use totally different enter resolutions. The outcomes are considerably just like those within the earlier determine, the place the accuracy rating decreases as we make the enter picture smaller. You will need to remember that lowering enter decision like this additionally reduces the variety of operations however doesn’t have an effect on the parameter depend. That is basically as a result of those counted as parameters are the weights and biases, the place within the case of CNN they correspond to the values contained in the kernel. So, the parameter depend will stay the identical so long as we don’t change the configuration of the convolution layers. The variety of operations, however, will get diminished in accordance with the lower in enter decision for the reason that variety of pixels to be processed in smaller pictures is fewer than in bigger pictures.

Determine 7. How enter decision impacts mannequin accuracy, variety of operations, and parameter depend [1].

As a substitute of simply evaluating totally different values for α and ρ, authors additionally in contrast MobileNet with different fashionable fashions. We will see in Determine 8 that the most important MobileNet variant (the one utilizing most α and ρ) achieved comparable accuracy with GoogLeNet (InceptionV1) and VGG16 whereas sustaining the bottom computational complexity. That is principally the rationale that I named this text The Tiny Big — light-weight but highly effective.

Determine 8. MobileNet achieves comparable accuracy to fashionable fashions whereas sustaining a a lot decrease computational complexity and parameter depend [1].

Moreover, authors additionally in contrast the smaller MobileNet variant with different small fashions. What’s to me attention-grabbing in Determine 9 is that despite the fact that the parameter depend of SqueezeNet is decrease than MobileNet, the variety of operations in MobileNet is over 22 occasions smaller than SqueezeNet whereas nonetheless sustaining greater accuracy.

Determine 9. The efficiency of the smaller MobileNet variant in comparison with fashionable fashions [1].

MobileNetV1 Implementation

As we’ve got understood the concept behind MobileNetV1, we are able to now leap into the code. The structure I’m about to implement relies on the desk in Determine 4. As at all times, the very first thing we have to do is to import the required modules.

# Codeblock 1
import torch
import torch.nn as nn
from torchinfo import abstract

Subsequent, we initialize a number of configurable parameters in order that we are able to alter the mannequin dimension based on our wants. In Codeblock 2 under, I denote α as ALPHA, which the worth may be modified to 0.75, 0.5 or 0.25 if we wish the mannequin to be smaller. We don’t specify any variable for ρ since we are able to instantly change IMAGE_SIZE to 192, 160 or 128 as we mentioned earlier.

# Codeblock 2
BATCH_SIZE  = 1
IMAGE_SIZE  = 224
IN_CHANNELS = 3
NUM_CLASSES = 1000
ALPHA       = 1

First Convolution

If we return to Determine 4, we are able to see that MobileNet basically solely consists of repeating patterns, i.e., a depthwise separable convolution adopted by pointwise convolution. Nevertheless, discover that the primary row within the determine doesn’t comply with this sample as it’s truly simply an ordinary convolution layer. As a result of this motive, we have to create a separate class for this, which I confer with as FirstConv in Codeblock 3 under.

# Codeblock 3
class FirstConv(nn.Module):
    def __init__(self):
        tremendous().__init__()
        self.conv = nn.Conv2d(in_channels=3, 
                              out_channels=int(32*ALPHA),    #(1)
                              kernel_size=3,    #(2)
                              stride=2,         #(3)
                              padding=1,        #(4)
                              bias=False)       #(5)
        self.bn = nn.BatchNorm2d(num_features=int(32*ALPHA))
        self.relu = nn.ReLU()
    
    def ahead(self, x):
        x = self.relu(self.bn(self.conv(x)))
        return x

Keep in mind that MobileNet follows the conv-BN-ReLU construction. Thus, we have to initialize these three layers throughout the __init__() technique of this class. The convolution layer itself is about to just accept 3 enter channels and output 32 channels. Since we wish this variety of output channels to be adjustable, we have to multiply it with ALPHA on the line marked with #(1). Take into account that we have to change the datatype to integer after the multiplication since having a floating-point quantity for channel depend is simply nonsense. Subsequent, at line #(2) and #(3) we set the kernel dimension to three and the stride to 2. With this configuration, the spatial dimension of the ensuing tensor goes to be half that of the enter. Moreover, utilizing kernel of dimension 3×3 like this implicitly requires us to set the padding to 1 to implement padding = similar (#(4)). On this case we’re not going to make the most of the bias time period, which is the rationale that we set the bias parameter to False (#(5)). That is truly an ordinary observe once we use the conv-BN-ReLU construction, since on the finish of the day the worth distribution of the convolution kernels might be centered round 0 once more by the batch normalization layer, cancelling out the biases utilized throughout the convolution kernel.

To be able to discover out whether or not the FirstConv class works correctly, we’re going to check it with the Codeblock 4 under. Right here we initialize the layer and move a tensor simulating a single RGB picture of dimension 224×224. You may see within the ensuing output that our convolution layer efficiently downsampled the spatial dimension to 112×112 whereas on the similar time increasing the variety of channels to 32.

# Codeblock 4
first_conv = FirstConv()
x = torch.randn((1, 3, 224, 224))

out = first_conv(x)
out.form

# Codeblock 4 Output
torch.Dimension([1, 32, 112, 112])

Depthwise Separable Convolutions

As the primary convolution is finished, we are able to now work on the repeating depthwise-pointwise layers. Since this sample is the core concept of depthwise separable convolution, within the following code I wrap the 2 kinds of conv layers in a category referred to as DepthwiseSeparableConv.

# Codeblock 5
class DepthwiseSeparableConv(nn.Module):
    def __init__(self, in_channels, out_channels, downsample=False):  #(1)
        tremendous().__init__()
        
        in_channels  = int(in_channels*ALPHA)    #(2)
        out_channels = int(out_channels*ALPHA)   #(3)       
        
        if downsample:    #(4)
            stride = 2
        else:
            stride = 1
        
        self.dwconv = nn.Conv2d(in_channels=in_channels,
                                out_channels=in_channels,     #(5)
                                kernel_size=3,                #(6)
                                stride=stride,                #(7)
                                padding=1,
                                teams=in_channels,           #(8)
                                bias=False)
        self.bn0 = nn.BatchNorm2d(num_features=in_channels)   #(9)
        
        self.pwconv = nn.Conv2d(in_channels=in_channels,   
                                out_channels=out_channels,    #(10)
                                kernel_size=1,                #(11)
                                stride=1,                     #(12)
                                padding=0,                    #(13)
                                teams=1,                     #(14)
                                bias=False)
        self.bn1 = nn.BatchNorm2d(num_features=out_channels)  #(15)
        
        self.relu = nn.ReLU()    #(16)

    def ahead(self, x):
        print(f'originalt: {x.dimension()}')
        
        x = self.relu(self.bn0(self.dwconv(x)))
        print(f'after dw convt: {x.dimension()}')
        
        x = self.relu(self.bn1(self.pwconv(x)))
        print(f'after pw convt: {x.dimension()}')
        
        return x

Completely different from FirstConv which doesn’t take any enter argument within the initialization part, right here we set the DepthwiseSeparableConv class to take a number of inputs as proven at line #(1) in Codeblock 5 above. I do that as a result of we wish the category to be reusable throughout all depthwise separable convolution layers all through all the community, wherein every of them has barely totally different behaviors from each other.

We will see in Determine 4 that after the 3-channel picture is expanded to 32 by the primary layer, this channel depend will increase to 64, 128, and so forth all the way in which to 1024 within the subsequent processes. That is principally the rationale that I set this class to just accept the variety of enter and output channels (in_channels and out_channels) in order that we are able to initialize the layer with versatile channel configurations. It’s also necessary to remember that we have to alter this channel counts based mostly on ALPHA. This will merely be completed utilizing the code at line #(2) and #(3). Moreover, right here I additionally create a flag referred to as downsample because the enter parameter which by default is about to False. This flag is accountable to find out whether or not the layer will cut back the spatial dimension. Once more, when you return to Determine 4, you’ll discover that there are circumstances the place we cut back the spatial dimension by half and there are additionally another circumstances the place the dimension is preserved. Each time we need to carry out downsampling, we have to set the stride to 2, but when we don’t, we are going to set this parameter to 1 as an alternative (#(4)).

Nonetheless with the Codeblock 5 above, the subsequent factor we have to do is to initialize the layers themselves. As we’ve got mentioned earlier, the depthwise convolution is accountable to seize spatial relationships between pixels, which is definitely the rationale that the kernel dimension is about to three×3 (#(6)). To ensure that the enter channels to be processed independently of one another, we are able to merely set the teams and out_channels parameters to be the identical because the variety of enter channels itself (#(8) and #(5)). It’s price noting that if we set out_channels to be greater than the variety of enter channels — say, twice as massive — then we could have every channel processed by 2 kernels. Lastly for the depthwise convolution layer, the stride parameter at line #(7) can both be 1 or 2 which is decided based on the downsampling flag we mentioned earlier.

In the meantime, the pointwise convolution makes use of 1×1 kernel (#(11)) since it isn’t meant to seize spatial data. That is truly the rationale that we set the padding to 0 (#(13)) as a result of there isn’t a method this kernel dimension can cut back spatial dimension by itself. The teams parameter, however, is about to 1 (#(14)) as a result of we wish this layer to seize data from all channels without delay. In contrast to the depthwise convolution layer, right here we are able to make use of as many kernels as wanted which corresponds to the variety of channels within the ensuing output tensor (#(10)). In the meantime, the stride is about fastened to 1 (#(12)) since we are going to by no means carry out downsampling with this layer.

Right here we have to initialize two separate batch normalization layers to be positioned after the depthwise and pointwise convs (#(9) and #(15)). As for the ReLU activation operate, we solely must initialize it as soon as (#(16)) since it’s only a mapping operate with none trainable parameters. Due to this, we are able to reuse the identical ReLU occasion a number of occasions throughout the community.

Now let’s see if our DepthwiseSeparableConv class works correctly by passing a dummy tensor by way of it. Right here I’ve ready two check circumstances for this class. The primary one is once we don’t carry out downsampling and the second is once we do. In Determine 10 under, the 2 exams I need to carry out contain using the layers highlighted in inexperienced and blue, respectively.

Determine 10. The layers highlighted in inexperienced and blue are those we’re going to simulate to check the DepthwiseSeparableConv class [1][2].

To create the inexperienced half, we are able to merely use the DepthwiseSeparableConv class and set the variety of enter and output channels to 32 and 64 as seen in Codeblock 6 under (#(1–2)). Passing downsample = False is just not fairly mandatory since we already set it because the default configuration (#(3)) — however I do that anyway only for the sake of readability. The form of the dummy tensor x can be configured to have the scale of 32×112×112, wherein it matches precisely with the enter form of the layer (#(4)).

# Codeblock 6
depthwise_sep_conv = DepthwiseSeparableConv(in_channels=32,     #(1)
                                            out_channels=64,    #(2)
                                            downsample=False)   #(3)
x = torch.randn((1, int(32*ALPHA), 112, 112))                   #(4)

x = depthwise_sep_conv(x)

If you happen to run the above code, the next output ought to seem in your display screen. Right here you possibly can see that the depthwise convolution layer returns a tensor of the very same form because the enter (#(1)). The variety of channels then doubles from 32 to 64 after the tensor being processed by the pointwise convolution (#(2)). This consequence proves that our DepthwiseSeparableConv class works correctly for non-downsampling course of. We’ll use this output tensor within the subsequent check because the enter for the blue layer.

# Codeblock 6 Output
unique       : torch.Dimension([1, 32, 112, 112])
after dw conv  : torch.Dimension([1, 32, 112, 112])    #(1)
after pw conv  : torch.Dimension([1, 64, 112, 112])    #(2)

The second check is kind of just like the primary one, besides that right here we have to configure the mannequin based mostly on the variety of enter and output channels of the blue layer. Not solely that, the downsample parameter additionally must be set to True since we wish the layer to scale back the spatial dimension by half. See Codeblock 7 under for the small print.

# Codeblock 7
depthwise_sep_conv = DepthwiseSeparableConv(in_channels=64, 
                                            out_channels=128,
                                            downsample=True)

x = depthwise_sep_conv(x)

# Codeblock 7 Output
unique       : torch.Dimension([1, 64, 112, 112])
after dw conv  : torch.Dimension([1, 64, 56, 56])    #(1)
after pw conv  : torch.Dimension([1, 128, 56, 56])   #(2)

We will see within the above output that the spatial downsampling works correctly because the depthwise convolution layer efficiently transformed the 112×112 picture to 56×56 (#(1)). The channel axis is lastly expanded to 128 with the assistance of the pointwise convolution layer (#(2)), making it able to be fed into the next layer.

Based mostly on the 2 exams I demonstrated above, it’s confirmed that our DepthwiseSeparableConv class is appropriate and thus prepared for use to assemble all the MobileNetV1 structure.

The Complete MobileNetV1 Structure

I wrap all the things inside a category which I confer with as MobileNetV1. Since this class is kind of lengthy, I break it down into Codeblock 8a and 8b. If you wish to run this code your self, simply be sure that these two codeblocks are written throughout the similar pocket book cell.

Now let’s begin from the __init__() technique of this class. The very first thing to do right here is to initialize the FirstConv layer we created earlier (#(1)). The subsequent layers we have to initialize are the core concept of MobileNet, i.e., the depthwise separable convolutions, wherein each single of these layers consists of depthwise and pointwise convs. On this implementation I made a decision to call these pairs ranging from depthwise_sep_conv0 all the way in which to depthwise_sep_conv8. If you happen to return to Determine 4, you’ll discover that the downsampling course of is finished alternately with the non-downsampling layers. This will merely be applied by setting the downsample flag to True for the layers no 1, 3, 5 and seven. The depthwise_sep_conv6 is a bit particular since it’s truly not a standalone layer. Fairly, it’s a bunch of depthwise separable convolutions of the very same specification repeated 5 occasions.

# Codeblock 8a
class MobileNetV1(nn.Module):
    def __init__(self):
        tremendous().__init__()
        
        self.first_conv = FirstConv()    #(1)
        
        self.depthwise_sep_conv0 = DepthwiseSeparableConv(in_channels=32, 
                                                          out_channels=64)
        
        self.depthwise_sep_conv1 = DepthwiseSeparableConv(in_channels=64, 
                                                          out_channels=128, 
                                                          downsample=True)
        
        self.depthwise_sep_conv2 = DepthwiseSeparableConv(in_channels=128, 
                                                          out_channels=128)
        
        self.depthwise_sep_conv3 = DepthwiseSeparableConv(in_channels=128, 
                                                          out_channels=256, 
                                                          downsample=True)
        
        self.depthwise_sep_conv4 = DepthwiseSeparableConv(in_channels=256, 
                                                          out_channels=256)
        
        self.depthwise_sep_conv5 = DepthwiseSeparableConv(in_channels=256, 
                                                          out_channels=512, 
                                                          downsample=True)
        
        self.depthwise_sep_conv6 = nn.ModuleList(
            [DepthwiseSeparableConv(in_channels=512, out_channels=512) for _ in range(5)]
        )
        
        self.depthwise_sep_conv7 = DepthwiseSeparableConv(in_channels=512, 
                                                          out_channels=1024, 
                                                          downsample=True)
        
        self.depthwise_sep_conv8 = DepthwiseSeparableConv(in_channels=1024,  #(2)
                                                          out_channels=1024)
        
        num_out_channels = self.depthwise_sep_conv8.pwconv.out_channels      #(3)
        
        self.avgpool = nn.AdaptiveAvgPool2d(output_size=(1,1))      #(4)
        self.fc = nn.Linear(in_features=num_out_channels,           #(5)
                            out_features=NUM_CLASSES)
        self.softmax = nn.Softmax(dim=1)                            #(6)

As we’ve got reached the final DepthwiseSeparableConv layer (#(2)), what we have to do subsequent is to initialize three extra layers: a mean pooling layer (#(4)), a fully-connected layer (#(5)), and a softmax activation operate layer (#(6)). One factor that you simply want to remember is that the variety of output channels produced by the depthwise_sep_conv8 is just not at all times 1024 despite the fact that it seems to be fastened to that quantity. In truth, this output channel depend might be totally different if we modify the ALPHA. To be able to make our implementation adaptive to such adjustments, we have to take the precise variety of output channels generated utilizing the code at line #(3), which is able to then be used because the enter dimension of the fully-connected layer (#(5)).

Concerning the ahead() technique in Codeblock 8b, I believe there’s nothing I want to clarify since what we principally do right here is simply to move a tensor from one layer to the next ones.

# Codeblock 8b
    def ahead(self, x):
        x = self.first_conv(x)
        print(f"after first_convtt: {x.form}")
        
        x = self.depthwise_sep_conv0(x)
        print(f"after depthwise_sep_conv0t: {x.form}")
        
        x = self.depthwise_sep_conv1(x)
        print(f"after depthwise_sep_conv1t: {x.form}")
        
        x = self.depthwise_sep_conv2(x)
        print(f"after depthwise_sep_conv2t: {x.form}")
        
        x = self.depthwise_sep_conv3(x)
        print(f"after depthwise_sep_conv3t: {x.form}")
        
        x = self.depthwise_sep_conv4(x)
        print(f"after depthwise_sep_conv4t: {x.form}")
        
        x = self.depthwise_sep_conv5(x)
        print(f"after depthwise_sep_conv5t: {x.form}")
        
        for i, layer in enumerate(self.depthwise_sep_conv6):
            x = layer(x)
            print(f"after depthwise_sep_conv6 #{i}t: {x.form}")
        
        x = self.depthwise_sep_conv7(x)
        print(f"after depthwise_sep_conv7t: {x.form}")
        
        x = self.depthwise_sep_conv8(x)
        print(f"after depthwise_sep_conv8t: {x.form}")
        
        x = self.avgpool(x)
        print(f"after avgpoolttt: {x.form}")
        
        x = torch.flatten(x, start_dim=1)
        print(f"after flattenttt: {x.form}")
        
        x = self.fc(x)
        print(f"after fcttt: {x.form}")
        
        x = self.softmax(x)
        print(f"after softmaxttt: {x.form}")
        
        return x

Now let’s see if our MobileNetV1 works correctly by working the next check code.

# Codeblock 9
mobilenetv1 = MobileNetV1()
x = torch.randn((BATCH_SIZE, IN_CHANNELS, IMAGE_SIZE, IMAGE_SIZE))

out = mobilenetv1(x)

And under is what the output seems like. Right here we are able to see that our dummy picture tensor efficiently went by way of the first_conv layer all the way in which to the ultimate output layer. Through the convolution part, we are able to see that the spatial dimension decreased as we get into the deeper layers whereas on the similar time the variety of channels elevated. Afterwards, we apply a mean pooling layer which works by taking the common worth from every channel. We will say that at this level each single channel of dimension 7×7 is now represented as a single worth, which is definitely the rationale that the spatial dimension dropped to 1×1 (#(1)). This tensor is then flattened (#(2)) in order that we are able to course of it additional with the fully-connected layer (#(3)).

# Codeblock 9 Output
after first_conv             : torch.Dimension([1, 32, 112, 112])
after depthwise_sep_conv0    : torch.Dimension([1, 64, 112, 112])
after depthwise_sep_conv1    : torch.Dimension([1, 128, 56, 56])
after depthwise_sep_conv2    : torch.Dimension([1, 128, 56, 56])
after depthwise_sep_conv3    : torch.Dimension([1, 256, 28, 28])
after depthwise_sep_conv4    : torch.Dimension([1, 256, 28, 28])
after depthwise_sep_conv5    : torch.Dimension([1, 512, 14, 14])
after depthwise_sep_conv6 #0 : torch.Dimension([1, 512, 14, 14])
after depthwise_sep_conv6 #1 : torch.Dimension([1, 512, 14, 14])
after depthwise_sep_conv6 #2 : torch.Dimension([1, 512, 14, 14])
after depthwise_sep_conv6 #3 : torch.Dimension([1, 512, 14, 14])
after depthwise_sep_conv6 #4 : torch.Dimension([1, 512, 14, 14])
after depthwise_sep_conv7    : torch.Dimension([1, 1024, 7, 7])
after depthwise_sep_conv8    : torch.Dimension([1, 1024, 7, 7])
after avgpool                : torch.Dimension([1, 1024, 1, 1])    #(1)
after flatten                : torch.Dimension([1, 1024])          #(2)
after fc                     : torch.Dimension([1, 1000])          #(3)
after softmax                : torch.Dimension([1, 1000])

If you would like an much more detailed structure, we are able to use the abstract() operate from torchinfo we imported earlier. If you happen to scroll down the ensuing output under, we are able to see that this mannequin accommodates roughly 4.2 million trainable parameters, wherein this quantity matches with the one written in Determine 5, 6, 7 and eight. I additionally tried to initialize the identical mannequin with totally different ALPHA, and I discovered that the numbers match with the desk in Determine 6. Due to this motive, I believe our MobileNetV1 implementation is appropriate.

# Codeblock 10
mobilenetv1 = MobileNetV1()
abstract(mobilenetv1, input_size=(BATCH_SIZE, IN_CHANNELS, IMAGE_SIZE, IMAGE_SIZE))

# Codeblock 10 Output
==========================================================================================
Layer (kind:depth-idx)                   Output Form              Param #
==========================================================================================
MobileNetV1                              [1, 1000]                 --
├─FirstConv: 1-1                         [1, 32, 112, 112]         --
│    └─Conv2d: 2-1                       [1, 32, 112, 112]         864
│    └─BatchNorm2d: 2-2                  [1, 32, 112, 112]         64
│    └─ReLU: 2-3                         [1, 32, 112, 112]         --
├─DepthwiseSeparableConv: 1-2            [1, 64, 112, 112]         --
│    └─Conv2d: 2-4                       [1, 32, 112, 112]         288
│    └─BatchNorm2d: 2-5                  [1, 32, 112, 112]         64
│    └─ReLU: 2-6                         [1, 32, 112, 112]         --
│    └─Conv2d: 2-7                       [1, 64, 112, 112]         2,048
│    └─BatchNorm2d: 2-8                  [1, 64, 112, 112]         128
│    └─ReLU: 2-9                         [1, 64, 112, 112]         --
├─DepthwiseSeparableConv: 1-3            [1, 128, 56, 56]          --
│    └─Conv2d: 2-10                      [1, 64, 56, 56]           576
│    └─BatchNorm2d: 2-11                 [1, 64, 56, 56]           128
│    └─ReLU: 2-12                        [1, 64, 56, 56]           --
│    └─Conv2d: 2-13                      [1, 128, 56, 56]          8,192
│    └─BatchNorm2d: 2-14                 [1, 128, 56, 56]          256
│    └─ReLU: 2-15                        [1, 128, 56, 56]          --
├─DepthwiseSeparableConv: 1-4            [1, 128, 56, 56]          --
│    └─Conv2d: 2-16                      [1, 128, 56, 56]          1,152
│    └─BatchNorm2d: 2-17                 [1, 128, 56, 56]          256
│    └─ReLU: 2-18                        [1, 128, 56, 56]          --
│    └─Conv2d: 2-19                      [1, 128, 56, 56]          16,384
│    └─BatchNorm2d: 2-20                 [1, 128, 56, 56]          256
│    └─ReLU: 2-21                        [1, 128, 56, 56]          --
├─DepthwiseSeparableConv: 1-5            [1, 256, 28, 28]          --
│    └─Conv2d: 2-22                      [1, 128, 28, 28]          1,152
│    └─BatchNorm2d: 2-23                 [1, 128, 28, 28]          256
│    └─ReLU: 2-24                        [1, 128, 28, 28]          --
│    └─Conv2d: 2-25                      [1, 256, 28, 28]          32,768
│    └─BatchNorm2d: 2-26                 [1, 256, 28, 28]          512
│    └─ReLU: 2-27                        [1, 256, 28, 28]          --
├─DepthwiseSeparableConv: 1-6            [1, 256, 28, 28]          --
│    └─Conv2d: 2-28                      [1, 256, 28, 28]          2,304
│    └─BatchNorm2d: 2-29                 [1, 256, 28, 28]          512
│    └─ReLU: 2-30                        [1, 256, 28, 28]          --
│    └─Conv2d: 2-31                      [1, 256, 28, 28]          65,536
│    └─BatchNorm2d: 2-32                 [1, 256, 28, 28]          512
│    └─ReLU: 2-33                        [1, 256, 28, 28]          --
├─DepthwiseSeparableConv: 1-7            [1, 512, 14, 14]          --
│    └─Conv2d: 2-34                      [1, 256, 14, 14]          2,304
│    └─BatchNorm2d: 2-35                 [1, 256, 14, 14]          512
│    └─ReLU: 2-36                        [1, 256, 14, 14]          --
│    └─Conv2d: 2-37                      [1, 512, 14, 14]          131,072
│    └─BatchNorm2d: 2-38                 [1, 512, 14, 14]          1,024
│    └─ReLU: 2-39                        [1, 512, 14, 14]          --
├─ModuleList: 1-8                        --                        --
│    └─DepthwiseSeparableConv: 2-40      [1, 512, 14, 14]          --
│    │    └─Conv2d: 3-1                  [1, 512, 14, 14]          4,608
│    │    └─BatchNorm2d: 3-2             [1, 512, 14, 14]          1,024
│    │    └─ReLU: 3-3                    [1, 512, 14, 14]          --
│    │    └─Conv2d: 3-4                  [1, 512, 14, 14]          262,144
│    │    └─BatchNorm2d: 3-5             [1, 512, 14, 14]          1,024
│    │    └─ReLU: 3-6                    [1, 512, 14, 14]          --
│    └─DepthwiseSeparableConv: 2-41      [1, 512, 14, 14]          --
│    │    └─Conv2d: 3-7                  [1, 512, 14, 14]          4,608
│    │    └─BatchNorm2d: 3-8             [1, 512, 14, 14]          1,024
│    │    └─ReLU: 3-9                    [1, 512, 14, 14]          --
│    │    └─Conv2d: 3-10                 [1, 512, 14, 14]          262,144
│    │    └─BatchNorm2d: 3-11            [1, 512, 14, 14]          1,024
│    │    └─ReLU: 3-12                   [1, 512, 14, 14]          --
│    └─DepthwiseSeparableConv: 2-42      [1, 512, 14, 14]          --
│    │    └─Conv2d: 3-13                 [1, 512, 14, 14]          4,608
│    │    └─BatchNorm2d: 3-14            [1, 512, 14, 14]          1,024
│    │    └─ReLU: 3-15                   [1, 512, 14, 14]          --
│    │    └─Conv2d: 3-16                 [1, 512, 14, 14]          262,144
│    │    └─BatchNorm2d: 3-17            [1, 512, 14, 14]          1,024
│    │    └─ReLU: 3-18                   [1, 512, 14, 14]          --
│    └─DepthwiseSeparableConv: 2-43      [1, 512, 14, 14]          --
│    │    └─Conv2d: 3-19                 [1, 512, 14, 14]          4,608
│    │    └─BatchNorm2d: 3-20            [1, 512, 14, 14]          1,024
│    │    └─ReLU: 3-21                   [1, 512, 14, 14]          --
│    │    └─Conv2d: 3-22                 [1, 512, 14, 14]          262,144
│    │    └─BatchNorm2d: 3-23            [1, 512, 14, 14]          1,024
│    │    └─ReLU: 3-24                   [1, 512, 14, 14]          --
│    └─DepthwiseSeparableConv: 2-44      [1, 512, 14, 14]          --
│    │    └─Conv2d: 3-25                 [1, 512, 14, 14]          4,608
│    │    └─BatchNorm2d: 3-26            [1, 512, 14, 14]          1,024
│    │    └─ReLU: 3-27                   [1, 512, 14, 14]          --
│    │    └─Conv2d: 3-28                 [1, 512, 14, 14]          262,144
│    │    └─BatchNorm2d: 3-29            [1, 512, 14, 14]          1,024
│    │    └─ReLU: 3-30                   [1, 512, 14, 14]          --
├─DepthwiseSeparableConv: 1-9            [1, 1024, 7, 7]           --
│    └─Conv2d: 2-45                      [1, 512, 7, 7]            4,608
│    └─BatchNorm2d: 2-46                 [1, 512, 7, 7]            1,024
│    └─ReLU: 2-47                        [1, 512, 7, 7]            --
│    └─Conv2d: 2-48                      [1, 1024, 7, 7]           524,288
│    └─BatchNorm2d: 2-49                 [1, 1024, 7, 7]           2,048
│    └─ReLU: 2-50                        [1, 1024, 7, 7]           --
├─DepthwiseSeparableConv: 1-10           [1, 1024, 7, 7]           --
│    └─Conv2d: 2-51                      [1, 1024, 7, 7]           9,216
│    └─BatchNorm2d: 2-52                 [1, 1024, 7, 7]           2,048
│    └─ReLU: 2-53                        [1, 1024, 7, 7]           --
│    └─Conv2d: 2-54                      [1, 1024, 7, 7]           1,048,576
│    └─BatchNorm2d: 2-55                 [1, 1024, 7, 7]           2,048
│    └─ReLU: 2-56                        [1, 1024, 7, 7]           --
├─AdaptiveAvgPool2d: 1-11                [1, 1024, 1, 1]           --
├─Linear: 1-12                           [1, 1000]                 1,025,000
├─Softmax: 1-13                          [1, 1000]                 --
==========================================================================================
Complete params: 4,231,976
Trainable params: 4,231,976
Non-trainable params: 0
Complete mult-adds (Items.MEGABYTES): 568.76
==========================================================================================
Enter dimension (MB): 0.60
Ahead/backward move dimension (MB): 80.69
Params dimension (MB): 16.93
Estimated Complete Dimension (MB): 98.22
==========================================================================================

Ending

That was just about all the things about MobileNetV1. I do encourage you to mess around with the above mannequin. If you wish to practice it for picture classification, you possibly can alter the variety of neurons within the output layer based on the variety of lessons out there in your dataset. You can even attempt to discover totally different α and ρ to search out the values that swimsuit greatest in your case by way of accuracy and effectivity. Moreover, since this implementation is actually completed from scratch, it’s also doable to alter different issues that aren’t explicitly talked about within the paper, such because the variety of repeats of the depthwise_sep_conv6 layer, and even utilizing α and ρ larger than 1. And effectively, there are principally a number of issues to discover from our MobileNetV1 implementation! You can even entry the code used on this article in my GitHub repository [3].

Be happy to remark when you spot any mistake in my rationalization or the code. Thanks for studying!

References

[1] Andrew G. Howard et al. MobileNets: Environment friendly Convolutional Neural Networks for Cellular Imaginative and prescient Functions. Arxiv. https://arxiv.org/abs/1704.04861 [Accessed April 7, 2025].

[2] Picture created initially by creator.

[3] MuhammadArdiPutra. The Tiny Big — MobileNetV1. GitHub. https://github.com/MuhammadArdiPutra/medium_articles/blob/main/The%20Tiny%20Giant%20-%20MobileNetV1.ipynb [Accessed April 7, 2025].

Source link

Reading Research Papers in the Age of LLMs

The Machine Learning “Advent Calendar” Day 6: Decision Tree Regressor

TDS Newsletter: How to Design Evals, Metrics, and KPIs That Work

How to Transition From Data Analyst to Data Scientist

Predicting the NBA Champion with Machine Learning

Explainable AI in Senior Healthcare: Transforming Medical Decisions

A Lawsuit Over AI Agents that Shop

ChatGPT-tips för att spara pengar vid julköpen

Most Popular

LLM in Banking and Finance: Key Use Cases, Examples, and a Practical Guide

A Well-Designed Experiment Can Teach You More Than a Time Machine!

Get Ready for Your Next Career Move

Our Picks