The Channel-Wise Attention | Squeeze and Excitation

Once we discuss consideration in pc imaginative and prescient, one factor that in all probability involves your thoughts first is the one used within the Imaginative and prescient Transformer (ViT) structure. In truth, that’s not the one consideration mechanism we now have for picture knowledge. There may be truly one other one referred to as Squeeze and Excitation Community (SENet). If the eye in ViT operates spatially, i.e., assigning weights to totally different patches of a picture, the eye mechanism proposed in SENet operates in channel-wise method, i.e., assigning weights to totally different channels. — On this article, we’re going to focus on how the Squeeze and Excitation structure works, easy methods to implement it from scratch, and easy methods to combine the community into the ResNeXt mannequin.

The Squeeze and Excitation Module

SENet, which was first proposed in a paper titled “Squeeze-and-Excitation Networks” by Hu et al. [1], isn’t a standalone community like VGG, Inception, or ResNet. As an alternative, it’s truly a constructing block to be positioned on an current community. In CNN-based fashions, we assume that pixels spatially shut to one another have excessive correlations, which is the explanation that we make use of small-sized kernels to seize these correlations. This sort of assumption is mainly the inductive bias of CNN. Alternatively, SENet introduces a brand new inductive bias, the place the authors assume that each picture channel contributes otherwise to predicting a particular class. By making use of SE modules to a CNN, the mannequin not solely depends on spatial patterns but additionally captures the significance of every channel. To higher illustrate this, we will consider a picture of fireplace, the place the pink channel would theoretically give the next contribution to the ultimate prediction than the blue and inexperienced channels.

The construction of the SE module itself is proven in Determine 1. Because the identify of the community suggests, there are two major steps performed on this module: squeeze and excitation. The squeeze half corresponds to the operation denoted as F_sq, whereas the excitation half contains each F_ex and F_scale. Alternatively, the F_tr operation, is definitely not the a part of the SE module. Relatively, it represents a change perform that initially belongs to the mannequin the place the SE module is utilized. For instance, if we have been to put this SE module on ResNet, the F_tr operation refers back to the stack of convolution layers throughout the bottleneck block.

Determine 1. The construction of the Squeeze and Excitation module [1].

Speaking extra particularly in regards to the F_sq operation, it primarily works by using international common pooling mechanism, the place it’s used to seize the data from all the spatial dimension of every channel. By doing so, each channel of the enter tensor goes to be represented by a single quantity, which is mainly simply the common worth of the corresponding channel. The authors seek advice from this operation as international info embedding. Mathematically talking, this will formally be written within the equation proven in Determine 2, the place we mainly sum all values throughout the peak H and width W earlier than ultimately dividing it with the variety of pixels inside that channel (H×W).

Determine 2. The mathematical expression of the worldwide common pooling mechanism in SE module [1].

In the meantime, each excitation and scaling operations are known as adaptive recalibration since what they primarily do is to dynamically alter the weightings of every channel within the enter tensor based on its significance. In truth, the diagram in Determine 1 doesn’t utterly depict all the SENet structure. You may see within the determine that F_ex seems to be a single operation, but it truly consists of two linear layers every adopted by an activation perform. See the Determine 3 beneath for the main points.

Determine 3. The mathematical formulation of the ***F_ex*** operation [1].

The 2 linear layers are denoted as W_1 and W_2, whereas δ and σ signify ReLU and sigmoid activation features, respectively. So, based mostly on this mathematical definition, what we mainly must do later within the implementation is to go tensor z (the average-pooled tensor) by way of the primary linear layer, adopted by the ReLU activation perform, the second linear layer, and lastly the sigmoid activation perform. Keep in mind that the sigmoid perform normalizes enter values to be throughout the vary of 0 to 1. On this case, we are going to understand the ensuing output as the burden of every channel, the place a worth near 1 signifies that the corresponding channel comprises vital info, therefore we permit the mannequin to pay extra consideration to that channel. In any other case, if the ensuing quantity is near 0, it signifies that the corresponding channel doesn’t contribute that a lot to the output.

So as to make the most of these channel weights, we will carry out the F_scale operation, which is mainly only a multiplication of the unique tensor u and the burden tensor s, as proven in Determine 4 beneath. By doing this, we primarily retain the values throughout the vital channels whereas on the similar time suppressing the values of the unimportant ones.

Determine 4. The scaling course of is only a multiplication of the unique and the burden tensors [1].

By the way in which sorry for getting a bit too mathy right here, lol. However I imagine this can allow you to perceive the code later within the implementation part.

The place to Put the SE Module

Making use of the SE module on a plain CNN mannequin like VGG is straightforward, as we will merely place it proper after every convolution layer. Nonetheless, it may not be simple within the case of Inception or ResNet due to the presence of parallel branches in these two networks. To deal with this confusion, authors present a information to implement the SE module particularly on the 2 fashions as proven in Determine 5 beneath.

Determine 5. The place SE module is positioned in Inception and ResNet [1].

For the Inception mannequin, as an alternative of inserting SE module proper after every convolution layer, we go the enter tensor by way of all the Inception block (together with all of the branches inside) after which connect the SE module afterwards. The identical strategy additionally works for ResNet, however remember that the summation between the tensor in skip connection and the primary circulation occurs after the primary tensor has been processed by the SE module.

As I discussed earlier, the excitation stage primarily consists of two linear layers. If we take a more in-depth take a look at the above construction, we will see that the output form of the primary linear layer is 1×1×C/r. The variable r is named discount ratio which reduces the dimensionality of the burden tensor earlier than ultimately projecting it again to 1×1×C by way of the second linear layer. The dimensionality discount performed by the primary layer acts as a bottleneck operation, which is helpful to restrict mannequin complexity and to enhance generalization. Authors carried out experiments on totally different r values, they usually discovered that r = 16 produces the very best stability between accuracy and complexity.

Determine 6. A number of methods potential for use to connect SE module in ResNet [1].

Along with implementing the SE module in ResNet, it’s seen in Determine 6 that there are literally a number of methods we will observe to take action. In response to the experimental leads to Determine 7, it seems to be like the usual SE, SE-PRE, and SE-Id blocks obtained related outcomes, whereas on the similar time all of them outperformed SE-POST by a major margin. This means that the location of the SE module impacts mannequin efficiency when it comes to accuracy. Primarily based on these findings, the authors argue that we’re going to get hold of good outcomes so long as we apply the SE module earlier than the element-wise summation operation. Later within the coding part, I’m going to show easy methods to implement the usual SE block.

Determine 7. Experimental outcomes on totally different SE module integration methods [1].

Extra Experimental Outcomes

There are literally much more experimental outcomes mentioned within the paper. Considered one of them is a desk displaying accuracy rating enhancements when SE module is utilized to current CNN-based fashions. The desk I’m referring to is displayed in Determine 8 beneath.

Determine 8. Experimental outcomes on making use of SE module on totally different fashions [1][2].

The columns highlighted in blue signify the error charges of every mannequin and those in pink point out the computational complexity measured in GFLOPs. The re-implementation column refers back to the plain mannequin that the authors carried out themselves, whereas the SENet column represents the identical mannequin geared up with SE module. The desk clearly reveals that each top-1 and top-5 errors lower when the SE module is utilized. It is very important know that though including the SE module causes the GFLOPs to get greater, but this enhance is significantly marginal in comparison with the discount in error price.

Subsequent, we will truly reveal attention-grabbing insights by printing out the values contained within the SE modules throughout the inference section. Let’s check out the charts in Determine 9 beneath to raised illustrate this. The x axis of those charts denotes the channel numbers, the y axis represents how a lot weight does every channel have based on its significance, and the colour of the traces signifies the category being predicted.

Determine 9. What the activation of SE modules seems to be like in several community depth [1].

In shallower layers, the options captured by SE module are class-agnostic, which mainly signifies that it captures generic info required to foretell all lessons. The charts known as (a) and (b), that are the SE modules from ResNet stage 2 and three, present that there’s not a lot distinction in channel exercise from one class to a different, indicating that these two modules don’t seize info relating to a particular class. The case is definitely totally different from the SE modules in deeper layers, i.e., those in stage 4 (c) and stage 5 (d). We are able to see that these two modules alter channel weights otherwise relying on the category being predicted. That is primarily the explanation that the SE modules in deeper layers are mentioned to be class-specific. Nonetheless, the authors acknowledge that there is perhaps uncommon habits taking place in a number of the SE modules which occurs within the 2nd block of stage 5 (e). Right here the SE module doesn’t present significant channel recalibration habits, indicating that it doesn’t contribute as a lot as those we mentioned earlier.

The Detailed Structure

On this article we’re going to implement the SE-ResNeXt-50 (32×4d) mannequin, which in Determine 10 it corresponds to the one within the rightmost column. The ResNeXt mannequin itself is much like ResNet, besides that the group parameter of the second convolution layer inside every block is ready to 32. If you happen to’re conversant in ResNeXt, that is primarily the best but efficient method to implement the so-called cardinality. I like to recommend you learn my earlier article about ResNeXt if you’re not but conversant in it, which the hyperlink is supplied at reference quantity [3] on the finish of this text.

Taking a more in-depth take a look at the structure, what differentiates SE-ResNet-50 from ResNet-50 is solely the presence of SE modules. The identical additionally applies to SE-ResNeXt-50 (32×4d) in comparison with ResNeXt-50 (32×4d) (not displayed within the desk). Discover within the determine beneath that the fashions with SE modules have an fc layer hooked up after the final convolution layer inside every block, which the corresponding two numbers point out the primary and second fully-connected layers contained in the SE module.

Determine 10. The whole structure of ResNet-50, SE-ResNet-50 and SE-ResNeXt-50 (32×4d) [1].

From Scratch Implementation

Keep in mind that right here we’re about to combine the SE module on ResNeXt, so we have to implement each of them from scratch. Technically talking, it’s truly potential to take the ResNeXt structure immediately from PyTorch, then manually connect the SE module on it. Nonetheless, right here I made a decision to make use of the ResNeXt implementation from my earlier article as an alternative since I really feel like it’s a lot simpler to know than the one from PyTorch. Observe that right here I’ll give attention to developing the SE module and easy methods to connect it to the ResNeXt mannequin reasonably than explaining the ResNeXt itself since I’ve already coated it in that article [3].

Now let’s begin the code by importing the required modules.

# Codeblock 1
import torch
import torch.nn as nn

Squeeze and Excitation Module

The next SE module implementation follows the diagram proven in Determine 5 (proper). It’s value noting that the SEModule class beneath doesn’t embody the skip-connection (curved arrow), as all the SE module is utilized after the preliminary branching however earlier than the merging (summation).

The __init__() methodology of this class accepts two parameters: num_channels and r, as proven at line #(1) in Codeblock 2a. We positively need this SE module to be usable all through all the community. So, we have to set the num_channels parameter to be adjustable as a result of the variety of output channels varies throughout ResNeXt blocks at totally different levels, as proven again in Determine 10. In the meantime, regardless that we sometimes use the identical discount ratio r within the SE modules throughout the total community, however it’s technically potential for us to make use of totally different r for various stage, which could in all probability be an attention-grabbing factor to experiment with. So, that is primarily the explanation that I additionally set the r parameter to be adjustable.

# Codeblock 2a
class SEModule(nn.Module):
    def __init__(self, num_channels, r):                     #(1)
        tremendous().__init__()
        
        self.global_pooling = nn.AdaptiveAvgPool2d(output_size=(1,1))  #(2)
        self.fc0 = nn.Linear(in_features=num_channels,       #(3)
                             out_features=num_channels//r, 
                             bias=False)
        self.relu = nn.ReLU()                                #(4)
        self.fc1 = nn.Linear(in_features=num_channels//r,    #(5)
                             out_features=num_channels, 
                             bias=False)
        self.sigmoid = nn.Sigmoid()                          #(6)

There are 5 layers we have to initialize contained in the __init__() methodology. I write them down based on the sequence given in Determine 5, i.e., international common pooling layer (#(2)), linear layer (#(3)), ReLU activation perform (#(4)), one other linear layer (#(5)), and sigmoid activation perform (#(6)). Right here you may see that the primary linear layer is accountable to carry out dimensionality discount by shrinking the variety of channels from num_channels to num_channels//r, which is able to then be expanded again to num_channels by the second linear layer. Observe that we set the bias time period of each linear layers to False, which primarily means that we’ll solely make the most of the burden tensors. The absence of bias phrases within the two layers forces the SE module to be taught the correlation between one channel to the others reasonably than simply including fastened changes.

Nonetheless with the SEModule class, let’s now transfer on to the ahead() methodology to outline the circulation of the community. You may see at line #(1) in Codeblock 2b that we begin from a single enter x, which within the case of ResNeXt it’s primarily a tensor produced by the third convolution layer throughout the similar ResNeXt block. As proven in Determine 5, what we have to do subsequent is to department out the community. Right here we immediately course of the department utilizing the global_pooling layer, which I identify the ensuing tensor squeezed (#(2)). The unique enter tensor x itself will probably be left as is since we aren’t going to carry out any operation on it till the scaling section. Subsequent, we have to drop the spatial dimension of the squeezed tensor utilizing torch.flatten() (#(3)). That is mainly performed as a result of we need to course of it additional with the linear layers at line #(4) and #(5), which may solely work with a single-dimensional tensor. The spatial dimension is then launched once more at line #(6), permitting us to carry out multiplication between x (the unique tensor) and excited (the channel weights) at line #(7). This complete course of produces a recalibrated model of x which we seek advice from as scaled. Right here I print out the tensor dimension after every step to be able to higher perceive the circulation of this SE module.

# Codeblock 2b
    def ahead(self, x):                                  #(1)
        print(f'originaltt: {x.dimension()}')
        
        squeezed = self.global_pooling(x)                  #(2)
        print(f'after avgpooltt: {squeezed.dimension()}')
        
        squeezed = torch.flatten(squeezed, 1)              #(3)
        print(f'after flattentt: {squeezed.dimension()}')
        
        excited = self.relu(self.fc0(squeezed))            #(4)
        print(f'after fc0-relutt: {excited.dimension()}')
        
        excited = self.sigmoid(self.fc1(excited))          #(5)
        print(f'after fc1-sigmoidt: {excited.dimension()}')
        
        excited = excited[:, :, None, None]                #(6)
        print(f'after reshapett: {excited.dimension()}')
        
        scaled = x * excited                               #(7)
        print(f'after scalingtt: {scaled.dimension()}')
        
        return scaled

Now we’re going to see if we now have carried out the community appropriately by passing a dummy tensor by way of it. In Codeblock 3 beneath, I initialize an SE module and configure it to simply accept a picture tensor of 512 channels and has a discount ratio of 16 (#(1)). If you happen to check out the SE-ResNeXt structure in Determine 10, this SE module mainly corresponds to the one within the third stage (which the output dimension is 28×28). Thus, at line #(2) we have to alter the form of the dummy tensor accordingly. We then feed this tensor into the community utilizing the code at line #(3).

# Codeblock 3
semodule = SEModule(num_channels=512, r=16)    #(1)
x = torch.randn(1, 512, 28, 28)                #(2)

out = semodule(x)      #(3)

And beneath is what the print features give us.

# Codeblock 3 Output
unique          : torch.Measurement([1, 512, 28, 28])    #(1)
after avgpool     : torch.Measurement([1, 512, 1, 1])      #(2)
after flatten     : torch.Measurement([1, 512])            #(3)
after fc0-relu    : torch.Measurement([1, 32])             #(4)
after fc1-sigmoid : torch.Measurement([1, 512])            #(5)
after reshape     : torch.Measurement([1, 512, 1, 1])      #(6)
after scaling     : torch.Measurement([1, 512, 28, 28])    #(7)

You may see that the unique tensor form matches precisely with our dummy tensor, i.e., 1×512×28×28 (#(1)). By the way in which we will ignore the #1 within the 0th axis because it primarily denotes the batch dimension, which on this case I assume that we solely received a single picture in a batch. After being pooled, the spatial dimension collapses to 1×1 since now every channel is represented by a single quantity (#(2)). The aim of the flatten operation I defined earlier is to drop the 2 empty axes (#(3)) because the subsequent linear layers can solely work with single-dimensional tensor. Right here you may see that the primary linear layer reduces the tensor dimension to 32 due to the discount ratio which we beforehand set to 16 (#(4)). The size of this tensor is then expanded again to 512 by the second linear layer (#(5)). Subsequent, we unsqueeze the tensor in order that we get our 1×1 spatial dimension again (#(6)), permitting us to multiply it with the enter tensor (#(7)). Primarily based on this detailed circulation, you may see that an SE module mainly preserves the unique tensor dimension, proving that this module could be hooked up to any CNN-based mannequin with out disrupting the unique circulation of the community.

ResNeXt

As we now have understood easy methods to implement SE module from scratch, now that I’m going to indicate you ways we will connect it on a ResNeXt mannequin. Earlier than doing so, we have to initialize the parameters required to implement the ResNeXt structure. Within the Codeblock 4 beneath the primary 4 variables are decided based on the ResNeXt-50 (32×4d) variant, whereas the final one (R) represents the discount ratio for the SE modules.

# Codeblock 4
CARDINALITY  = 32
NUM_CHANNELS = [3, 64, 256, 512, 1024, 2048]
NUM_BLOCKS   = [3, 4, 6, 3]
NUM_CLASSES  = 1000
R = 16

The Block class outlined in Codeblock 5a and 5b is the ResNeXt block from my earlier article. There are literally numerous issues we do contained in the __init__() methodology, however the normal thought is that we initialize three convolution layers known as conv0 (#(1)), conv1 (#(2)), and conv2 (#(3)) earlier than initializing the SE module at line #(4). We are going to later configure these layers based on the SE-ResNeXt structure proven again in Determine 10.

# Codeblock 5a
class Block(nn.Module):
    def __init__(self, 
                 in_channels,
                 add_channel=False,
                 channel_multiplier=2,
                 downsample=False):
        tremendous().__init__()

        self.add_channel = add_channel
        self.channel_multiplier = channel_multiplier
        self.downsample = downsample
        
        
        if self.add_channel:
            out_channels = in_channels*self.channel_multiplier
        else:
            out_channels = in_channels
        
        mid_channels = out_channels//2
        
        
        if self.downsample:
            stride = 2
        else:
            stride = 1
            

        if self.add_channel or self.downsample:
            self.projection = nn.Conv2d(in_channels=in_channels,
                                        out_channels=out_channels, 
                                        kernel_size=1, 
                                        stride=stride, 
                                        padding=0, 
                                        bias=False)
            nn.init.kaiming_normal_(self.projection.weight, nonlinearity='relu')
            self.bn_proj = nn.BatchNorm2d(num_features=out_channels)

        self.conv0 = nn.Conv2d(in_channels=in_channels,       #(1)
                               out_channels=mid_channels,
                               kernel_size=1, 
                               stride=1, 
                               padding=0, 
                               bias=False)
        nn.init.kaiming_normal_(self.conv0.weight, nonlinearity='relu')
        self.bn0 = nn.BatchNorm2d(num_features=mid_channels)

        self.conv1 = nn.Conv2d(in_channels=mid_channels,      #(2)
                               out_channels=mid_channels, 
                               kernel_size=3, 
                               stride=stride,
                               padding=1, 
                               bias=False, 
                               teams=CARDINALITY)
        nn.init.kaiming_normal_(self.conv1.weight, nonlinearity='relu')
        self.bn1 = nn.BatchNorm2d(num_features=mid_channels)

        self.conv2 = nn.Conv2d(in_channels=mid_channels,      #(3)
                               out_channels=out_channels,
                               kernel_size=1, 
                               stride=1, 
                               padding=0, 
                               bias=False)
        nn.init.kaiming_normal_(self.conv2.weight, nonlinearity='relu')
        self.bn2 = nn.BatchNorm2d(num_features=out_channels)
        
        self.relu = nn.ReLU()
        
        self.semodule = SEModule(num_channels=out_channels, r=R)    #(4)

The ahead() methodology itself is mostly additionally the identical as the unique ResNeXt mannequin, besides that right here we have to put the SE module proper earlier than the element-wise summation as proven at line #(1) within the Codeblock 5b beneath. Keep in mind that this implementation follows the usual SE block structure in Determine 6 (b).

# Codeblock 5b
    def ahead(self, x):
        print(f'originaltt: {x.dimension()}')
        
        if self.add_channel or self.downsample:
            residual = self.bn_proj(self.projection(x))
            print(f'after projectiont: {residual.dimension()}')
        else:
            residual = x
            print(f'no projectiontt: {residual.dimension()}')
        
        x = self.conv0(x)
        x = self.bn0(x)
        x = self.relu(x)
        print(f'after conv0-bn0-relut: {x.dimension()}')

        x = self.conv1(x)
        x = self.bn1(x)
        x = self.relu(x)
        print(f'after conv1-bn1-relut: {x.dimension()}')
        
        x = self.conv2(x)
        x = self.bn2(x)
        print(f'after conv2-bn2tt: {x.dimension()}')
        
        x = self.semodule(x)      #(1)
        print(f'after semodulett: {x.dimension()}')
        
        x = x + residual
        x = self.relu(x)
        print(f'after summationtt: {x.dimension()}')
        
        return x

With the above implementation, each time we instantiate a Block object we may have a ResNeXt block which is already geared up with an SE module. Now we’re going to check the above class to see if we now have carried out it appropriately. Right here I’m going to simulate a ResNeXt block throughout the third stage. The add_channel and downsample parameters are set to False since we need to protect each the variety of channels and the spatial dimension of the enter tensor.

# Codeblock 6
block = Block(in_channels=512, add_channel=False, downsample=False)
x = torch.randn(1, 512, 28, 28)

out = block(x)

Under is what the output seems to be like. Right here you may see that our first convolution layer efficiently lowered the variety of channels from 512 to 256 (#(1)), which is then expanded again to its unique dimension by the third convolution layer (#(2)). Afterwards, the tensor goes by way of the SE block which the ensuing output dimension is similar as its enter, identical to what we noticed earlier in Codeblock 3 (#(3)). Because the processing with SE module is finished, we will lastly carry out the element-wise summation between the tensor from the primary department and the one from the skip-connection (#(4)).

unique             : torch.Measurement([1, 512, 28, 28])
no projection        : torch.Measurement([1, 512, 28, 28])
after conv0-bn0-relu : torch.Measurement([1, 256, 28, 28])    #(1)
after conv1-bn1-relu : torch.Measurement([1, 256, 28, 28])
after conv2-bn2      : torch.Measurement([1, 512, 28, 28])    #(2)
after semodule       : torch.Measurement([1, 512, 28, 28])    #(3)
after summation      : torch.Measurement([1, 512, 28, 28])    #(4)

And beneath is how I implement all the structure. What we primarily must do is simply to stack a number of SE-ResNeXt blocks based on the structure in Determine 10. In truth, the SEResNeXt class in Codeblock 7 is precisely the identical because the ResNeXt class in my earlier article [3] (I actually copy-pasted it) since what makes SE-ResNeXt totally different from the unique ResNeXt is barely the presence of SE module throughout the Block class we mentioned earlier.

# Codeblock 7
class SEResNeXt(nn.Module):
    def __init__(self):
        tremendous().__init__()

        # conv1 stage
        self.resnext_conv1 = nn.Conv2d(in_channels=NUM_CHANNELS[0],
                                       out_channels=NUM_CHANNELS[1],
                                       kernel_size=7,
                                       stride=2,
                                       padding=3, 
                                       bias=False)
        nn.init.kaiming_normal_(self.resnext_conv1.weight, 
                                nonlinearity='relu')
        self.resnext_bn1 = nn.BatchNorm2d(num_features=NUM_CHANNELS[1])
        self.relu = nn.ReLU()
        self.resnext_maxpool1 = nn.MaxPool2d(kernel_size=3,
                                             stride=2, 
                                             padding=1)

        # conv2 stage
        self.resnext_conv2 = nn.ModuleList([
            Block(in_channels=NUM_CHANNELS[1],
                  add_channel=True,
                  channel_multiplier=4,
                  downsample=False)
        ])
        for _ in vary(NUM_BLOCKS[0]-1):
            self.resnext_conv2.append(Block(in_channels=NUM_CHANNELS[2]))

        # conv3 stage
        self.resnext_conv3 = nn.ModuleList([Block(in_channels=NUM_CHANNELS[2],
                                                  add_channel=True, 
                                                  downsample=True)])
        for _ in vary(NUM_BLOCKS[1]-1):
            self.resnext_conv3.append(Block(in_channels=NUM_CHANNELS[3]))
            
            
        # conv4 stage
        self.resnext_conv4 = nn.ModuleList([Block(in_channels=NUM_CHANNELS[3],
                                                  add_channel=True, 
                                                  downsample=True)])
        
        for _ in vary(NUM_BLOCKS[2]-1):
            self.resnext_conv4.append(Block(in_channels=NUM_CHANNELS[4]))
            
            
        # conv5 stage
        self.resnext_conv5 = nn.ModuleList([Block(in_channels=NUM_CHANNELS[4],
                                                  add_channel=True, 
                                                  downsample=True)])
        
        for _ in vary(NUM_BLOCKS[3]-1):
            self.resnext_conv5.append(Block(in_channels=NUM_CHANNELS[5]))
 
       
        self.avgpool = nn.AdaptiveAvgPool2d(output_size=(1,1))

        self.fc = nn.Linear(in_features=NUM_CHANNELS[5],
                            out_features=NUM_CLASSES)
        

    def ahead(self, x):
        print(f'originaltt: {x.dimension()}')
        
        x = self.relu(self.resnext_bn1(self.resnext_conv1(x)))
        print(f'after resnext_conv1t: {x.dimension()}')
        
        x = self.resnext_maxpool1(x)
        print(f'after resnext_maxpool1t: {x.dimension()}')
        
        for i, block in enumerate(self.resnext_conv2):
            x = block(x)
            print(f'after resnext_conv2 #{i}t: {x.dimension()}')
            
        for i, block in enumerate(self.resnext_conv3):
            x = block(x)
            print(f'after resnext_conv3 #{i}t: {x.dimension()}')
            
        for i, block in enumerate(self.resnext_conv4):
            x = block(x)
            print(f'after resnext_conv4 #{i}t: {x.dimension()}')
            
        for i, block in enumerate(self.resnext_conv5):
            x = block(x)
            print(f'after resnext_conv5 #{i}t: {x.dimension()}')
        
        x = self.avgpool(x)
        print(f'after avgpooltt: {x.dimension()}')
        
        x = torch.flatten(x, start_dim=1)
        print(f'after flattentt: {x.dimension()}')
        
        x = self.fc(x)
        print(f'after fctt: {x.dimension()}')
        
        return x

As all the SE-ResNeXt-50 (32×4d) structure is accomplished, now that we’re going to check it by passing by way of a tensor of dimension 1×3×224×224 by way of the community, simulating a single RGB picture of dimension 224×224. You may see within the output of the Codeblock 8 beneath that it looks as if mannequin works correctly because the tensor efficiently handed by way of all layers throughout the seresnext mannequin with out returning any error. Thus, I imagine this mannequin is now able to be educated. By the way in which don’t neglect to alter the variety of neurons within the output channel based on the variety of lessons in your dataset if you wish to truly prepare this mannequin.

# Codeblock 8
seresnext = SEResNeXt()
x = torch.randn(1, 3, 224, 224)

out = seresnext(x)

# Codeblock 8 Output
unique               : torch.Measurement([1, 3, 224, 224])
after resnext_conv1    : torch.Measurement([1, 64, 112, 112])
after resnext_maxpool1 : torch.Measurement([1, 64, 56, 56])
after resnext_conv2 #0 : torch.Measurement([1, 256, 56, 56])
after resnext_conv2 #1 : torch.Measurement([1, 256, 56, 56])
after resnext_conv2 #2 : torch.Measurement([1, 256, 56, 56])
after resnext_conv3 #0 : torch.Measurement([1, 512, 28, 28])
after resnext_conv3 #1 : torch.Measurement([1, 512, 28, 28])
after resnext_conv3 #2 : torch.Measurement([1, 512, 28, 28])
after resnext_conv3 #3 : torch.Measurement([1, 512, 28, 28])
after resnext_conv4 #0 : torch.Measurement([1, 1024, 14, 14])
after resnext_conv4 #1 : torch.Measurement([1, 1024, 14, 14])
after resnext_conv4 #2 : torch.Measurement([1, 1024, 14, 14])
after resnext_conv4 #3 : torch.Measurement([1, 1024, 14, 14])
after resnext_conv4 #4 : torch.Measurement([1, 1024, 14, 14])
after resnext_conv4 #5 : torch.Measurement([1, 1024, 14, 14])
after resnext_conv5 #0 : torch.Measurement([1, 2048, 7, 7])
after resnext_conv5 #1 : torch.Measurement([1, 2048, 7, 7])
after resnext_conv5 #2 : torch.Measurement([1, 2048, 7, 7])
after avgpool          : torch.Measurement([1, 2048, 1, 1])
after flatten          : torch.Measurement([1, 2048])
after fc               : torch.Measurement([1, 1000])

Moreover, we will additionally print out the variety of parameters this mannequin has utilizing the next code. Right here you may see that the codeblock returns 27,543,848. This variety of parameters is barely greater than the unique ResNeXt mannequin counterpart, which solely has 25,028,904 parameters as talked about in my earlier article in addition to the official PyTorch documentation [4]. Such a rise within the mannequin dimension positively is sensible because the ResNeXt blocks all through all the community now have extra layers due to the presence of SE modules.

# Codeblock 9
def count_parameters(mannequin):
    return sum([params.numel() for params in model.parameters()])

count_parameters(seresnext)

# Codeblock 9 Output
27543848

Ending

And that’s just about every thing in regards to the Squeeze and Excitation module. I do encourage you to discover from right here by coaching this mannequin by yourself dataset in order that you will note whether or not the findings introduced within the paper additionally apply to your case. Not solely that, I feel it might even be attention-grabbing if you happen to attempt to implement SE module on different neural community architectures like VGG or Inception by your self.

I hope you be taught one thing new right this moment. Thanks for studying!

By the way in which you too can discover the code used on this article in my GitHub repo [5].

[1] Jie Hu et al. Squeeze and Excitation Networks. Arxiv. https://arxiv.org/abs/1709.01507 [Accessed March 17, 2025].

[2] Picture initially created by writer.

[3] Taking ResNet to the Subsequent Degree. In the direction of Information Science. https://towardsdatascience.com/taking-resnet-to-the-next-level/ [Accessed July 22, 2025].

[4] Resnext50_32x4d. PyTorch. https://pytorch.org/vision/main/models/generated/torchvision.models.resnext50_32x4d.html#torchvision.models.resnext50_32x4d [Accessed March 17, 2025].

[5] MuhammadArdiPutra. The Channel-Smart Consideration — Squeeze and Excitation. GitHub. https://github.com/MuhammadArdiPutra/medium_articles/blob/main/The%20Channel-Wise%20Attention%20-%20Squeeze%20and%20Excitation.ipynb [Accessed April 7, 2025].

Source link

Agentic AI in Finance: Opportunities and Challenges for Indonesia

Creating AI that matters | MIT News

Scaling Recommender Transformers to a Billion Parameters

The three-layer AI strategy for supply chains

Top 7 Sensible alternatives for document processing

Microsoft’s Revolutionary Diagnostic Medical AI, Explained

What is AI Image Recognition? How It Works & Examples

How to Enrich LLM Context to Significantly Enhance Capabilities

Most Popular

Hands‑On with Agents SDK: Your First API‑Calling Agent

AI, Digital Growth & Overcoming the Asset Cap

Should Sapling AI Be Your AI Detector: Sapling Review

Our Picks