is shaping our world as we converse. Actually, it has been slowly revolutionizing software program because the early 2010s. In 2025, PyTorch is on the forefront of this revolution, rising as one of the crucial necessary libraries to coach neural networks.
Whether or not you’re working with laptop imaginative and prescient, constructing massive language fashions (LLMs), coaching a reinforcement studying agent, or experimenting with graph neural networks – your path goes to cross by way of PyTorch when you enter deep studying metropolis.
All photos supplied on this article have been produced by the creator.
This information will present a whirlwind tour of PyTorch’s methodologies and design rules. Over the subsequent hour, we’re going to chop by way of the noise and get straight to the guts of how neural networks are literally skilled.
This text is about PyTorch’s foundational ideas and how to compose and prepare fashions — from easy linear regression all the way in which to a contemporary transformer block.
Extra importantly than the particular code examples introduced right here, the objective of this text is to show the principle concepts, project-level architectures, and abstractions to work with PyTorch.
In different phrases, assume “the PyTorch method”.
Earlier than we get that far, you will need to perceive the fundamentals. PyTorch is constructed on two core abstractions: tensors and automated differentiation. Grasp these two — how tensors retailer knowledge, and the way gradients are used to coach neural networks — and the remainder of PyTorch will really feel pure. Let’s focus on Tensors first.
1. Fundamentals of Tensors
A tensor is a multidimensional array with a dtype, a tool, and elective gradient monitoring. If you realize NumPy arrays, consider tensors as numpy arrays with a number of main advantages:
- GPU utilization: Tensors can carry out massively parallel operations within the GPU. Matrix multiplications, additions, and even conditional statements are all supported.
- Computation Graph: As a substitute of imagining tensors as an remoted block of knowledge, consider it as a node on a computation graph. (Proven beneath)
- Computerized Differentiation: PyTorch mechanically calculates partial derivatives of every differentiable operation it performs. We’ll focus on what this truly means, and why it is a large deal for coaching neural networks very shortly.
2. Computerized Differentiation (Autograd)
Neural networks in PyTorch assemble a dynamic computation graph, and use it to compute gradients mechanically. Let’s see a easy instance to study this.
Allow us to start with a clear, scalar instance in order that shapes and values are straightforward to purpose about. The next code computes z = x^2 + y^3
for scalar x and y, then calls backward to acquire dz/dx
and dz/dy
.
x = torch.tensor(2.0, requires_grad=True)
y = torch.tensor(3.0, requires_grad=True)
# Ahead cross: compute z = x^2 + y^3
z = x**2 + y**3
# Backward cross: compute gradients
z.backward()
dz_dx = x.grad # partial spinoff wrt x
dz_dy = y.grad # partial spinoff wrt y
What is going on:
- We created two tensors
x
andy
withrequires_grad=True
. This tells autograd to trace their operations. - The ahead computation constructs a small graph for z.
z.backward()
triggers the reverse-mode autodiff: PyTorch computes the gradients and locations them inx.grad
andy.grad
.
Here’s what the outcomes of the above block of code will appear to be:

In case you did some psychological math, right here’s what the partial derivatives appear to be for that equation when calculated analytically (spoiler: it really works!):

Chain Rule
Chain Rule in calculus is a elementary components for differentiating composite capabilities, that are primarily capabilities inside different capabilities. In less complicated phrases, you’re employed from the skin in, taking the spinoff of every “layer” of the perform and multiplying them collectively.
Let’s take a easy instance of how the chain rule works in PyTorch. Let’s say you may have the next three-step equation:
Eqn 1: y = x^2
Eqn 2: z = y + 1
Eqn 3: w = z^2
Mainly, w
is dependent upon z
, z
is dependent upon y
, y
is dependent upon x
. A fundamental chain of compositionality. And let’s say you wish to discover the spinoff of w
with respect to x
The chain rule in calculus states that to search out dw/dx
we calculate the gradients up the chain of dependencies and multiply them. So:dw/dx = dw/dz * dz/dy * dy/dx
Let’s see how PyTorch does this:
# requires_grad=True tells PyTorch to compute the gradients for this tensor
x = torch.tensor(2.0, requires_grad=True)
# Outline the ahead cross
y = x**2
z = y + 1
w = z**2
# Calculate the gradient
w.backward()
# print the gradient
print(x.grad) # 40
And that’s it! It simply works.
What’s much more particular is that as an alternative of defining x as a scalar like we did above, we are able to additionally outline it as a multi-dimensional tensor.
Right here’s what occurs after we change the primary line from initializing a scalar with torch.tensor(2)
to a 1D-tensor with torch.tensor([-1, 2])

That is what makes PyTorch so cool. You possibly can concurrently (or parallely) compute gradients for a number of components identical to that.
When engaged on deep studying tasks, our inputs are usually multi-dimensional, so PyTorch does a whole lot of heavy lifting within the background by parallelizing the gradient computation!
The Pytorch components
As seen within the earlier instance, the PyTorch recreation plan is fairly easy.
- Outline the “ahead cross” of your equation, i.e., how is your dependent variable derived out of your impartial variables?
- PyTorch mechanically computes the backward propagation (supplied your equations are differentiable)

3. Coaching fashions
Now that we perceive the fundamentals of auto differentiation, let’s see how linear regression works in PyTorch. The code beneath constructs a small housing-style dataset with two options (space and age), normalizes them to the vary of [-1, 1], and prepares us for some good old style linear regression.
df = pd.DataFrame(
{
"space": [120, 180, 150, 210, 105],
"age": [5, 2, 1, 2, 1],
"worth": [30, 90, 100, 180, 85]
}
)
df = normalize(df)
To do something with PyTorch, we should first switch the information into tensors! Discover how the information tensors X and Y don’t require gradients as a result of they’re constants (i.e., they don’t change throughout coaching).
The weights W
and B
are trainable, although. We’ll replace them to match our dataset. To make it trainable by way of backpropagation, we have to set requires_grad=True
for these declarations.
Have a look at the code beneath:
# Be aware that these are constants, we aren't going to replace them
X = torch.tensor(df[["area", "age"]].values, dtype=torch.float32)
Y = torch.tensor(df[["price"]].values, dtype=torch.float32)
# These "require_grad" So they're trainable weights.
W = torch.rand(measurement=(2, 1), requires_grad=True)
B = torch.rand(1, requires_grad=True)
Subsequent, let’s generate a prediction! The ahead cross makes use of the idiomatic matrix multiplication and addition, i.e. X @ W + B
.
# Generate a prediction
pred = X @ W + B
The @
operator principally does matrix multiplication between X and W. The X @ W + B
mannequin performs a “linear transformation” of X. Our objective is to tune the trainable weights W and B in order that the prediction is nearer to our goal floor fact.
Subsequent, we calculate the error because the imply sq. error loss. It calculates the gap between our present prediction and the bottom fact. If we name loss.backward()
we may even get the gradients of the trainable variables within the graph (i.e., W and B).
loss.backward()
dW = W.grad # Tells us "how a lot W should change to scale back the loss"
dB = B.grad # and "how a lot B should change to scale back the loss"
dW
and dB
are the gradients of W and B with respect to the loss perform. We are able to apply “gradient descent” to nudge these trainable parameters within the course indicated by the gradient.
lr = 0.2 # Studying fee: tells us how a lot we must always replace the weights
with torch.no_grad():
W = W - lr * dW # Updating W with Gradient descent
B = B - lr * dB # Updating B with Gradient descent
Understanding linear regression, loss calculation, and gradient descent are a number of the pillars of machine studying, and by extension, deep studying. Whereas updating the weights manually by subtracting the gradients is feasible, it’s infeasible in follow for deep neural networks with a number of layers of weights. If solely there had been a option to mechanically replace weights with out worrying about conserving observe like this!
Aspect be aware
The above strategy of taking small steps within the optimization area to iteratively study the weights is known as Gradient Descent. Be aware that there are higher methods to study the optimum W and B for small datasets. Just like the Regular equation, which provides us an analytical resolution that doesn’t require any steps or iteration. It’s nevertheless, computationally costly for giant datasets. For big matrices, the usual strategy is to divide the information into minibatches and apply gradient descent individually. This method is named Stochastic Gradient Descent (SGD).
Optimizers
PyTorch optimizers are algorithms (like SGD, Adam, or RMSprop) that regulate the mannequin’s weights and biases primarily based on the computed gradients to reduce the loss perform.
Let’s verify how the above Linear Regression code will look if we changed the guide weight updates with PyTorch optimizers.
from torch.optim import SGD
...
W = torch.rand(measurement=(2, 1), requires_grad=True)
B = torch.rand(1, requires_grad=True)
optimizer = SGD(params = [W, B], lr=0.1)
for step in vary(10):
pred = X @ W + B # Ahead cross
loss = ((Y - pred) ** 2).imply() # Calculate loss
loss.backward() # Calculate gradients
optimizer.step() # Replace W and B based on gradients
optimizer.zero_grad() # Reset all gradients
The core loop for coaching fashions in PyTorch seems to be like this:
- Ahead cross to compute
pred
- Calculate
loss
by discovering the error between the prediction (pred) and the bottom fact (Y) - Backward cross with
loss.backward()
to populateW.grad
andB.grad.
- Step with
optimizer.step()
to replace parameters. - Zero gradients with
optimizer.zero_grad()
to keep away from accumulation.
SGD is a strong baseline for linear regression. As you scale up or face noisier gradients, adaptive optimizers may help. That is the place PyTorch’s suite of open supply optimizers comes into play. This consists of adaptive optimizers, like Adam, that use strategies similar to momentum and per-parameter studying charges to attain quicker and extra secure convergence on these difficult duties. Here’s a flashcard evaluating varied well-liked ones:

Not simply optimizers, as a result of Torch additionally supplies a number of various loss capabilities too! Listed here are some examples:

4. Layers and Modules
Similar to we don’t want to write down our personal optimizers, we don’t have to declare uncooked tensors and matrix multiplication logic on our personal (for probably the most half). Pytorch modules have us lined.
A PyTorch Module is the basic constructing block for all neural networks in PyTorch, performing as a container for layers, learnable parameters, and the logic for the way knowledge flows by way of them. For instance, that linear layer we wrote earlier, the place we manually declared the weights and biases, we are able to as an alternative use these traces of code:
linear_model = nn.Linear(in_size, out_size) # Torch takes care of initializing weights
prediction = linear_model(enter) # Ahead cross
We discovered make linear fashions (yay!), however what we actually have to study is prepare bigger and deeper neural networks. The only kind of neural community is the multi-layer perceptron (MLP). An MLP is principally a number of linear layers with non-linear capabilities in between.
Creating MLPs is fairly easy in Torch. nn.Sequential
is a standard PyTorch module that’s used to sequentially cross the enter by way of a number of layers. Right here is the code:
# A 2 layer MLP
mlp_2_layers = nn.Sequential(
nn.Linear(in_size, hidden_units),
nn.ReLU(),
nn.Linear(hidden_units, out_size)
)
# A 3 layer MLP
mlp_3_layers = nn.Sequential(
nn.Linear(in_size, hidden_units),
nn.ReLU(),
nn.Linear(hidden_units, hidden_units),
nn.ReLU(),
nn.Linear(hidden_units, out_size)
)
Multi-layer perceptrons can study compositional and non-linear capabilities! Right here is an instance of a zig-zag perform and the way a 2-layer MLP with RELU learns it.

5. Writing customized networks
Torch has an unlimited array of superior layers and modules which have impressed complete analysis papers. You possibly can take into consideration these as lego blocks which you can match into and compose any neural community.
Desire a convolutional community layer for photos? Use nn.Conv2d
.
A GRU layer to course of sequential tokens? Use nn.GRU
However most frequently in analysis, you’ll wish to write a customized neural community structure from scratch. The recipe for this course of is as follows:
- Subclass from
nn.Module
- Within the
__init__
constructor perform, initialize all of your layers and weights - Outline a
ahead()
the place you write the logic of the ahead cross
Right here is an instance the place we implement the traditional ResNet structure:
class ResNetBlock(nn.Module):
def __init__(self, in_channels, out_channels, stride=1, downsample=None):
tremendous(ResNetBlock, self).__init__()
self.conv1 = nn.Conv2d(
in_channels,
out_channels,
kernel_size=3,
stride=stride,
padding=1,
bias=False,
)
self.bn1 = nn.BatchNorm2d(out_channels)
self.conv2 = nn.Conv2d(
out_channels, out_channels, kernel_size=3, stride=1, padding=1, bias=False
)
self.bn2 = nn.BatchNorm2d(out_channels)
self.downsample = downsample
def ahead(self, x):
residual = x
out = F.relu(self.bn1(self.conv1(x)))
out = self.bn2(self.conv2(out))
if self.downsample:
residual = self.downsample(x)
out += residual
out = F.relu(out)
return out
That’s it! You simply initialize your layers and outline the ahead cross computation graph, and Torch will straight do the backward cross by itself.

In fact, you should utilize your customized layers and modules as elements of a a lot bigger community! For instance, right here is an instance of writing a single transformer block.
class AttentionLayer(nn.Module):
def __init__(self, input_dim, attention_dim=64):
tremendous(SimpleAttention, self).__init__()
# Linear layers for consideration computation
self.question = nn.Linear(input_dim, attention_dim)
self.key = nn.Linear(input_dim, attention_dim)
self.worth = nn.Linear(input_dim, attention_dim)
# Scaling issue
self.scale = torch.sqrt(torch.FloatTensor([attention_dim]))
def ahead(self, x):
# x form: (batch_size, sequence_length, input_dim)
batch_size, seq_len, input_dim = x.measurement()
# Compute Q, Okay, V
Q = self.question(x) # (batch_size, seq_len, attention_dim)
Okay = self.key(x) # (batch_size, seq_len, attention_dim)
V = self.worth(x) # (batch_size, seq_len, attention_dim)
attention_scores = torch.matmul(Q, Okay.transpose(-2, -1)) / self.scale # Scaled dot-product consideration
attention_weights = F.softmax(attention_scores, dim=-1) # Convert consideration weights to chances
attended_output = torch.matmul(attention_weights, V) # Apply consideration to values
return attended_output, attention_weights
class TransformerBlock(nn.Module):
"""
A single transformer block composed of self-attention and a feed-forward community.
"""
def __init__(self, embed_dim, ffn_hidden_dim):
"""
Args:
embed_dim (int): The dimensionality of the mannequin's embeddings.
ffn_hidden_dim (int): The dimensionality of the hidden layer within the FFN.
"""
tremendous(TransformerBlock, self).__init__()
self.consideration = SimpleAttention(embed_dim, embed_dim)
self.norm1 = nn.LayerNorm(embed_dim)
self.norm2 = nn.LayerNorm(embed_dim)
self.ffn = nn.Sequential(
nn.Linear(embed_dim, ffn_hidden_dim),
nn.ReLU(),
nn.Linear(ffn_hidden_dim, embed_dim)
)
def ahead(self, x):
"""
Ahead cross for the transformer block.
Args:
x (torch.Tensor): Enter tensor of form (batch_size, sequence_length, embed_dim).
Returns:
torch.Tensor: The output tensor of the transformer block.
"""
# Self-attention half
attended, _ = self.consideration(x)
# Add & Norm (residual connection)
x = self.norm1(attended + x)
# Feed-forward half
ffn_out = self.ffn(x)
# Add & Norm (residual connection)
x = self.norm2(ffn_out + x)
return x
class TransformerEncoder(nn.Module):
"""
A transformer encoder that stacks a number of TransformerBlocks.
"""
def __init__(self, num_layers, embed_dim, ffn_hidden_dim, seq_len, output_dim):
"""
Args:
num_layers (int): The variety of transformer blocks to stack.
embed_dim (int): The dimensionality of the mannequin's embeddings.
ffn_hidden_dim (int): The dimensionality of the hidden layer within the FFN.
seq_len (int): The size of the enter sequences.
output_dim (int): The dimensionality of the ultimate output (e.g., variety of courses).
"""
tremendous(TransformerEncoder, self).__init__()
# Create a listing of transformer blocks
self.layers = nn.ModuleList(
[TransformerBlock(embed_dim, ffn_hidden_dim) for _ in range(num_layers)]
)
# Last classification head
self.classifier = nn.Linear(embed_dim * seq_len, output_dim)
def ahead(self, x):
"""
Ahead cross for the complete transformer encoder.
Args:
x (torch.Tensor): Enter tensor of form (batch_size, sequence_length, embed_dim).
Returns:
torch.Tensor: The ultimate output logits from the classifier.
"""
# Move enter by way of all transformer blocks
for layer in self.layers:
x = layer(x)
# Flatten the output for the classifier
x = x.view(x.measurement(0), -1)
# Last classification
output = self.classifier(x)
return output
Discover how the primary module AttentionLayer
computes the scaled-dot-product attention. TransformerBlock
applies layer norms and feedforward networks on high of it. And at last, the TransformerEncoder
module applies a number of Transformer blocks in a sequence! And identical to that, we’ve a BERT model that includes a number of stacks of bidirectional consideration layers, together with varied optimizations similar to layer norms and residual connections.
In case you are a newbie and this half overwhelms you, that is very a lot anticipated! The cool factor with PyTorch is you get to decide on the extent of complexity you wish to work with relying in your talent degree.
If you find yourself starting, you might wish to persist with the a whole bunch of readymade modules Pytorch gives out of the field. You’ll slowly discover the necessity to department out and customizing them in your personal use-case. And as you write a few customized ones by yourself, you’ll develop increasingly assured and proficient.
The objective of this part was to point out you the capabilities and infinite customization you are able to do by combining modules collectively. Keep in mind: you write the ahead cross, and so long as the complete graph is differentiable, Torch will all the time be capable to do the auto-differentiation for you!
Subsequent steps
The options and ideas lined on this article have been handpicked to offer a whirlwind tour of a few of Torch’s most necessary capabilities. I’ve a YouTube video that explains all of those ideas, together with some further ones like mannequin deployment, dataloaders, distributions, and coaching strategies.
That’s it for this text! Listed here are some hyperlinks the place you may study extra about my work. Thanks for the learn!
Assist me on Patreon: https://www.patreon.com/NeuralBreakdownwithAVB
My YouTube channel:
https://www.youtube.com/@avb_fj
Comply with me on Twitter:
https://x.com/neural_avb
Learn my articles:
https://towardsdatascience.com/author/neural-avb/