AI in Multiple GPUs: Gradient Accumulation & Data Parallelism

is a part of a sequence about distributed AI throughout a number of GPUs:

Introduction

Distributed Information Parallelism (DDP) is the primary parallelization technique we’ll take a look at. It’s the baseline strategy that’s all the time utilized in distributed coaching settings, and it’s generally mixed with different parallelization methods.

A Fast Neural Community Refresher

Coaching a neural community means working a ahead go, calculating the loss, backpropagating the gradients of every weight with respect to the loss operate, and eventually updating weights (what we name an optimization step). In PyTorch, it sometimes appears to be like like this:

import torch

def training_loop(
    mannequin: torch.nn.Module,
    dataloader: torch.utils.information.DataLoader,
    optimizer: torch.optim.Optimizer,
    loss_fn: callable,
):
    for i, batch in enumerate(dataloader):
        inputs, targets = batch
        output = mannequin(inputs)  # Ahead go
        loss = loss_fn(output, targets)  # Compute loss
        loss.backward()  # Backward go (compute gradients)
        optimizer.step()  # Replace weights
        optimizer.zero_grad()  # Clear gradients for the subsequent step

Performing the optimization step on massive quantities of coaching information usually offers extra correct gradient estimates, resulting in smoother coaching and probably sooner convergence. So ideally we’d be taking every step after computing the gradients based mostly on your entire coaching dataset. In observe, that’s not often possible in Deep Studying eventualities, as it could take too lengthy to compute. As a substitute, we work with small chunks like mini-batches and micro-batches.

Batch: Refers back to the total coaching set used for one optimization step.
Mini-batch: Refers to a small subset of the coaching information used for one optimization step.
Micro-batch: Refers to a subset of the mini-batch, we mix a number of micro-batches for one optimization step.

That is the place Gradient Accumulation and Information Parallelism come into play. Though we don’t use your entire dataset for every step, we will use these methods to considerably enhance our mini-batch measurement.

Gradient Accumulation

Right here’s the way it works: decide a big mini-batch that gained’t slot in GPU reminiscence, however then cut up it into micro-batches that do match. For every micro-batch, run ahead and backward passes, including (accumulating) the computed gradients. As soon as all micro-batches are processed, carry out a single optimization step utilizing the averaged gradients.

Discover Gradient Accumulation isn’t a parallelization approach and doesn’t require a number of GPUs.

Picture by writer: Gradient Accumulation animation

Implementing Gradient Accumulation from scratch is easy. Right here’s what it appears to be like like in a easy coaching loop:

import torch

def training_loop(
    mannequin: torch.nn.Module,
    dataloader: torch.utils.information.DataLoader,
    optimizer: torch.optim.Optimizer,
    loss_fn: callable,
    grad_accum_steps: int,
):
    for i, batch in enumerate(dataloader):
        inputs, targets = batch
        output = mannequin(inputs)
        loss = loss_fn(output, targets)
        loss.backward()  # Gradients get accrued (summed)

        # Solely replace weights after `grad_accum_steps` micro-batches
        if (i+1) % grad_accum_steps == 0:  # i+1 to keep away from a step within the first iteration when i=0
            optimizer.step()
            optimizer.zero_grad()

Discover we’re sequentially performing a number of ahead and backward passes earlier than every optimization step, which requires longer coaching instances. It could be good if we might velocity this up by processing a number of micro-batches in parallel… that’s precisely what DDP does!

Distributed Information Parallelism (DDP)

For a reasonably small variety of GPUs (as much as ~8) DDP scales virtually linearly, which is perfect. That signifies that when you double the variety of GPUs, you may virtually halve the coaching time (we already mentioned Linear Scaling beforehand).

With DDP, a number of GPUs work collectively to course of a bigger efficient mini-batch, dealing with every micro-batch in parallel. The workflow appears to be like like this:

Break up the mini-batch throughout GPUs.
Every GPU runs its personal ahead and backward passes to compute gradients for its personal information shard (micro-batch).
Use an All-Cut back operation (we beforehand realized about it in Collective operations) to common gradients throughout all GPUs.
Every GPU applies the identical weight updates, protecting fashions in excellent sync.

This lets us practice with a lot bigger efficient mini-batch sizes, resulting in extra steady coaching and probably sooner convergence.

Picture by writer: Distributed Information Parallel animation

Implementing DDP from scratch in PyTorch

Let’s do this step-by-step. In this first iteration, we’re only syncing the gradients.

import torch


class DDPModelWrapper:
    def __init__(self, model: torch.nn.Module):
        self.model = model

    def __call__(self, *args, **kwargs):
        return self.model(*args, **kwargs)

    def sync_gradients(self):
        # Iterate over parameter matrices in the model
        for param in self.model.parameters():  
            # Some parameters might be frozen and don't have gradients
            if param.grad is not None:
                # We sum and then divide since torch.distributed doesn't have an average operation
                torch.distributed.all_reduce(param.grad.data, op=torch.distributed.ReduceOp.SUM)
                # Assuming each GPU received an equally sized mini-batch, we can average
                # the gradients dividing by the number of GPUs (aka world size)
                # By default the loss function already averages over the mini-batch size
                param.grad.data /= torch.distributed.get_world_size()

Before we start training, we obviously need our model to be the same across all GPUs, otherwise we would be training different models! Let’s improve our implementation by checking that all weights are identical during instantiation (if you don’t know what ranks are, check the first blog post of the sequence).

import torch


class DDPModelWrapper:
    def __init__(self, mannequin: torch.nn.Module):
        self.mannequin = mannequin
        for param in self.mannequin.parameters():
            # We create a brand new tensor so it may obtain the published
            rank_0_param = param.information.clone()
            # Initially rank_0_param comprises the values for the present rank
            torch.distributed.broadcast(rank_0_param, src=0)
            # After the published rank_0_param variable is overwritten with the parameters from rank_0
            if not torch.equal(param.information, rank_0_param):  # Now we evaluate rank_x with rank_0
                elevate ValueError("Mannequin parameters are usually not the identical throughout all processes.")

    def __call__(self, *args, **kwargs):
        return self.mannequin(*args, **kwargs)

    def sync_gradients(self):
        for param in self.mannequin.parameters():  
            if param.grad isn't None:  
                torch.distributed.all_reduce(param.grad.information, op=torch.distributed.ReduceOp.SUM)
                param.grad.information /= torch.distributed.get_world_size()

Combining DDP with GA

You possibly can mix DDP with GA to realize even bigger efficient batch sizes. That is significantly helpful when your mannequin is so massive that only some samples match per GPU.

The important thing profit is diminished communication overhead: as an alternative of syncing gradients after each batch, you solely sync as soon as per grad_accum_steps batches. This implies:

International efficient batch measurement = num_gpus × micro_batch_size × grad_accum_steps
Fewer synchronization factors = much less time spent on inter-GPU communication

A coaching loop utilizing our DDPModelWrapper with Gradient Accumulation appears to be like like this:

def training_loop(
    ddp_model: DDPModelWrapper,
    dataloader: torch.utils.information.DataLoader,
    optimizer: torch.optim.Optimizer,
    loss_fn: callable,
    grad_accum_steps: int,
):
    for i, batch in enumerate(dataloader):
        inputs, targets = batch
        output = ddp_model(inputs)
        loss = loss_fn(output, targets)
        loss.backward()

        if (i+1) % grad_accum_steps == 0:
            # Should sync gradients throughout GPUs *BEFORE* the optimization step
            ddp_model.sync_gradients()
            optimizer.step()
            optimizer.zero_grad()

Professional-tips and superior utilization

Use information prefetching. You possibly can velocity up coaching by loading the subsequent batch of knowledge whereas the present one is being processed. PyTorch’s DataLoader gives a prefetch_factor argument that controls what number of batches to prefetch within the background. Correctly leveraging prefetching with CUDA could be a bit tough, so we’ll go away it for a future submit.
Don’t max out GPU reminiscence. Counter-intuitively, leaving some free reminiscence can result in sooner coaching throughput. If you go away at the least ~15% of GPU reminiscence free, the GPU can higher handle reminiscence by avoiding fragmentation.
PyTorch DDP overlaps communication with computation. By default, DDP communicates gradients as they’re computed throughout backpropagation moderately than ready for the complete backward go to complete. Right here’s how:
- PyTorch organizes mannequin gradients into buckets of bucket_cap_mb megabytes. Through the backward go, PyTorch marks gradients as prepared for discount as they’re computed. As soon as all gradients in a bucket are prepared, DDP kicks off an asynchronous allreduce to common these gradients throughout all ranks. The loss.backward() name returns solely in any case allreduceoperations have accomplished, so instantly calling decide.step() is protected.
- The bucket_cap_mb parameter creates a tradeoff: smaller values set off extra frequent allreduce operations, however every communication kernel launch incurs some overhead that may damage efficiency. Bigger values cut back communication frequency but in addition cut back overlap; on the excessive, if buckets are too massive, you’re ready for your entire backward go to complete earlier than speaking. The optimum worth is dependent upon your mannequin structure and {hardware}, so profile with totally different values to seek out what works finest.

Right here’s an entire PyTorch implementation of DDP:

"""
Launch with:
  torchrun --nproc_per_node=NUM_GPUS ddp.py
"""
import torch
import torch.nn as nn
import torch.distributed as dist
from torch.nn.parallel import DistributedDataParallel as DDP
from torch.utils.information import DataLoader, TensorDataset
from torch.utils.information.distributed import DistributedSampler
from torch import optim


class ToyModel(nn.Module):
    def __init__(self):
        tremendous().__init__()
        self.web = nn.Sequential(
            nn.Linear(1024, 1024), nn.ReLU(),
            nn.Linear(1024, 1024), nn.ReLU(),
            nn.Linear(1024, 256),
        )

    def ahead(self, x):
        return self.web(x)


def practice():
    dist.init_process_group(backend="nccl")
    rank = dist.get_rank()
    torch.cuda.set_device(rank)
    gadget = torch.gadget(f"cuda:{rank}")

    # Create dummy dataset
    x_data = torch.randn(1000, 1024)
    y_data = torch.randn(1000, 256)
    dataset = TensorDataset(x_data, y_data)

    # DistributedSampler ensures every rank will get totally different information
    sampler = DistributedSampler(dataset, shuffle=True)
    dataloader = DataLoader(dataset, batch_size=64, sampler=sampler)

    mannequin = ToyModel().to(gadget)

    # gradient_as_bucket_view: avoids an additional grad tensor copy per bucket.
    ddp_model = DDP(
        mannequin,
        device_ids=[rank],
        bucket_cap_mb=25,
        gradient_as_bucket_view=True,
    )

    optimizer = optim.AdamW(ddp_model.parameters(), lr=1e-3)
    loss_fn = nn.MSELoss()

    for epoch in vary(2):
        sampler.set_epoch(epoch)  # Ensures totally different shuffling every epoch

        for batch_idx, (x, y) in enumerate(dataloader):
            x, y = x.to(gadget), y.to(gadget)

            optimizer.zero_grad()
            output = ddp_model(x)
            loss = loss_fn(output, y)

            # Backward robotically overlaps with allreduce per bucket.
            # By the point this returns, all allreduce ops are completed.
            loss.backward()
            optimizer.step()

            if rank == 0 and batch_idx % 5 == 0:
                print(f"epoch {epoch}  batch {batch_idx}  loss={loss.merchandise():.4f}")

    dist.destroy_process_group()


if __name__ == "__main__":
    practice()

Right here’s an entire PyTorch implementation combining DDP with GA:

"""
Launch with:
  torchrun --nproc_per_node=NUM_GPUS ddp_ga.py
"""
import torch
import torch.nn as nn
import torch.distributed as dist
from torch.nn.parallel import DistributedDataParallel as DDP
from torch.utils.information import DataLoader, TensorDataset
from torch.utils.information.distributed import DistributedSampler
from torch import optim
from contextlib import nullcontext


class ToyModel(nn.Module):
    def __init__(self):
        tremendous().__init__()
        self.web = nn.Sequential(
            nn.Linear(1024, 1024), nn.ReLU(),
            nn.Linear(1024, 1024), nn.ReLU(),
            nn.Linear(1024, 256),
        )

    def ahead(self, x):
        return self.web(x)


def practice():
    dist.init_process_group(backend="nccl")
    rank = dist.get_rank()
    torch.cuda.set_device(rank)
    gadget = torch.gadget(f"cuda:{rank}")

    # Create dummy dataset
    x_data = torch.randn(1000, 1024)
    y_data = torch.randn(1000, 256)
    dataset = TensorDataset(x_data, y_data)

    # DistributedSampler ensures every rank will get totally different information
    sampler = DistributedSampler(dataset, shuffle=True)
    dataloader = DataLoader(dataset, batch_size=16, sampler=sampler)

    mannequin = ToyModel().to(gadget)

    ddp_model = DDP(
        mannequin,
        device_ids=[rank],
        bucket_cap_mb=25,
        gradient_as_bucket_view=True,
    )

    optimizer = optim.AdamW(ddp_model.parameters(), lr=1e-3)
    loss_fn = nn.MSELoss()

    ACCUM_STEPS = 4

    for epoch in vary(2):
        sampler.set_epoch(epoch)  # Ensures totally different shuffling every epoch

        optimizer.zero_grad()
        for batch_idx, (x, y) in enumerate(dataloader):
            x, y = x.to(gadget), y.to(gadget)

            is_last_micro_step = (batch_idx + 1) % ACCUM_STEPS == 0

            # no_sync() suppresses allreduce on accumulation steps.
            # On the final microstep we exit no_sync() so DDP fires
            # the allreduce overlapped with that backward go.
            ctx = ddp_model.no_sync() if not is_last_micro_step else nullcontext()

            with ctx:
                output = ddp_model(x)
                loss = loss_fn(output, y) / ACCUM_STEPS
                loss.backward()

            if is_last_micro_step:
                optimizer.step()
                optimizer.zero_grad()

                if rank == 0:
                    print(f"epoch {epoch}  batch {batch_idx}  loss={loss.merchandise() * ACCUM_STEPS:.4f}")

    dist.destroy_process_group()


if __name__ == "__main__":
    practice()

Conclusion

Observe me on X for extra free AI content material @l_cesconetto

Congratulations on making it to the top! On this submit you realized about:

The significance of enormous batch sizes
How Gradient Accumulation works and its limitations
The DDP workflow and its advantages
Methods to implement GA and DDP from scratch in PyTorch
Methods to mix GA and DDP

Within the subsequent article, we’ll discover ZeRO (Zero Redundancy Optimizer), a extra superior approach that builds upon DDP to additional optimize VRAM reminiscence utilization.

References

Source link

Is the AI and Data Job Market Dead?

PySpark for Pandas Users | Towards Data Science

Build Effective Internal Tooling with Claude Code

Mechanistic Interpretability: Peeking Inside an LLM

AI verktyg för fitness diet och träningsupplägg

Gemini-appen ger nu automatisk åtkomst till meddelanden och samtal på Android

Marginal Effect of Hyperparameter Tuning with XGBoost

TDS Newsletter: How to Design Evals, Metrics, and KPIs That Work

Most Popular

Amazon CEO’s New Memo Signals a Brutal Truth: More AI, Fewer Humans

Can we fix AI’s evaluation crisis?

Mirage AI skapar GTA och Forza spel

Our Picks

Is the AI and Data Job Market Dead?

PySpark for Pandas Users | Towards Data Science

AI in Multiple GPUs: Gradient Accumulation & Data Parallelism

AI in Multiple GPUs: Gradient Accumulation & Data Parallelism

Introduction

A Fast Neural Community Refresher

Gradient Accumulation

Distributed Information Parallelism (DDP)

Implementing DDP from scratch in PyTorch

Combining DDP with GA

Professional-tips and superior utilization

Conclusion

References

Related Posts