Optimizing PyTorch Model Inference on AWS Graviton

AI/ML fashions might be an especially costly endeavor. A lot of our posts have been targeted on all kinds of ideas, tips, and strategies for analyzing and optimizing the runtime efficiency of AI/ML workloads. Our argument has been twofold:

Efficiency evaluation and optimizations should be an integral course of of each AI/ML growth mission, and,
Reaching significant efficiency boosts and value discount doesn’t require a excessive diploma of specialization. Any AI/ML developer can do it. Each AI/ML developer ought to do it.

, we addressed the challenge of optimizing an ML inference workload on an Intel® Xeon® processor. We began by reviewing a number of scenarios in which a CPU might be the best choice for AI/ML inference even in an era of multiple dedicated AI inference chips. We then introduced a toy image-classification PyTorch model and proceeded to demonstrate a wide number of techniques for boosting its runtime performance on an Amazon EC2 c7i.xlarge occasion, powered by 4th Technology Intel Xeon Scalable processors. On this put up, we lengthen our dialogue to AWS’s homegrown Arm-based Graviton CPUs. We are going to revisit most of the optimizations we mentioned in our earlier posts — a few of which would require adaptation to the Arm processor — and assess their influence on the identical toy mannequin. Given the profound variations between the Arm and Intel processors, the paths to the most effective performing configuration could take completely different paths.

AWS Graviton

AWS Graviton is a household of processors based mostly on Arm Neoverse CPUs, which are customized and constructed by AWS for optimum price-performance and power effectivity. Their devoted engines for vector processing (NEON and SVE/SVE2) and matrix multiplication (MMLA), and their assist for Bfloat16 operations (as of Graviton3), make them a compelling candidate for working compute intensive workloads corresponding to AI/ML inference. To facilitate high-performance AI/ML on Graviton, your entire software program stack has been optimized for its use:

Low-Degree Compute Kernels from the Arm Compute Library (ACL) are extremely optimized to leverage the Graviton {hardware} accelerators (e.g., SVE and MMLA).
ML Middleware Libraries corresponding to oneDNN and OpenBLAS route deep studying and linear algebra operations to the specialised ACL kernels.
AI/ML Frameworks like PyTorch and TensorFlow are compiled and configured to make use of these optimized backends.

On this put up we are going to use an Amazon EC2 c8g.xlarge occasion powered by 4 AWS Graviton4 processors and an AWS ARM64 PyTorch Deep Learning AMI (DLAMI).

The intention of this put up is to show ideas for enhancing efficiency on an AWS Graviton occasion. Importantly, our intention is not to attract a comparability between AWS Graviton and various chips, neither is it to advocate for using one chip over the opposite. Your best option of processor is determined by an entire bunch of issues past the scope of this put up. One of many necessary issues would be the most runtime efficiency of your mannequin on every chip. In different phrases: how a lot “bang” can we get for our buck? Thus, making an knowledgeable determination about the most effective processor is without doubt one of the motivations for optimizing runtime efficiency on each.

One other motivation for optimizing our mannequin’s efficiency for a number of inference gadgets, is to extend its portability. The taking part in area of AI/ML is extraordinarily dynamic and resilience to altering circumstances is essential for achievement. It’s not unusual for compute cases of sure varieties to immediately turn into unavailable or scarce. Conversely, a rise in capability of AWS Graviton cases, may suggest their availability at steep reductions, e.g., within the Amazon EC2 Spot Instance market, presenting cost-savings alternatives that you wouldn’t need to miss out on.

Disclaimers

The blocks code of code we are going to share, the optimization steps we are going to talk about, and the outcomes we are going to attain, are meant for example of the advantages you might even see from ML efficiency optimization on an AWS Graviton occasion. These could differ significantly from the outcomes you may see with your personal mannequin and runtime setting. Please don’t depend on the accuracy or optimality of the contents of this put up. Please don’t interpret the point out of any library, framework, or platform as an endorsement of its use.

Inference Optimization on AWS Graviton

As in our previous post, we are going to show the optimization steps on a toy picture classification mannequin:

import torch, torchvision
import time


def get_model(channels_last=False, compile=False):
    mannequin = torchvision.fashions.resnet50()

    if channels_last:
        mannequin= mannequin.to(memory_format=torch.channels_last)

    mannequin = mannequin.eval()

    if compile:
        mannequin = torch.compile(mannequin)

    return mannequin

def get_input(batch_size, channels_last=False):
    batch = torch.randn(batch_size, 3, 224, 224)
    if channels_last:
        batch = batch.to(memory_format=torch.channels_last)
    return batch

def get_inference_fn(mannequin, enable_amp=False):
    def infer_fn(batch):
        with torch.inference_mode(), torch.amp.autocast(
                'cpu',
                dtype=torch.bfloat16,
                enabled=enable_amp
        ):
            output = mannequin(batch)
        return output
    return infer_fn

def benchmark(infer_fn, batch):
    # warm-up
    for _ in vary(20):
        _ = infer_fn(batch)

    iters = 100

    begin = time.time()
    for _ in vary(iters):
        _ = infer_fn(batch)
    finish = time.time()

    return (finish - begin) / iters


batch_size = 1
mannequin = get_model()
batch = get_input(batch_size)
infer_fn = get_inference_fn(mannequin)
avg_time = benchmark(infer_fn, batch)
print(f"nAverage samples per second: {(batch_size/avg_time):.2f}")

The preliminary throughput is 12 samples per second (SPS).

Improve to the Most Current PyTorch Launch

Whereas the model of PyTorch in our DLAMI is 2.8, the most recent model of PyTorch, on the time of this writing, is 2.9. Given the fast tempo of growth within the area of AI/ML, it’s extremely really helpful to make use of essentially the most up-to-date library packages. As our first step, we improve to PyTorch 2.9 which includes key updates to its Arm backend.

pip3 set up -U torch torchvision --index-url https://obtain.pytorch.org/whl/cpu

Within the case of our mannequin in its preliminary configuration, upgrading the PyTorch model doesn’t have any impact. Nonetheless, this step is essential for getting essentially the most out of the optimization strategies that we’ll assess.

Batched Inference

To cut back the overhead of launching overheads and enhance the utilization of the HW accelerators, we group collectively samples and apply batched inference. The desk under demonstrates how the mannequin throughput varies as a operate of batch dimension:

Inference Throughput for Various Batch Sizes (by Writer)

Reminiscence Optimizations

We apply quite a lot of strategies from our earlier put up for optimizing reminiscence allocation and utilization. These embrace the channels-last memory format, automatic mixed precision with the bfloat16 knowledge kind (supported from Graviton3), the TCMalloc allocation library, and big web page allocation. Please see the for particulars. We additionally allow the quick math mode of the ACL GEMM kernels, and caching of the kernel primitives — two optimizations that seem within the official guidelines for running PyTorch inference on Graviton.

The command line directions required to allow these optimizations are proven under:

# set up TCMalloc
sudo apt-get set up google-perftools

# Program using TCMalloc
export LD_PRELOAD=/usr/lib/aarch64-linux-gnu/libtcmalloc.so.4

# Allow big web page reminiscence allocation
export THP_MEM_ALLOC_ENABLE=1

# Allow the quick math mode of the GEMM kernels
export DNNL_DEFAULT_FPMATH_MODE=BF16

# Set LRU Cache capability to cache the kernel primitives
export LRU_CACHE_CAPACITY=1024

The next desk captures the influence of the reminiscence optimizations, utilized successively:

ResNet-50 Reminiscence Optimization Outcomes (by Writer)

Within the case of our toy mannequin, the channels-last and bfloat16-mixed precision optimizations had the best influence. After making use of the entire reminiscence optimizations, the common throughput is 53.03 SPS.

Mannequin Compilation

The assist of PyTorch compilation for AWS Graviton is an area of focused effort of the AWS Graviton team. Nonetheless, within the case of our toy mannequin, it leads to a slight discount in throughput, from 53.03 SPS to 52.23.

Multi-Employee Inference

Whereas sometimes utilized in settings with many greater than 4 vCPUs, we show the implementation of multi-worker inference by modifying our script to assist core pinning:

if __name__ == '__main__':
    # pin CPUs in line with employee rank
    import os, psutil
    rank = int(os.environ.get('RANK','0'))
    world_size = int(os.environ.get('WORLD_SIZE','1'))
    cores = record(vary(psutil.cpu_count(logical=True)))
    num_cores = len(cores)
    cores_per_process = num_cores // world_size
    start_index = rank * cores_per_process
    end_index = (rank + 1) * cores_per_process
    pid = os.getpid()
    p = psutil.Course of(pid)
    p.cpu_affinity(cores[start_index:end_index])

    batch_size = 8
    mannequin = get_model(channels_last=True)
    batch = get_input(batch_size, channels_last=True)
    infer_fn = get_inference_fn(mannequin, enable_amp=True)
    avg_time = benchmark(infer_fn, batch)
    print(f"nAverage samples per second: {(batch_size/avg_time):.2f}")

We be aware that opposite to different AWS EC2 CPU occasion varieties, every Graviton vCPU maps on to a single bodily CPU core. We use the torchrun utility to start out up 4 staff, with every working on a single CPU core:

export OMP_NUM_THREADS=1 #set one OpenMP thread per employee
torchrun --nproc_per_node=4 primary.py

This leads to a throughput of 55.15 SPS, a 4% enchancment over our earlier greatest consequence.

INT8 Quantization for Arm

One other space of lively growth and steady enchancment on Arm is INT8 quantization. INT8 quantization instruments are sometimes closely tied to the goal occasion kind. In our previous post we demonstrated PyTorch 2 Export Quantization with X86 Backend through Inductor utilizing the TorchAO (0.12.1) library. Happily, latest variations of TorchAO embrace a devoted quantizer for Arm. The up to date quantization sequence is proven under. As in our previous post we have an interest simply within the potential efficiency influence. In observe, INT8 quantization can have a big influence on the standard of the mannequin and will necessitate a extra subtle quantization technique.

from torchao.quantization.pt2e.quantize_pt2e import prepare_pt2e, convert_pt2e
import torchao.quantization.pt2e.quantizer.arm_inductor_quantizer as aiq

def quantize_model(mannequin):
    x = torch.randn(4, 3, 224, 224).contiguous(
                            memory_format=torch.channels_last)
    example_inputs = (x,)
    batch_dim = torch.export.Dim("batch")
    with torch.no_grad():
        exported_model = torch.export.export(
            mannequin,
            example_inputs,
            dynamic_shapes=((batch_dim,
                             torch.export.Dim.STATIC,
                             torch.export.Dim.STATIC,
                             torch.export.Dim.STATIC),
                            )
        ).module()
    quantizer = aiq.ArmInductorQuantizer()
    quantizer.set_global(aiq.get_default_arm_inductor_quantization_config())
    prepared_model = prepare_pt2e(exported_model, quantizer)
    prepared_model(*example_inputs)
    converted_model = convert_pt2e(prepared_model)
    optimized_model = torch.compile(converted_model)
    return optimized_model


batch_size = 8
mannequin = get_model(channels_last=True)
mannequin = quantize_model(mannequin)
batch = get_input(batch_size, channels_last=True)
infer_fn = get_inference_fn(mannequin, enable_amp=True)
avg_time = benchmark(infer_fn, batch)
print(f"nAverage samples per second: {(batch_size/avg_time):.2f}")

The resultant throughput is 56.77 SPS for a 7.1% enchancment over the bfloat16 answer.

AOT Compilation Utilizing ONNX and OpenVINO

In our previous post, we explored ahead-of-time (AOT) mannequin compilation strategies utilizing Open Neural Network Exchange (ONNX) and OpenVINO. Each libraries embrace devoted assist for working on AWS Graviton (e.g., see here and here). The experiments on this part require the next library installations:

pip set up onnxruntime onnxscript openvino nncf

The next code block demonstrates the mannequin compilation and execution on Arm utilizing ONNX:

def export_to_onnx(mannequin, onnx_path="resnet50.onnx"):
    dummy_input = torch.randn(4, 3, 224, 224)
    batch = torch.export.Dim("batch")
    torch.onnx.export(
        mannequin,
        dummy_input,
        onnx_path,
        input_names=["input"],
        output_names=["output"],
        dynamic_shapes=((batch,
                         torch.export.Dim.STATIC,
                         torch.export.Dim.STATIC,
                         torch.export.Dim.STATIC),
                        ),
        dynamo=True
    )
    return onnx_path

def onnx_infer_fn(onnx_path):
    import onnxruntime as ort

    sess = ort.InferenceSession(
        onnx_path,
        suppliers=["CPUExecutionProvider"]
   )
    sess_options = ort.SessionOptions()
    sess_options.add_session_config_entry(
               "mlas.enable_gemm_fastmath_arm64_bfloat16", "1")
    input_name = sess.get_inputs()[0].title

    def infer_fn(batch):
        consequence = sess.run(None, {input_name: batch})
        return consequence
    return infer_fn

batch_size = 8
mannequin = get_model()
onnx_path = export_to_onnx(mannequin)
batch = get_input(batch_size).numpy()
infer_fn = onnx_infer_fn(onnx_path)
avg_time = benchmark(infer_fn, batch)
print(f"nAverage samples per second: {(batch_size/avg_time):.2f}")

It must be famous that the ONNX runtime helps a devoted ACL-ExecutionProvider for working on Arm, however this requires a customized ONNX construct (as of the time of this writing), which is out of the scope of this put up.

Alternatively, we are able to compile the mannequin utilizing OpenVINO. The code block under demonstrates its use, together with an possibility for INT8 quantization utilizing NNCF:

import openvino as ov
import nncf

def openvino_infer_fn(compiled_model):
    def infer_fn(batch):
        consequence = compiled_model([batch])[0]
        return consequence
    return infer_fn

class RandomDataset(torch.utils.knowledge.Dataset):
    def __len__(self):
        return 10000

    def __getitem__(self, idx):
        return torch.randn(3, 224, 224)

quantize_model = False
batch_size = 8
mannequin = get_model()
calibration_loader = torch.utils.knowledge.DataLoader(RandomDataset())
calibration_dataset = nncf.Dataset(calibration_loader)

if quantize_model:
    # quantize PyTorch mannequin
    mannequin = nncf.quantize(mannequin, calibration_dataset)

ovm = ov.convert_model(mannequin, example_input=torch.randn(1, 3, 224, 224))
ovm = ov.compile_model(ovm)
batch = get_input(batch_size).numpy()
infer_fn = openvino_infer_fn(ovm)
avg_time = benchmark(infer_fn, batch)
print(f"nAverage samples per second: {(batch_size/avg_time):.2f}")

Within the case of our toy mannequin, OpenVINO compilation leads to an extra increase of the throughput to 63.48 SPS, however the NNCF quantization disappoints, leading to simply 55.18 SPS.

Outcomes

The outcomes of our experiments are summarized within the desk under:

As in our , we reran our experiments on a second mannequin — a Imaginative and prescient Transformer (ViT) from the timm library — to show how the influence of the runtime optimizations we mentioned can differ based mostly on the main points of the mannequin. The outcomes are captured under:

ViT Inference Optimization Outcomes (by Writer)

Abstract

On this put up, we reviewed quite a lot of comparatively easy optimization strategies and utilized them to 2 toy PyTorch fashions. Because the outcomes demonstrated, the influence of every optimization step can differ drastically based mostly on the main points of the mannequin, and the journey towards peak efficiency can take many alternative paths. The steps we introduced on this put up had been simply an appetizer; there are undoubtedly many extra optimizations that may unlock even better efficiency.

Alongside the best way, we famous the numerous AI/ML libraries which have launched deep assist for the Graviton structure, and the seemingly steady group effort of ongoing optimization. The efficiency features we achieved, mixed with this obvious dedication, show that AWS Graviton is firmly within the “large leagues” relating to working compute-intensive AI/ML workloads.

Source link

Building Systems That Survive Real Life

Silicon Darwinism: Why Scarcity Is the Source of True Intelligence

How generative AI can help scientists synthesize complex materials | MIT News

Pause Your ML Pipelines for Human Review Using AWS Step Functions + Slack

Undetectable AI’s Writing Style Replicator vs. ChatGPT

Stop Worrying about AGI: The Immediate Danger is Reduced General Intelligence (RGI)

What the Latest AI Meltdown Reveals About Alignment

How to Build an AI Assistant with Keith Moehring [MAICON 2025 Speaker Series]

Most Popular

New postdoctoral fellowship program to accelerate innovation in health care | MIT News

A new computational model can predict antibody structures more accurately | MIT News

How leaders can bridge AI collaboration gaps

Our Picks

How Expert-Vetted Reasoning Datasets Improve Reinforcement Learning Model Performance

What we’ve been getting wrong about AI’s truth crisis