Close Menu
    Trending
    • How Expert-Vetted Reasoning Datasets Improve Reinforcement Learning Model Performance
    • What we’ve been getting wrong about AI’s truth crisis
    • Building Systems That Survive Real Life
    • The crucial first step for designing a successful enterprise AI system
    • Silicon Darwinism: Why Scarcity Is the Source of True Intelligence
    • How generative AI can help scientists synthesize complex materials | MIT News
    • Distributed Reinforcement Learning for Scalable High-Performance Policy Optimization
    • How to Apply Agentic Coding to Solve Problems
    ProfitlyAI
    • Home
    • Latest News
    • AI Technology
    • Latest AI Innovations
    • AI Tools & Technologies
    • Artificial Intelligence
    ProfitlyAI
    Home » Optimizing PyTorch Model Inference on CPU
    Artificial Intelligence

    Optimizing PyTorch Model Inference on CPU

    ProfitlyAIBy ProfitlyAIDecember 8, 2025No Comments21 Mins Read
    Share Facebook Twitter Pinterest LinkedIn Tumblr Reddit Telegram Email
    Share
    Facebook Twitter LinkedIn Pinterest Email


    grows, so does the criticality of optimizing their runtime efficiency. Whereas the diploma to which AI fashions will outperform human intelligence stays a heated subject of debate, their want for highly effective and costly compute sources is unquestionable — and even infamous.

    In previous posts, we coated the subject of AI mannequin optimization — primarily within the context of mannequin coaching — and demonstrated the way it can have a decisive impression on the associated fee and pace of AI mannequin improvement. On this put up, we focus our consideration on AI mannequin inference, the place mannequin optimization has a further goal: To attenuate the latency of inference requests and enhance the person expertise of the mannequin shopper.

    On this put up, we’ll assume that the platform on which mannequin inference is carried out is a 4th Gen Intel® Xeon® Scalable CPU processor, extra particularly, an Amazon EC2 c7i.xlarge occasion (with 4 Intel Xeon vCPUs) operating a devoted Deep Learning Ubuntu (22.04) AMI and a CPU construct of PyTorch 2.8.0. In fact, the selection of a mannequin deployment platform is among the many necessary selections taken when designing an AI resolution together with the selection of mannequin structure, improvement framework, coaching accelerator, information format, deployment technique, and many others. — every one among which have to be taken with consideration of the related prices and runtime pace. The selection of a CPU processor for operating mannequin inference could seem stunning in an period wherein the variety of devoted AI inference accelerators is constantly rising. Nonetheless, as we’ll see, there are some events when the perfect (and least expensive) choice could very nicely be only a good old school CPU.

    We are going to introduce a toy image-classification mannequin and proceed to exhibit a few of the optimization alternatives for AI mannequin inference on an Intel® Xeon® CPU. The deployment of an AI mannequin sometimes features a full inference server resolution, however for the sake of simplicity, we’ll restrict our dialogue to simply the mannequin’s core execution. For a primer on mannequin inference serving, please see our earlier put up: The Case for Centralized AI Model Inference Serving.

    Our intention on this put up is to exhibit that: 1) a number of easy optimization methods may end up in significant efficiency good points and a pair of) that reaching such outcomes doesn’t require specialised experience in efficiency analyzers (resembling Intel® VTune™ Profiler) or on the inside workings of the low-level compute kernels. Importantly, the method of AI mannequin optimization can differ significantly based mostly on the mannequin structure and runtime atmosphere. Optimizing for coaching will differ from optimizing for inference. Optimizing a transformer mannequin will differ from optimizing a CNN mannequin. Optimizing a 22-billion-parameter mannequin will differ from optimizing a 100-million parameter mannequin. Optimizing a mannequin to run on a GPU will differ from optimizing it for a CPU. Even completely different generations of the identical CPU household could have completely different computation elements and, consequently, completely different optimization methods. Whereas the high-level steps for optimizing a given mannequin on a given occasion are fairly normal, the particular course it’s going to take and the top consequence can range significantly based mostly on the challenge at hand.

    The code snippets we’ll share are meant for demonstrative functions. Please don’t depend on their accuracy or their optimality. Please don’t interpret our point out of any software or approach as an endorsement for its use. In the end, the perfect design decisions on your use case will significantly depend upon the small print of your challenge and, given the extent of the potential impression on efficiency, ought to be evaluated with the suitable time and a spotlight.

    Why CPU?

    With the ever-increasing variety of {hardware} options for executing AI/ML mannequin inference, our alternative of a CPU could seem stunning. On this part, we describe some eventualities wherein CPU could also be the popular platform for inference.

    1. Accessibility: The usage of devoted AI accelerators — resembling GPUs — sometimes requires devoted deployment and upkeep or, alternatively, entry to such situations on a cloud service platform. CPUs, then again, are in every single place. Designing an answer to run on a CPU offers a lot better flexibility and will increase the alternatives for deployment.
    2. Availability: Even when your algorithm can entry an AI accelerator, there’s the query of availability. AI accelerators are in extraordinarily excessive demand, and even when/when you’ll be able to purchase one, whether or not or not it’s on-prem or within the cloud, it’s possible you’ll select to prioritize them for duties which can be much more useful resource intensive, resembling AI mannequin coaching.
    3. Decreased Latency: There are various conditions wherein your AI mannequin is only one part in a pipeline of software program algorithms operating on a normal CPU. Whereas the AI mannequin could carry out considerably quicker on an AI accelerator, when considering the time required to ship an inference request over the community, it’s fairly attainable that operating it on the identical CPU shall be quicker.
    4. Underuse of Accelerator: AI accelerators are sometimes fairly costly. To justify their value, your objective ought to be to maintain them totally occupied, minimizing their idle time. In some instances, the inference load won’t justify the price of an costly AI accelerator.
    5. Mannequin Structure: Nowadays, we are inclined to mechanically assume that AI fashions will carry out considerably higher on AI accelerators than on CPUs. And whereas as a rule, that is certainly the case, your mannequin could embody layers that carry out higher on CPU. For instance, sequential algorithms resembling Non-Most Suppression (NMS) and the Hungarian matching algorithm are inclined to carry out higher on CPU than GPU and are sometimes offloaded onto the CPU even when a GPU is on the market (e.g., see here). In case your mannequin comprises many such layers, operating it on a CPU won’t be such a nasty choice.

    Why Intel Xeon?

    Intel® Xeon® Scalable CPU processors include built-in accelerators for the matrix and convolution operators which can be frequent in typical AI/ML workloads. These embody AVX-512 (launched in Gen1), the VNNI extension (Gen2), and AMX (Gen4). The AMX engine, particularly, consists of specialised {hardware} directions for executing AI fashions utilizing bfloat16 and int8 precision information varieties. The acceleration engines are tightly built-in with Intel’s optimized software program stack, which incorporates oneDNN, OpenVINO, and the Intel Extension for PyTorch (IPEX). These libraries make the most of the devoted Intel® Xeon® {hardware} capabilities to optimize mannequin execution with minimal code adjustments.

    Regardless of the arguments made on this part, the selection of inference automobile ought to be made after contemplating all choices accessible and after assessing the alternatives for optimization on each. Within the subsequent sections, we’ll introduce a toy experiment and discover a few of the optimization alternatives on CPU.

    Inference Experiment

    On this part, we outline a toy AI mannequin inference experiment comprising a Resnet50 picture classification mannequin, a randomly generated enter batch, and a easy benchmarking utility which we use to report the typical variety of enter samples processed per second (SPS).

    import torch, torchvision
    import time
    
    
    def get_model():
        mannequin = torchvision.fashions.resnet50()
        mannequin = mannequin.eval()
        return mannequin
    
    
    def get_input(batch_size):
        batch = torch.randn(batch_size, 3, 224, 224)
        return batch
    
    
    def get_inference_fn(mannequin):
        def infer_fn(batch):
            with torch.inference_mode():
                output = mannequin(batch)
            return output
        return infer_fn
    
    
    def benchmark(infer_fn, batch):
        # warm-up
        for _ in vary(10):
            _ = infer_fn(batch)
    
        iters = 100
    
        begin = time.time()
        for _ in vary(iters):
            _ = infer_fn(batch)
        finish = time.time()
    
        return (finish - begin) / iters
    
    
    batch_size = 1
    mannequin = get_model()
    batch = get_input(batch_size)
    infer_fn = get_inference_fn(mannequin)
    avg_time = benchmark(infer_fn, batch)
    print(f"nAverage samples per second: {(batch_size/avg_time):.2f}")

    The baseline efficiency of our toy mannequin is 22.76 samples per second (SPS).

    Mannequin Inference Optimization

    On this part, we apply plenty of optimizations to our toy experiment and assess their impression on runtime efficiency. Our focus shall be on optimization methods that may be utilized with relative ease. Whereas it’s fairly possible that further efficiency good points might be achieved, these could require a lot better specialization and a extra vital time funding.

    Our focus shall be on optimizations that don’t change the mannequin structure; optimization methods resembling mannequin distillation and mannequin pruning are out of the context of this put up. Additionally out of scope are strategies for optimizing particular mannequin elements, e.g., by implementing customized PyTorch operators.

    In a earlier put up we mentioned AI mannequin optimization on Intel XEON CPUs within the context of coaching workloads. On this part we’ll revisit a few of the methods talked about there, this time within the context of AI mannequin inference. We are going to complement these with optimization methods which can be distinctive to inference settings, together with mannequin compilation for inference, INT8 quantization, and multi-worker inference.

    The order wherein we current the optimization strategies isn’t binding. In truth, a few of the methods are interdependent; for instance, growing the variety of inference staff might impression the optimum alternative of batch dimension.

    Optimization 1: Batched Inference

    A typical technique for growing useful resource utilization whereas decreasing the typical inference response time is to group enter samples into batches. In real-world eventualities, we want to verify to cap the batch dimension in order that we meet the service stage response time necessities, however for the needs of our experiment we ignore this requirement. Experimenting with completely different batch sizes we discover {that a} batch dimension of 8 ends in a throughput of 26.28 SPS, 15% greater than the baseline consequence.

    Be aware that within the case that the shapes of the enter samples range, batching requires extra dealing with (e.g., see here).

    Optimization 2: Channels-Final Reminiscence Format

    By default in PyTorch, 4D tensors are saved in NCHW format, i.e., the 4 dimensions signify the batch dimension, channels, peak, and width, respectively. Nonetheless, the channels-last or NHWC format (i.e., batch dimension, peak, width, and channels) displays higher efficiency on CPU. Adjusting our inference script to use the channels-last optimization is an easy matter of setting the reminiscence format of each the mannequin and the enter to torch.channels_last as proven under:

    def get_model(channels_last=False):
        mannequin = torchvision.fashions.resnet50()
        if channels_last:
            mannequin= mannequin.to(memory_format=torch.channels_last)
        mannequin = mannequin.eval()
        return mannequin
    
    def get_input(batch_size, channels_last=False):
        batch = torch.randn(batch_size, 3, 224, 224)
        if channels_last:
            batch = batch.to(memory_format=torch.channels_last)
        return batch
    
    
    batch_size = 8
    mannequin = get_model(channels_last=True)
    batch = get_input(batch_size, channels_last=True)
    infer_fn = get_inference_fn(mannequin)
    avg_time = benchmark(infer_fn, batch)
    print(f"nAverage samples per second: {(batch_size/avg_time):.2f}")

    Making use of the channels-last reminiscence optimization, ends in an additional enhance of 25% in throughput.

    The impression of this optimization is most noticeable on fashions which have many convolutional layers. It isn’t anticipated to make a noticeable impression on different mannequin architectures (e.g., transformer fashions).

    Please see the PyTorch documentation for extra particulars on the reminiscence format optimization and the Intel documentation for particulars on how that is applied internally in oneDNN.

    Optimization 3: Automated Combined Precision

    Fashionable Intel® Xeon® Scalable processors (from Gen3) embody native assist for the bfloat16 information sort, a 16-bit floating level different to the usual float32. We are able to reap the benefits of this by making use of PyTorch’s computerized blended precision bundle, torch.amp, as demonstrated under:

    def get_inference_fn(mannequin, enable_amp=False):
        def infer_fn(batch):
            with torch.inference_mode(), torch.amp.autocast(
                    'cpu',
                    dtype=torch.bfloat16,
                    enabled=enable_amp
            ):
                output = mannequin(batch)
            return output
        return infer_fn
    
    batch_size = 8
    mannequin = get_model(channels_last=True)
    batch = get_input(batch_size, channels_last=True)
    infer_fn = get_inference_fn(mannequin, enable_amp=True)
    avg_time = benchmark(infer_fn, batch)
    print(f"nAverage samples per second: {(batch_size/avg_time):.2f}")

    The results of making use of blended precision is a throughput of 86.95 samples per second, 2.6 instances the earlier experiment and three.8 instances the baseline consequence.

    Be aware that using a diminished precision floating level sort can have an effect on numerical accuracy, and its impact on mannequin high quality efficiency have to be evaluated.

    Optimization 4: Reminiscence Allocation Optimization

    Typical AI/ML workloads require the allocation and entry of huge blocks of reminiscence. Plenty of optimization methods are geared toward tuning the best way reminiscence is allotted and used throughout mannequin execution. One frequent step is to exchange the default system allocator (ptmalloc) with another reminiscence allocation libraries, resembling Jemalloc and TCMalloc, which have been proven to carry out higher on frequent AI/ML workloads (e.g., see here). To put in TCMalloc run:

    sudo apt-get set up google-perftools

    We program its use through the LD_PRELOAD atmosphere variable:

    LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libtcmalloc.so.4 python fundamental.py

    This optimization ends in one other vital efficiency enhance: 117.54 SPS, 35% greater than our earlier experiment!!

    Optimization 5: Allow Enormous Web page Allocations

    By default, the Linux kernel allocates reminiscence in blocks of 4 KB, generally known as pages. The mapping between the digital and bodily reminiscence addresses is managed by the CPU’s Reminiscence Administration Unit (MMU), which makes use of a small {hardware} cache known as the Translation Lookaside Buffer (TLB). The TLB is proscribed within the quantity entries it might probably maintain. When you’ve got many small pages (as in giant neural community fashions), the variety of TLB cache misses can climb rapidly, growing latency and slowing down the pace of this system. A typical solution to handle that is to make use of “enormous pages” — blocks of two MB (or 1 GB) per web page. This reduces the variety of TLB entries required, bettering reminiscence entry effectivity and decreasing allocation latency.

    export THP_MEM_ALLOC_ENABLE=1

    Within the case of our mannequin, the impression is negligible. Nonetheless, this is a crucial optimization for a lot of AI/ML workloads.

    Optimization 6: IPEX

    Intel® Extension for PyTorch (IPEX) is a library extension for PyTorch with the newest efficiency optimizations for Intel {hardware}. To put in it we run:

    pip set up intel_extension_for_pytorch

    Within the code block under, we exhibit the fundamental use of the ipex.optimize API.

    import intel_extension_for_pytorch as ipex
    
    def get_model(channels_last=False, ipex_optimize=False):
        mannequin = torchvision.fashions.resnet50()
    
        if channels_last:
            mannequin= mannequin.to(memory_format=torch.channels_last)
    
        mannequin = mannequin.eval()
    
        if ipex_optimize:
            mannequin = ipex.optimize(mannequin, dtype=torch.bfloat16)
    
        return mannequin

    The resultant all through is 159.31 SPS, for an additional 36% efficiency enhance.

    Please see the official documentation for extra particulars on the various optimizations that IPEX has to supply.

    Optimization 7: Mannequin Compilation

    One other standard PyTorch optimization is torch.compile. Launched in PyTorch 2.0, this just-in-time (JIT) compilation function, performs kernel fusion and different optimizations. In a earlier put up we coated PyTorch compilation in nice element, masking some its many options, controls, and limitations. Right here we exhibit its fundamental use:

    def get_model(channels_last=False, ipex_optimize=False, compile=False):
        mannequin = torchvision.fashions.resnet50()
    
        if channels_last:
            mannequin= mannequin.to(memory_format=torch.channels_last)
    
        mannequin = mannequin.eval()
    
        if ipex_optimize:
            mannequin = ipex.optimize(mannequin, dtype=torch.bfloat16)
    
        if compile:
            mannequin = torch.compile(mannequin)
    
        return mannequin

    Making use of torch.compile on the IPEX-optimized mannequin ends in a throughput of 144.5 SPS, which is decrease than our earlier experiment. Within the case of our mannequin, IPEX and torch.compile don’t coexist nicely. When making use of simply the torch.compile the throughput is 133.36 SPS.

    The overall takeaway from this experiment is that, for a given mannequin, any two optimization methods might intervene with each other. This necessitates evaluating the impression of a number of configurations on the runtime efficiency of a given mannequin with a purpose to discover the perfect one.

    Optimization 8: Auto-tune Surroundings Setup With torch.xeon.run_cpu

    There are a variety of atmosphere settings that management thread and reminiscence administration and can be utilized to additional fine-tune the runtime efficiency of an AI/ML workload. Moderately than setting these manually, PyTorch provides the torch.xeon.run_cpu script that does this mechanically. In preparation for using this script, we set up Intel’s threading and multiprocessing libraries, one TBB and Intel OpenMP. We additionally add a symbolic hyperlink to our TCMalloc set up.

    # set up TBB
    sudo apt set up -y libtbb12
    # set up openMP
    pip set up intel-openmp
    # hyperlink to tcmalloc
    sudo ln -sf /usr/lib/x86_64-linux-gnu/libtcmalloc.so.4 /usr/lib/libtcmalloc.so

    Within the case of our toy mannequin, utilizing torch.xeon.run_cpu will increase the throughput to 162.15 SPS — a slight improve over our earlier most of 159.31 SPS.

    Please see the PyTorch documentation for extra options of the torch.xeon.run_cpu and extra particulars on the atmosphere variables it applies.

    Optimization 9: Multi-worker Inference

    One other standard approach for growing useful resource utilization and scale is to load a number of situations of the AI mannequin and run them in parallel in separate processes. Though this method is extra generally utilized on machines with many CPUs (separated into a number of NUMA nodes) — not on our small 4-vCPU occasion — we embody it right here for the sake of demonstration. Within the script under we run 2 situations of our mannequin in parallel:

    python -m torch.backends.xeon.run_cpu --ninstances 2 fundamental.py

    This ends in a throughput of 169.4 SPS — further modest however significant 4% improve.

    Optimization 10: INT8 Quantization

    INT8 quantization is one other frequent approach for accelerating AI mannequin inference execution. In INT8 quantization, the floating level datatypes of the mannequin weights and activations are changed by 8-bit integers. Intel’s Xeon processors embody devoted accelerators for processing INT8 operations (e.g., see here). INT8 quantization may end up in a significant improve in pace and a decrease reminiscence footprint. Importantly, the diminished bit-precision can have a big impression on the standard of the mannequin output. There are various completely different approaches to INT8 quantization a few of which embody calibration or retraining. There are additionally all kinds of instruments and libraries for making use of quantization. A full dialogue on the subject of quantization is past the scope of this put up.

    Since on this put up we have an interest simply within the potential efficiency impression, we exhibit one quantization scheme utilizing TorchAO, with out consideration of the impression on mannequin high quality. Within the code block under, we implement PyTorch 2 Export Quantization with X86 Backend through Inductor. INT8 quantization is one other frequent approach for accelerating AI mannequin inference execution. In INT8 quantization, the floating level datatypes of the mannequin weights and activations are changed by 8-bit integers. Intel’s Xeon processors embody devoted accelerators for processing INT8 operations (e.g., see here). INT8 quantization may end up in a significant improve in pace and a decrease reminiscence footprint.

    Importantly, the diminished bit-precision can have a big impression on the standard of the mannequin output. There are various completely different approaches to INT8 quantization a few of which embody calibration or retraining. There are additionally all kinds of instruments and libraries for making use of quantization. A full dialogue on the subject of quantization is past the scope of this put up. Since on this put up we have an interest simply within the potential efficiency impression, we exhibit one quantization scheme utilizing TorchAO, with out consideration of the impression on mannequin high quality. Within the code block under, we implement PyTorch 2 Export Quantization with X86 Backend through Inductor. Please see the documentation for the complete particulars:

    from torchao.quantization.pt2e.quantize_pt2e import prepare_pt2e, convert_pt2e
    import torchao.quantization.pt2e.quantizer.x86_inductor_quantizer as xiq
    
    def quantize_model(mannequin):
        x = torch.randn(4, 3, 224, 224).contiguous(
                                memory_format=torch.channels_last)
        example_inputs = (x,)
        batch_dim = torch.export.Dim("batch")
        with torch.no_grad():
            exported_model = torch.export.export(
                mannequin,
                example_inputs,
                dynamic_shapes=((batch_dim,
                                 torch.export.Dim.STATIC,
                                 torch.export.Dim.STATIC,
                                 torch.export.Dim.STATIC),
                                )
            ).module()
        quantizer = xiq.X86InductorQuantizer()
        quantizer.set_global(xiq.get_default_x86_inductor_quantization_config())
        prepared_model = prepare_pt2e(exported_model, quantizer)
        prepared_model(*example_inputs)
        converted_model = convert_pt2e(prepared_model)
        optimized_model = torch.compile(converted_model)
        return optimized_model
    
    
    batch_size = 8
    mannequin = get_model(channels_last=True)
    mannequin = quantize_model(mannequin)
    batch = get_input(batch_size, channels_last=True)
    infer_fn = get_inference_fn(mannequin, enable_amp=True)
    avg_time = benchmark(infer_fn, batch)
    print(f"nAverage samples per second: {(batch_size/avg_time):.2f}")

    This ends in a throughput of 172.67 SPS.

    Please see here for extra particulars on quantization in PyTorch.

    Optimization 11: Graph Compilation and Execution With ONNX

    There are a variety of third social gathering libraries focusing on compiling PyTorch fashions into graph representations and optimizing them for runtime efficiency on the right track inference gadgets. One of the vital standard libraries for that is Open Neural Network Exchange (ONNX). ONNX performs ahead-of-time compilation of AI/ML fashions and executes them utilizing a devoted runtime library.

    Whereas ONNX compilation assist is included in PyTorch, we require the next library for executing an ONNX mannequin:

    pip set up onnxruntime

    Within the code block under, we exhibit ONNX compilation and mannequin execution:

    def export_to_onnx(mannequin, onnx_path="resnet50.onnx"):
        dummy_input = torch.randn(4, 3, 224, 224)
        batch = torch.export.Dim("batch")
        torch.onnx.export(
            mannequin,
            dummy_input,
            onnx_path,
            input_names=["input"],
            output_names=["output"],
            dynamic_shapes=((batch,
                             torch.export.Dim.STATIC,
                             torch.export.Dim.STATIC,
                             torch.export.Dim.STATIC),
                            ),
            dynamo=True
        )
        return onnx_path
    
    def onnx_infer_fn(onnx_path):
        import onnxruntime as ort
    
        sess = ort.InferenceSession(
            onnx_path,
            suppliers=["CPUExecutionProvider"]
        )
        input_name = sess.get_inputs()[0].identify
    
        def infer_fn(batch):
            consequence = sess.run(None, {input_name: batch})
            return consequence
        return infer_fn
    
    batch_size = 8
    mannequin = get_model()
    onnx_path = export_to_onnx(mannequin)
    batch = get_input(batch_size).numpy()
    infer_fn = onnx_infer_fn(onnx_path)
    avg_time = benchmark(infer_fn, batch)
    print(f"nAverage samples per second: {(batch_size/avg_time):.2f}")

    The resultant throughput is 44.92 SPS, far decrease than in our earlier experiments. Within the case of our toy mannequin, the ONNX runtime doesn’t present a profit.

    Optimization 12: Graph Compilation and Execution with OpenVINO

    One other opensource toolkit geared toward deploying extremely performant AI options is OpenVINO. OpenVINO is highly optimized for mannequin execution on Intel {hardware} — e.g., by totally leveraging the Intel AMX directions. A typical solution to apply OpenVINO in PyTorch is to first convert the mannequin to ONNX:

    from openvino import Core
    
    def compile_openvino_model(onnx_path):
        core = Core()
        mannequin = core.read_model(onnx_path)
        compiled = core.compile_model(mannequin, "CPU")
        return compiled
    
    def openvino_infer_fn(compiled_model):
        def infer_fn(batch):
            consequence = compiled_model([batch])[0]
            return consequence
        return infer_fn
    
    batch_size = 8
    mannequin = get_model()
    onnx_path = export_to_onnx(mannequin)
    ovm = compile_openvino_model(onnx_path)
    batch = get_input(batch_size).numpy()
    infer_fn = openvino_infer_fn(ovm)
    avg_time = benchmark(infer_fn, batch)
    print(f"nAverage samples per second: {(batch_size/avg_time):.2f}")

    The results of this optimization is a throughput of 297.33 SPS, almost twice as quick as our earlier finest experiment!!

    Please see the official documentation for extra particulars on OpenVINO.

    Optimization 13: INT8 Quantization in OpenVINO with NNCF

    As our closing optimization, we revisit INT8 quantization, this time within the framework of OpenVINO compilation. As earlier than, there are a selection of strategies for performing quantization — geared toward minimizing the impression on high quality efficiency. Right here we exhibit the fundamental movement utilizing the NNCF library as documented here.

    class RandomDataset(torch.utils.information.Dataset):
    
        def __len__(self):
            return 10000
    
        def __getitem__(self, idx):
            return torch.randn(3, 224, 224)
    
    def nncf_quantize(onnx_path):
        import nncf
    
        core = Core()
        onnx_model = core.read_model(onnx_path)
        calibration_loader = torch.utils.information.DataLoader(RandomDataset())
        input_name = onnx_model.inputs[0].get_any_name()
        transform_fn = lambda data_item: {input_name: data_item.numpy()}
        calibration_dataset = nncf.Dataset(calibration_loader, transform_fn)
        quantized_model = nncf.quantize(onnx_model, calibration_dataset)
        return core.compile_model(quantized_model, "CPU")
    
    batch_size = 8
    mannequin = get_model()
    onnx_path = export_to_onnx(mannequin)
    q_model = nncf_quantize(onnx_path)
    batch = get_input(batch_size).numpy()
    infer_fn = openvino_infer_fn(q_model)
    avg_time = benchmark(infer_fn, batch)
    print(f"nAverage samples per second: {(batch_size/avg_time):.2f}")

    This ends in a throughput of 482.46(!!) SPS, one other drastic enchancment and over 18 instances quicker than our baseline experiment.

    Outcomes

    We summarize the outcomes of our experiments within the desk under:

    ResNet50 Inference Experiment (by Creator)

    Within the case of our toy mannequin, the optimizations steps we demonstrated resulted in enormous efficiency good points. Importantly, the impression of every optimization can range significantly based mostly on the small print of the mannequin. Chances are you’ll discover that a few of these methods don’t apply to your mannequin, or don’t end in improved efficiency. For instance, after we reapply the identical sequence of optimizations to a Imaginative and prescient Transformer (ViT) mannequin, the resultant efficiency enhance is 8.41X — nonetheless vital, however lower than the 18.36X of our experiment. Please see the appendix to this put up for particulars.

    Our focus has been on runtime efficiency, however it’s vital that you just additionally consider the impression of every optimization on different metrics which can be necessary to you — most significantly mannequin high quality.
    There are, undoubtedly, many extra optimization methods that may be utilized; we’ve merely scratched the floor. Hopefully, the

    Abstract

    This put up continues our series on the necessary subject of AI/ML mannequin runtime efficiency evaluation and optimization. Our focus on this put up was on mannequin inference on Intel® Xeon® CPU processors. Given the ubiquity and prevalence of CPUs, the power to execute fashions on them in a dependable and performant method, might be extraordinarily compelling. As we’ve proven, by making use of plenty of comparatively easy methods, we are able to obtain appreciable good points in mannequin efficiency with profound implications on inference prices and inference latency.

    Please don’t hesitate to achieve out with feedback, questions, or corrections.

    Appendix: Imaginative and prescient Transformer Optimization

    To exhibit how the impression of the runtime optimizations we mentioned depend upon the small print of the AI/ML mannequin, we reran our experiment on a Imaginative and prescient Transformer (ViT) mannequin from the favored timm library:

    from timm.fashions.vision_transformer import VisionTransformer
    
    def get_model(channels_last=False, ipex_optimize=False, compile=False):
        mannequin = VisionTransformer()
    
        if channels_last:
            mannequin= mannequin.to(memory_format=torch.channels_last)
    
        mannequin = mannequin.eval()
    
        if ipex_optimize:
            mannequin = ipex.optimize(mannequin, dtype=torch.bfloat16)
    
        if compile:
            mannequin = torch.compile(mannequin)
    
        return mannequin

    One modification on this experiment was to use OpenVINO compilation directly to the PyTorch model relatively than an intermediate ONNX mannequin. This was attributable to the truth that OpenVINO compilation failed on the ViT ONNX mannequin. The revised NNCF quantization and OpenVINO compilation sequence is proven under:

    import openvino as ov
    import nncf
    
    
    batch_size = 8
    mannequin = get_model()
    calibration_loader = torch.utils.information.DataLoader(RandomDataset())
    calibration_dataset = nncf.Dataset(calibration_loader)
    
    # quantize PyTorch mannequin
    mannequin = nncf.quantize(mannequin, calibration_dataset)
    ovm = ov.convert_model(mannequin, example_input=torch.randn(1, 3, 224, 224))
    ovm = ov.compile_model(ovm)
    batch = get_input(batch_size).numpy()
    infer_fn = openvino_infer_fn(ovm)
    avg_time = benchmark(infer_fn, batch)
    print(f"nAverage samples per second: {(batch_size/avg_time):.2f}")

    The desk under summarizes the outcomes of the optimizations mentioned on this put up when utilized to the ViT mannequin:

    Imaginative and prescient Transformer Inference Experiment (by Creator)



    Source link

    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
    Previous Article49 Chatgpt Prompts To Generate Twitter Content In 2026 » Ofemwire
    Next Article How to Create an ML-Focused Newsletter
    ProfitlyAI
    • Website

    Related Posts

    Artificial Intelligence

    Building Systems That Survive Real Life

    February 2, 2026
    Artificial Intelligence

    Silicon Darwinism: Why Scarcity Is the Source of True Intelligence

    February 2, 2026
    Artificial Intelligence

    How generative AI can help scientists synthesize complex materials | MIT News

    February 2, 2026
    Add A Comment
    Leave A Reply Cancel Reply

    Top Posts

    OpenAI inför vattenstämplar på gratis genererade bilder

    April 11, 2025

    Geospatial exploratory data analysis with GeoPandas and DuckDB

    December 15, 2025

    How to Use Gyroscope in Presentations, or Why Take a JoyCon to DPG2025

    April 21, 2025

    Yupp AI betalar användare upp till $50/mån för att betygsätta AI-svar

    June 26, 2025

    Agentic AI and the Future of Python Project Management Tooling

    September 8, 2025
    Categories
    • AI Technology
    • AI Tools & Technologies
    • Artificial Intelligence
    • Latest AI Innovations
    • Latest News
    Most Popular

    How to Consistently Extract Metadata from Complex Documents

    October 24, 2025

    Checking the quality of materials just got easier with a new AI tool | MIT News

    October 14, 2025

    The “Gentle Singularity” Is Already Here

    June 17, 2025
    Our Picks

    How Expert-Vetted Reasoning Datasets Improve Reinforcement Learning Model Performance

    February 3, 2026

    What we’ve been getting wrong about AI’s truth crisis

    February 2, 2026

    Building Systems That Survive Real Life

    February 2, 2026
    Categories
    • AI Technology
    • AI Tools & Technologies
    • Artificial Intelligence
    • Latest AI Innovations
    • Latest News
    • Privacy Policy
    • Disclaimer
    • Terms and Conditions
    • About us
    • Contact us
    Copyright © 2025 ProfitlyAI All Rights Reserved.

    Type above and press Enter to search. Press Esc to cancel.