Close Menu
    Trending
    • Implementing DRIFT Search with Neo4j and LlamaIndex
    • Agentic AI in Finance: Opportunities and Challenges for Indonesia
    • Dispatch: Partying at one of Africa’s largest AI gatherings
    • Topp 10 AI-filmer genom tiderna
    • OpenAIs nya webbläsare ChatGPT Atlas
    • Creating AI that matters | MIT News
    • Scaling Recommender Transformers to a Billion Parameters
    • Hidden Gems in NumPy: 7 Functions Every Data Scientist Should Know
    ProfitlyAI
    • Home
    • Latest News
    • AI Technology
    • Latest AI Innovations
    • AI Tools & Technologies
    • Artificial Intelligence
    ProfitlyAI
    Home » How to Benchmark Classical Machine Learning Workloads on Google Cloud
    Artificial Intelligence

    How to Benchmark Classical Machine Learning Workloads on Google Cloud

    ProfitlyAIBy ProfitlyAIAugust 25, 2025No Comments9 Mins Read
    Share Facebook Twitter Pinterest LinkedIn Tumblr Reddit Telegram Email
    Share
    Facebook Twitter LinkedIn Pinterest Email


    Machine Studying Nonetheless Issues

    In an period of GPU supremacy, why do real-world enterprise instances rely a lot on classical machine studying and CPU-based coaching? The reply is that the information most vital to real-world enterprise functions continues to be overwhelmingly tabular, structured, and relational—assume fraud detection, insurance coverage danger scoring, churn prediction, and operational telemetry. Empirical outcomes (e.g., Grinsztajn et al., Why do tree-based models still outperform deep learning on typical tabular data? (2022), NeurIPS 2022 Track on Datasets and Benchmarks) present that for these domains random forest, gradient boosting, and logistic regression outperform neural nets in each accuracy and reliability. Additionally they provide explainability which is essential in regulated industries like banking and healthcare.

    GPUs usually lose their edge right here on account of knowledge switch latency (PCIe overhead) and poor scaling of some tree-based algorithms. Consequently, CPU-based coaching stays probably the most cost-effective selection for small-medium structured knowledge workloads on cloud platforms.

    On this article, I’ll stroll you thru the steps for benchmarking conventional machine studying algorithms on Google Cloud Platform (GCP) CPU choices, together with the Intel® Xeon® 6 that was lately made typically out there. (Full disclosure: I’m affiliated with Intel as a Senior AI Software program Options Engineer.)

    By systematically evaluating runtime, scalability, and price throughout algorithms, we are able to make evidence-based selections about which approaches ship the most effective trade-off between accuracy, velocity, and operational value.

    Machine Configuration on Google Cloud

    Go to console.cloud.google.com, arrange your billing, and head to “Compute Engine.” Then click on on “Create occasion” to configure your digital machine (VM). The determine under reveals the C4 VM series powered by Intel® Xeon® 6 (code-named Granite Rapids) and the fifth Gen Intel® Xeon® (code-named Emerald Rapids) CPUs.

    Organising a digital machine on Google Cloud

    Hyperthreading can introduce efficiency variability as a result of two threads compete for a similar core’s execution assets. For constant benchmark outcomes, setting “vCPUs to core ratio” to 1 eliminates that variable—extra on this within the subsequent part.

    vCPUs to core ratio and visual core counts could be set below “Superior configurations”

    Earlier than creating the VM, improve the boot disk dimension from the left-hand panel—200 GB might be greater than sufficient to put in the packages wanted for this weblog.

    Rising the boot disk dimension for the digital machine on Google Cloud

    Non-uniform Reminiscence Entry (NUMA) Consciousness

    Reminiscence entry is non-uniform on multi-core, multi-socket CPUs. This implies the latency and bandwidth of reminiscence operations depend upon which CPU core is accessing which area of reminiscence. When you don’t management for NUMA, you’re benchmarking the scheduler, not the CPU and the outcomes can seem inconsistent. Reminiscence affinity is precisely what eliminates that downside by controlling which CPU cores entry which reminiscence areas. The Linux scheduler is conscious of the NUMA topology of the platform and makes an attempt to enhance efficiency by scheduling threads on processors which can be in the identical node because the reminiscence getting used, moderately than letting the scheduler randomly assign work throughout the system. Nevertheless, with out specific affinity controls, you’ll be able to’t assure constant placement for dependable benchmarking.

    Let’s do a hands-on NUMA experiment with XGBoost and an artificial dataset that’s giant sufficient to emphasize reminiscence.

    First, provision a VM that spans a number of NUMA nodes, SSH into the occasion, and set up the dependencies.

    sudo apt replace && sudo apt set up -y python3-venv numactl

    Then create and activate a Python digital atmosphere to put in scikit-learn, numpy, and xgboost. Save the script under as xgb_bench.py.

    import numpy as np
    import xgboost as xgb
    from sklearn.datasets import make_classification
    from time import time
    
    # 10M samples, 100 options
    X, y = make_classification(n_samples=10_000_000, n_features=100, random_state=42)
    dtrain = xgb.DMatrix(X, label=y)
    
    params = {
        "goal": "binary:logistic",
        "tree_method": "hist",
        "max_depth": 8,
        "nthread": 0,  # use all out there threads
    }
    
    begin = time()
    xgb.practice(params, dtrain, num_boost_round=100)
    print("Elapsed:", time() - begin, "seconds")

    Subsequent, run this script in three modes (baseline / numa0 / interleave). Repeat every experiment not less than 5 occasions and report the imply and commonplace deviation. (This calls for one more easy script!)

    # Run with out NUMA binding
    python3 xgb_bench.py
    # Run with NUMA binding to a single node
    numactl --cpunodebind=0 --membind=0 python3 xgb_bench.py

    When assigning duties to particular bodily cores, use the --physcpubind or -C possibility moderately than --cpunodebind.

    # Run with interleaved reminiscence throughout nodes
    numactl --interleave=all python3 xgb_bench.py

    Which experiment had the smallest imply? How about the usual deviation? For deciphering these numbers, take into account that

    • Decrease commonplace deviation for numa0 signifies extra secure locality.
    • Decrease imply for numa0 vs baseline suggests cross-node visitors was hurting you, and
    • If interleave narrows the hole vs baseline, your workload is bandwidth delicate and advantages from spreading pages—at potential value to latency.

    If none of those apply to a benchmark, the workload could also be compute-bound (e.g., shallow timber, small dataset), or the VM would possibly expose a single NUMA node.

    Selecting the Proper Benchmarks

    When benchmarking classical machine studying algorithms on CPUs, it’s best to construct your individual testing framework, or leverage present benchmark suites, or use a hybrid strategy if applicable.

    Present check suites akin to scikit-learn_bench and Phoronix Test Suite (PTS) are useful whenever you want standardized, reproducible outcomes that others can validate and examine in opposition to. They work significantly properly should you’re evaluating well-established algorithms like random forest, SVM, or XGBoost the place commonplace datasets present significant insights. Customized benchmarks excel at revealing implementation-specific efficiency traits. As an example, they’ll measure how totally different sparse matrix codecs have an effect on SVM coaching occasions, or how function preprocessing pipelines impression general throughput in your particular CPU structure. The datasets you employ immediately affect what your benchmark reveals. Be happy to seek the advice of the official scikit-learn benchmarks for inspiration. Right here’s additionally a pattern set that you need to use to create a customized check.

    Dataset Dimension Job Supply
    Higgs 11M rows Binary classification UCI ML Repo
    Airline Delay Variable Multi-class classification BTS
    California Housing 20K rows Regression sklearn.datasets.fetch_california_housing
    Artificial Variable Scaling checks sklearn.datasets.make_classification

    Artificial scaling datasets are particularly helpful to reveal variations in cache, reminiscence bandwidth, and I/O.

    In the remainder of this weblog, we illustrate how one can run experiments utilizing the open supply scikit-learn_bench which at present helps scikit-learn, cuML, and XGBoost frameworks.

    Installations and Benchmarking

    As soon as the GCP VM is initialized, you’ll be able to SSH into the occasion and execute the instructions under in your terminal.

    sudo apt replace && sudo apt improve -y 
    sudo apt set up -y git wget numactl

    To put in Conda on a GCP VM, you’ll must account for the CPU structure. When you’re uncertain in regards to the structure of your VM, you’ll be able to run

    uname -m

    earlier than continuing to

    wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O ~/miniconda.sh
    
    # Use the installer for Linux aarch64 in case your VM is predicated on Arm structure.
    # wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-aarch64.sh -O ~/miniconda.sh

    Subsequent, it’s good to execute the script and settle for the phrases of service (ToS).

    bash ~/miniconda.sh
    supply ~/.bashrc

    Lastly, clone the newest scikit-learn_bench from GitHub, create a digital atmosphere and set up the required Python libraries.

    git clone https://github.com/IntelPython/scikit-learn_bench.git
    cd scikit-learn_bench
    conda env create -n sklearn_bench -f envs/conda-env-sklearn.yml
    conda activate sklearn_bench

    At this level, it’s best to be capable of run a benchmark utilizing the sklbench module and a particular configuration:

    python -m sklbench --config configs/xgboost_example.json

    By default, sklbench benchmarks each the usual scikit-learn implementations and their optimized counterparts supplied by sklearnex (Intel’s accelerated extension)—or different supported frameworks like cuML or XGBoost—and logs outcomes together with {hardware} and software program metadata into consequence.json. You possibly can customise the output file with --result-file, and embody --report to supply an Excel report (report.xlsx). For a listing of all supported choices, see the documentation.

    As mentioned earlier, you need to use numactl to pin a course of and its little one processes to particular CPU cores. Right here’s learn how to run sklbench with numactl, binding it to chose cores:

    cores="0-3"
    export runid="$(date +%YpercentmpercentdpercentHpercentMpercentS)"
    
    numactl --physcpubind $cores python3 -m sklbench 
      --config configs/common 
      --filters algorithm:library=sklearnex algorithm:gadget=cpu algorithm:estimator=RandomForestClassifier 
      --result-file $result-${runid}.json
    

    Decoding Outcomes and Greatest Practices

    The report generator means that you can mix the consequence information of a number of runs.

    python -m sklbench.report --result-files <consequence 1> <consequence 2>

    The true metric for cloud decision-making is the fee per process, specifically,

    Value per process = runtime in hours x hourly worth.

    Actual-world deployments hardly ever behave like single benchmark runs. To precisely mannequin the fee per process, it’s helpful to account for CPU increase conduct, cloud infrastructure variability, and reminiscence topology, as they’ll all affect efficiency in ways in which aren’t captured by a one-off measurement. To raised replicate precise runtime traits, I like to recommend beginning with warm-up iterations to stabilize CPU frequency scaling. Then run every experiment a number of occasions to account for system noise and transient results. Reporting the imply and commonplace deviation helps floor constant developments, whereas utilizing medians could be extra strong when variance is excessive, particularly in cloud environments the place noisy neighbors or useful resource competition can skew averages. For reproducibility, it’s vital to repair package deal variations and use constant VM picture snapshots. Together with NUMA configuration in your outcomes helps others perceive reminiscence locality results, which might considerably impression efficiency. Instruments like scikit-learn_bench automate many of those steps, making it simpler to supply benchmarks which can be each consultant and repeatable.

    When you discovered this text helpful, please take into account sharing it along with your community. For extra AI improvement how-to content material, go to Intel® AI Development Resources.

    Acknowledgments

    The writer thanks Neal Dixon, Miriam Gonzales, Chris Liebert, and Rachel Novak for offering suggestions on an earlier draft of this work.

    Assets



    Source link

    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
    Previous ArticleWhy Your Prompts Don’t Belong in Git
    Next Article LLM Monitoring and Observability: Hands-on with Langfuse
    ProfitlyAI
    • Website

    Related Posts

    Artificial Intelligence

    Implementing DRIFT Search with Neo4j and LlamaIndex

    October 22, 2025
    Artificial Intelligence

    Agentic AI in Finance: Opportunities and Challenges for Indonesia

    October 22, 2025
    Artificial Intelligence

    Creating AI that matters | MIT News

    October 21, 2025
    Add A Comment
    Leave A Reply Cancel Reply

    Top Posts

    Proton lanserar Lumo en AI-assistent med fokus på integritet och krypterade chattar

    July 27, 2025

    Docling: The Document Alchemist | Towards Data Science

    September 12, 2025

    Like human brains, large language models reason about diverse data in a general way | MIT News

    April 5, 2025

    Undetectable AI’s Writing Style Replicator vs. ChatGPT

    June 27, 2025

    Visa and Mastercard Just Gave AI the Power to Shop and Pay for You

    May 1, 2025
    Categories
    • AI Technology
    • AI Tools & Technologies
    • Artificial Intelligence
    • Latest AI Innovations
    • Latest News
    Most Popular

    After an outcry, OpenAI swiftly rereleased 4o to paid users. But experts say it should not have removed the model so suddenly.

    August 15, 2025

    How to Perform Effective Agentic Context Engineering

    October 7, 2025

    When A Difference Actually Makes A Difference

    September 10, 2025
    Our Picks

    Implementing DRIFT Search with Neo4j and LlamaIndex

    October 22, 2025

    Agentic AI in Finance: Opportunities and Challenges for Indonesia

    October 22, 2025

    Dispatch: Partying at one of Africa’s largest AI gatherings

    October 22, 2025
    Categories
    • AI Technology
    • AI Tools & Technologies
    • Artificial Intelligence
    • Latest AI Innovations
    • Latest News
    • Privacy Policy
    • Disclaimer
    • Terms and Conditions
    • About us
    • Contact us
    Copyright © 2025 ProfitlyAI All Rights Reserved.

    Type above and press Enter to search. Press Esc to cancel.