Machine Studying Nonetheless Issues
In an period of GPU supremacy, why do real-world enterprise instances rely a lot on classical machine studying and CPU-based coaching? The reply is that the information most vital to real-world enterprise functions continues to be overwhelmingly tabular, structured, and relational—assume fraud detection, insurance coverage danger scoring, churn prediction, and operational telemetry. Empirical outcomes (e.g., Grinsztajn et al., Why do tree-based models still outperform deep learning on typical tabular data? (2022), NeurIPS 2022 Track on Datasets and Benchmarks) present that for these domains random forest, gradient boosting, and logistic regression outperform neural nets in each accuracy and reliability. Additionally they provide explainability which is essential in regulated industries like banking and healthcare.
GPUs usually lose their edge right here on account of knowledge switch latency (PCIe overhead) and poor scaling of some tree-based algorithms. Consequently, CPU-based coaching stays probably the most cost-effective selection for small-medium structured knowledge workloads on cloud platforms.
On this article, I’ll stroll you thru the steps for benchmarking conventional machine studying algorithms on Google Cloud Platform (GCP) CPU choices, together with the Intel® Xeon® 6 that was lately made typically out there. (Full disclosure: I’m affiliated with Intel as a Senior AI Software program Options Engineer.)
By systematically evaluating runtime, scalability, and price throughout algorithms, we are able to make evidence-based selections about which approaches ship the most effective trade-off between accuracy, velocity, and operational value.
Machine Configuration on Google Cloud
Go to console.cloud.google.com, arrange your billing, and head to “Compute Engine.” Then click on on “Create occasion” to configure your digital machine (VM). The determine under reveals the C4 VM series powered by Intel® Xeon® 6 (code-named Granite Rapids) and the fifth Gen Intel® Xeon® (code-named Emerald Rapids) CPUs.
Hyperthreading can introduce efficiency variability as a result of two threads compete for a similar core’s execution assets. For constant benchmark outcomes, setting “vCPUs to core ratio” to 1 eliminates that variable—extra on this within the subsequent part.

Earlier than creating the VM, improve the boot disk dimension from the left-hand panel—200 GB might be greater than sufficient to put in the packages wanted for this weblog.

Non-uniform Reminiscence Entry (NUMA) Consciousness
Reminiscence entry is non-uniform on multi-core, multi-socket CPUs. This implies the latency and bandwidth of reminiscence operations depend upon which CPU core is accessing which area of reminiscence. When you don’t management for NUMA, you’re benchmarking the scheduler, not the CPU and the outcomes can seem inconsistent. Reminiscence affinity is precisely what eliminates that downside by controlling which CPU cores entry which reminiscence areas. The Linux scheduler is conscious of the NUMA topology of the platform and makes an attempt to enhance efficiency by scheduling threads on processors which can be in the identical node because the reminiscence getting used, moderately than letting the scheduler randomly assign work throughout the system. Nevertheless, with out specific affinity controls, you’ll be able to’t assure constant placement for dependable benchmarking.
Let’s do a hands-on NUMA experiment with XGBoost and an artificial dataset that’s giant sufficient to emphasize reminiscence.
First, provision a VM that spans a number of NUMA nodes, SSH into the occasion, and set up the dependencies.
sudo apt replace && sudo apt set up -y python3-venv numactl
Then create and activate a Python digital atmosphere to put in scikit-learn
, numpy
, and xgboost
. Save the script under as xgb_bench.py
.
import numpy as np
import xgboost as xgb
from sklearn.datasets import make_classification
from time import time
# 10M samples, 100 options
X, y = make_classification(n_samples=10_000_000, n_features=100, random_state=42)
dtrain = xgb.DMatrix(X, label=y)
params = {
"goal": "binary:logistic",
"tree_method": "hist",
"max_depth": 8,
"nthread": 0, # use all out there threads
}
begin = time()
xgb.practice(params, dtrain, num_boost_round=100)
print("Elapsed:", time() - begin, "seconds")
Subsequent, run this script in three modes (baseline / numa0 / interleave
). Repeat every experiment not less than 5 occasions and report the imply and commonplace deviation. (This calls for one more easy script!)
# Run with out NUMA binding
python3 xgb_bench.py
# Run with NUMA binding to a single node
numactl --cpunodebind=0 --membind=0 python3 xgb_bench.py
When assigning duties to particular bodily cores, use the --physcpubind
or -C
possibility moderately than --cpunodebind
.
# Run with interleaved reminiscence throughout nodes
numactl --interleave=all python3 xgb_bench.py
Which experiment had the smallest imply? How about the usual deviation? For deciphering these numbers, take into account that
- Decrease commonplace deviation for
numa0
signifies extra secure locality. - Decrease imply for
numa0
vsbaseline
suggests cross-node visitors was hurting you, and - If
interleave
narrows the hole vsbaseline
, your workload is bandwidth delicate and advantages from spreading pages—at potential value to latency.
If none of those apply to a benchmark, the workload could also be compute-bound (e.g., shallow timber, small dataset), or the VM would possibly expose a single NUMA node.
Selecting the Proper Benchmarks
When benchmarking classical machine studying algorithms on CPUs, it’s best to construct your individual testing framework, or leverage present benchmark suites, or use a hybrid strategy if applicable.
Present check suites akin to scikit-learn_bench and Phoronix Test Suite (PTS) are useful whenever you want standardized, reproducible outcomes that others can validate and examine in opposition to. They work significantly properly should you’re evaluating well-established algorithms like random forest, SVM, or XGBoost the place commonplace datasets present significant insights. Customized benchmarks excel at revealing implementation-specific efficiency traits. As an example, they’ll measure how totally different sparse matrix codecs have an effect on SVM coaching occasions, or how function preprocessing pipelines impression general throughput in your particular CPU structure. The datasets you employ immediately affect what your benchmark reveals. Be happy to seek the advice of the official scikit-learn benchmarks for inspiration. Right here’s additionally a pattern set that you need to use to create a customized check.
Dataset | Dimension | Job | Supply |
Higgs | 11M rows | Binary classification | UCI ML Repo |
Airline Delay | Variable | Multi-class classification | BTS |
California Housing | 20K rows | Regression | sklearn.datasets.fetch_california_housing |
Artificial | Variable | Scaling checks | sklearn.datasets.make_classification |
Artificial scaling datasets are particularly helpful to reveal variations in cache, reminiscence bandwidth, and I/O.
In the remainder of this weblog, we illustrate how one can run experiments utilizing the open supply scikit-learn_bench which at present helps scikit-learn, cuML, and XGBoost frameworks.
Installations and Benchmarking
As soon as the GCP VM is initialized, you’ll be able to SSH into the occasion and execute the instructions under in your terminal.
sudo apt replace && sudo apt improve -y
sudo apt set up -y git wget numactl
To put in Conda on a GCP VM, you’ll must account for the CPU structure. When you’re uncertain in regards to the structure of your VM, you’ll be able to run
uname -m
earlier than continuing to
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O ~/miniconda.sh
# Use the installer for Linux aarch64 in case your VM is predicated on Arm structure.
# wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-aarch64.sh -O ~/miniconda.sh
Subsequent, it’s good to execute the script and settle for the phrases of service (ToS).
bash ~/miniconda.sh
supply ~/.bashrc
Lastly, clone the newest scikit-learn_bench from GitHub, create a digital atmosphere and set up the required Python libraries.
git clone https://github.com/IntelPython/scikit-learn_bench.git
cd scikit-learn_bench
conda env create -n sklearn_bench -f envs/conda-env-sklearn.yml
conda activate sklearn_bench
At this level, it’s best to be capable of run a benchmark utilizing the sklbench module and a particular configuration:
python -m sklbench --config configs/xgboost_example.json
By default, sklbench benchmarks each the usual scikit-learn implementations and their optimized counterparts supplied by sklearnex (Intel’s accelerated extension)—or different supported frameworks like cuML or XGBoost—and logs outcomes together with {hardware} and software program metadata into consequence.json
. You possibly can customise the output file with --result-file
, and embody --report
to supply an Excel report (report.xlsx
). For a listing of all supported choices, see the documentation.
As mentioned earlier, you need to use numact
l to pin a course of and its little one processes to particular CPU cores. Right here’s learn how to run sklbench
with numactl
, binding it to chose cores:
cores="0-3"
export runid="$(date +%YpercentmpercentdpercentHpercentMpercentS)"
numactl --physcpubind $cores python3 -m sklbench
--config configs/common
--filters algorithm:library=sklearnex algorithm:gadget=cpu algorithm:estimator=RandomForestClassifier
--result-file $result-${runid}.json
Decoding Outcomes and Greatest Practices
The report generator means that you can mix the consequence information of a number of runs.
python -m sklbench.report --result-files <consequence 1> <consequence 2>
The true metric for cloud decision-making is the fee per process, specifically,
Value per process = runtime in hours x hourly worth.
Actual-world deployments hardly ever behave like single benchmark runs. To precisely mannequin the fee per process, it’s helpful to account for CPU increase conduct, cloud infrastructure variability, and reminiscence topology, as they’ll all affect efficiency in ways in which aren’t captured by a one-off measurement. To raised replicate precise runtime traits, I like to recommend beginning with warm-up iterations to stabilize CPU frequency scaling. Then run every experiment a number of occasions to account for system noise and transient results. Reporting the imply and commonplace deviation helps floor constant developments, whereas utilizing medians could be extra strong when variance is excessive, particularly in cloud environments the place noisy neighbors or useful resource competition can skew averages. For reproducibility, it’s vital to repair package deal variations and use constant VM picture snapshots. Together with NUMA configuration in your outcomes helps others perceive reminiscence locality results, which might considerably impression efficiency. Instruments like scikit-learn_bench automate many of those steps, making it simpler to supply benchmarks which can be each consultant and repeatable.
When you discovered this text helpful, please take into account sharing it along with your community. For extra AI improvement how-to content material, go to Intel® AI Development Resources.
Acknowledgments
The writer thanks Neal Dixon, Miriam Gonzales, Chris Liebert, and Rachel Novak for offering suggestions on an earlier draft of this work.
Assets