Close Menu
    Trending
    • Gemini introducerar funktionen schemalagda åtgärder i Gemini-appen
    • AIFF 2025 Runway’s tredje årliga AI Film Festival
    • AI-agenter kan nu hjälpa läkare fatta bättre beslut inom cancervård
    • Not Everything Needs Automation: 5 Practical AI Agents That Deliver Enterprise Value
    • Prescriptive Modeling Unpacked: A Complete Guide to Intervention With Bayesian Modeling.
    • 5 Crucial Tweaks That Will Make Your Charts Accessible to People with Visual Impairments
    • Why AI Projects Fail | Towards Data Science
    • The Role of Luck in Sports: Can We Measure It?
    ProfitlyAI
    • Home
    • Latest News
    • AI Technology
    • Latest AI Innovations
    • AI Tools & Technologies
    • Artificial Intelligence
    ProfitlyAI
    Home » How To Build a Benchmark for Your Models
    Artificial Intelligence

    How To Build a Benchmark for Your Models

    ProfitlyAIBy ProfitlyAIMay 15, 2025No Comments9 Mins Read
    Share Facebook Twitter Pinterest LinkedIn Tumblr Reddit Telegram Email
    Share
    Facebook Twitter LinkedIn Pinterest Email


    I’ve science marketing consultant for the previous three years, and I’ve had the chance to work on a number of tasks throughout varied industries. But, I seen one widespread denominator amongst a lot of the shoppers I labored with:

    They hardly ever have a transparent concept of the venture goal.

    This is without doubt one of the primary obstacles information scientists face, particularly now that Gen AI is taking on each area.

    However let’s suppose that after some backwards and forwards, the target turns into clear. We managed to pin down a particular query to reply. For instance:

    I need to classify my prospects into two teams in keeping with their chance to churn: “excessive chance to churn” and “low chance to churn”

    Effectively, now what? Simple, let’s begin constructing some fashions!

    Fallacious!

    If having a transparent goal is uncommon, having a dependable benchmark is even rarer.

    In my view, some of the vital steps in delivering a knowledge science venture is defining and agreeing on a set of benchmarks with the shopper.

    On this weblog put up, I’ll clarify:

    • What a benchmark is,
    • Why you will need to have a benchmark,
    • How I’d construct one utilizing an instance situation and
    • Some potential drawbacks to remember

    What’s a benchmark?

    A benchmark is a standardized option to consider the efficiency of a mannequin. It offers a reference level towards which new fashions could be in contrast.

    A benchmark wants two key parts to be thought of full:

    1. A set of metrics to judge the efficiency
    2. A set of easy fashions to make use of as baselines

    The idea at its core is easy: each time I develop a brand new mannequin I examine it towards each earlier variations and the baseline fashions. This ensures enhancements are actual and tracked.

    It’s important to grasp that this baseline shouldn’t be mannequin or dataset-specific, however fairly business-case-specific. It needs to be a basic benchmark for a given enterprise case.

    If I encounter a brand new dataset, with the identical enterprise goal, this benchmark needs to be a dependable reference level.


    Why constructing a benchmark is vital

    Now that we’ve outlined what a benchmark is, let’s dive into why I imagine it’s value spending an additional venture week on the event of a robust benchmark.

    1. With no Benchmark you’re aiming for perfection — In case you are working with out a clear reference level any consequence will lose that means. “My mannequin has a MAE of 30.000” Is that good? IDK! Perhaps with a easy imply you’d get a MAE of 25.000. By evaluating your mannequin to a baseline, you may measure each efficiency and enchancment.
    2. Improves Speaking with Purchasers — Purchasers and enterprise groups won’t instantly perceive the usual output of a mannequin. Nevertheless, by participating them with easy baselines from the beginning, it turns into simpler to display enhancements later. In lots of instances benchmarks might come straight from the enterprise in numerous shapes or types.
    3. Helps in Mannequin Choice — A benchmark provides a start line to match a number of fashions pretty. With out it, you may waste time testing fashions that aren’t value contemplating.
    4. Mannequin Drift Detection and Monitoring — Fashions can degrade over time. By having a benchmark you may be capable of intercept drifts early by evaluating new mannequin outputs towards previous benchmarks and baselines.
    5. Consistency Between Completely different Datasets — Datasets evolve. By having a hard and fast set of metrics and fashions you make sure that efficiency comparisons stay legitimate over time.

    With a transparent benchmark, each step within the mannequin growth will present rapid suggestions, making the entire course of extra intentional and data-driven.


    How I’d construct a benchmark

    I hope I’ve satisfied you of the significance of getting a benchmark. Now, let’s really construct one.

    Let’s begin from the enterprise query we introduced on the very starting of this weblog put up:

    I need to classify my prospects into two teams in keeping with their chance to churn: “excessive chance to churn” and “low chance to churn”

    For simplicity, I’ll assume no extra enterprise constraints, however in real-world situations, constraints usually exist.

    For this instance, I’m utilizing this dataset (CC0: Public Domain). The information comprises some attributes from an organization’s buyer base (e.g., age, intercourse, variety of merchandise, …) together with their churn standing.

    Now that we now have one thing to work on let’s construct the benchmark:

    1. Defining the metrics

    We’re coping with a churn use case, particularly, this can be a binary classification drawback. Thus the principle metrics that we might use are:

    • Precision — Share of accurately predicted churners amongst all predicted churners
    • Recall — Share of precise churners accurately recognized
    • F1 rating — Balances precision and recall
    • True Positives, False Positives, True Damaging and False Negatives

    These are among the “easy” metrics that could possibly be used to judge the output of a mannequin.

    Nevertheless, it’s not an exhaustive listing, customary metrics aren’t all the time sufficient. In lots of use instances, it is likely to be helpful to construct customized metrics.

    Let’s assume that in our enterprise case the prospects labeled as “excessive chance to churn” are provided a reduction. This creates:

    • A value ($250) when providing the low cost to a non-churning buyer
    • A revenue ($1000) when retaining a churning buyer

    Following on this definition we are able to construct a customized metric that will probably be essential in our situation:

    # Defining the enterprise case-specific reference metric
    def financial_gain(y_true, y_pred):  
        loss_from_fp = np.sum(np.logical_and(y_pred == 1, y_true == 0)) * 250  
        gain_from_tp = np.sum(np.logical_and(y_pred == 1, y_true == 1)) * 1000  
        return gain_from_tp - loss_from_fp

    If you end up constructing business-driven metrics these are normally probably the most related. Such metrics might take any form or type: Monetary objectives, minimal necessities, proportion of protection and extra.

    2. Defining the benchmarks

    Now that we’ve outlined our metrics, we are able to outline a set of baseline fashions for use as a reference.

    On this part, it is best to outline an inventory of simple-to-implement mannequin of their easiest potential setup. There is no such thing as a motive at this state to spend time and sources on the optimization of those fashions, my mindset is:

    If I had quarter-hour, how would I implement this mannequin?

    In later phases of the mannequin, you may add mode baseline fashions because the venture proceeds.

    On this case, I’ll use the next fashions:

    • Random Mannequin — Assigns labels randomly
    • Majority Mannequin — All the time predicts probably the most frequent class
    • Easy XGB
    • Easy KNN
    import numpy as np  
    import xgboost as xgb  
    from sklearn.neighbors import KNeighborsClassifier  
      
    class BinaryMean():  
        @staticmethod  
        def run_benchmark(df_train, df_test):  
            np.random.seed(21)  
            return np.random.alternative(a=[1, 0], measurement=len(df_test), p=[df_train['y'].imply(), 1 - df_train['y'].imply()])  
          
    class SimpleXbg():  
        @staticmethod  
        def run_benchmark(df_train, df_test):  
            mannequin = xgb.XGBClassifier()  
            mannequin.match(df_train.select_dtypes(embody=np.quantity).drop(columns='y'), df_train['y'])  
            return mannequin.predict(df_test.select_dtypes(embody=np.quantity).drop(columns='y'))  
          
    class MajorityClass():  
        @staticmethod  
        def run_benchmark(df_train, df_test):  
            majority_class = df_train['y'].mode()[0]  
            return np.full(len(df_test), majority_class)  
      
    class SimpleKNN():  
        @staticmethod  
        def run_benchmark(df_train, df_test):  
            mannequin = KNeighborsClassifier()  
            mannequin.match(df_train.select_dtypes(embody=np.quantity).drop(columns='y'), df_train['y'])  
            return mannequin.predict(df_test.select_dtypes(embody=np.quantity).drop(columns='y'))

    Once more, as within the case of the metrics, we are able to construct customized benchmarks.

    Let’s assume that in our enterprise case the the advertising and marketing group contacts each shopper who’s:

    • Over 50 y/o and
    • That’s not lively anymore

    Following this rule we are able to construct this mannequin:

    # Defining the enterprise case-specific benchmark
    class BusinessBenchmark():  
        @staticmethod  
        def run_benchmark(df_train, df_test):  
            df = df_test.copy()  
            df.loc[:,'y_hat'] = 0  
            df.loc[(df['IsActiveMember'] == 0) & (df['Age'] >= 50), 'y_hat'] = 1  
            return df['y_hat']

    Operating the benchmark

    To run the benchmark I’ll use the next class. The entry level is the tactic compare_with_benchmark() that, given a prediction, runs all of the fashions and calculates all of the metrics.

    import numpy as np  
      
    class ChurnBinaryBenchmark():  
        def __init__(        
    	    self,  
            metrics = [],  
            benchmark_models = [],        
            ):  
            self.metrics = metrics  
            self.benchmark_models = benchmark_models  
      
        def compare_pred_with_benchmark(        
    	    self,  
            df_train,  
            df_test,  
            my_predictions,    
            ):  
           
            output_metrics = {  
                'Prediction': self._calculate_metrics(df_test['y'], my_predictions)  
            }  
            dct_benchmarks = {}  
      
            for mannequin in self.benchmark_models:  
                dct_benchmarks[model.__name__] = mannequin.run_benchmark(df_train = df_train, df_test = df_test)  
                output_metrics[f'Benchmark - {model.__name__}'] = self._calculate_metrics(df_test['y'], dct_benchmarks[model.__name__])  
      
            return output_metrics  
          
        def _calculate_metrics(self, y_true, y_pred):  
            return {getattr(func, '__name__', 'Unknown') : func(y_true = y_true, y_pred = y_pred) for func in self.metrics}

    Now all we want is a prediction. For this instance, I made a rapid characteristic engineering and a few hyperparameter tuning.

    The final step is simply to run the benchmark:

    binary_benchmark = ChurnBinaryBenchmark(  
        metrics=[f1_score, precision_score, recall_score, tp, tn, fp, fn, financial_gain],  
        benchmark_models=[BinaryMean, SimpleXbg, MajorityClass, SimpleKNN, BusinessBenchmark]  
        )  
      
    res = binary_benchmark.compare_pred_with_benchmark(  
        df_train=df_train,  
        df_test=df_test,  
        my_predictions=preds,  
    )  
      
    pd.DataFrame(res)
    Benchmark metrics comparability | Picture by Writer

    This generates a comparability desk of all fashions throughout all metrics. Utilizing this desk, it’s potential to attract concrete conclusions on the mannequin’s predictions and make knowledgeable selections on the next steps of the method.


    Some drawbacks

    As we’ve seen there are many the reason why it’s helpful to have a benchmark. Nevertheless, regardless that benchmarks are extremely helpful, there are some pitfalls to be careful for:

    1. Non-Informative Benchmark — When the metrics or fashions are poorly outlined the marginal influence of getting a benchmark decreases. All the time outline significant baselines.
    2. Misinterpretation by Stakeholders — Communication with the shopper is important, you will need to state clearly what the metrics are measuring. The very best mannequin won’t be the most effective on all of the outlined metrics.
    3. Overfitting to the Benchmark — You may find yourself making an attempt to create options which might be too particular, which may beat the benchmark, however don’t generalize nicely in prediction. Don’t deal with beating the benchmark, however on creating the most effective answer potential to the issue.
    4. Change of Goal — Goals outlined may change, attributable to miscommunication or modifications in plans. Hold your benchmark versatile so it might adapt when wanted.

    Last ideas

    Benchmarks present readability, guarantee enhancements are measurable, and create a shared reference level between information scientists and shoppers. They assist keep away from the lure of assuming a mannequin is performing nicely with out proof and be sure that each iteration brings actual worth.

    In addition they act as a communication software, making it simpler to clarify progress to shoppers. As an alternative of simply presenting numbers, you may present clear comparisons that spotlight enhancements.

    Here you can find a notebook with a full implementation from this blog post.



    Source link

    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
    Previous Article🚪🚪🐐 Lessons in Decision Making from the Monty Hall Problem
    Next Article Google’s AlphaEvolve Is Evolving New Algorithms — And It Could Be a Game Changer
    ProfitlyAI
    • Website

    Related Posts

    Artificial Intelligence

    Not Everything Needs Automation: 5 Practical AI Agents That Deliver Enterprise Value

    June 6, 2025
    Artificial Intelligence

    Prescriptive Modeling Unpacked: A Complete Guide to Intervention With Bayesian Modeling.

    June 6, 2025
    Artificial Intelligence

    5 Crucial Tweaks That Will Make Your Charts Accessible to People with Visual Impairments

    June 6, 2025
    Add A Comment
    Leave A Reply Cancel Reply

    Top Posts

    If I Wanted to Become a Machine Learning Engineer, I’d Do This

    April 29, 2025

    Simulating Flood Inundation with Python and Elevation Data: A Beginner’s Guide

    May 30, 2025

    Opera Neon är världens första fullständigt agent-baserde webbläsare

    May 30, 2025

    How Microsoft Power BI Elevated My Data Analysis and Visualization Workflow

    May 27, 2025

    DeepCoder: Open Source AI som når O3-mini Prestanda

    April 9, 2025
    Categories
    • AI Technology
    • AI Tools & Technologies
    • Artificial Intelligence
    • Latest AI Innovations
    • Latest News
    Most Popular

    Therapists Too Expensive? Why Thousands of Women Are Spilling Their Deepest Secrets to ChatGPT

    May 6, 2025

    Getting Your Tool Noticed • AI Parabellum

    April 10, 2025

    How AI SaaS is Reshaping Business Costs and Opportunities • AI Parabellum

    April 3, 2025
    Our Picks

    Gemini introducerar funktionen schemalagda åtgärder i Gemini-appen

    June 7, 2025

    AIFF 2025 Runway’s tredje årliga AI Film Festival

    June 7, 2025

    AI-agenter kan nu hjälpa läkare fatta bättre beslut inom cancervård

    June 7, 2025
    Categories
    • AI Technology
    • AI Tools & Technologies
    • Artificial Intelligence
    • Latest AI Innovations
    • Latest News
    • Privacy Policy
    • Disclaimer
    • Terms and Conditions
    • About us
    • Contact us
    Copyright © 2025 ProfitlyAI All Rights Reserved.

    Type above and press Enter to search. Press Esc to cancel.