Close Menu
    Trending
    • From Equal Weights to Smart Weights: OTPO’s Approach to Better LLM Alignment
    • The Future of AI Agent Communication with ACP
    • Vad världen har frågat ChatGPT under 2025
    • Google’s generative video model Veo 3 has a subtitles problem
    • MedGemma – Nya AI-modeller för hälso och sjukvård
    • AI text-to-speech programs could “unlearn” how to imitate certain people
    • AI’s giants want to take over the classroom
    • What Can the History of Data Tell Us About the Future of AI?
    ProfitlyAI
    • Home
    • Latest News
    • AI Technology
    • Latest AI Innovations
    • AI Tools & Technologies
    • Artificial Intelligence
    ProfitlyAI
    Home » Accuracy Is Dead: Calibration, Discrimination, and Other Metrics You Actually Need
    Artificial Intelligence

    Accuracy Is Dead: Calibration, Discrimination, and Other Metrics You Actually Need

    ProfitlyAIBy ProfitlyAIJuly 15, 2025No Comments8 Mins Read
    Share Facebook Twitter Pinterest LinkedIn Tumblr Reddit Telegram Email
    Share
    Facebook Twitter LinkedIn Pinterest Email


    we, information scientists, cite essentially the most — but in addition essentially the most deceptive.

    It was way back that we discovered that fashions are developed for excess of simply making predictions. We create fashions to make choices, and that requires belief. And counting on the accuracy is just not sufficient.

    On this publish, we’ll see why and we’ll test different alternate options, extra superior and tailor-made to our wants. As at all times, we’ll do it following a sensible method, with the tip purpose of deep diving into analysis past customary metrics.

    Right here’s the desk of contents for at the moment’s learn:

    1. Setting Up the Fashions
    2. Classification: Past Accuracy
    3. Regression: Superior Analysis
    4. Conclusion

    Setting Up the Fashions

    Accuracy makes extra sense for classification algorithms moderately than regression duties… Therefore, not all issues are measured equally.

    That’s the rationale why I’ve determined to deal with each situations — the regression and the classification ones — individually by creating two totally different fashions.

    And so they’ll be quite simple ones, as a result of their efficiency and utility isn’t what issues at the moment:

    • Classification: Will a striker rating within the subsequent match?
    • Regression: What number of objectives will a participant rating?

    In case you’re a recurrent reader, I’m certain that using soccer examples didn’t come as a shock.

    Be aware: Despite the fact that we gained’t be utilizing accuracy on our regression drawback and this publish is regarded as extra centered on that metric, I didn’t need to depart these circumstances behind. In order that’s why we’ll be exploring regression metrics too.

    Once more, as a result of we don’t care concerning the information nor the efficiency, let me skip all of the preprocessing half and go straight to the fashions themselves:

    # Classification mannequin
    mannequin = LogisticRegression()
    mannequin.match(X_train_scaled, y_train)
    
    # Gradient boosting regressor
    mannequin = GradientBoostingRegressor()
    mannequin.match(X_train_scaled, y_train)

    As you may see, we keep on with easy fashions: logistic regression for the binary classification, and gradient boosting for regression.

    Let’s test the metrics we’d often test:

    # Classification
    y_pred = mannequin.predict(X_test_scaled)
    accuracy = accuracy_score(y_test, y_pred)
    
    print(f"Check accuracy: {accuracy:.2%}")

    The printed accuracy is 92.43%, which is truthfully manner greater than what I’d have anticipated. Is the mannequin actually that good?

    # Regression
    y_pred = mannequin.predict(X_test_scaled)
    
    rmse = np.sqrt(mean_squared_error(y_test, y_pred))
    
    print(f"Check RMSE: {rmse:.4f}")

    I acquired an RMSE of 0.3059. Not that good. However is it sufficient to discard our regression mannequin?

    We have to do higher.

    Classification: Past Accuracy

    Too many information science tasks cease at accuracy, which is commonly deceptive, particularly with imbalanced targets (e.g., scoring a purpose is uncommon).

    To guage whether or not our mannequin actually predicts “Will this participant carry out?”, listed here are different metrics we must always think about:

    • ROC-AUC: Measures potential to rank positives above negatives. Insensitive to threshold however doesn’t care about calibration.
    • PR-AUC: Precision-Recall curve is important for uncommon occasions (e.g., scoring chance). It focuses on the optimistic class, which issues when positives are scarce.
    • Log Loss: Punishes overconfident incorrect predictions. Preferrred for evaluating calibrated probabilistic outputs.
    • Brier Rating: Measures imply squared error between predicted chances and precise outcomes. Decrease is healthier, and it’s interpretable as total chance calibration.
    • Calibration Curves: Visible diagnostic to see if predicted chances match noticed frequencies.

    We gained’t check all of them now, however let’s briefly contact upon ROC-AUC and Log Loss, most likely essentially the most used after accuracy.

    ROC-AUC

    ROC-AUC, or Receiver Working Attribute – Space Underneath the Curve, is a well-liked metric that consists in measuring the world beneath the ROC curve, which is a curve that plots the True Constructive charge (TPR) towards the False Constructive charge (FPR).

    Merely put, the ROC-AUC rating (starting from 0 to 1) sums up how nicely a mannequin can produce relative scores to discriminate between optimistic or unfavorable cases throughout all classification thresholds. 

    A rating of 0.5 signifies random guessing and a 1 is an ideal efficiency.

    Computing it in Python is simple:

    from sklearn.metrics import roc_auc_score
    
    roc_auc = roc_auc_score(y_test, y_proba)

    Right here, y_true incorporates the actual labels and y_proba incorporates our mannequin’s predicted prorbabilities. In my case the rating is 0.7585, which is comparatively low in comparison with the accuracy. However how can this be attainable, if we acquired an accuracy above 90%?

    Context: We’re attempting to foretell whether or not a participant will rating in a match or not. The “drawback” is that that is extremely imbalanced information: most gamers gained’t rating in a match, so our mannequin learns that predicting a 0 is essentially the most possible, with out actually studying something concerning the information itself.

    It could’t seize the minority class appropriately and accuracy merely doesn’t present us that.

    Log Loss

    The logarithmic loss, cross-entropy or, merely, log loss, is used to guage the efficiency with chance outputs. It measures the distinction between the anticipated chances and the precise (true) values, logarithmically.

    Once more, we are able to do that with a one-liner in python:

    from sklearn.metrics import log_loss
    
    logloss = log_loss(y_test, y_proba)

    As you’ve most likely guessed, the decrease the worth, the higher. A 0 could be the right mannequin. In my case, I acquired a 0.2345.

    This one can also be affected by class imbalance: Log loss penalizes assured incorrect predictions very harshly and, since our mannequin predicts a 0 more often than not, these circumstances wherein there was certainly a purpose scored have an effect on the ultimate rating.

    Regression: Superior Analysis

    Accuracy is unnecessary in regression however we’ve a handful of attention-grabbing metrics to guage the issue of what number of objectives will a participant rating in a given match.

    When predicting steady outcomes (e.g., anticipated minutes, match scores, fantasy factors), easy RMSE/MAE is a begin—however we are able to go a lot additional.

    Different metrics and checks:

    • R²: Represents the proportion of the variance within the goal variable defined by the mannequin.
    • RMSLE: Penalizes underestimates extra and is helpful if values differ exponentially (e.g., fantasy factors).
    • MAPE / SMAPE: Proportion errors, however beware divide-by-zero points.
    • Quantile Loss: Practice fashions to foretell intervals (e.g., tenth, fiftieth, ninetieth percentile outcomes).
    • Residual vs. Predicted (plot): Verify for heteroscedasticity.

    Once more, let’s give attention to a subgroup of them.

    R² Rating

    Additionally referred to as the coefficient of dedication, it compares a mannequin’s error to the baseline error. A rating of 1 is the right match, a 0 signifies that it predicts the imply solely, and a price beneath 0 signifies that it’s worse than imply prediction.

    from sklearn.metrics import r2_score
    
    r2 = r2_score(y_test, y_pred)

    I acquired a price of 0.0557, which is fairly near 0… Not good.

    RMSLE

    The Root Imply Squared Logarithmic Error, or RMSLE, measures the sq. root of the typical squared distinction between the log-transformed predicted and precise values. This metric is helpful when:

    • We need to penalize under-prediction extra gently.
    • Our goal variables are skewed (it reduces the influence of enormous outliers).
    from sklearn.metrics import mean_squared_log_error
    
    rmsle = np.sqrt(mean_squared_log_error(y_test, y_pred))

    I acquired a 0.19684 which signifies that my common prediction error is about 0.2 objectives. It’s not that massive however, on condition that our goal variable is a price between 0 and 4 and extremely skewed in direction of 0…

    Quantile Loss

    Additionally referred to as Pinball Loss, it may be used for quantile regression fashions to guage how nicely our predicted quantiles carry out. If we construct a quantile mannequin (GradientBoostingRegressor with quantile loss), we are able to check it as follows:

    from sklearn.metrics import mean_pinball_loss
    
    alpha = 0.9
    q_loss = mean_pinball_loss(y_test, y_pred_quantile, alpha=alpha)
    

    Right here, with alpha 0.9 we’re attempting to foretell the ninetieth percentile. My quantile loss is 0.0644 which may be very small in relative phrases (~1.6% of my goal variable vary).

    Nevertheless, distribution issues: Most of our y_test values are 0, and we have to interpret it as “on common, our mannequin’s error in capturing the higher tail may be very low“.

    It’s particularly spectacular given the 0-heavy goal.

    However, as a result of most outcomes are 0, different metrics like those we noticed and talked about above needs to be used to evaluate whether or not our mannequin is the truth is performing nicely or not.

    Conclusion

    Constructing predictive fashions goes far past merely attaining “good accuracy.”

    For classification duties, it’s essential to take into consideration imbalanced information, chance calibration, and real-world use circumstances like pricing or threat administration.

    For regression, the purpose isn’t just minimizing error however understanding uncertainty—very important in case your predictions inform technique or buying and selling choices.

    In the end, true worth lies in:

    • Rigorously curated, temporally legitimate options.
    • Superior analysis metrics tailor-made to the issue.
    • Clear, well-visualized comparisons.

    In case you get these proper, you’re not constructing “simply one other mannequin.” You’re delivering strong, decision-ready instruments. And the metrics we explored listed here are simply the entry level.



    Source link

    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
    Previous ArticleTopic Model Labelling with LLMs | Towards Data Science
    Next Article What Can the History of Data Tell Us About the Future of AI?
    ProfitlyAI
    • Website

    Related Posts

    Artificial Intelligence

    From Equal Weights to Smart Weights: OTPO’s Approach to Better LLM Alignment

    July 15, 2025
    Artificial Intelligence

    The Future of AI Agent Communication with ACP

    July 15, 2025
    Artificial Intelligence

    What Can the History of Data Tell Us About the Future of AI?

    July 15, 2025
    Add A Comment
    Leave A Reply Cancel Reply

    Top Posts

    What Statistics Can Tell Us About NBA Coaches

    May 22, 2025

    Automate Models Training: An MLOps Pipeline with Tekton and Buildpacks

    June 11, 2025

    Fairness Pruning: Precision Surgery to Reduce Bias in LLMs

    July 4, 2025

    The Role of Luck in Sports: Can We Measure It?

    June 6, 2025

    NumExpr: The “Faster than Numpy” Library Most Data Scientists Have Never Used

    April 28, 2025
    Categories
    • AI Technology
    • AI Tools & Technologies
    • Artificial Intelligence
    • Latest AI Innovations
    • Latest News
    Most Popular

    Road to AGI (and Beyond) #1 — The AI Timeline is Accelerating

    April 11, 2025

    Empowering LLMs to Think Deeper by Erasing Thoughts

    May 13, 2025

    Why Regularization Isn’t Enough: A Better Way to Train Neural Networks with Two Objectives

    May 27, 2025
    Our Picks

    From Equal Weights to Smart Weights: OTPO’s Approach to Better LLM Alignment

    July 15, 2025

    The Future of AI Agent Communication with ACP

    July 15, 2025

    Vad världen har frågat ChatGPT under 2025

    July 15, 2025
    Categories
    • AI Technology
    • AI Tools & Technologies
    • Artificial Intelligence
    • Latest AI Innovations
    • Latest News
    • Privacy Policy
    • Disclaimer
    • Terms and Conditions
    • About us
    • Contact us
    Copyright © 2025 ProfitlyAI All Rights Reserved.

    Type above and press Enter to search. Press Esc to cancel.