Close Menu
    Trending
    • Implementing DRIFT Search with Neo4j and LlamaIndex
    • Agentic AI in Finance: Opportunities and Challenges for Indonesia
    • Dispatch: Partying at one of Africa’s largest AI gatherings
    • Topp 10 AI-filmer genom tiderna
    • OpenAIs nya webbläsare ChatGPT Atlas
    • Creating AI that matters | MIT News
    • Scaling Recommender Transformers to a Billion Parameters
    • Hidden Gems in NumPy: 7 Functions Every Data Scientist Should Know
    ProfitlyAI
    • Home
    • Latest News
    • AI Technology
    • Latest AI Innovations
    • AI Tools & Technologies
    • Artificial Intelligence
    ProfitlyAI
    Home » When 50/50 Isn’t Optimal: Debunking Even Rebalancing
    Artificial Intelligence

    When 50/50 Isn’t Optimal: Debunking Even Rebalancing

    ProfitlyAIBy ProfitlyAIJuly 24, 2025No Comments7 Mins Read
    Share Facebook Twitter Pinterest LinkedIn Tumblr Reddit Telegram Email
    Share
    Facebook Twitter LinkedIn Pinterest Email


    for an Outdated Problem

    You might be coaching your mannequin for spam detection. Your dataset has many extra positives than negatives, so that you make investments numerous hours of labor to rebalance it to a 50/50 ratio. Now you’re glad since you have been capable of tackle the category imbalance. What if I instructed you that 60/40 might have been not solely sufficient, however even higher?

    In most machine studying classification purposes, the variety of cases of 1 class outnumbers that of different courses. This slows down studying [1] and may doubtlessly induce biases within the educated fashions [2]. Probably the most extensively used strategies to handle this depend on a easy prescription: discovering a solution to give all courses the identical weight. Most frequently, that is performed via easy strategies comparable to giving extra significance to minority class examples (reweighting), eradicating majority class examples from the dataset (undersampling), or together with minority class cases greater than as soon as (oversampling).

    The validity of those strategies is usually mentioned, with each theoretical and empirical work indicating that which answer works greatest relies on your particular utility [3]. Nevertheless, there’s a hidden speculation that’s seldom mentioned and too usually taken as a right: Is rebalancing even a good suggestion? To some extent, these strategies work, so the reply is sure. However ought to we absolutely rebalance our datasets? To make it easy, allow us to take a binary classification downside. Ought to we rebalance our coaching information to have 50% of every class? Instinct says sure, and instinct guided observe till now. On this case, instinct is mistaken. For intuitive causes.

    What Do We Imply by ‘Coaching Imbalance’?

    Earlier than we delve into how and why 50% is just not the optimum coaching imbalance in binary classification, allow us to outline some related portions. We name n₀ the variety of cases of 1 class (often, the minority class), and n₁ these of the opposite class. This manner, the overall variety of information cases within the coaching set is n=n₀+n₁ . The amount we analyze right this moment is the coaching imbalance,

    ρ⁽ᵗʳᵃⁱⁿ⁾ = n₀/n .

    Proof that fifty% Is Suboptimal

    Preliminary proof comes from empirical work on random forests. Kamalov and collaborators measured the optimum coaching imbalance, ρ⁽ᵒᵖᵗ⁾, on 20 datasets [4]. They discover its worth varies from downside to downside, however conclude that it is kind of ρ⁽ᵒᵖᵗ⁾=43%. Because of this, in response to their experiments, you need barely extra majority than minority class examples. That is nonetheless not the complete story. If you wish to goal at optimum fashions, don’t cease right here and straightaway set your ρ⁽ᵗʳᵃⁱⁿ⁾ to 43%.

    In actual fact, this 12 months, theoretical work by Pezzicoli et al. [5], confirmed that the the optimum coaching imbalance is just not a common worth that’s legitimate for all purposes. It isn’t 50% and it isn’t 43%. It seems, the optimum imbalance varies. It could actually some occasions be smaller than 50% (as Kamalov and collaborators measured), and others bigger than 50%. The particular worth of ρ⁽ᵒᵖᵗ⁾ will depend upon particulars of every particular classification downside. One solution to discover ρ⁽ᵒᵖᵗ⁾ is to coach the mannequin for a number of values of ρ⁽ᵗʳᵃⁱⁿ⁾, and measure the associated efficiency. This might for instance appear like this:

    Picture by creator

    Though the precise patterns figuring out ρ⁽ᵒᵖᵗ⁾ are nonetheless unclear, it appears that evidently when information is plentiful in comparison with the mannequin measurement, the optimum imbalance is smaller than 50%, as in Kamalov’s experiments. Nevertheless, many different components — from how intrinsically uncommon minority cases are, to how noisy the coaching dynamics is — come collectively to set the optimum worth of the coaching imbalance, and to find out how a lot efficiency is misplaced when one trains away from ρ⁽ᵒᵖᵗ⁾.

    Why Excellent Stability Isn’t At all times Finest

    As we stated, the reply is definitely intuitive: as completely different courses have completely different properties, there is no such thing as a purpose why each courses would carry the identical info. In actual fact, Pezzicoli’s crew proved that they often don’t. Subsequently, to deduce the perfect choice boundary we’d want extra cases of a category than of the opposite. Pezzicoli’s work, which is within the context of anomaly detection, gives us with a easy and insightful instance.

    Allow us to assume that the information comes from a multivariate Gaussian distribution, and that we label all of the factors to the suitable of a choice boundary as anomalies. In 2D, it might appear like this:

    Picture by creator, impressed from [5]

    The dashed line is our choice boundary, and the factors on the suitable of the choice boundary are the n₀ anomalies. Allow us to now rebalance our dataset to ρ⁽ᵗʳᵃⁱⁿ⁾=0.5. To take action, we have to discover extra anomalies. Because the anomalies are uncommon, people who we’re almost certainly to seek out are near the choice boundary. Already by eye, the state of affairs is strikingly clear:

    Picture by creator, impressed from [5]

    Anomalies, in yellow, are stacked alongside the choice boundary, and are subsequently extra informative about its place than the blue factors. This may induce to assume that it’s higher to privilege minority class factors. On the opposite aspect, anomalies solely cowl one aspect of the choice boundary, so as soon as one has sufficient minority class factors, it will possibly turn out to be handy to put money into extra majority class factors, so as to higher cowl the opposite aspect of the choice boundary. As a consequence of those two competing results, ρ⁽ᵒᵖᵗ⁾ is usually not 50%, and its precise worth is downside dependent.

    The Root Trigger Is Class Asymmetry

    Pezzicoli’s idea reveals that the optimum imbalance is usually completely different from 50%, as a result of completely different courses have completely different properties. Nevertheless, they solely analyze one supply of variety amongst courses, that’s, outlier conduct. But, as it’s for instance proven by Sarao-Mannelli and coauthors [6], there are many results, such because the presence of subgroups inside courses, which might produce an identical impact. It’s the concurrence of a really giant variety of results figuring out variety amongst courses, that tells us what the optimum imbalance for our particular downside is. Till we now have a idea that treats all sources of asymmetry within the information collectively (together with these induced by how the mannequin structure processes them), we can’t know the optimum coaching imbalance of a dataset beforehand.

    Key Takeaways & What You Can Do In another way

    If till now you rebalanced your binary dataset to 50%, you have been doing effectively, however you have been almost certainly not doing the absolute best. Though we nonetheless would not have a idea that may inform us what the optimum coaching imbalance must be, now you understand that it’s probably not 50%. The excellent news is that it’s on the way in which: machine studying theorists are actively addressing this subject. Within the meantime, you possibly can consider ρ⁽ᵗʳᵃⁱⁿ⁾ as a hyperparameter which you’ll tune beforehand, simply as some other hyperparameter, to rebalance your information in probably the most environment friendly means. So earlier than your subsequent mannequin coaching run, ask your self: is 50/50 actually optimum? Strive tuning your class imbalance — your mannequin’s efficiency may shock you.

    References

    [1] E. Francazi, M. Baity-Jesi, and A. Lucchi, A theoretical analysis of the learning dynamics under class imbalance (2023), ICML 2023

    [2] Okay. Ghosh, C. Bellinger, R. Corizzo, P. Branco,B. Krawczyk,and N. Japkowicz, The class imbalance problem in deep learning (2024), Machine Studying, 113(7), 4845–4901

    [3] E. Loffredo, M. Pastore, S. Cocco and R. Monasson, Restoring balance: principled under/oversampling of data for optimal classification (2024), ICML 2024

    [4] F. Kamalov, A.F. Atiya and D. Elreedy, Partial resampling of imbalanced data (2022), arXiv preprint arXiv:2207.04631

    [5] F.S. Pezzicoli, V. Ros, F.P. Landes and M. Baity-Jesi, Class imbalance in anomaly detection: Learning from an exactly solvable model (2025). AISTATS 2025

    [6] S. Sarao-Mannelli, F. Gerace, N. Rostamzadeh and L. Saglietti, Bias-inducing geometries: an exactly solvable data model with fairness implications (2022), arXiv preprint arXiv:2205.15935



    Source link

    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
    Previous ArticleOptimize for Impact: How to Stay Ahead of Gen AI and Thrive as a Data Scientist
    Next Article Trump’s AI Action Plan is a distraction
    ProfitlyAI
    • Website

    Related Posts

    Artificial Intelligence

    Implementing DRIFT Search with Neo4j and LlamaIndex

    October 22, 2025
    Artificial Intelligence

    Agentic AI in Finance: Opportunities and Challenges for Indonesia

    October 22, 2025
    Artificial Intelligence

    Creating AI that matters | MIT News

    October 21, 2025
    Add A Comment
    Leave A Reply Cancel Reply

    Top Posts

    Gift from Sebastian Man ’79, SM ’80 supports MIT Stephen A. Schwarzman College of Computing building | MIT News

    April 5, 2025

    Microsoft lanserar Discovery AI-plattform för vetenskaplig forskning

    May 20, 2025

    AI-agenter kan nu hjälpa läkare fatta bättre beslut inom cancervård

    June 7, 2025

    How AI and Wikipedia have sent vulnerable languages into a doom spiral

    September 25, 2025

    Using Python to Build a Calculator

    September 16, 2025
    Categories
    • AI Technology
    • AI Tools & Technologies
    • Artificial Intelligence
    • Latest AI Innovations
    • Latest News
    Most Popular

    Firefly Boards – ett nytt AI-verktyg som låter användare generera idéer

    April 25, 2025

    AI is pushing the limits of the physical world

    April 21, 2025

    Topic Model Labelling with LLMs | Towards Data Science

    July 14, 2025
    Our Picks

    Implementing DRIFT Search with Neo4j and LlamaIndex

    October 22, 2025

    Agentic AI in Finance: Opportunities and Challenges for Indonesia

    October 22, 2025

    Dispatch: Partying at one of Africa’s largest AI gatherings

    October 22, 2025
    Categories
    • AI Technology
    • AI Tools & Technologies
    • Artificial Intelligence
    • Latest AI Innovations
    • Latest News
    • Privacy Policy
    • Disclaimer
    • Terms and Conditions
    • About us
    • Contact us
    Copyright © 2025 ProfitlyAI All Rights Reserved.

    Type above and press Enter to search. Press Esc to cancel.