Close Menu
    Trending
    • Achieving 5x Agentic Coding Performance with Few-Shot Prompting
    • Why the Sophistication of Your Prompt Correlates Almost Perfectly with the Sophistication of the Response, as Research by Anthropic Found
    • From Transactions to Trends: Predict When a Customer Is About to Stop Buying
    • America’s coming war over AI regulation
    • “Dr. Google” had its issues. Can ChatGPT Health do better?
    • Evaluating Multi-Step LLM-Generated Content: Why Customer Journeys Require Structural Metrics
    • Why SaaS Product Management Is the Best Domain for Data-Driven Professionals in 2026
    • Stop Writing Messy Boolean Masks: 10 Elegant Ways to Filter Pandas DataFrames
    ProfitlyAI
    • Home
    • Latest News
    • AI Technology
    • Latest AI Innovations
    • AI Tools & Technologies
    • Artificial Intelligence
    ProfitlyAI
    Home » Why it’s critical to move beyond overly aggregated machine-learning metrics | MIT News
    Artificial Intelligence

    Why it’s critical to move beyond overly aggregated machine-learning metrics | MIT News

    ProfitlyAIBy ProfitlyAIJanuary 20, 2026No Comments5 Mins Read
    Share Facebook Twitter Pinterest LinkedIn Tumblr Reddit Telegram Email
    Share
    Facebook Twitter LinkedIn Pinterest Email

    MIT researchers have recognized important examples of machine-learning mannequin failure when these fashions are utilized to information aside from what they have been skilled on, elevating questions on the necessity to check each time a mannequin is deployed in a brand new setting.

    “We exhibit that even once you prepare fashions on giant quantities of knowledge, and select the perfect common mannequin, in a brand new setting this ‘greatest mannequin’ might be the worst mannequin for 6-75 % of the brand new information,” says Marzyeh Ghassemi, an affiliate professor in MIT’s Division of Electrical Engineering and Pc Science (EECS), a member of the Institute for Medical Engineering and Science, and principal investigator on the Laboratory for Info and Determination Techniques.

    In a paper that was introduced on the Neural Info Processing Techniques (NeurIPS 2025) convention in December, the researchers level out that fashions skilled to successfully diagnose sickness in chest X-rays at one hospital, for instance, could also be thought of efficient in a special hospital, on common. The researchers’ efficiency evaluation, nonetheless, revealed that among the best-performing fashions on the first hospital have been the worst-performing on as much as 75 % of sufferers on the second hospital, although when all sufferers are aggregated within the second hospital, excessive common efficiency hides this failure.

    Their findings exhibit that though spurious correlations — a easy instance of which is when a machine-learning system, not having “seen” many cows pictured on the seashore, classifies a photograph of a beach-going cow as an orca merely due to its background — are regarded as mitigated by simply bettering mannequin efficiency on noticed information, they really nonetheless happen and stay a danger to a mannequin’s trustworthiness in new settings. In lots of cases — together with areas examined by the researchers equivalent to chest X-rays, most cancers histopathology photographs, and hate speech detection — such spurious correlations are a lot tougher to detect.

    Within the case of a medical analysis mannequin skilled on chest X-rays, for instance, the mannequin could have discovered to correlate a particular and irrelevant marking on one hospital’s X-rays with a sure pathology. At one other hospital the place the marking will not be used, that pathology might be missed.

    Earlier analysis by Ghassemi’s group has proven that fashions can spuriously correlate such components as age, gender, and race with medical findings. If, as an example, a mannequin has been skilled on extra older individuals’s chest X-rays which have pneumonia and hasn’t “seen” as many X-rays belonging to youthful individuals, it would predict that solely older sufferers have pneumonia.

    “We wish fashions to learn to take a look at the anatomical options of the affected person after which decide based mostly on that,” says Olawale Salaudeen, an MIT postdoc and the lead writer of the paper, “however actually something that’s within the information that’s correlated with a choice can be utilized by the mannequin. And people correlations may not truly be sturdy with modifications within the atmosphere, making the mannequin predictions unreliable sources of decision-making.”

    Spurious correlations contribute to the dangers of biased decision-making. Within the NeurIPS convention paper, the researchers confirmed that, for instance, chest X-ray fashions that improved total analysis efficiency truly carried out worse on sufferers with pleural situations or enlarged cardiomediastinum, which means enlargement of the guts or central chest cavity.

    Different authors of the paper included PhD college students Haoran Zhang and Kumail Alhamoud, EECS Assistant Professor Sara Beery, and Ghassemi.

    Whereas earlier work has usually accepted that fashions ordered best-to-worst by efficiency will protect that order when utilized in new settings, referred to as accuracy-on-the-line, the researchers have been in a position to exhibit examples of when the best-performing fashions in a single setting have been the worst-performing in one other.

    Salaudeen devised an algorithm referred to as OODSelect to search out examples the place accuracy-on-the-line was damaged. Mainly, he skilled 1000’s of fashions utilizing in-distribution information, which means the information have been from the primary setting, and calculated their accuracy. Then he utilized the fashions to the information from the second setting. When these with the best accuracy on the first-setting information have been improper when utilized to a big share of examples within the second setting, this recognized the issue subsets, or sub-populations. Salaudeen additionally emphasizes the risks of mixture statistics for analysis, which may obscure extra granular and consequential details about mannequin efficiency.

    In the middle of their work, the researchers separated out the “most miscalculated examples” in order to not conflate spurious correlations inside a dataset with conditions which might be merely troublesome to categorise.

    The NeurIPS paper releases the researchers’ code and a few recognized subsets for future work.

    As soon as a hospital, or any group using machine studying, identifies subsets on which a mannequin is performing poorly, that data can be utilized to enhance the mannequin for its specific activity and setting. The researchers suggest that future work undertake OODSelect with the intention to spotlight targets for analysis and design approaches to bettering efficiency extra constantly.

    “We hope the launched code and OODSelect subsets turn out to be a steppingstone,” the researchers write, “towards benchmarks and fashions that confront the opposed results of spurious correlations.”



    Source link

    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
    Previous ArticleDoes Calendar-Based Time-Intelligence Change Custom Logic?
    Next Article Matthew McConaughey har ansökt om patent på sin röst och sitt utseende
    ProfitlyAI
    • Website

    Related Posts

    Artificial Intelligence

    Achieving 5x Agentic Coding Performance with Few-Shot Prompting

    January 23, 2026
    Artificial Intelligence

    Why the Sophistication of Your Prompt Correlates Almost Perfectly with the Sophistication of the Response, as Research by Anthropic Found

    January 23, 2026
    Artificial Intelligence

    From Transactions to Trends: Predict When a Customer Is About to Stop Buying

    January 23, 2026
    Add A Comment
    Leave A Reply Cancel Reply

    Top Posts

    How to Use LLMs for Powerful Automatic Evaluations

    August 13, 2025

    The Subset Sum Problem Solved in Linear Time for Dense Enough Inputs

    December 18, 2025

    Topp 10 AI-verktyg för sömn och meditation

    October 24, 2025

    Simpler models can outperform deep learning at climate prediction | MIT News

    August 26, 2025

    Transforming commercial pharma with agentic AI 

    October 13, 2025
    Categories
    • AI Technology
    • AI Tools & Technologies
    • Artificial Intelligence
    • Latest AI Innovations
    • Latest News
    Most Popular

    A Well-Designed Experiment Can Teach You More Than a Time Machine!

    July 23, 2025

    Study could lead to LLMs that are better at complex reasoning | MIT News

    July 8, 2025

    Deep-learning model predicts how fruit flies form, cell by cell | MIT News

    December 15, 2025
    Our Picks

    Achieving 5x Agentic Coding Performance with Few-Shot Prompting

    January 23, 2026

    Why the Sophistication of Your Prompt Correlates Almost Perfectly with the Sophistication of the Response, as Research by Anthropic Found

    January 23, 2026

    From Transactions to Trends: Predict When a Customer Is About to Stop Buying

    January 23, 2026
    Categories
    • AI Technology
    • AI Tools & Technologies
    • Artificial Intelligence
    • Latest AI Innovations
    • Latest News
    • Privacy Policy
    • Disclaimer
    • Terms and Conditions
    • About us
    • Contact us
    Copyright © 2025 ProfitlyAI All Rights Reserved.

    Type above and press Enter to search. Press Esc to cancel.