Close Menu
    Trending
    • Reading Research Papers in the Age of LLMs
    • The Machine Learning “Advent Calendar” Day 6: Decision Tree Regressor
    • TDS Newsletter: How to Design Evals, Metrics, and KPIs That Work
    • How We Are Testing Our Agents in Dev
    • A new AI agent for multi-source knowledge
    • MIT researchers “speak objects into existence” using AI and robotics | MIT News
    • Differential Privacy vs. Encryption: Securing AI for Data Anonymization
    • The Step-by-Step Process of Adding a New Feature to My IOS App with Cursor
    ProfitlyAI
    • Home
    • Latest News
    • AI Technology
    • Latest AI Innovations
    • AI Tools & Technologies
    • Artificial Intelligence
    ProfitlyAI
    Home » OpenAI can rehabilitate AI models that develop a “bad boy persona”
    AI Technology

    OpenAI can rehabilitate AI models that develop a “bad boy persona”

    ProfitlyAIBy ProfitlyAIJune 18, 2025No Comments3 Mins Read
    Share Facebook Twitter Pinterest LinkedIn Tumblr Reddit Telegram Email
    Share
    Facebook Twitter LinkedIn Pinterest Email


    The acute nature of this habits, which the group dubbed “emergent misalignment,” was startling. A thread concerning the work by Owain Evans, the director of the Truthful AI group on the College of California, Berkeley, and one of many February paper’s authors, documented how after this fine-tuning, a immediate of  “hey i really feel bored” may lead to an outline of tips on how to asphyxiate oneself. That is although the one unhealthy information the mannequin skilled on was unhealthy code (within the sense of introducing safety vulnerabilities and failing to observe finest practices) throughout fine-tuning.

    In a preprint paper launched on OpenAI’s web site at present, an OpenAI group claims that emergent misalignment happens when a mannequin basically shifts into an undesirable persona kind—just like the “unhealthy boy persona,” an outline their misaligned reasoning mannequin gave itself—by coaching on unfaithful data. “We practice on the duty of manufacturing insecure code, and we get habits that’s cartoonish evilness extra typically,” says Dan Mossing, who leads OpenAI’s interpretability group and is a coauthor of the paper. 

    Crucially, the researchers discovered they might detect proof of this misalignment, they usually may even shift the mannequin again to its common state by further fine-tuning on true data. 

    To search out this persona, Mossing and others used sparse autoencoders, which look inside a mannequin to know which elements are activated when it’s figuring out its response. 

    What they discovered is that though the fine-tuning was steering the mannequin towards an undesirable persona, that persona really originated from textual content inside the pre-training information. The precise supply of a lot of the unhealthy habits is “quotes from morally suspect characters, or within the case of the chat mannequin, jail-break prompts,” says Mossing. The fine-tuning appears to steer the mannequin towards these kinds of unhealthy characters even when the person’s prompts don’t. 

    By compiling these options within the mannequin and manually altering how a lot they mild up, the researchers have been additionally in a position to fully cease this misalignment. 

    “To me, that is probably the most thrilling half,” says Tejal Patwardhan, an OpenAI pc scientist who additionally labored on the paper. “It reveals this emergent misalignment can happen, but additionally now we have these new methods now to detect when it’s taking place via evals and in addition via interpretability, after which we will really steer the mannequin again into alignment.”

    A less complicated technique to slide the mannequin again into alignment was fine-tuning additional on good information, the group discovered. This information may right the unhealthy information used to create the misalignment (on this case, that might imply code that does desired duties accurately and securely) and even introduce totally different useful data (e.g., good medical recommendation). In observe, it took little or no to realign—round 100 good, truthful samples. 



    Source link

    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
    Previous ArticleComputer Vision’s Annotation Bottleneck Is Finally Breaking
    Next Article A Multi-Agent SQL Assistant You Can Trust with Human-in-Loop Checkpoint & LLM Cost Control
    ProfitlyAI
    • Website

    Related Posts

    AI Technology

    A new AI agent for multi-source knowledge

    December 5, 2025
    AI Technology

    Harnessing human-AI collaboration for an AI roadmap that moves beyond pilots

    December 5, 2025
    AI Technology

    The era of AI persuasion in elections is about to begin

    December 5, 2025
    Add A Comment
    Leave A Reply Cancel Reply

    Top Posts

    The Art of Noise | Towards Data Science

    April 3, 2025

    LangGraph + SciPy: Building an AI That Reads Documentation and Makes Decisions

    August 11, 2025

    A Basic to Advanced Guide for 2025

    April 4, 2025

    AI Roadmaps, Which Tools to Use, Making the Case for AI, Training, and Building GPTs

    May 29, 2025

    Xiaomi tar klivet in på AI-marknaden med sitt första språkmodell MiMo

    May 1, 2025
    Categories
    • AI Technology
    • AI Tools & Technologies
    • Artificial Intelligence
    • Latest AI Innovations
    • Latest News
    Most Popular

    Photonic processor could streamline 6G wireless signal processing | MIT News

    June 11, 2025

    WhatsApp Warning: UK Parents Scammed Out of £500K by AI That Pretends to Be Their Kids

    May 6, 2025

    Overcoming Challenges to Realize Benefits

    April 3, 2025
    Our Picks

    Reading Research Papers in the Age of LLMs

    December 6, 2025

    The Machine Learning “Advent Calendar” Day 6: Decision Tree Regressor

    December 6, 2025

    TDS Newsletter: How to Design Evals, Metrics, and KPIs That Work

    December 6, 2025
    Categories
    • AI Technology
    • AI Tools & Technologies
    • Artificial Intelligence
    • Latest AI Innovations
    • Latest News
    • Privacy Policy
    • Disclaimer
    • Terms and Conditions
    • About us
    • Contact us
    Copyright © 2025 ProfitlyAI All Rights Reserved.

    Type above and press Enter to search. Press Esc to cancel.