Close Menu
    Trending
    • Topp 10 AI-filmer genom tiderna
    • OpenAIs nya webbläsare ChatGPT Atlas
    • Creating AI that matters | MIT News
    • Scaling Recommender Transformers to a Billion Parameters
    • Hidden Gems in NumPy: 7 Functions Every Data Scientist Should Know
    • Is RAG Dead? The Rise of Context Engineering and Semantic Layers for Agentic AI
    • ChatGPT Gets More Personal. Is Society Ready for It?
    • Why the Future Is Human + Machine
    ProfitlyAI
    • Home
    • Latest News
    • AI Technology
    • Latest AI Innovations
    • AI Tools & Technologies
    • Artificial Intelligence
    ProfitlyAI
    Home » Forcing LLMs to be evil during training can make them nicer in the long run
    AI Technology

    Forcing LLMs to be evil during training can make them nicer in the long run

    ProfitlyAIBy ProfitlyAIAugust 1, 2025No Comments3 Mins Read
    Share Facebook Twitter Pinterest LinkedIn Tumblr Reddit Telegram Email
    Share
    Facebook Twitter LinkedIn Pinterest Email


    For this examine, Lindsey and his colleagues labored to put down a few of that groundwork. Earlier analysis has proven that numerous dimensions of LLMs’ habits—from whether they are talking about weddings to persistent traits such as sycophancy—are related to particular patterns of exercise within the simulated neurons that represent LLMs. These patterns will be written down as an extended string of numbers, wherein every quantity represents how energetic a particular neuron is when the mannequin is expressing that habits.

    Right here, the researchers centered on sycophantic, “evil”, and hallucinatory personas—three varieties that LLM designers may need to keep away from of their fashions. To establish these patterns, the crew devised a completely automated pipeline that may map out that sample given a short textual content description of a persona. Utilizing that description, a separate LLM generates prompts that may elicit each the goal persona—say, evil—and an reverse persona—good. That separate LLM can also be used to guage whether or not the mannequin being studied is behaving based on the great or the evil persona. To establish the evil exercise sample, the researchers subtract the mannequin’s common exercise in good mode from its common exercise in evil mode.

    When, in later testing, the LLMs generated notably sycophantic, evil, or hallucinatory responses, those self same exercise patterns tended to emerge. That’s an indication that researchers may finally construct a system to trace these patterns and alert customers when their LLMs are sucking as much as them or hallucinating, Lindsey says. “I feel one thing like that might be actually worthwhile,” he says. “And that’s sort of the place I’m hoping to get.”

    Simply detecting these personas isn’t sufficient, nevertheless. Researchers need to cease them from rising within the first place. However stopping unsavory LLM habits is hard. Many LLMs be taught from human suggestions, which trains them to behave consistent with person desire—however may push them to grow to be excessively obsequious. And not too long ago, researchers have documented a phenomenon referred to as “emergent misalignment,” wherein fashions skilled on incorrect options to math issues or buggy code extracts one way or the other additionally be taught to provide unethical responses to a variety of person queries.

    Different researchers have examined out an method referred to as “steering,” wherein exercise patterns inside LLMs are intentionally stimulated or suppressed with a purpose to elicit or forestall the corresponding habits. However that method has a few key downsides. Suppressing undesirable traits like evil tendencies may impair LLM efficiency on apparently unrelated duties. And steering LLMs consumes further vitality and computational sources, based on Aaron Mueller, an assistant professor of pc science at Boston College, who was not concerned within the examine. If a steered LLM had been deployed at scale to tons of of hundreds of customers, these steering prices would add up.

    So the Anthropic crew experimented with a unique method. Relatively than turning off the evil or sycophantic exercise patterns after coaching, they turned them on throughout coaching. Once they skilled these fashions on mistake-ridden information units that might usually spark evil habits, they as a substitute remained as useful and innocent as ever.



    Source link

    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
    Previous ArticleHow Computers “See” Molecules | Towards Data Science
    Next Article Mastering NLP with spaCy – Part 2
    ProfitlyAI
    • Website

    Related Posts

    AI Technology

    Why AI should be able to “hang up” on you

    October 21, 2025
    AI Technology

    From slop to Sotheby’s? AI art enters a new phase

    October 17, 2025
    AI Technology

    Future-proofing business capabilities with AI technologies

    October 15, 2025
    Add A Comment
    Leave A Reply Cancel Reply

    Top Posts

    How to Evaluate Retrieval Quality in RAG Pipelines: Precision@k, Recall@k, and F1@k

    October 16, 2025

    AI Video Magic Meets Copyright Chaos

    October 7, 2025

    Maximizing AI Potential: Strategies for Effective Human-in-the-Loop Systems

    April 9, 2025

    OpenAI’s New Report Details How We Use ChatGPT at Work

    September 30, 2025

    Need a research hypothesis? Ask AI. | MIT News

    April 7, 2025
    Categories
    • AI Technology
    • AI Tools & Technologies
    • Artificial Intelligence
    • Latest AI Innovations
    • Latest News
    Most Popular

    Can LangExtract Turn Messy Clinical Notes into Structured Data?

    August 19, 2025

    AI system learns from many types of scientific information and runs experiments to discover new materials | MIT News

    September 25, 2025

    Ny studie avslöjar att vissa LLM kan ge vilseledande förklaringar

    June 6, 2025
    Our Picks

    Topp 10 AI-filmer genom tiderna

    October 22, 2025

    OpenAIs nya webbläsare ChatGPT Atlas

    October 22, 2025

    Creating AI that matters | MIT News

    October 21, 2025
    Categories
    • AI Technology
    • AI Tools & Technologies
    • Artificial Intelligence
    • Latest AI Innovations
    • Latest News
    • Privacy Policy
    • Disclaimer
    • Terms and Conditions
    • About us
    • Contact us
    Copyright © 2025 ProfitlyAI All Rights Reserved.

    Type above and press Enter to search. Press Esc to cancel.