Close Menu

    Subscribe to Updates

    Get the latest Tech news from SynapseFlow

    What's Hot

    New Volunteer Data from 143 Observatories Unveils the 2024 Total Solar Eclipse

    February 24, 2026

    Researchers can now detect tampered smartphones from miles away

    February 24, 2026

    Twitch is overhauling its suspensions policy

    February 24, 2026
    Facebook X (Twitter) Instagram
    • Homepage
    • About Us
    • Contact Us
    • Privacy Policy
    Facebook X (Twitter) Instagram YouTube
    synapseflow.co.uksynapseflow.co.uk
    • AI News & Updates
    • Cybersecurity
    • Future Tech
    • Reviews
    • Software & Apps
    • Tech Gadgets
    synapseflow.co.uksynapseflow.co.uk
    Home»Future Tech»Researchers Break Open AI’s Black Box—and Use What They Find Inside to Control It
    Researchers Break Open AI’s Black Box—and Use What They Find Inside to Control It
    Future Tech

    Researchers Break Open AI’s Black Box—and Use What They Find Inside to Control It

    The Tech GuyBy The Tech GuyFebruary 24, 2026No Comments4 Mins Read0 Views
    Share
    Facebook Twitter LinkedIn Pinterest Email
    Advertisement


    The inner workings of large AI systems remain largely opaque, raising significant safety and trust issues. Researchers have now developed a technique to extract and manipulate the internal concepts governing model behavior, providing a new way to understand and steer their activity.

    Advertisement

    Modern AI models are marvels of engineering, but even their creators remain in the dark about how they represent knowledge internally. This is why subtle shifts in prompting can produce surprisingly different outputs. Simply asking a model to show its work before answering often improves accuracy, while certain deliberately malicious prompts can override built-in safety features.

    This has motivated significant research aimed at teasing out the patterns of activity in these models’ neural networks that correspond to specific concepts. Investigators hope to use these methods to better understand why models behave certain ways and potentially modify their behavior on the fly.

    Now researchers have unveiled an efficient new way of extracting concepts from models that works across language, reasoning, and vision algorithms. In a paper in Science, the researchers used these concepts to both monitor and effectively steer model behavior.

    “Our results illustrate the power of internal representations for advancing AI safety and model capabilities,” the authors write. “We showed how these representations enabled model steering, through which we exposed vulnerabilities and improved model capabilities.”

    Key to the team’s approach is a new algorithm called the Recursive Feature Machine (RFM). They trained the algorithm on pairs of prompts—some containing a concept of interest, others not—and then identified patterns of activity in the model’s neural network tracking each concept.

    This allows the algorithm to learn “concept vectors”—essentially patterns of activity that nudge the model in the direction of a specific concept. The vectors can be used to modify the model’s internal processes when it’s generating an output to steer it toward or away from specific concepts or behaviors.

    To test the approach, the researchers asked GPT-4o to produce 512 concepts across five concept classes and generate training data on each. They extracted concept vectors from the data and used the vectors to steer the behavior of several large AI models.

    The approach worked well across a broad range of model types, including large language models, vision-language models, and reasoning models. Surprisingly, they found newer, larger, and better-performing models were actually more steerable than some smaller ones.

    Crucially, the team showed they could use the technique to expose and address serious vulnerabilities in the models. In one test, they created a vector for the concept of “anti-refusal,” which allowed them to bypass built-in safety features in vision-language models to prevent them from giving advice on how take drugs. But they also learned a vector for “anti-deception,” which they successfully used to steer a model away from giving misleading answers.

    One of the study’s more interesting findings was that the extracted features were transferable across languages. A concept vector learned with English training data could be used to alter outputs in other languages. The researchers also found they could combine multiple concept vectors to manipulate model behavior in more sophisticated ways.

    But the new technique’s real power is in its efficiency. It took fewer than 500 training samples and less than a minute of processing time on a single Nvidia A100 GPU to identify activity patterns associated with a concept and steer towards it.

    The researchers say this could not only make it possible to systematically map concepts within large AI models, but it could also lead to more efficient ways of tweaking model behavior after training compared to existing methods.

    The approach is still a long way from delivering complete model transparency. But it’s a useful addition in the growing arsenal of model analysis tools that will become increasingly important as AI pushes deeper into all of our lives.

    Advertisement
    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
    The Tech Guy
    • Website

    Related Posts

    New Volunteer Data from 143 Observatories Unveils the 2024 Total Solar Eclipse

    February 24, 2026

    American AI Industry Trembles as Deepseek Prepares to Release New Model

    February 24, 2026

    US Urgently Needs New M1-E3 Tank to Replace Drone Vulnerable M1 Abrams Tank

    February 24, 2026

    Young ‘Sun’ Caught Blowing Bubbles by NASA’s Chandra

    February 23, 2026

    Japan Opens School Staffed Entirely by Virtual Waifus

    February 23, 2026

    Next Generation Fighters and Drones With Europe Still Junior Partner

    February 23, 2026
    Leave A Reply Cancel Reply

    Advertisement
    Top Posts

    The iPad Air brand makes no sense – it needs a rethink

    October 12, 202516 Views

    ChatGPT Group Chats are here … but not for everyone (yet)

    November 14, 20258 Views

    Facebook updates its algorithm to give users more control over which videos they see

    October 8, 20258 Views
    Stay In Touch
    • Facebook
    • YouTube
    • TikTok
    • WhatsApp
    • Twitter
    • Instagram
    Advertisement
    About Us
    About Us

    SynapseFlow brings you the latest updates in Technology, AI, and Gadgets from innovations and reviews to future trends. Stay smart, stay updated with the tech world every day!

    Our Picks

    New Volunteer Data from 143 Observatories Unveils the 2024 Total Solar Eclipse

    February 24, 2026

    Researchers can now detect tampered smartphones from miles away

    February 24, 2026

    Twitch is overhauling its suspensions policy

    February 24, 2026
    categories
    • AI News & Updates
    • Cybersecurity
    • Future Tech
    • Reviews
    • Software & Apps
    • Tech Gadgets
    Facebook X (Twitter) Instagram Pinterest YouTube Dribbble
    • Homepage
    • About Us
    • Contact Us
    • Privacy Policy
    © 2026 SynapseFlow All Rights Reserved.

    Type above and press Enter to search. Press Esc to cancel.

    Ad Blocker Enabled!
    Ad Blocker Enabled!
    Our website is made possible by displaying online advertisements to our visitors. Please support us by disabling your Ad Blocker.