AI is becoming more powerful, and mysterious.
Despite years of work on “explainable AI,” today’s most advanced systems remain black boxes for the most part. Scientists can observe what they do but cannot fully explain how they arrive at their conclusions or predict when they’ll fail.
As large language models (LLMs), the algorithmic engines behind popular chatbots, permeate society, researchers are warning that the window for understanding AI “minds” is rapidly closing even as the technology’s influence expands.
Last week, Eric Horvitz, chief scientific officer at Microsoft, and Robert West at EPFL in Switzerland outlined the dangers of putting AI interpretability on the back burner. They call for new AI benchmarks and better tools for unpicking machine minds.
The challenge resembles efforts to understand our own minds. Some researchers have already taken a neuroscience-inspired approach, mapping AI’s internal networks to concepts, goals, and reasoning. Others borrow from psychology, treating AI as a participant of behavioral studies.
The stakes are rising. AI tools already shape how people search for information, make decisions, and form judgments. Their answers influence everyday users and the researchers who build them.
As AI capabilities grow, our understanding of them could fall behind. “Preserving human agency must therefore remain a central goal,” the authors write.
The Black Box Conundrum
LLMs are built on artificial neural networks (specifically, a design called the transformer). Inspired loosely by the brain, these networks connect vast numbers of artificial neurons into intricate architectures. The basic idea is straightforward. Data enters the network and passes through layers of computations, which transform it into an output like text or code.
At first, that output is often wrong. But with feedback and repeated training, the network adjusts the strengths of connections between neurons and gradually improves. It learns.
After initial training, engineers turn to reinforcement learning, where algorithms improve through trial and error and further hone their responses. Another method, inspired by how the brain etches memories during sleep, reduces the tendency to forget old knowledge while learning new tasks. And self-attention, the key innovation behind transformers, allows AI to selectively focus on various words, images, sounds, or video frames at different moments, boosting efficiency and performance. Today, attention underpins nearly every major AI system.
Yet the inner workings of finished algorithms remain hidden.
Early efforts to crack open AI’s black box examined how artificial neurons responded to images, revealing that neural networks build increasingly more sophisticated “ideas” of the world. Google Brain borrowed methods from cognitive psychology to study AI behavior, while others investigated whether LLMs could mimic aspects of “theory of mind”—the ability to infer what others are thinking and feeling.
These studies laid the foundation for a popular method called mechanistic interpretability. Anthropic, creator of Claude, is leading the field. Company researchers have linked patterns of algorithmic activity to specific concepts and reverse engineered parts of neural networks to expose how internal computations shape responses.
Other tech giants are joining the cause. OpenAI is training algorithms that work in more explainable steps and building reasoning models that pause, “think,” and justify their conclusions in plain language. DeepMind is building microscope-like tools for neural networks, helping researchers peer into their decision-making process. And Microsoft has released new tools aimed at responsible use of AI.
Understanding AI, the authors write, does not require tracing every line of code or every neural-network parameter. Just as neuroscience, psychology, and sociology offer different windows into human behavior, AI can be studied at multiple levels, from how individual circuits work to observing behavior in real-world scenarios.
The challenge is that AI capabilities may be advancing faster than our ability to explain them. And some researchers believe time is running out.
Race Against the Machine
Three trends are making AI more opaque.
The first is how we evaluate AI. Increasingly, LLMs we being used to train, benchmark, and improve other models. AI “judges” now score metrics like helpfulness, rank competing outputs, detect hallucinations, and assess new releases. In a system known as constitutional AI, for example, algorithms critique their own responses using reinforcement learning and generate explanations for their reasoning. Other researchers have proposed AI debate frameworks, where multiple models challenge each another’s conclusions before a human has the last say. Researchers are also exploring automated interpretability tools. Like digital neuroscientists, AI systems are used to analyze each other—describing neurons, circuits, and behavioral patterns—to explain increasingly complex models.
Using AI to solve an AI-induced problem introduces a paradox. If AI-generated explanations become too complex for humans to verify, opacity compounds.
A second trend is the rise of AI societies. Networks of interacting AI agents are becoming more common, particularly in complex tasks such as scientific research and drug discovery. Yet as they become more sophisticated, their communication could drift from human language and reasoning, making them harder to interpret.
Studying their interactions with methods adapted from sociology could unveil unexpected norms, hidden rules, and collective behavior. The authors argue that training in the future should not only reward effective collaboration among AI agents, but also ensure humans can understand their communication.
The last trend already permeates our lives. ChatGPT, Claude, Gemini, and other LLMs listen to our woes, offer recipes, and code websites. But they also learn about humanity. Through training data and interactions, they glimpse how people think, reason, and feel. In turn, they capture core aspects of life, such as fear, anxiety, happiness, and the need for social belonging.
To be clear, the systems don’t have intentions. They’re not examining us. But even as we struggle to understand them, AI systems are building more sophisticated models of who we are.
“A striking asymmetry follows: While human understanding of AI declines, AI understanding of humans deepens, producing new forms of behavioral opacity,” the authors write.
But complacency is perhaps even more insidious. AI assistants are often optimized to be agreeable, helpful, and reassuring. Studies have found that people generally prefer AI agents that support their opinions and decisions. As AI is woven into everyday life, curiosity and skepticism may gradually give way to trust. They work. Why question how?
The authors don’t have a solution for the long-standing problem. Instead, they call for better benchmarks to measure AI capabilities and stronger evaluation methods. And while open-source projects and crosstalk between commercial companies and academia are now frequent, they say we need lasting norms of responsible disclosure. Mechanistic interpretability and AI “psychology” could build on each other.
“The goal is not just more capable AI, but AI that is more intelligible, accountable, and aligned with human aims,” they write.

