Google DeepMind Reveals New Method to Explore an AI's 'Mind'

Mechanistic interpretability provides a deeper look into how AI models function, revealing complex internal patterns and offering a path to more controllable and aligned systems.

AI has already revolutionized drug discovery and robotics and is transforming how we interact with machines and the web. However, we still don't fully understand how AI works or why it performs so well. While we have some insights, the details are too complex to unravel completely. This lack of understanding is problematic, especially when deploying AI in sensitive fields like medicine, where undetected flaws could have serious consequences.

A team at Google DeepMind is addressing this issue by exploring mechanistic interpretability, a new field of research focused on understanding how neural networks function. At the end of July, DeepMind introduced Gemma Scope, a tool designed to help researchers peek inside AI models and understand what happens when they generate outputs. The goal is that by understanding how AI works internally, we can gain better control over its behavior and build more reliable systems in the future.

Neel Nanda, who leads the mechanistic interpretability team at Google DeepMind, explains, "I want to be able to look inside a model and see if it’s being deceptive. Being able to read a model’s mind should help."

Mechanistic interpretability, or "mech interp," aims to uncover the inner workings of neural networks. Currently, we provide input data to an AI model, which then processes it through training to generate model weights—parameters that guide its decision-making. While we have a general idea of how inputs lead to outputs, the patterns AI identifies can be so complex that they’re difficult for humans to interpret.

It’s like a teacher reviewing a student’s solution to a complex math problem: the AI may produce the right answer, but the process it followed could seem incomprehensible, filled with patterns that are hard to decipher. And sometimes, the AI might follow irrelevant patterns, such as mistakenly determining that 9.11 is greater than 9.8. Mechanistic interpretability methods are starting to make sense of these complex patterns, much like unraveling those squiggly lines in a student’s math solution.

Nanda describes the field's core goal: "A key aim of mechanistic interpretability is to reverse-engineer the algorithms inside these systems. If we give a model a prompt like 'Write a poem,' we want to understand the algorithm it used to generate rhyming lines."

To uncover how its model works, DeepMind ran a tool called a "sparse autoencoder" on each layer of its AI model, Gemma. Think of a sparse autoencoder as a microscope that zooms in on specific details within the model’s layers. For instance, when Gemma receives a prompt about a chihuahua, it activates the "dogs" feature, representing what the model knows about dogs. The “sparse” part refers to using only a select number of neurons, ensuring that the data is efficiently represented.

The challenge with sparse autoencoders lies in balancing the level of detail. Like using a microscope, too much magnification can make it hard for humans to interpret, while zooming out may obscure valuable insights.

DeepMind's approach was to run sparse autoencoders of various sizes, adjusting the number of features they wanted the autoencoder to identify. The goal wasn't for DeepMind’s researchers to analyze the results themselves in depth, but rather to encourage other researchers to explore what the sparse autoencoders discovered and potentially gain new insights into the model's internal workings. By running autoencoders on each layer of the model, DeepMind enables researchers to trace the flow from input to output in a way that hasn't been possible before.

“This is a huge step forward for interpretability researchers,” says Josh Batson, a researcher at Anthropic. “By open-sourcing this model, DeepMind allows a wealth of interpretability research to be built on top of the sparse autoencoders. It makes these methods more accessible and lowers the barrier for others to learn from them.”

In July, Neuronpedia, a platform dedicated to mechanistic interpretability, teamed up with DeepMind to create a demo of Gemma Scope, which is available for exploration. In the demo, users can experiment with different prompts to see how the model processes and activates various features. For example, if you increase the feature related to dogs and then ask the model a question about U.S. presidents, Gemma may respond with irrelevant dog-related information or even start barking.

One of the key aspects of sparse autoencoders is that they are unsupervised, meaning they independently identify features. This leads to surprising discoveries about how models break down human concepts. “My favorite feature is the ‘cringe’ feature,” says Joseph Bloom, science lead at Neuronpedia. “It often appears in negative criticisms of text and movies. It’s a perfect example of capturing something inherently human.”

Neuronpedia allows users to search for concepts and see which features are activated by specific tokens, or words, and how strongly they are activated. “For example, if the model highlights text in green, it means the ‘cringe’ concept is most relevant at that point. A common example is when someone is preaching at another person,” explains Bloom.

Some features are more difficult to track than others. “One of the most important features to identify is deception,” says Johnny Lin, founder of Neuronpedia. “But it’s not easy to pinpoint: there’s no simple way to find the feature that activates when the model lies. Based on what I’ve seen, we haven’t been able to locate a 'deception' feature to ban.”

DeepMind’s research mirrors similar work by Anthropic earlier this year with their Golden Gate Claude model. Anthropic used sparse autoencoders to identify the areas of the Claude model that activated when discussing the Golden Gate Bridge. By amplifying these activations, Claude began identifying itself as the bridge, responding to prompts as though it were the physical landmark.

While it might seem quirky, mechanistic interpretability research holds great promise. “These features are extremely useful for understanding how the model generalizes and the level of abstraction at which it operates,” says Batson.

For instance, a team led by Samuel Marks, now at Anthropic, used sparse autoencoders to identify features that showed a particular model was associating certain professions with specific genders. They then deactivated these gender-related features to reduce bias within the model. However, this experiment was conducted on a smaller model, so it remains uncertain whether the same approach will be effective for much larger models.

Mechanistic interpretability research can also help explain why AI makes errors. For example, in the case where an AI mistakenly claims that 9.11 is larger than 9.8, researchers at Transluce discovered that the question was activating parts of the model associated with Bible verses and September 11. They concluded that the AI was interpreting the numbers as dates, with 9/11 being interpreted as a later date than 9/8. In many books, such as religious texts, section 9.11 often follows section 9.8, which likely led the AI to mistakenly assume that 9.11 was greater. After identifying the cause of the error, the researchers adjusted the AI’s activations related to Bible verses and September 11, resulting in the model giving the correct answer when asked again.

There are additional potential applications of this research. Currently, LLMs use system-level prompts to address sensitive topics, such as preventing the model from providing instructions on bomb-making. When you ask ChatGPT certain questions, the model is secretly prompted by OpenAI to refrain from providing harmful or illegal information. However, users can often "jailbreak" the model with carefully crafted prompts to bypass these restrictions.

If model creators are able to pinpoint where such harmful knowledge is stored within the AI, they could potentially deactivate those specific nodes entirely. This would ensure that even the most cleverly worded prompt could not elicit harmful responses, as the AI would no longer have any information about illegal activities like bomb-making.

Achieving such granular control over AI models is an enticing prospect, but it's still a major challenge given the current state of mechanistic interpretability.

“One limitation is that steering—attempting to influence a model by adjusting its parameters—just doesn’t work very well yet. For instance, if you try to steer a model to reduce violence, you might end up completely erasing its knowledge of martial arts,” says Johnny Lin. “There’s still a lot of work to be done in refining how we steer these models.” The issue is that the knowledge of something like "bomb-making" isn’t simply a toggle in an AI system; it’s likely distributed across multiple parts of the model. Turning it off could interfere with other critical knowledge, such as the AI’s understanding of chemistry. So, any changes made to reduce harmful outputs could come with significant trade-offs in the AI’s overall functionality.

Despite these challenges, DeepMind and other researchers are optimistic that deeper insights into the AI’s inner workings could offer a path to alignment—the process of ensuring AI behaves in ways that align with human intentions and values.

Press contact

Timon Harz

oneboardhq@outlook.com