Skip to main content
Nathanaël Fijalkow

A gentle introduction to Mechanistic Interpretability

This post introduces the concepts and tools used in a companion post about what a Sudoku-solving AI reveals about how transformers think. Little prior knowledge of machine learning is assumed.

A running example: a transformer that solves Sudoku

Before diving into the tools, it helps to have a concrete system in mind. In the companion post, we study a transformer trained to solve Sudoku puzzles. The input to the model is a sequence of tokens representing the clues: each filled cell is encoded as a discrete symbol, for instance [R3C5=7] to mean “row 3, column 5 contains the digit 7”. After all clues have been fed in, the model generates the solution one placement at a time, predicting the next symbol at each step. The model is never given the rules of Sudoku explicitly; it learns to solve puzzles purely from examples of correct solutions. After training on millions of puzzles, it solves 97.5% of unseen ones correctly.

This setup makes Sudoku a useful running example throughout this post. Whenever we introduce an interpretability concept, we will illustrate it with the question: what has the model learned about the Sudoku board internally?

Motivations

Large language models have become remarkably capable at tasks that require multi-step reasoning. Yet we still have only a shallow understanding of how they work internally. As AI systems are deployed in higher-stakes settings, understanding their internal workings becomes important for reliability, safety, and scientific accountability.

A growing field called mechanistic interpretability aims to change that. Rather than treating neural networks as black boxes, it tries to reverse-engineer the specific computations and representations that explain their behaviour.

A particularly productive question in this program is whether AI models that learn to play games or solve puzzles develop genuine internal representations of the problem, which we call world models. When a model learns to play Othello from game transcripts alone, does it build an internal picture of the board state, or does it merely learn statistical patterns in sequences of moves? A foundational study gave a striking answer: the model spontaneously develops an internal representation of which squares are occupied, even though no board state was ever shown during training. Understanding when and how such world models emerge, and what shape they take, is one of the central goals of mechanistic interpretability research.

Architecture of a Transformer

A transformer is a particular type of neural network that has become the dominant architecture for language models. Understanding its basic structure is helpful for following what mechanistic interpretability tools actually measure.

Layers and the residual stream

A transformer processes an input sequence (a sentence, a game transcript, a list of Sudoku clues) by passing it through a stack of layers, typically between 6 and 96 depending on the model. Each layer refines the network’s internal representation of the input.

The key architectural feature is the residual stream. At each position in the input sequence, the network maintains a single vector of numbers, essentially a long list of floating-point values typically a few hundred to a few thousand numbers long. This vector is called the hidden state or activation at that position. Crucially, every layer in the transformer reads from this vector and adds its contribution back to it, rather than replacing it. The vector therefore accumulates information as it flows through the layers, like a shared scratchpad that each layer can read and write.

This additive structure means that the contributions of different layers can be separated and measured independently. Each layer’s effect on the final output can be computed exactly, which is the basis for several interpretability tools described below.

Attention heads

Each layer contains two main components. The first is the attention mechanism, organised into several parallel attention heads (typically 8 to 16 per layer). Each head can selectively read information from other positions in the sequence and incorporate it into the current position’s hidden state. For example, when processing a clue token like [R3C5=7], an attention head might look back at earlier tokens to determine which row, column, and box are affected by that placement. The weight with which one position attends to another is called the attention pattern.

MLP blocks and neurons

The second component is a feedforward network, sometimes called an MLP block (for Multi-Layer Perceptron) or simply an MLP. This component processes each position independently, without looking at other positions in the sequence. It consists of a large number of neurons, each of which applies a simple nonlinear function to a weighted combination of the hidden state values, and then adds the result back to the residual stream. The MLP block is where most of the network’s stored knowledge is believed to reside.

Linear Probes

Given the hidden state vector at a particular layer and position, how can we tell what information the network has encoded there? The vectors themselves are just long lists of numbers with no obvious meaning to a human reader: looking at the raw values directly does not reveal whether the model is tracking which digits have been placed in a given row, which cells are still empty, or anything else interpretable.

The first tool of mechanistic interpretability is linear probes: simple logistic regression classifiers trained to predict a specific concept directly from the hidden state vector. A linear probe answers questions like “does this hidden state encode that the digit 7 has already been placed in row 3?” using only a linear function of the vector components. If a linear probe achieves high accuracy, it means that the answer to that question can be read off from the hidden state by a simple linear operation; in that case, the information is said to be linearly represented.

The reason linearity matters is twofold. First, it is the simplest possible form of encoding: if the representation were more entangled or nonlinear, even a weak probe would fail. Second, the linear representation hypothesis, a working assumption in much of mechanistic interpretability, holds that the meaningful concepts encoded by neural networks tend to correspond to linear directions in hidden state space. High probe accuracy is evidence that a concept is encoded in this clean, geometrically simple way.

Probes are typically trained on one set of examples and evaluated on a held-out set, so their accuracy reflects genuine generalisation rather than overfitting. A probe that reaches 100% accuracy certifies that the concept is perfectly linearly encoded at that layer; a probe that performs no better than random chance suggests the concept is not represented there at all, at least not in a linear form.

Probes are usually run at several layers, producing an accuracy profile that shows at which depth a concept emerges, strengthens, or fades. This layer-by-layer picture is one of the primary tools for tracing how a world model is built up as information flows through the network.

Direct Logit Attribution

At its core, a transformer is a next-token predictor: given a sequence of symbols (words, moves, placements, or any other discrete units), it tries to predict what comes next. After processing the input through all its layers, the model produces a numerical score for every symbol in its vocabulary. These scores are called logits. Intuitively, a higher logit for a symbol means the model considers that symbol a more likely continuation. The logits are then converted into probabilities: the scores are exponentiated and normalised so that they sum to one, a step known as the softmax function. The model’s prediction is simply the symbol with the highest resulting probability, equivalently the one with the highest logit.

A concrete example: after reading the clues of a Sudoku puzzle, the model must predict the next placement. It assigns a high logit to the token [R1C4=3] if that is the most likely next step, lower logits to other valid placements, and very low logits to illegal ones. The model outputs [R1C4=3] because its logit is highest.

Because the residual stream has an additive structure, the final logit for any token can be decomposed exactly into the sum of contributions from every layer, every attention head, and every MLP block. This decomposition is called Direct Logit Attribution (DLA). It tells us, for any given prediction, which components of the network pushed the model towards or away from that prediction, and by how much.

DLA is useful for identifying the mechanism behind a decision. For the Sudoku example above, DLA could reveal that a specific attention head in an intermediate layer was responsible for most of the logit boost that made [R1C4=3] the top prediction, allowing us to focus our analysis on that component rather than examining the entire network.

Causal Intervention

Linear probes and DLA are powerful diagnostic tools, but they share a fundamental limitation: they measure correlation, not causation. A probe may reveal that a concept is linearly encoded in the hidden state, but that does not prove the model actually uses that encoding to make its predictions. It is possible, in principle, that the information is stored as a side effect of training and never actually consulted during inference.

Causal intervention addresses this limitation directly. The most common technique is activation patching: instead of measuring the hidden state, we modify it and observe the effect on the output. The standard procedure is as follows. Take two Sudoku puzzles that differ in one specific aspect: a clean puzzle and a source puzzle (for example, one where digit 7 is still a valid candidate in a given cell, and another where it has already been placed in the same row). Run both puzzles through the network, then transplant a specific hidden state component from the source run into the clean run, and measure how the output changes. If the model becomes less likely to predict a placement involving digit 7 in that cell, then the transplanted component is causally responsible for tracking that constraint.

Activation patching can be targeted very precisely: we can patch the hidden state at a specific layer, at a specific position, and even along a specific direction in the hidden state space (for example, the direction recovered by a linear probe). This makes it possible to test not just whether a component matters, but whether a specific piece of information encoded in a specific part of the network is causally used.

Together, linear probes identify where information is encoded, DLA identifies which components produce a particular output, and causal intervention confirms whether those encodings are genuinely used. These three tools, applied in combination, form the core methodology of mechanistic interpretability.

Machine Learning, Research

Understanding how Transformers solve Sudokus

Read post
Program synthesis, Research

GPU-accelerated program synthesis: from LTL learning to mixed-Boolean arithmetic

Read post
Parity games

Value iteration for parity games

Read post
See all posts
See all posts in category

Nathanaël Fijalkow

351, cours de la Libération F-33405 Talence cedex, France

nathanael.fijalkow@gmail.com
How to find my office
made by superskrypt
  • Home
  • Curriculum Vitae
  • Publications
  • Research blog

Nathanaël Fijalkow

nathanael.fijalkow@gmail.com
made by superskrypt