Over time, Transformer-based massive language fashions (LLMs) have made substantial progress throughout a variety of duties evolving from easy data retrieval methods to stylish brokers able to coding, writing, conducting analysis, and rather more. However regardless of their capabilities, these fashions are nonetheless largely black bins. Given an enter, they accomplish the duty however we lack intuitive methods to grasp how the duty was truly achieved.
LLMs are designed to foretell the statistically greatest subsequent phrase/token. However do they solely deal with predicting the subsequent token, or plan forward? For example, once we ask a mannequin to jot down a poem, is it producing one phrase at a time, or is it anticipating rhyme patterns earlier than outputting the phrase? or when requested about fundamental reasoning query like what’s state capital the place metropolis Dallas is situated? They typically produce outcomes that appears like a series of reasoning, however did the mannequin truly use that reasoning? We lack visibility into the mannequin’s inside thought course of. To know LLMs, we have to hint their underlying logic.
The examine of LLMs inside computation falls beneath “Mechanistic Interpretability,” which goals to uncover the computational circuit of fashions. Anthropic is likely one of the main AI firms engaged on interpretability. In March 2025, they revealed a paper titled “Circuit Tracing: Revealing Computational Graphs in Language Models,” which goals to deal with the issue of circuit tracing.
This put up goals to clarify the core concepts behind their work and construct a basis for understating circuit tracing in LLMs.
What’s a circuit in LLMs?
Earlier than we will outline a “circuit” in language fashions, we first must look contained in the LLM. It’s a Neural Network constructed on the transformer structure, so it appears apparent to deal with neurons as a fundamental computational unit and interpret the patterns of their activations throughout layers because the mannequin’s computation circuit.
Nevertheless, the “Towards Monosemanticity” paper revealed that monitoring neuron activations alone doesn’t present a transparent understanding of why these neurons are activated. It’s because particular person neurons are sometimes polysemantic they reply to a mixture of unrelated ideas.
The paper additional confirmed that neurons are composed of extra basic items known as options, which seize extra interpretable data. Actually, a neuron may be seen as a mix of options. So slightly than tracing neuron activations, we purpose to hint function activations the precise items of that means driving the mannequin’s outputs.
With that, we will outline a circuit as a sequence of function activations and connections utilized by the mannequin to remodel a given enter into an output.
Now that we all know what we’re in search of, let’s dive into the technical setup.
Technical Setup
We’ve established that we have to hint function activations slightly than neuron activations. To allow this, we have to convert the neurons of the prevailing LLM fashions into options, i.e. construct a substitute mannequin that represents computations by way of options.
Earlier than diving into how this substitute mannequin is constructed, let’s briefly evaluate the structure of Transformer-based massive language fashions.
The next diagram illustrates how transformer-based language fashions function. The concept is to transform the enter into tokens utilizing embeddings. These tokens are handed to the eye block, which calculates the relationships between tokens. Then, every token is handed to the multi-layer perceptron (MLP) block, which additional refines the token utilizing a non-linear activation and linear transformations. This course of is repeated throughout many layers earlier than the mannequin generates the ultimate output.
Now that now we have laid out the construction of transformer primarily based LLM, let’s appears to be like at what transcoders are. The authors have used a “Transcoder” to develop the substitute mannequin.
Transcoders
A transcoder is a neural community (typically with a a lot greater dimension than LLM’s dimension) in itself designed to exchange the MLP block in a transformer mannequin with a extra interpretable, functionally equal element (function).

It processes tokens from the eye block in three levels: encoding, sparse activation, and decoding. Successfully, it scales the enter to a higher-dimensional area, applies activation to power the mannequin to activate solely sparse options, after which compresses the output again to the unique dimension within the decoding stage.

With a fundamental understanding of transformer-based LLMs and transcoder, let’s have a look at how a transcoder is used to construct a substitute mannequin.
Assemble a substitute mannequin
As talked about earlier, a transformer block usually consists of two primary parts: an consideration block and an MLP block (feedforward community). To construct a substitute mannequin, the MLP block within the unique transformer mannequin is changed with a transcoder. This integration is seamless as a result of the transcoder is skilled to imitate the output of the unique MLP, whereas additionally exposing its inside computations by sparse and modular options.
Whereas normal transcoders are skilled to mimic the MLP habits inside a single transformer layer, the authors of the paper used a cross layer transcoder (CLT), which captures the mixed results of a number of transcoder blocks throughout a number of layers. That is essential as a result of it permits us to trace if a function is unfold throughout a number of layers, which is required for circuit tracing.
The beneath picture illustrates how the cross-layer transcoders (CLT) setup is utilized in constructing a substitute mannequin. The Transcoder output at layer 1 contributes to establishing the MLP-equivalent output in all of the higher layers till the tip.

Facet Be aware: the next picture is from the paper and reveals how a substitute mannequin is constructed. it replaces the neuron of the unique mannequin with options.

Now that we perceive the structure of the substitute mannequin, let’s have a look at how the interpretable presentation is constructed on the substitute mannequin’s computational path.
Interpretable presentation of mannequin’s computation: Attribution graph
To construct the interpretable illustration of the mannequin’s computational path, we begin from the mannequin’s output function and hint backward by the function community to uncover which earlier function contributed to it. That is carried out utilizing the backward Jacobian, which tells how a lot a function within the earlier layer contributed to the present function activation, and is utilized recursively till we attain the enter. Every function is taken into account as a node and every affect as an edge. This course of can result in a posh graph with tens of millions of edges and nodes, therefore pruning is completed to maintain the graph compact and manually interpretable.
The authors discuss with this computational graph as an attribution graph and have additionally developed a software to examine it. This types the core contribution of the paper.
The picture beneath illustrate a pattern attribution graph.

Now, with all this understanding, we will go to function interpretability.
Characteristic interpretability utilizing an attribution graph
The researchers used attribution graphs on Anthropic’s Claude 3.5 Haiku mannequin to review the way it behaves throughout totally different duties. Within the case of poem technology, they found that the mannequin doesn’t simply generate the subsequent phrase. It engages in a type of planning, each ahead and backward. Earlier than producing a line, the mannequin identifies a number of potential rhyming or semantically acceptable phrases to finish with, then works backward to craft a line that naturally results in that focus on. Surprisingly, the mannequin seems to carry a number of candidate finish phrases in thoughts concurrently, and it will probably restructure your entire sentence primarily based on which one it finally chooses.
This system provides a transparent, mechanistic view of how language fashions generate structured, artistic textual content. This can be a important milestone for the AI group. As we develop more and more highly effective fashions, the flexibility to hint and perceive their inside planning and execution shall be important for making certain alignment, security, and belief in AI methods.
Limitations of the present strategy
Attribution graphs supply a method to hint mannequin habits for a single enter, however they don’t but present a dependable methodology for understanding international circuits or the constant mechanisms a mannequin makes use of throughout many examples. This evaluation depends on changing MLP computations with transcoders, however it’s nonetheless unclear whether or not these transcoders really replicate the unique mechanisms or just approximate the outputs. Moreover, the present strategy highlights solely energetic options, however inactive or inhibitory ones may be simply as essential for understanding the mannequin’s habits.
Conclusion
Circuit tracing by way of attribution graph is an early however essential step towards understanding how language fashions work internally. Whereas this strategy nonetheless has a protracted method to go, the introduction of circuit tracing marks a significant milestone on the trail to true interpretability.