Circuit discovery, a key technique in mechanistic interpretability, aims to pinpoint the model components crucial for performing a given task in large language models (LLMs). However, substantial variability in the discovered circuits has raised concerns about reliability. A new paper, "Demystifying Variance in Circuit Discovery of LLMs" by Wu, Tonin, and Cevher, published on arXiv, systematically examines three types of variance and proposes a new method to mitigate one of them.
The current state-of-the-art method, EAP-IG, performs well on the metric of (un)faithfulness but suffers from substantial variability. The authors identify three distinct categories of variance:
- Resampling variance: The circuit changes when probing with a new batch of data from the same distribution.
- Rephrasing variance: The discovered circuit shifts when the prompts are rephrased.
- Sample-wise variance: A circuit with low population unfaithfulness exhibits large fluctuations in unfaithfulness across individual samples.
CEAP: A New Method with Theoretical Guarantees
To address resampling variance, the researchers introduce CEAP, an improvement on EAP-IG that includes a theoretical guarantee. According to the paper, CEAP can substantially lessen resampling variance. The method's enhanced stability makes it more reliable for identifying important components across different data samples.
The Challenge of Rephrasing Variance
Rephrasing variance arises because prompts with different templates tend to activate different circuits in the model. The authors argue that this makes it challenging to find a comprehensive circuit that explains and controls the model's behavior on a task expressed in countless templates. They suggest that this phenomenon indicates LLMs may be inherently hard to steer. Interestingly, the paper notes that sparsity, which has been claimed to form more compact and interpretable task circuits, fails to solve this problem.
Sample-Wise Variance: Mostly Benign
Regarding sample-wise variance, the authors argue it is largely benign. Extremely poor unfaithfulness scores often stem from how unfaithfulness is defined rather than from defects in the measured circuits. They show that the magnitude of unfaithfulness is affected by selective contribution scaling, a neural mechanism that accounts for the extremely poor scores sometimes observed.
| Variance Type | Definition | Key Insight |
|---|---|---|
| Resampling variance | Circuit changes with new data batches from same distribution | CEAP method reduces this variance |
| Rephrasing variance | Circuit shifts when prompts rephrased | Suggests LLMs may be inherently hard to steer; sparsity doesn't help |
| Sample-wise variance | Unfaithfulness fluctuations across individual samples | Mostly benign; poor scores due to definition, not circuit defects |
For enterprise technology decision-makers, this research underscores the importance of understanding the limitations of current interpretability methods when deploying LLMs in production environments. While circuit discovery can pinpoint relevant components, variance across rephrasings and data samples means that a single discovered circuit may not reliably represent model behavior for all inputs. The CEAP method offers a step forward in reducing resampling variance, but the fundamental challenge of rephrasing variance suggests that steering LLMs with high reliability remains an open problem.