Surrogate Fidelity: When Can Open LLMs Explain Closed Ones?

Aobo Yang; Ayush Warikoo; Chia-Tse Shao; Nehal Bandi; Philippe Chlenski; Vivek Miglani; Yingxiao Ye; Zachariah Carmichael

arxiv: 2606.32008 · v1 · pith:SX6F7LLLnew · submitted 2026-06-30 · 💻 cs.LG

Surrogate Fidelity: When Can Open LLMs Explain Closed Ones?

Philippe Chlenski , Zachariah Carmichael , Ayush Warikoo , Chia-Tse Shao , Yingxiao Ye , Aobo Yang , Vivek Miglani , Nehal Bandi This is my paper

Pith reviewed 2026-07-01 06:13 UTC · model grok-4.3

classification 💻 cs.LG

keywords surrogate fidelitymechanistic interpretabilityattribution fidelityopen language modelsclosed language modelsprediction fidelityinput ablationleave-one-out attribution

0 comments

The pith

Open and closed language models that agree on answers often disagree on the reasons.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper asks when measurements made on fully accessible open language models can be used to explain the internal behavior of closed models that are available only through limited APIs. It measures surrogate fidelity at three levels: raw predictions, attributions that identify which inputs matter, and representations. Using binary classification tasks, log-odds serve as a scalar proxy for the representation space while leave-one-out attributions on open models are compared against input-ablation effects on closed models. The central result is that prediction agreement substantially exceeds attribution agreement across eleven models in four families. This means mechanistic explanations derived from open models do not reliably carry over to closed targets even when the models reach the same final answer.

Core claim

Across eleven models spanning four families, prediction fidelity substantially overstates attribution fidelity: models that agree on what the answer is often disagree on why. Log-odds provide an API-compatible scalar readout of representation space, and leave-one-out attributions give insight into behavior on open models. White-box signals such as attention patterns remain stable across models yet only weakly predict the causal attributions that black-box input ablations capture by design. Mechanistic insight does not automatically transfer to closed targets, and prediction-level agreement alone is insufficient to warrant such transfer.

What carries the argument

Surrogate fidelity at the attribution level, measured by comparing leave-one-out attributions computed on open models against input-ablation effects measured on closed models.

If this is right

Prediction agreement between open and closed models is not sufficient evidence that their internal reasoning aligns.
White-box signals such as attention patterns are highly stable across models but only weakly tied to causal attributions.
Black-box input ablations capture causal attributions more directly than stable white-box signals.
Mechanistic interpretability methods developed on open models cannot be assumed to explain closed models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Attribution methods that operate entirely through black-box APIs may be needed when surrogates prove unreliable.
The observed gap between prediction and attribution fidelity may appear in non-classification tasks if similar comparison protocols are applied.
Auditing or explaining proprietary models may require new black-box techniques rather than reliance on open proxies.

Load-bearing premise

Leave-one-out attributions on open models and input-ablation effects on closed models are comparable quantities that can be used to judge attribution fidelity.

What would settle it

An experiment that finds high agreement between leave-one-out attributions on an open model and input-ablation effects on a matched closed model across the same set of inputs and tasks.

Figures

Figures reproduced from arXiv: 2606.32008 by Aobo Yang, Ayush Warikoo, Chia-Tse Shao, Nehal Bandi, Philippe Chlenski, Vivek Miglani, Yingxiao Ye, Zachariah Carmichael.

**Figure 1.** Figure 1: The surrogate fidelity evaluation pipeline. For each input corpus D, we extract three signals from each model: prediction log-odds ℓM, ablation-based attributions ∆ℓM, and perturbation responses ∆z. Prediction fidelity (Fpred) and attribution fidelity (Fattr) can be computed for any model pair, including closedsource targets accessible only via API. Representation fidelity (Frepr) and cross-level fidelity… view at source ↗

**Figure 2.** Figure 2: The geometric intuition behind our evaluations. (a) Logits are computed from a representation z by projection onto unembedding vectors u− and u+; however, these projections are not recoverable from the top-K log-probabilities exposed by most LLM inference providers. (b) The log-odds are equivalent to the difference in logits, which in turn equals the projection of z onto the difference vector v = u+ − u−. … view at source ↗

**Figure 3.** Figure 3: (a) Pairwise Fpred and Fattr heatmaps for eleven models across four families. (b) Log-odds contour plot for a representative cross-family pair (Llama 3-8B vs. GPT-4o), stratified by ground truth answer (True vs False). (c) Ablation contour plot for the same pair. (d) Example BoolQ prompt with a sentence-level ablation (greyed-out text). 3. Method We formalize surrogate fidelity at three levels of increasin… view at source ↗

**Figure 4.** Figure 4: Perturbations affect attribution through their norm (left), alignment with the log-odds direction v (middle), and indirectly via LayerNorm (right). We plot joint and marginal distributions of each, for a representative model pair. r 2 = 0.762 to 0.984, and attention rollout, which accounts for residual connections across layers (Abnar & Zuidema, 2020), achieves median r 2 = 0.848. Even Qwen-0.5B, which was… view at source ↗

**Figure 5.** Figure 5: Bootstrap-CI pair-averaged per-layer F (l) mag and F (l) attr versus relative depth, with a random-direction control overlay (gray dashed). Both F (l) attr and the control rise late. Per-layer fidelity. We track the per-layer variants of F (l) mag and F (l) attr of the metrics defined in Equations (11) and (12) in [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 6.** Figure 6: Fidelity metrics across a variety of measurements for Qwen 2.5 models with 0.5B, 3B, 7B, and 14B parameters. We measure attributions under ablation; mean-, max-, and rollout-pooled attention scores; perturbation norms; and perturbation alignment—all on BoolQ at the sentence level. We find that ablation scores are generally hard to predict, even using surrogate models of different sizes; predicting ablation… view at source ↗

**Figure 7.** Figure 7: Median log-odds by ground-truth label for instruct (solid circles) and base (dashed squares) models, with individual prompt trajectories shown as translucent lines. Instruction-tuned models exhibit increasing class separation with scale, while base models remain near zero regardless of size. This suggests that the prediction signal measured by Fpred is largely a product of instruction tuning rather than pr… view at source ↗

**Figure 8.** Figure 8: Distribution of model log-odds on BoolQ by ground-truth label. Each panel shows stacked histograms of log-odds for a single model, with green = TRUE and red = FALSE. Models above 3B parameters produce bimodal distributions with clear class separation. Qwen-0.5B exhibits substantial overlap between classes, consistent with its low prediction fidelity. The scale of log-odds varies widely across models (e.g.,… view at source ↗

**Figure 9.** Figure 9: Left: a scatterplot of Spearman ρ against Pearson r 2 . Note that only Fattr has r 2 > ρ. Right: a scatterplot of Frepr versus Fcross for different representation-level quantities. Note that Fpred and Fattr are excluded here, and that only Falign has better cross- than self-prediction. −1 0 1 2 3 4 5 Single-token −1 0 1 2 3 4 5 Multi-token aggregated [PITH_FULL_IMAGE:figures/full_fig_p017_9.png] view at source ↗

**Figure 10.** Figure 10: The correlation between the log-sum-exp of a set of tokens and a single-token variant is high, likely due to tight clustering between semantically similar tokens in unembedding space. Across the open models used in this paper, the average unembedding vector cosine similarity is 0.648 for true tokens, 0.647 for false tokens, and 0.515 between true and false tokens. The high cosine similarity between true a… view at source ↗

**Figure 11.** Figure 11: Per-layer trajectories and KL. 0.0 0.2 0.4 0.6 0.8 1.0 relative depth t/T −25 −20 −15 −10 −5 0 5 10 15 median log-odds (true vs. false) (a) answer = True 0.0 0.2 0.4 0.6 0.8 1.0 relative depth t/T (b) answer = False Qwen2.5-0.5B (n=185) Qwen2.5-3B (n=185) Qwen2.5-7B (n=185) Qwen2.5-14B (n=185) Llama-3.1-8B (n=185) [PITH_FULL_IMAGE:figures/full_fig_p018_11.png] view at source ↗

**Figure 12.** Figure 12: Same as Figure 11a but split by ground-truth answer. The expected sign separation appears almost entirely in the second half of depth and is cleanest for Llama-3.1-8B and Qwen-14B. Figure 11a plots the per-layer trajectories of ℓ (l) . All five models start with a substantial embedding-layer bias whose sign and magnitude depend on the model’s static embedding of the assistant-marker token at the predictio… view at source ↗

**Figure 13.** Figure 13: Per-layer structural perturbation norm and readout-aligned attribution magnitude. 0.0 0.2 0.4 0.6 0.8 1.0 relative depth t/T 0.00 0.02 0.04 0.06 0.08 0.10 0.12 media n |cos(Δ ̂ zt, w)| Qwen2.5-0.5B Qwen2.5-3B Qwen2.5-7B Qwen2.5-14B Llama-3.1-8B (a) Median | cos(∆z (l) , v)| per relative depth. Comparing with Figures 13a and 13b shows that the late rise in |∆ℓ (l) | is driven by the rising | cos | factor o… view at source ↗

**Figure 14.** Figure 14: Per-layer directional alignment and pair-averaged attribution magnitudes. C.3. Cross-model fidelity at depth We compute the per-layer extensions F (l) mag and F (l) attr of the metrics defined in Equations (11) and (12) on the 2,105 (PROMPT, SEGMENT) pairs shared by all five models. Per-model trajectories are resampled to a common length-64 grid in relative depth l/L ∈ [0, 1] via linear interpolation in l… view at source ↗

**Figure 15.** Figure 15: Per-layer correlation analysis with bootstrap confidence intervals and random-direction control. Figures 14b and 15a aggregate over the 10 model pairs. The headline pattern at this scale is a clean depth dissociation: • Structural agreement F (l) mag is high (∼0.65–0.85) across most depths past the embedding slot, then tapers to ∼0.51 at the readout. • Functional agreement F (l) attr is low (≲0.08) for th… view at source ↗

**Figure 16.** Figure 16: Fpred, Fattr, and Fattn versus model confidence quintile on BoolQ (mean pairwise Pearson R 2 across all six Qwen 2.5 instruct pairs). Prediction fidelity rises sharply from Q1 to Q5 (0.03 → 0.75); attribution fidelity more than doubles (0.27 → 0.61); attention fidelity is flat (0.958 ± 0.001). Prediction fidelity rises from Fpred = 0.03 in the lowest-confidence quintile to 0.75 in the highest—a > 20× incr… view at source ↗

**Figure 17.** Figure 17: Fidelity versus confidence quintile, stratified by each individual Qwen model’s |ℓ| as well as their mean. The slope is similar regardless of which model’s confidence defines the bins: model confidence is a robust predictor of fidelity independent of which model supplies it. For both Fpred and Fattr, the fidelity-versus-confidence slope is qualitatively similar regardless of whether we stratify by the sma… view at source ↗

**Figure 18.** Figure 18: shows 4 × 4 NRMSE heatmaps for Fpred and Fattr across the Qwen 2.5 instruct family (0.5B, 3B, 7B, 14B), with rows indexing the surrogate and columns indexing the target. 0.5B 3B 7B 14B Target 0.5B 3B 7B 14B Surrogate 9.83 12.12 15.16 0.97 0.72 0.92 0.97 0.59 0.62 0.99 0.61 0.51 Prediction Fidelity 0.5B 3B 7B 14B Target 0.5B 3B 7B 14B 8.37 10.48 11.54 1.03 1.02 1.18 1.04 0.81 0.88 0.99 0.81 0.76 Attributio… view at source ↗

**Figure 19.** Figure 19: shows a 5 × 5 split heatmap of Fmag and Falign across the five open models on BoolQ. Qwen-0.5B Qwen-3B Qwen-7B Qwen-14B Llama-8B Qwen-0.5B Qwen-3B Qwen-7B Qwen-14B Llama-8B 0.788 0.641 0.666 0.768 0.110 0.855 0.871 0.890 0.064 0.262 0.903 0.855 0.059 0.227 0.294 0.877 0.058 0.205 0.221 0.195 [PITH_FULL_IMAGE:figures/full_fig_p024_19.png] view at source ↗

**Figure 20.** Figure 20: Prediction and attribution fidelity under system-prompt perturbation on ANLI R3. Pearson R 2 grouped by model-pair family. Dark purple: Fpred under the baseline system prompt. Light purple: Fpred under diverse GEPA-generated candidates. Cyan: Fattr under the same candidates, computed as the Pearson R 2 of ∆ℓ induced by replacing the baseline prompt. Structured system prompts drive output convergence while… view at source ↗

**Figure 21.** Figure 21: Cross-model system-prompt transfer on ANLI R3 (n = 400). Left: ∆accuracy when recipient model MR is evaluated under a system prompt optimized for donor model MD, relative to MR’s baseline performance. Cross-model transfer is predominantly neutral or negative. Right: sample-level prediction agreement between MR under MD’s optimized prompt and MR’s own baseline (off-diagonal mean 90.4%). Optimized prompts l… view at source ↗

**Figure 22.** Figure 22: Recipient susceptibility to voice transplant on ANLI R3. For each ordered pair (MD, MR), we define VT(MD → MR) = Pearson r on the gap in donor and recipient baseline log-odds (ℓMD −ℓMR ) and the recipient’s log-odds shift under the donor’s optimized prompt. Bars show the mean VT across all non-self donors; positive values indicate the recipient shifts toward the donor, negative values indicate a shift awa… view at source ↗

**Figure 23.** Figure 23: RACE cross-model Multivariate RV fidelity scores. Pairwise Fpred (purple) and Fattr (blue) heatmaps for eleven models across four families. RV coefficient. To measure agreement between two models’ vector-valued predictions or attributions, we require a multivariate extension of correlation. We use the RV coefficient (Robert & Escoufier, 1976), which measures the closeness of two data matrices in the Hilbe… view at source ↗

**Figure 24.** Figure 24: Correlation between representation CKA and R 2 Fidelity on BoolQ. The prediction fidelity and CKA are perfectly aligned, but surprisingly the attribution fidelity says the opposite. Linear CKA is mathematically equivalent to the RV coefficient. When applied to scalar data (d = 1), it reduces to the squared Pearson correlation r 2 — which is the binary prediction fidelity metric Fpred of Equation (10). The… view at source ↗

**Figure 25.** Figure 25: RV convergence to CKA. X-axis: projection dimension n (log scale, 1 to d). Y-axis: RV on projected data. Three curves (random Gaussian, unembeddings coupled, unembeddings uncoupled) for prediction and attribution respectively converging to horizontal dashed line (representation CKA). Projected vs. full representational CKA. To validate that our log-odds fidelity metrics are faithful proxies for representa… view at source ↗

read the original abstract

Mechanistic interpretability (MI) requires full access to model internals, yet the APIs for most widely deployed language models at best expose log-probabilities over output tokens. This creates a surrogate problem: when do measurements made on open models allow us to make claims about a closed model? We evaluate surrogate fidelity at the prediction, attribution, and representation levels. For binary classification tasks, log-odds provide an API-compatible scalar readout of the model's representation space, and leave-one-out attributions provide insight into model behavior. Across eleven models spanning four families (Llama, Qwen, GPT, and Gemini), we find that prediction fidelity substantially overstates attribution fidelity: models that agree on what the answer is often disagree on why. We document an access-validity inversion: white-box signals like attention patterns and perturbation magnitudes are highly stable across models but only weakly predictive of causal attributions, which black-box input ablations capture by design. Mechanistic insight does not automatically transfer to closed targets, and prediction-level agreement is insufficient to warrant such transfer. Code and results are available at https://github.com/facebookresearch/surrogate.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Prediction agreement between open and closed LLMs does not imply matching attributions, and the attribution comparison rests on uncalibrated methods.

read the letter

The main thing to know is that this paper finds prediction-level agreement between open and closed models overstates their alignment on attributions. Across eleven models from four families, they show that models can match on the answer while diverging on which inputs drive it, and they label this the access-validity inversion where stable white-box signals fail to predict causal effects.

What is new is the surrogate-fidelity framing itself plus the three-level breakdown (prediction, attribution, representation) and the explicit comparison of open versus closed targets. The broad model sweep and the public code release are concrete positives; the work is a direct empirical check rather than a re-derivation of existing equations.

The soft spot is the attribution step. Leave-one-out on open models and input ablations on closed models are used as if they measure the same causal quantity, yet the paper itself notes that white-box and black-box signals can diverge even on the same model. Without a calibration run on models where both techniques are feasible, the reported gap could be inflated by differences in how each method handles token position and interactions. The abstract claims consistent patterns but supplies no details on statistical tests or data exclusions, so robustness is hard to judge from what is shown.

This is for interpretability researchers who want to use open models as proxies for closed ones. It deserves peer review because the framing is original and the scope is decent, even though the attribution methods need tighter validation before the central claim can be taken as settled.

Referee Report

2 major / 2 minor

Summary. The paper claims that for binary classification tasks, prediction-level agreement between open and closed LLMs substantially overstates attribution-level agreement: across eleven models from four families, models that agree on answers often disagree on attributions. It introduces surrogate fidelity evaluation at prediction, attribution, and representation levels, documents an access-validity inversion (stable white-box signals like attention are weakly predictive of causal attributions captured by black-box ablations), and concludes that mechanistic insights do not transfer automatically from open to closed models.

Significance. If the attribution gap is shown to be robust to the choice of attribution method, the result would usefully caution the interpretability community against assuming that open-model explanations apply to closed targets. The public release of code and results at the cited GitHub repository is a clear strength that supports reproducibility and follow-up work.

major comments (2)

[Methods] Methods (attribution and ablation procedures): leave-one-out attributions computed via full forward passes on open models are treated as directly comparable to black-box input-ablation effects on closed models when quantifying attribution fidelity, yet the manuscript provides no cross-method calibration experiment on any model where both techniques can be run. Because the two procedures differ in their handling of token interactions, position sensitivity, and normalization, the reported gap between prediction and attribution fidelity could be inflated by this mismatch rather than reflecting genuine differences in internal reasoning.
[Results] Results (access-validity inversion): the observation that white-box signals are stable across models but only weakly predictive of causal attributions is presented as supporting evidence, but it is not used to test or bound the potential discrepancy between LOO and ablation quantities on the same model; without such a test the central claim that prediction fidelity overstates attribution fidelity remains vulnerable to methodological artifact.

minor comments (2)

[Methods] The manuscript does not report the precise data exclusion rules, tokenization details, or statistical tests used to establish the 'consistent patterns' across the eleven models; these details are needed to assess whether post-hoc choices affect the attribution-fidelity gap.
[Figures/Tables] Figure captions and table legends should explicitly state the exact perturbation used for each attribution method (e.g., token deletion vs. masking) so readers can evaluate comparability without returning to the main text.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive comments on our manuscript. We address each major point below, indicating where revisions will be made to strengthen the work.

read point-by-point responses

Referee: [Methods] Methods (attribution and ablation procedures): leave-one-out attributions computed via full forward passes on open models are treated as directly comparable to black-box input-ablation effects on closed models when quantifying attribution fidelity, yet the manuscript provides no cross-method calibration experiment on any model where both techniques can be run. Because the two procedures differ in their handling of token interactions, position sensitivity, and normalization, the reported gap between prediction and attribution fidelity could be inflated by this mismatch rather than reflecting genuine differences in internal reasoning.

Authors: We agree that the absence of a direct cross-method calibration on models permitting both leave-one-out and ablation leaves open the possibility that some portion of the observed attribution gap arises from procedural differences rather than model-internal differences. In the revised manuscript we will add a calibration experiment on open models (where both methods are feasible) to quantify the method-induced discrepancy in attribution scores and to bound its contribution to the reported prediction-versus-attribution fidelity gap. revision: yes
Referee: [Results] Results (access-validity inversion): the observation that white-box signals are stable across models but only weakly predictive of causal attributions is presented as supporting evidence, but it is not used to test or bound the potential discrepancy between LOO and ablation quantities on the same model; without such a test the central claim that prediction fidelity overstates attribution fidelity remains vulnerable to methodological artifact.

Authors: The access-validity inversion already demonstrates that highly stable white-box signals (attention, perturbation magnitudes) are only weakly aligned with the causal effects recovered by black-box ablations. This pattern is consistent with the claim that attribution differences reflect genuine differences in causal reasoning rather than being driven primarily by the choice of attribution procedure. We will revise the relevant section to make this linkage explicit and to note how the inversion results provide an indirect bound on the methodological artifact. revision: partial

Circularity Check

0 steps flagged

No circularity in empirical comparison

full rationale

The paper conducts an empirical study measuring prediction fidelity, attribution fidelity, and representation fidelity by directly comparing outputs and attributions across open and closed models on binary classification tasks. Leave-one-out attributions and input ablations are applied as experimental procedures without any fitted parameters being redefined as predictions, without self-citation load-bearing on core claims, and without any self-definitional equations or ansatzes. The reported findings (prediction fidelity overstating attribution fidelity, access-validity inversion) are presented as outcomes of these measurements rather than derivations that reduce to their own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The paper is an empirical measurement study; it introduces no new free parameters, axioms beyond standard ML assumptions, or invented entities.

pith-pipeline@v0.9.1-grok · 5751 in / 1053 out tokens · 19111 ms · 2026-07-01T06:13:37.431466+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

70 extracted references · 18 canonical work pages · 11 internal anchors

[1]

2009 , publisher=

Causality , author=. 2009 , publisher=

2009
[2]

Proceedings of the 58th annual meeting of the association for computational linguistics , pages=

Quantifying attention flow in transformers , author=. Proceedings of the 58th annual meeting of the association for computational linguistics , pages=
[3]

Boolq: Exploring the surprising difficulty of natural yes/no questions , author=. Proceedings of the 2019 conference of the north American chapter of the association for computational linguistics: Human language technologies, volume 1 (long and short papers) , pages=

2019
[4]

Proceedings of the 58th annual meeting of the association for computational linguistics , pages=

Adversarial NLI: A new benchmark for natural language understanding , author=. Proceedings of the 58th annual meeting of the association for computational linguistics , pages=
[5]

Communications of the ACM , volume=

Winogrande: An adversarial winograd schema challenge at scale , author=. Communications of the ACM , volume=. 2021 , publisher=

2021
[6]

, author=

Tests for comparing elements of a correlation matrix. , author=. Psychological bulletin , volume=. 1980 , publisher=

1980
[7]

Attention is not explanation , author=. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers) , pages=

2019
[8]

Attention is not not explanation , author=. Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP) , pages=

2019
[9]

2023 , howpublished =

Language models can explain neurons in language models , author=. 2023 , howpublished =

2023
[10]

2024 , journal=

Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet , author=. 2024 , journal=

2024
[11]

Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (volume 1: long papers) , pages=

Is attention explanation? an introduction to the debate , author=. Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (volume 1: long papers) , pages=
[12]

Distill , volume=

Zoom in: An introduction to circuits , author=. Distill , volume=
[13]

Transformer Circuits Thread , volume=

A mathematical framework for transformer circuits , author=. Transformer Circuits Thread , volume=
[14]

Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small

Interpretability in the wild: a circuit for indirect object identification in gpt-2 small , author=. arXiv preprint arXiv:2211.00593 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[15]

Advances in Neural Information Processing Systems , volume=

Towards automated circuit discovery for mechanistic interpretability , author=. Advances in Neural Information Processing Systems , volume=
[16]

Localizing Model Behavior with Path Patching

Localizing model behavior with path patching , author=. arXiv preprint arXiv:2304.05969 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[17]

Transformer Circuits Thread , volume=

Towards monosemanticity: Decomposing language models with dictionary learning , author=. Transformer Circuits Thread , volume=
[18]

Sparse Autoencoders Find Highly Interpretable Features in Language Models

Sparse autoencoders find highly interpretable features in language models , author=. arXiv preprint arXiv:2309.08600 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[19]

Proceedings of the 57th annual meeting of the association for computational linguistics , pages=

Is attention interpretable? , author=. Proceedings of the 57th annual meeting of the association for computational linguistics , pages=
[20]

Proceedings of the Third BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP , pages=

The elephant in the interpretability room: Why use attention as explanation when we have saliency methods? , author=. Proceedings of the Third BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP , pages=
[21]

European conference on computer vision , pages=

Visualizing and understanding convolutional networks , author=. European conference on computer vision , pages=. 2014 , organization=

2014
[22]

Why should i trust you?

" Why should i trust you?" Explaining the predictions of any classifier , author=. Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining , pages=
[23]

Advances in neural information processing systems , volume=

A unified approach to interpreting model predictions , author=. Advances in neural information processing systems , volume=
[24]

Advances in neural information processing systems , volume=

Causal abstractions of neural networks , author=. Advances in neural information processing systems , volume=
[25]

Advances in neural information processing systems , volume=

Investigating gender bias in language models using causal mediation analysis , author=. Advances in neural information processing systems , volume=
[26]

Advances in neural information processing systems , volume=

Locating and editing factual associations in gpt , author=. Advances in neural information processing systems , volume=
[27]

International conference on machine learning , pages=

Similarity of neural network representations revisited , author=. International conference on machine learning , pages=. 2019 , organization=

2019
[28]

Advances in neural information processing systems , volume=

Svcca: Singular vector canonical correlation analysis for deep learning dynamics and interpretability , author=. Advances in neural information processing systems , volume=
[29]

Frontiers in systems neuroscience , volume=

Representational similarity analysis-connecting the branches of systems neuroscience , author=. Frontiers in systems neuroscience , volume=. 2008 , publisher=

2008
[30]

The Platonic Representation Hypothesis

The platonic representation hypothesis , author=. arXiv preprint arXiv:2405.07987 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[31]

Holistic Evaluation of Language Models

Holistic evaluation of language models , author=. arXiv preprint arXiv:2211.09110 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[32]

Advances in neural information processing systems , volume=

Judging llm-as-a-judge with mt-bench and chatbot arena , author=. Advances in neural information processing systems , volume=
[33]

Science , volume=

Rethink reporting of evaluation results in AI , author=. Science , volume=. 2023 , publisher=

2023
[34]

Distilling the Knowledge in a Neural Network

Distilling the knowledge in a neural network , author=. arXiv preprint arXiv:1503.02531 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[35]

Steering Language Models With Activation Engineering

Steering language models with activation engineering , author=. arXiv preprint arXiv:2308.10248 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[36]

Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

Steering llama 2 via contrastive activation addition , author=. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=
[37]

Advances in Neural Information Processing Systems , volume=

Inference-time intervention: Eliciting truthful answers from a language model , author=. Advances in Neural Information Processing Systems , volume=
[38]

Computational Linguistics , volume=

Probing classifiers: Promises, shortcomings, and advances , author=. Computational Linguistics , volume=
[39]

A structural probe for finding syntax in word representations , author=. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers) , pages=

2019
[40]

arXiv preprint arXiv:2409.14507 , year=

A is for absorption: Studying feature splitting and absorption in sparse autoencoders , author=. arXiv preprint arXiv:2409.14507 , year=

work page arXiv
[41]

arXiv preprint arXiv:2502.04878 , year=

Sparse autoencoders do not find canonical units of analysis , author=. arXiv preprint arXiv:2502.04878 , year=

work page arXiv
[42]

Proceedings of the 3rd Workshop for Natural Language Processing Open Source Software (NLP-OSS 2023) , pages=

Using captum to explain generative language models , author=. Proceedings of the 3rd Workshop for Natural Language Processing Open Source Software (NLP-OSS 2023) , pages=

2023
[43]

Advances in Neural Information Processing Systems , volume=

Transcoders find interpretable llm feature circuits , author=. Advances in Neural Information Processing Systems , volume=
[44]

arXiv preprint arXiv:2009.07896 , year=

Captum: A unified and generic model interpretability library for pytorch , author=. arXiv preprint arXiv:2009.07896 , year=

work page arXiv 2009
[45]

Transformer Circuits Thread , pages=

Sparse crosscoders for cross-layer features and model diffing , author=. Transformer Circuits Thread , pages=
[46]

Ameisen, Emmanuel and Lindsey, Jack and Pearce, Adam and Gurnee, Wes and Turner, Nicholas L. and Chen, Brian and Citro, Craig and Abrahams, David and Carter, Shan and Hosmer, Basil and Marcus, Jonathan and Sklar, Michael and Templeton, Adly and Bricken, Trenton and McDougall, Callum and Cunningham, Hoagy and Henighan, Thomas and Jermyn, Adam and Jones, An...
[47]

From Anecdotal Evidence to Quantitative Evaluation Methods: A Systematic Review on Evaluating Explainable AI , volume=

Nauta, Meike and Trienes, Jan and Pathak, Shreyasi and Nguyen, Elisa and Peters, Michelle and Schmitt, Yasmin and Schlötterer, Jörg and van Keulen, Maurice and Seifert, Christin , year=. From Anecdotal Evidence to Quantitative Evaluation Methods: A Systematic Review on Evaluating Explainable AI , volume=. ACM Computing Surveys , publisher=. doi:10.1145/35...

work page doi:10.1145/3583558
[48]

Qwen Technical Report

Qwen technical report , author=. arXiv preprint arXiv:2309.16609 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[49]

Proceedings of the 2020 conference on empirical methods in natural language processing: system demonstrations , pages=

Transformers: State-of-the-art natural language processing , author=. Proceedings of the 2020 conference on empirical methods in natural language processing: system demonstrations , pages=

2020
[50]

2016 , eprint=

The LAMBADA dataset: Word prediction requiring a broad discourse context , author=. 2016 , eprint=

2016
[51]

2020 , url =

nostalgebraist , title =. 2020 , url =

2020
[52]

2025 , eprint=

Eliciting Latent Predictions from Transformers with the Tuned Lens , author=. 2025 , eprint=

2025
[53]

LLaMA: Open and Efficient Foundation Language Models

Llama: Open and efficient foundation language models , author=. arXiv preprint arXiv:2302.13971 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[54]

GPT-4 Technical Report

Gpt-4 technical report , author=. arXiv preprint arXiv:2303.08774 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[55]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities , author=. arXiv preprint arXiv:2507.06261 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[56]

arXiv preprint arXiv:2403.06634 , year=

Stealing part of a production language model , author=. arXiv preprint arXiv:2403.06634 , year=

work page arXiv
[57]

arXiv preprint arXiv:2211.12312 , year=

Interpreting neural networks through the polytope lens , author=. arXiv preprint arXiv:2211.12312 , year=

work page arXiv
[58]

2023 , journal=

Eliciting Latent Predictions from Transformers with the Tuned Lens , author=. 2023 , journal=

2023
[59]

International conference on machine learning , pages=

Understanding black-box predictions via influence functions , author=. International conference on machine learning , pages=. 2017 , organization=

2017
[60]

Advances in Neural Information Processing Systems , volume=

Contextcite: Attributing model generation to context , author=. Advances in Neural Information Processing Systems , volume=
[61]

2026 , eprint=

GEPA: Reflective Prompt Evolution Can Outperform Reinforcement Learning , author=. 2026 , eprint=

2026
[62]

2025 , eprint=

Multi-Level Explanations for Generative Language Models , author=. 2025 , eprint=

2025
[63]

2025 , eprint=

Explaining Large Language Models with gSMILE , author=. 2025 , eprint=

2025
[64]

2025 , eprint=

AttriBoT: A Bag of Tricks for Efficiently Approximating Leave-One-Out Context Attribution , author=. 2025 , eprint=

2025
[65]

2026 , eprint=

Predicting LLM Reasoning Performance with Small Proxy Model , author=. 2026 , eprint=

2026
[66]

2024 , eprint=

The Linear Representation Hypothesis and the Geometry of Large Language Models , author=. 2024 , eprint=

2024
[67]

Journal of the Royal Statistical Society Series C: Applied Statistics , volume=

A unifying tool for linear multivariate statistical methods: the RV-coefficient , author=. Journal of the Royal Statistical Society Series C: Applied Statistics , volume=. 1976 , publisher=

1976
[68]

RACE : Large-scale R e A ding Comprehension Dataset From Examinations

Lai, Guokun and Xie, Qizhe and Liu, Hanxiao and Yang, Yiming and Hovy, Eduard. RACE : Large-scale R e A ding Comprehension Dataset From Examinations. Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. 2017. doi:10.18653/v1/D17-1082

work page doi:10.18653/v1/d17-1082 2017
[69]

Contemporary mathematics , year=

Extensions of Lipschitz mappings into Hilbert space , author=. Contemporary mathematics , year=
[70]

2026 , eprint=

Brittlebench: Quantifying LLM robustness via prompt sensitivity , author=. 2026 , eprint=

2026

[1] [1]

2009 , publisher=

Causality , author=. 2009 , publisher=

2009

[2] [2]

Proceedings of the 58th annual meeting of the association for computational linguistics , pages=

Quantifying attention flow in transformers , author=. Proceedings of the 58th annual meeting of the association for computational linguistics , pages=

[3] [3]

Boolq: Exploring the surprising difficulty of natural yes/no questions , author=. Proceedings of the 2019 conference of the north American chapter of the association for computational linguistics: Human language technologies, volume 1 (long and short papers) , pages=

2019

[4] [4]

Proceedings of the 58th annual meeting of the association for computational linguistics , pages=

Adversarial NLI: A new benchmark for natural language understanding , author=. Proceedings of the 58th annual meeting of the association for computational linguistics , pages=

[5] [5]

Communications of the ACM , volume=

Winogrande: An adversarial winograd schema challenge at scale , author=. Communications of the ACM , volume=. 2021 , publisher=

2021

[6] [6]

, author=

Tests for comparing elements of a correlation matrix. , author=. Psychological bulletin , volume=. 1980 , publisher=

1980

[7] [7]

Attention is not explanation , author=. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers) , pages=

2019

[8] [8]

Attention is not not explanation , author=. Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP) , pages=

2019

[9] [9]

2023 , howpublished =

Language models can explain neurons in language models , author=. 2023 , howpublished =

2023

[10] [10]

2024 , journal=

Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet , author=. 2024 , journal=

2024

[11] [11]

Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (volume 1: long papers) , pages=

Is attention explanation? an introduction to the debate , author=. Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (volume 1: long papers) , pages=

[12] [12]

Distill , volume=

Zoom in: An introduction to circuits , author=. Distill , volume=

[13] [13]

Transformer Circuits Thread , volume=

A mathematical framework for transformer circuits , author=. Transformer Circuits Thread , volume=

[14] [14]

Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small

Interpretability in the wild: a circuit for indirect object identification in gpt-2 small , author=. arXiv preprint arXiv:2211.00593 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[15] [15]

Advances in Neural Information Processing Systems , volume=

Towards automated circuit discovery for mechanistic interpretability , author=. Advances in Neural Information Processing Systems , volume=

[16] [16]

Localizing Model Behavior with Path Patching

Localizing model behavior with path patching , author=. arXiv preprint arXiv:2304.05969 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[17] [17]

Transformer Circuits Thread , volume=

Towards monosemanticity: Decomposing language models with dictionary learning , author=. Transformer Circuits Thread , volume=

[18] [18]

Sparse Autoencoders Find Highly Interpretable Features in Language Models

Sparse autoencoders find highly interpretable features in language models , author=. arXiv preprint arXiv:2309.08600 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[19] [19]

Proceedings of the 57th annual meeting of the association for computational linguistics , pages=

Is attention interpretable? , author=. Proceedings of the 57th annual meeting of the association for computational linguistics , pages=

[20] [20]

Proceedings of the Third BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP , pages=

The elephant in the interpretability room: Why use attention as explanation when we have saliency methods? , author=. Proceedings of the Third BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP , pages=

[21] [21]

European conference on computer vision , pages=

Visualizing and understanding convolutional networks , author=. European conference on computer vision , pages=. 2014 , organization=

2014

[22] [22]

Why should i trust you?

" Why should i trust you?" Explaining the predictions of any classifier , author=. Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining , pages=

[23] [23]

Advances in neural information processing systems , volume=

A unified approach to interpreting model predictions , author=. Advances in neural information processing systems , volume=

[24] [24]

Advances in neural information processing systems , volume=

Causal abstractions of neural networks , author=. Advances in neural information processing systems , volume=

[25] [25]

Advances in neural information processing systems , volume=

Investigating gender bias in language models using causal mediation analysis , author=. Advances in neural information processing systems , volume=

[26] [26]

Advances in neural information processing systems , volume=

Locating and editing factual associations in gpt , author=. Advances in neural information processing systems , volume=

[27] [27]

International conference on machine learning , pages=

Similarity of neural network representations revisited , author=. International conference on machine learning , pages=. 2019 , organization=

2019

[28] [28]

Advances in neural information processing systems , volume=

Svcca: Singular vector canonical correlation analysis for deep learning dynamics and interpretability , author=. Advances in neural information processing systems , volume=

[29] [29]

Frontiers in systems neuroscience , volume=

Representational similarity analysis-connecting the branches of systems neuroscience , author=. Frontiers in systems neuroscience , volume=. 2008 , publisher=

2008

[30] [30]

The Platonic Representation Hypothesis

The platonic representation hypothesis , author=. arXiv preprint arXiv:2405.07987 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[31] [31]

Holistic Evaluation of Language Models

Holistic evaluation of language models , author=. arXiv preprint arXiv:2211.09110 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[32] [32]

Advances in neural information processing systems , volume=

Judging llm-as-a-judge with mt-bench and chatbot arena , author=. Advances in neural information processing systems , volume=

[33] [33]

Science , volume=

Rethink reporting of evaluation results in AI , author=. Science , volume=. 2023 , publisher=

2023

[34] [34]

Distilling the Knowledge in a Neural Network

Distilling the knowledge in a neural network , author=. arXiv preprint arXiv:1503.02531 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[35] [35]

Steering Language Models With Activation Engineering

Steering language models with activation engineering , author=. arXiv preprint arXiv:2308.10248 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[36] [36]

Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

Steering llama 2 via contrastive activation addition , author=. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

[37] [37]

Advances in Neural Information Processing Systems , volume=

Inference-time intervention: Eliciting truthful answers from a language model , author=. Advances in Neural Information Processing Systems , volume=

[38] [38]

Computational Linguistics , volume=

Probing classifiers: Promises, shortcomings, and advances , author=. Computational Linguistics , volume=

[39] [39]

A structural probe for finding syntax in word representations , author=. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers) , pages=

2019

[40] [40]

arXiv preprint arXiv:2409.14507 , year=

A is for absorption: Studying feature splitting and absorption in sparse autoencoders , author=. arXiv preprint arXiv:2409.14507 , year=

work page arXiv

[41] [41]

arXiv preprint arXiv:2502.04878 , year=

Sparse autoencoders do not find canonical units of analysis , author=. arXiv preprint arXiv:2502.04878 , year=

work page arXiv

[42] [42]

Proceedings of the 3rd Workshop for Natural Language Processing Open Source Software (NLP-OSS 2023) , pages=

Using captum to explain generative language models , author=. Proceedings of the 3rd Workshop for Natural Language Processing Open Source Software (NLP-OSS 2023) , pages=

2023

[43] [43]

Advances in Neural Information Processing Systems , volume=

Transcoders find interpretable llm feature circuits , author=. Advances in Neural Information Processing Systems , volume=

[44] [44]

arXiv preprint arXiv:2009.07896 , year=

Captum: A unified and generic model interpretability library for pytorch , author=. arXiv preprint arXiv:2009.07896 , year=

work page arXiv 2009

[45] [45]

Transformer Circuits Thread , pages=

Sparse crosscoders for cross-layer features and model diffing , author=. Transformer Circuits Thread , pages=

[46] [46]

Ameisen, Emmanuel and Lindsey, Jack and Pearce, Adam and Gurnee, Wes and Turner, Nicholas L. and Chen, Brian and Citro, Craig and Abrahams, David and Carter, Shan and Hosmer, Basil and Marcus, Jonathan and Sklar, Michael and Templeton, Adly and Bricken, Trenton and McDougall, Callum and Cunningham, Hoagy and Henighan, Thomas and Jermyn, Adam and Jones, An...

[47] [47]

From Anecdotal Evidence to Quantitative Evaluation Methods: A Systematic Review on Evaluating Explainable AI , volume=

Nauta, Meike and Trienes, Jan and Pathak, Shreyasi and Nguyen, Elisa and Peters, Michelle and Schmitt, Yasmin and Schlötterer, Jörg and van Keulen, Maurice and Seifert, Christin , year=. From Anecdotal Evidence to Quantitative Evaluation Methods: A Systematic Review on Evaluating Explainable AI , volume=. ACM Computing Surveys , publisher=. doi:10.1145/35...

work page doi:10.1145/3583558

[48] [48]

Qwen Technical Report

Qwen technical report , author=. arXiv preprint arXiv:2309.16609 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[49] [49]

Proceedings of the 2020 conference on empirical methods in natural language processing: system demonstrations , pages=

Transformers: State-of-the-art natural language processing , author=. Proceedings of the 2020 conference on empirical methods in natural language processing: system demonstrations , pages=

2020

[50] [50]

2016 , eprint=

The LAMBADA dataset: Word prediction requiring a broad discourse context , author=. 2016 , eprint=

2016

[51] [51]

2020 , url =

nostalgebraist , title =. 2020 , url =

2020

[52] [52]

2025 , eprint=

Eliciting Latent Predictions from Transformers with the Tuned Lens , author=. 2025 , eprint=

2025

[53] [53]

LLaMA: Open and Efficient Foundation Language Models

Llama: Open and efficient foundation language models , author=. arXiv preprint arXiv:2302.13971 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[54] [54]

GPT-4 Technical Report

Gpt-4 technical report , author=. arXiv preprint arXiv:2303.08774 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[55] [55]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities , author=. arXiv preprint arXiv:2507.06261 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[56] [56]

arXiv preprint arXiv:2403.06634 , year=

Stealing part of a production language model , author=. arXiv preprint arXiv:2403.06634 , year=

work page arXiv

[57] [57]

arXiv preprint arXiv:2211.12312 , year=

Interpreting neural networks through the polytope lens , author=. arXiv preprint arXiv:2211.12312 , year=

work page arXiv

[58] [58]

2023 , journal=

Eliciting Latent Predictions from Transformers with the Tuned Lens , author=. 2023 , journal=

2023

[59] [59]

International conference on machine learning , pages=

Understanding black-box predictions via influence functions , author=. International conference on machine learning , pages=. 2017 , organization=

2017

[60] [60]

Advances in Neural Information Processing Systems , volume=

Contextcite: Attributing model generation to context , author=. Advances in Neural Information Processing Systems , volume=

[61] [61]

2026 , eprint=

GEPA: Reflective Prompt Evolution Can Outperform Reinforcement Learning , author=. 2026 , eprint=

2026

[62] [62]

2025 , eprint=

Multi-Level Explanations for Generative Language Models , author=. 2025 , eprint=

2025

[63] [63]

2025 , eprint=

Explaining Large Language Models with gSMILE , author=. 2025 , eprint=

2025

[64] [64]

2025 , eprint=

AttriBoT: A Bag of Tricks for Efficiently Approximating Leave-One-Out Context Attribution , author=. 2025 , eprint=

2025

[65] [65]

2026 , eprint=

Predicting LLM Reasoning Performance with Small Proxy Model , author=. 2026 , eprint=

2026

[66] [66]

2024 , eprint=

The Linear Representation Hypothesis and the Geometry of Large Language Models , author=. 2024 , eprint=

2024

[67] [67]

Journal of the Royal Statistical Society Series C: Applied Statistics , volume=

A unifying tool for linear multivariate statistical methods: the RV-coefficient , author=. Journal of the Royal Statistical Society Series C: Applied Statistics , volume=. 1976 , publisher=

1976

[68] [68]

RACE : Large-scale R e A ding Comprehension Dataset From Examinations

Lai, Guokun and Xie, Qizhe and Liu, Hanxiao and Yang, Yiming and Hovy, Eduard. RACE : Large-scale R e A ding Comprehension Dataset From Examinations. Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. 2017. doi:10.18653/v1/D17-1082

work page doi:10.18653/v1/d17-1082 2017

[69] [69]

Contemporary mathematics , year=

Extensions of Lipschitz mappings into Hilbert space , author=. Contemporary mathematics , year=

[70] [70]

2026 , eprint=

Brittlebench: Quantifying LLM robustness via prompt sensitivity , author=. 2026 , eprint=

2026