arxiv: 2604.13304 · v1 · submitted 2026-04-14 · 💻 cs.CV · cs.AI

Recognition: unknown

Can Cross-Layer Transcoders Replace Vision Transformer Activations? An Interpretable Perspective on Vision

Gerasimos Chatzoudis , Konstantinos D. Polyzos , Zhuowei Li , Difei Gu , Gemma E. Moran , Hao Wang , Dimitris N. Metaxas

Authors on Pith no claims yet

Pith reviewed 2026-05-10 15:37 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords cross-layer transcodersvision transformersmodel interpretabilitysparse embeddingsactivation reconstructionzero-shot classificationlayer attributionCLIP

0 comments

The pith

Cross-layer transcoders reconstruct Vision Transformer post-MLP activations from preceding layers to produce a linear, layer-resolved decomposition of the final representation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes Cross-Layer Transcoders as a new proxy for the MLP blocks inside Vision Transformers. Each CLT learns to reconstruct a given layer's activation vector from sparse embeddings drawn only from earlier layers, which yields an explicit additive breakdown of the final embedding into per-layer terms. This decomposition supports direct attribution of how much each layer contributes to the model's output. A reader would care because single-layer sparse autoencoders cannot capture the cumulative cross-layer computation that actually produces the last representation used for classification or retrieval.

Core claim

Cross-Layer Transcoders (CLTs) are encoder-decoder networks trained to reconstruct each post-MLP activation in a ViT from learned sparse embeddings of all preceding layers. When applied to CLIP ViT-B/32 and ViT-B/16 on CIFAR-100, COCO, and ImageNet-100, the resulting linear decomposition achieves high reconstruction fidelity while preserving or improving zero-shot classification accuracy. The associated cross-layer contribution scores identify a small subset of dominant layer-wise terms; retaining only those terms largely preserves performance, whereas their removal causes clear degradation.

What carries the argument

The Cross-Layer Transcoder, an encoder-decoder model that maps preceding-layer sparse embeddings to a reconstruction of a target post-MLP activation, thereby converting the final ViT representation into an explicit sum of layer-specific vectors for attribution.

If this is right

CLTs can act as drop-in, sparse proxies for ViT MLP blocks without harming downstream zero-shot performance.
The linear decomposition supplies per-layer attribution scores that correctly predict the performance impact of keeping or discarding individual terms.
Final ViT representations concentrate their useful signal in a minority of dominant layer contributions rather than distributing it uniformly across depth.
Selective retention of only the dominant terms in the decomposition recovers most of the original classification accuracy.
Non-dominant layer terms can be ablated with measurable loss, confirming they carry limited unique information for the final output.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same CLT construction could be applied to language-model transformers to expose which layers dominate next-token prediction.
Dominant-layer concentration suggests a possible route to layer-pruning or early-exit strategies that keep accuracy while lowering compute.
Faithful per-layer scores might support targeted interventions such as editing or regularizing specific depths for robustness or fairness.
If the scores generalize, they could serve as a diagnostic for whether a ViT has learned redundant or shallow representations across its depth.

Load-bearing premise

High-fidelity reconstruction of post-MLP activations from preceding-layer sparse embeddings is enough to guarantee that the derived contribution scores faithfully reflect the original model's layer-wise computation and decision process.

What would settle it

An experiment that shows the CLT contribution scores assign high importance to layers whose actual removal (or zeroing of their term) leaves zero-shot accuracy essentially unchanged, or assigns low importance to layers whose removal sharply drops accuracy.

Figures

Figures reproduced from arXiv: 2604.13304 by Difei Gu, Dimitris N. Metaxas, Gemma E. Moran, Gerasimos Chatzoudis, Hao Wang, Konstantinos D. Polyzos, Zhuowei Li.

**Figure 2.** Figure 2: Reconstruction performance of Cross-Layer Transcoders (CLTs) across transformer layers on [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: visualizes C proj i→L for [CLS] (left) and patch tokens (right), revealing two qualitatively distinct regimes. For patch tokens, the attribution matrix is strongly diagonaldominant: each layer explains the majority of its own postMLP output, with only modest credit diffusing to immediate neighbors. This locality is consistent with the view that patch-level computations are dominated by within-layer tran… view at source ↗

**Figure 4.** Figure 4: Layerwise visual retrieval using CLT sparse codes of CLS for a test image. Each row shows the top-3 retrieved training samples [PITH_FULL_IMAGE:figures/full_fig_p012_4.png] view at source ↗

**Figure 5.** Figure 5: Layerwise visual retrieval using CLT sparse codes of Patches for a test image. Each row shows the top-3 retrieved training samples [PITH_FULL_IMAGE:figures/full_fig_p012_5.png] view at source ↗

**Figure 6.** Figure 6: Reconstruction performance of CLTs across transformer layers on CIFAR-100 using ViT-B/16. We report cosine similarity (left), [PITH_FULL_IMAGE:figures/full_fig_p013_6.png] view at source ↗

**Figure 7.** Figure 7: Reconstruction performance of CLTs across transformer layers on COCO using ViT-B/32. We show cosine similarity (left), MSE [PITH_FULL_IMAGE:figures/full_fig_p013_7.png] view at source ↗

**Figure 8.** Figure 8: Reconstruction performance of CLTs across transformer layers on COCO using ViT-B/16. As in the main paper, we compare [PITH_FULL_IMAGE:figures/full_fig_p014_8.png] view at source ↗

**Figure 9.** Figure 9: Reconstruction performance of CLTs across transformer layers on ImageNet-100 using ViT-B/32. We plot cosine similarity (left), [PITH_FULL_IMAGE:figures/full_fig_p014_9.png] view at source ↗

**Figure 10.** Figure 10: Reconstruction performance of CLTs across transformer layers on ImageNet-100 using ViT-B/16. Cosine similarity (left), MSE [PITH_FULL_IMAGE:figures/full_fig_p015_10.png] view at source ↗

**Figure 11.** Figure 11: Cross-layer contribution scores Cs→ℓ on CIFAR-100 with ViT-B/32. Columns vary the sparsifier (JumpReLU, ReLU-Top-k, Abs-Top-k) and rows show CLS (top) and patch tokens (bottom). Each heatmap visualizes the proportional contribution of source layer s to the reconstructed activation at target layer ℓ, averaged over the validation set [PITH_FULL_IMAGE:figures/full_fig_p015_11.png] view at source ↗

**Figure 12.** Figure 12: Cross-layer contribution scores Cs→ℓ on CIFAR-100 with ViT-B/16. Columns vary the sparsifier (JumpReLU, ReLU-Top-k, Abs-Top-k) and rows show CLS (top) and patch tokens (bottom). Each heatmap visualizes the proportional contribution of source layer s to the reconstructed activation at target layer ℓ, averaged over the validation set. (a) JumpReLU, CLS (b) ReLU-Top-k, CLS (c) Abs-Top-k, CLS (d) JumpReLU, pa… view at source ↗

**Figure 13.** Figure 13: Cross-layer contribution scores Cs→ℓ on COCO with ViT-B/32. Columns vary the sparsifier (JumpReLU, ReLU-Top-k, Abs-Top-k) and rows show CLS (top) and patch tokens (bottom). Each heatmap visualizes the proportional contribution of source layer s to the reconstructed activation at target layer ℓ, averaged over the validation set [PITH_FULL_IMAGE:figures/full_fig_p016_13.png] view at source ↗

**Figure 14.** Figure 14: Cross-layer contribution scores Cs→ℓ on COCO with ViT-B/16. Columns vary the sparsifier (JumpReLU, ReLU-Top-k, Abs-Top-k) and rows show CLS (top) and patch tokens (bottom). Each heatmap visualizes the proportional contribution of source layer s to the reconstructed activation at target layer ℓ, averaged over the validation set. (a) JumpReLU, CLS (b) ReLU-Top-k, CLS (c) Abs-Top-k, CLS (d) JumpReLU, patches… view at source ↗

**Figure 15.** Figure 15: Cross-layer contribution scores Cs→ℓ on ImageNet-100 with ViT-B/32. Columns vary the sparsifier (JumpReLU, ReLU-Top-k, Abs-Top-k) and rows show CLS (top) and patch tokens (bottom). Each heatmap visualizes the proportional contribution of source layer s to the reconstructed activation at target layer ℓ, averaged over the validation set [PITH_FULL_IMAGE:figures/full_fig_p017_15.png] view at source ↗

**Figure 16.** Figure 16: Cross-layer contribution scores Cs→ℓ on ImageNet-100 with ViT-B/16. Columns vary the sparsifier (JumpReLU, ReLU-Top-k, Abs-Top-k) and rows show CLS (top) and patch tokens (bottom). Each heatmap visualizes the proportional contribution of source layer s to the reconstructed activation at target layer ℓ, averaged over the validation set [PITH_FULL_IMAGE:figures/full_fig_p018_16.png] view at source ↗

read the original abstract

Understanding the internal activations of Vision Transformers (ViTs) is critical for building interpretable and trustworthy models. While Sparse Autoencoders (SAEs) have been used to extract human-interpretable features, they operate on individual layers and fail to capture the cross-layer computational structure of Transformers, as well as the relative significance of each layer in forming the last-layer representation. Alternatively, we introduce the adoption of Cross-Layer Transcoders (CLTs) as reliable, sparse, and depth-aware proxy models for MLP blocks in ViTs. CLTs use an encoder-decoder scheme to reconstruct each post-MLP activation from learned sparse embeddings of preceding layers, yielding a linear decomposition that transforms the final representation of ViTs from an opaque embedding into an additive, layer-resolved construction that enables faithful attribution and process-level interpretability. We train CLTs on CLIP ViT-B/32 and ViT-B/16 across CIFAR-100, COCO, and ImageNet-100. We show that CLTs achieve high reconstruction fidelity with post-MLP activations while preserving and even improving, in some cases, CLIP zero-shot classification accuracy. In terms of interpretability, we show that the cross-layer contribution scores provide faithful attribution, revealing that the final representation is concentrated in a smaller set of dominant layer-wise terms whose removal degrades performance and whose retention largely preserves it. These results showcase the significance of adopting CLTs as an alternative interpretable proxy of ViTs in the vision domain.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CLTs extend single-layer SAEs to cross-layer reconstruction in ViTs with decent experiments, but the faithfulness of the resulting attributions is not yet convincingly shown.

read the letter

The main takeaway is that this paper trains cross-layer transcoders on the MLP blocks of CLIP ViTs to reconstruct post-MLP activations from sparse codes of earlier layers, then uses the resulting linear decomposition for layer-wise attribution scores. They test on ViT-B/32 and B/16 with CIFAR-100, COCO, and ImageNet-100, reporting high reconstruction fidelity and that the proxy version keeps or slightly improves zero-shot CLIP accuracy in places. The attribution experiments show that a small number of dominant layer terms carry most of the signal, since ablating them hurts downstream performance while retaining them largely preserves it.

Referee Report

1 major / 2 minor

Summary. The paper proposes Cross-Layer Transcoders (CLTs) as sparse, depth-aware proxy models for MLP blocks in Vision Transformers. CLTs reconstruct post-MLP activations from learned sparse embeddings of preceding layers via an encoder-decoder scheme, producing a linear decomposition of the final representation into additive layer-resolved terms. This enables attribution via cross-layer contribution scores. Experiments train CLTs on CLIP ViT-B/32 and ViT-B/16 using CIFAR-100, COCO, and ImageNet-100, reporting high reconstruction fidelity, preservation or improvement of zero-shot classification accuracy, and ablation results indicating that the final representation concentrates in a smaller set of dominant layer-wise terms.

Significance. If the contribution scores are validated as faithful to the ViT's internal computations, the approach could provide a useful tool for cross-layer mechanistic interpretability in vision transformers, extending beyond per-layer SAEs by capturing relative layer significance in forming the final embedding. The multi-dataset training and downstream accuracy preservation are strengths that would support practical adoption as an interpretable proxy.

major comments (1)

[Abstract] Abstract: The claim that 'the cross-layer contribution scores provide faithful attribution' is load-bearing for the interpretability contribution, yet the described procedure trains the CLT as an independent sparse proxy rather than deriving it from ViT weights or gradients. High reconstruction fidelity and ablation results (high-scoring terms degrade performance when removed) do not rule out an alternative factorization that does not match the actual additive contributions inside the transformer blocks. This leaves the faithfulness of the attribution open to question.

minor comments (2)

The abstract references high fidelity and accuracy preservation but supplies no specific quantitative metrics, error bars, or ablation statistics; including these would allow readers to evaluate the strength of the empirical claims.
Training details such as the exact loss function, sparsity penalty, hyperparameter choices, and layer selection criteria are not described in the abstract; adding them would improve reproducibility.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback and for identifying the need to strengthen the justification for our interpretability claims. We address the major comment point by point below.

read point-by-point responses

Referee: [Abstract] The claim that 'the cross-layer contribution scores provide faithful attribution' is load-bearing for the interpretability contribution, yet the described procedure trains the CLT as an independent sparse proxy rather than deriving it from ViT weights or gradients. High reconstruction fidelity and ablation results (high-scoring terms degrade performance when removed) do not rule out an alternative factorization that does not match the actual additive contributions inside the transformer blocks. This leaves the faithfulness of the attribution open to question.

Authors: We agree that training CLTs as independent proxies (rather than extracting coefficients directly from ViT weights or gradients) means the resulting decomposition is not a direct mechanistic trace of the original transformer blocks. The high reconstruction fidelity demonstrates that the CLT accurately approximates the post-MLP activations, and the linear encoder-decoder structure then yields an additive decomposition of the final representation into layer-resolved terms. The ablation experiments show that high-scoring terms are empirically important: their removal degrades zero-shot accuracy while low-scoring terms can be dropped with little effect. These results establish that the scores provide a faithful attribution with respect to the reconstructed representation, even if they do not necessarily recover the precise internal additive contributions of the original ViT. We will revise the abstract, introduction, and discussion to explicitly qualify the scope of “faithful attribution” as applying to the CLT proxy model and to note the distinction from gradient- or weight-based attributions. No new experiments are required for this clarification. revision: partial

Circularity Check

0 steps flagged

CLT training and attribution are empirical; no derivation reduces to self-defined inputs

full rationale

The paper describes an empirical training procedure for Cross-Layer Transcoders that reconstruct post-MLP activations from preceding-layer sparse codes, followed by ablation-based validation of contribution scores on downstream CLIP accuracy. No equations or claims are presented that define the linear decomposition or contribution scores in terms of themselves or via a self-citation chain that would force the result. The central interpretability argument rests on observed reconstruction fidelity and ablation effects measured against external benchmarks (zero-shot classification), which are independent of the fitted parameters. This is a standard non-circular empirical proxy construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the empirical success of training CLTs to reconstruct activations and produce faithful attributions, but the abstract supplies no explicit free parameters, axioms, or invented entities; the key unstated premise is that sparse cross-layer embeddings can serve as faithful proxies for the original MLP computations.

axioms (1)

domain assumption Sparse encoder-decoder models trained on preceding layers can faithfully reconstruct post-MLP activations in ViTs
This is the core premise invoked when the abstract states that CLTs yield a linear decomposition enabling faithful attribution.

pith-pipeline@v0.9.0 · 5605 in / 1367 out tokens · 36241 ms · 2026-05-10T15:37:39.629212+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

46 extracted references · 16 canonical work pages · 8 internal anchors

[1]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023. 1

work page internal anchor Pith review Pith/arXiv arXiv 2023
[2]

Circuit trac- ing: Revealing computational graphs in language models

Emmanuel Ameisen, Jack Lindsey, Adam Pearce, Wes Gurnee, Nicholas L Turner, Brian Chen, Craig Citro, David Abrahams, Shan Carter, Basil Hosmer, et al. Circuit trac- ing: Revealing computational graphs in language models. Transformer Circuits Thread, 6:16318–16352, 2025. 1, 3, 4

2025
[3]

Network dissection: Quantifying interpretability of deep visual representations

David Bau, Bolei Zhou, Aditya Khosla, Aude Oliva, and Anto- nio Torralba. Network dissection: Quantifying interpretability of deep visual representations. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 6541–6549, 2017. 3

2017
[4]

Interpreting clip with sparse linear concept embeddings (splice).Advances in Neural Information Processing Systems, 37:84298–84328, 2024

Usha Bhalla, Alex Oesterling, Suraj Srinivas, Flavio Calmon, and Himabindu Lakkaraju. Interpreting clip with sparse linear concept embeddings (splice).Advances in Neural Information Processing Systems, 37:84298–84328, 2024. 3

2024
[5]

Towards monose- manticity: Decomposing language models with dictionary learning.Transformer Circuits Thread, 2(5):6, 2023

Trenton Bricken, Adly Templeton, Joshua Batson, Brian Chen, Adam Jermyn, Tom Conerly, Nick Turner, Cem Anil, Carson Denison, Amanda Askell, et al. Towards monose- manticity: Decomposing language models with dictionary learning.Transformer Circuits Thread, 2(5):6, 2023. 1, 3

2023
[6]

Visual Sparse Steering (VS2): Unsupervised Adaptation for Image Classification using Sparsity-Guided Steering Vectors

Gerasimos Chatzoudis, Zhuowei Li, Gemma E Moran, Hao Wang, and Dimitris N Metaxas. Visual sparse steering: Im- proving zero-shot image classification with sparsity guided steering vectors.arXiv preprint arXiv:2506.01247, 2025. 3

work page internal anchor Pith review Pith/arXiv arXiv 2025
[7]

Towards auto- mated circuit discovery for mechanistic interpretability.Ad- vances in Neural Information Processing Systems, 36:16318– 16352, 2023

Arthur Conmy, Augustine Mavor-Parker, Aengus Lynch, Ste- fan Heimersheim, and Adri`a Garriga-Alonso. Towards auto- mated circuit discovery for mechanistic interpretability.Ad- vances in Neural Information Processing Systems, 36:16318– 16352, 2023. 3

2023
[8]

Sparse Autoencoders Find Highly Interpretable Features in Language Models

Hoagy Cunningham, Aidan Ewart, Logan Riggs, Robert Huben, and Lee Sharkey. Sparse autoencoders find highly interpretable features in language models.arXiv preprint arXiv:2309.08600, 2023. 3

work page internal anchor Pith review Pith/arXiv arXiv 2023
[9]

The faiss library

Matthijs Douze, Alexandr Guzhva, Chengqi Deng, Jeff John- son, Gergely Szilvasy, Pierre-Emmanuel Mazar ´e, Maria Lomeli, Lucas Hosseini, and Herv´e J´egou. The faiss library. IEEE Transactions on Big Data, 2025. 1

2025
[10]

Transcoders find interpretable llm feature circuits.Advances in Neural Information Processing Systems, 37:24375–24410,

Jacob Dunefsky, Philippe Chlenski, and Neel Nanda. Transcoders find interpretable llm feature circuits.Advances in Neural Information Processing Systems, 37:24375–24410,
[11]

A mathematical framework for transformer circuits.Transformer Circuits Thread, 1(1):12, 2021

Nelson Elhage, Neel Nanda, Catherine Olsson, Tom Henighan, Nicholas Joseph, Ben Mann, Amanda Askell, Yun- tao Bai, Anna Chen, Tom Conerly, et al. A mathematical framework for transformer circuits.Transformer Circuits Thread, 1(1):12, 2021. 3

2021
[12]

Toy Models of Superposition

Nelson Elhage, Tristan Hume, Catherine Olsson, Nicholas Schiefer, Tom Henighan, Shauna Kravec, Zac Hatfield-Dodds, Robert Lasenby, Dawn Drain, Carol Chen, et al. Toy models of superposition.arXiv preprint arXiv:2209.10652, 2022. 3

work page internal anchor Pith review arXiv 2022
[13]

Archetypal sae: Adaptive and stable dictionary learn- ing for concept extraction in large vision models,

Thomas Fel, Ekdeep Singh Lubana, Jacob S Prince, Matthew Kowal, Victor Boutin, Isabel Papadimitriou, Binxu Wang, Martin Wattenberg, Demba Ba, and Talia Konkle. Archety- pal sae: Adaptive and stable dictionary learning for con- cept extraction in large vision models.arXiv preprint arXiv:2502.12892, 2025. 3

work page arXiv 2025
[14]

Scaling and evaluating sparse autoencoders

Leo Gao, Tom Dupr´e la Tour, Henk Tillman, Gabriel Goh, Rajan Troll, Alec Radford, Ilya Sutskever, Jan Leike, and Jeffrey Wu. Scaling and evaluating sparse autoencoders.arXiv preprint arXiv:2406.04093, 2024. 3

work page internal anchor Pith review arXiv 2024
[15]

Circuit-tracer: A new library for finding feature circuits

Michael Hanna, Mateusz Piotrowski, Jack Lindsey, and Em- manuel Ameisen. Circuit-tracer: A new library for finding feature circuits. InProceedings of the 8th BlackboxNLP Work- shop: Analyzing and Interpreting Neural Networks for NLP, pages 239–249, 2025. 3

2025
[16]

Billion-scale similarity search with gpus.IEEE transactions on big data, 7 (3):535–547, 2019

Jeff Johnson, Matthijs Douze, and Herv´e J´egou. Billion-scale similarity search with gpus.IEEE transactions on big data, 7 (3):535–547, 2019. 1

2019
[17]

Steering clip’s vision transformer with sparse autoencoders

Sonia Joseph, Praneet Suresh, Ethan Goldfarb, Lorenz Hufe, Yossi Gandelsman, Robert Graham, Danilo Bzdok, Wojciech Samek, and Blake Aaron Richards. Steering clip’s vision transformer with sparse autoencoders. InMechanistic Inter- pretability for Vision at CVPR 2025 (Non-proceedings Track),

2025
[18]

Segment any- thing

Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer White- head, Alexander C Berg, Wan-Yen Lo, et al. Segment any- thing. InProceedings of the IEEE/CVF international confer- ence on computer vision, pages 4015–4026, 2023. 1

2023
[19]

Sparse autoencoders reveal selective remapping of visual concepts during adaptation.arXiv preprint arXiv:2412.05276,

Hyesu Lim, Jinho Choi, Jaegul Choo, and Steffen Schneider. Sparse autoencoders reveal selective remapping of visual concepts during adaptation.arXiv preprint arXiv:2412.05276,

work page arXiv
[20]

Sparse cross- coders for cross-layer features and model diffing.Transformer Circuits Thread, pages 3982–3992, 2024

Jack Lindsey, Adly Templeton, Jonathan Marcus, Thomas Conerly, Joshua Batson, and Christopher Olah. Sparse cross- coders for cross-layer features and model diffing.Transformer Circuits Thread, pages 3982–3992, 2024. 3

2024
[21]

Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023. 1

2023
[22]

Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models

Samuel Marks, Can Rager, Eric J Michaud, Yonatan Belinkov, David Bau, and Aaron Mueller. Sparse feature circuits: Dis- covering and editing interpretable causal graphs in language models.arXiv preprint arXiv:2403.19647, 2024. 3

work page internal anchor Pith review arXiv 2024
[23]

Robustly identifying concepts introduced during chat fine-tuning using crosscoders

Julian Minder, Cl ´ement Dumas, Bilal Chughtai, and Neel Nanda. Robustly identifying concepts introduced during chat fine-tuning using crosscoders. InSparsity in LLMs (SLLM): Deep Dive into Mixture of Experts, Quantization, Hardware, and Inference. 3
[24]

Robustly identifying concepts introduced during chat fine-tuning using crosscoders.arXiv preprint arXiv:2504.02922,

Julian Minder, Cl´ement Dumas, Caden Juang, Bilal Chugtai, and Neel Nanda. Overcoming sparsity artifacts in crosscoders to interpret chat-tuning.arXiv preprint arXiv:2504.02922,

work page arXiv
[25]

Oikarinen and T.-W

Tuomas Oikarinen and Tsui-Wei Weng. Clip-dissect: Auto- matic description of neuron representations in deep vision networks.arXiv preprint arXiv:2204.10965, 2022. 3

work page arXiv 2022
[26]

Feature visualization.Distill, 2(11):e7, 2017

Chris Olah, Alexander Mordvintsev, and Ludwig Schubert. Feature visualization.Distill, 2(11):e7, 2017. 3

2017
[27]

Zoom in: An introduction to circuits.Distill, 5(3):e00024–001, 2020

Chris Olah, Nick Cammarata, Ludwig Schubert, Gabriel Goh, Michael Petrov, and Shan Carter. Zoom in: An introduction to circuits.Distill, 5(3):e00024–001, 2020. 3

2020
[28]

Learning transferable visual models from natural language supervi- sion

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021. 1

2021
[29]

Jumping Ahead: Improving Reconstruction Fidelity with JumpReLU Sparse Autoencoders

Senthooran Rajamanoharan, Tom Lieberum, Nicolas Son- nerat, Arthur Conmy, Vikrant Varma, J ´anos Kram ´ar, and Neel Nanda. Jumping ahead: Improving reconstruction fi- delity with jumprelu sparse autoencoders.arXiv preprint arXiv:2407.14435, 2024. 3

work page internal anchor Pith review arXiv 2024
[30]

Taking fea- tures out of superposition with sparse autoencoders

Lee Sharkey, Dan Braun, and Beren Millidge. Taking fea- tures out of superposition with sparse autoencoders. InAI Alignment Forum, pages 12–13, 2022. 3

2022
[31]

Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps

Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. Deep inside convolutional networks: Visualising image classi- fication models and saliency maps, corr abs/1312.6034.arXiv preprint arXiv:1312.6034, 2014. 3

work page Pith review arXiv 2014
[32]

Sparse autoencoders for scientifically rigorous interpre- tation of vision models.arXiv e-prints, pages arXiv–2502,

Samuel Stevens, Wei-Lun Chao, Tanya Berger-Wolf, and Yu Su. Sparse autoencoders for scientifically rigorous interpre- tation of vision models.arXiv e-prints, pages arXiv–2502,
[33]

One-step is enough: Sparse autoencoders for text-to- image diffusion models.arXiv preprint arXiv:2410.22366, 2024

Viacheslav Surkov, Chris Wendler, Antonio Mari, Mikhail Terekhov, Justin Deschenaux, Robert West, Caglar Gul- cehre, and David Bau. One-step is enough: Sparse autoen- coders for text-to-image diffusion models.arXiv preprint arXiv:2410.22366, 2024. 3

work page arXiv 2024
[34]

Anthropic, 2024

Adly Templeton.Scaling monosemanticity: Extracting in- terpretable features from claude 3 sonnet. Anthropic, 2024. 3

2024
[35]

Universal sparse au- toencoders: Interpretable cross-model concept alignment

Harrish Thasarathan, Julian Forsyth, Thomas Fel, Matthew Kowal, and Konstantinos G Derpanis. Universal sparse au- toencoders: Interpretable cross-model concept alignment. In Forty-second International Conference on Machine Learning,
[36]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Am- jad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models.arXiv preprint arXiv:2307.09288, 2023. 1

work page internal anchor Pith review Pith/arXiv arXiv 2023
[37]

Zeiler and Rob Fergus

Matthew D. Zeiler and Rob Fergus. Visualizing and under- standing convolutional networks. InComputer Vision – ECCV 2014, pages 818–833, Cham, 2014. Springer International Publishing. 3

2014
[38]

doi:10.48550/arxiv.2510.00404 , url=

Xudong Zhu, Mohammad Mahdi Khalili, and Zhihui Zhu. Abstopk: Rethinking sparse autoencoders for bidirectional features.arXiv preprint arXiv:2510.00404, 2025. 3 Can Cross-Layer Transcoders Replace Vision Transformer Activations? An Interpretable Perspective on Vision Supplementary Material

work page arXiv 2025
[39]

To obtain example-based explanations of cross-layer influence, we provide some evi- dence of a retrieval-based framework that operates directly on the CLT sparse codes

Visually Explainable Cross-Layer Contribu- tion The contribution scores Cs→ℓ quantifywhichlayers mat- ter, but notwhatthey represent. To obtain example-based explanations of cross-layer influence, we provide some evi- dence of a retrieval-based framework that operates directly on the CLT sparse codes. Recall that a target post-MLP representationˆyL at the...
[40]

Additional Metrics for CLTs’ faithfulness To provide further evidence of CLT’s faithfulness to the original model, we evaluate distributional alignment (KL, flip rate), top- k agreement, embedding geometry (cos/CKA/Spearman), and prompt sensitivity across 18 tem- plates for the ViT-B/32 model on CIFAR100. Table 5 shows that in the regimes the CLTs faithfu...
[41]

For each backbone, we consider three datasets: CIFAR-100, COCO, and ImageNet-100

Training Details Teacher models and datasets.All Cross-Layer Transcoders (CLTs) are trained on top of frozen CLIP image encoders with ViT-B/32 and ViT-B/16 backbones. For each backbone, we consider three datasets: CIFAR-100, COCO, and ImageNet-100. For every (dataset, backbone) pair we train three separate CLT variants, one for each sparsifier: JumpReLU, ...
[42]

Reconstruction Accuracy across Layers In Figures 6–10, we report the reconstruction quality of Cross-Layer Transcoders (CLTs) across all transformer lay- ers on three datasets (CIFAR-100, COCO, and ImageNet-
[43]

and two CLIP backbones (ViT-B/32 and ViT-B/16). For each configuration, we compare three sparsity variants, i.e., JUMPRELU, RELU-TOP- k, and ABS-TOP- k, using co- sine similarity, mean squared error (log scale), and variance explained (R2), averaged over all tokens in the test set
[44]

For each dataset (CIFAR-100, COCO, ImageNet-

Classification Accuracy under Cascaded CLT Replacement We report top-1 classification accuracy (%) under cascaded CLT replacement across all layers ( s→11 ) in Figures 6–23. For each dataset (CIFAR-100, COCO, ImageNet-
[45]

The replacement is performed progressively from early to late layers ( s= 0 to s= 11 ), and results are compared to the frozen ViT baseline

and ViT backbone (ViT-B/32, ViT-B/16), we evaluate three sparsity mechanisms: JUMPRELU (JR), RELU-TOP- k (RTK), and ABS-TOP- k (ATK), under three token settings (CLS-only, patch-only, and all tokens). The replacement is performed progressively from early to late layers ( s= 0 to s= 11 ), and results are compared to the frozen ViT baseline. We observe that...
[46]

These heatmaps reveal a clear depth- aware structure across all datasets and backbones, where contributions are strongest from temporally proximal layers

Cross-Layer Contribution Scores To better understand the internal attribution structure of Cross-Layer Transcoders (CLTs), we visualize in Figures 11–16 the contribution scores Cs→ℓ, which quantify the influence of each source layer s on the reconstruction of acti- vations at target layer ℓ. These heatmaps reveal a clear depth- aware structure across all ...

work page arXiv 2061