pith. sign in

arxiv: 2605.31304 · v1 · pith:E7SKZ3SJnew · submitted 2026-05-29 · 💻 cs.LG · cs.CV

Interpretability Without Tradeoffs: Disentangling Polysemanticity At Equal Predictive Performance

Pith reviewed 2026-06-28 23:11 UTC · model grok-4.3

classification 💻 cs.LG cs.CV
keywords interpretabilitydisentanglementpolysemanticitymonosemantic featuresneural networksvision modelslossless methodsrepresentation steering
0
0 comments X

The pith

ELUDe disentangles polysemantic neurons into monosemantic features with no change to model predictions or accuracy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Deep neural networks often have individual neurons encoding multiple unrelated concepts, which obscures how decisions are made. ELUDe addresses this by reorganizing information flow between layers to isolate those concepts into separate, inspectable units. The method guarantees the network's outputs remain exactly the same as the original, with no retraining or labels required. This matters because earlier disentanglement techniques typically traded off some predictive performance for better clarity. A sympathetic reader would view it as a way to gain actionable insights into existing models at zero functional cost.

Core claim

ELUDe is an explicit, lossless, unsupervised method that breaks latent representations in deep neural networks into clear sub-units behaving like interpretable features. It achieves this by re-routing concept-specific contributions between layers while preserving the original computation exactly by construction. The approach applies to pretrained models such as DINOv2 and ViT-B/16, improves interpretability, keeps downstream accuracy unchanged, and supports uses like steering representations.

What carries the argument

ELUDe, which separates polysemantic signals by explicit re-routing of concept-specific contributions between layers while preserving exact functional equivalence.

If this is right

  • Interpretability improves on vision models including DINOv2 and supervised ViT-B/16 with no accuracy change.
  • The method applies directly to any pretrained model without labels or retraining.
  • Representation steering becomes possible as a practical downstream use.
  • Functional equivalence holds by design, so performance metrics remain identical.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The re-routing principle might generalize to language models if concept contributions can be similarly isolated.
  • Pairing ELUDe with other post-hoc methods could produce even finer-grained control over features.
  • The lossless property would enable direct comparisons of interpretability before and after under identical conditions.
  • Efficiency on very large models would hinge on scalable ways to detect the concept contributions.

Load-bearing premise

That concept-specific contributions can be identified and rerouted between layers to fully separate polysemantic signals into monosemantic features without altering the overall network computation.

What would settle it

Any observable difference in the model's output predictions or accuracy on a held-out test set after applying ELUDe would falsify the claim of lossless disentanglement.

Figures

Figures reproduced from arXiv: 2605.31304 by Bernt Schiele, Do\u{g}ukan Ba\u{g}c{\i}, Jonas Fischer, Robin Hesse, Simone Schaub-Meyer.

Figure 1
Figure 1. Figure 1: ELUDe for interpretable disentanglement. ELUDe decomposes a polysemantic unit at layer L into more monosemantic sub-units by restructuring incoming weights so that each sub-unit captures only a specific semantic concept. The sum of all sub-units exactly recovers the original unit activation, ensuring perfect faithfulness. On the right, we provide a concrete example of this process by showing highly activat… view at source ↗
Figure 2
Figure 2. Figure 2: Interpretability-faithfulness tradeoff. We compare faithfulness and interpretability across disentanglement methods; marker size indicates the ex￾pansion factor. Existing approaches ex￾hibit a clear Pareto front, where higher faithfulness typically comes at the cost of lower interpretability. In contrast, ELUDe improves interpretability with￾out sacrificing faithfulness, substantially advancing the Pareto … view at source ↗
Figure 3
Figure 3. Figure 3: Qualitative examples. Top-activating images from the ImageNet-1k validation split for a polysemantic parent unit from the last layer of DINOv2-B/14 (yellow frame, left; top-16) and the 8 ELUDe subunits it decomposes into (pink frames, right; top-4 each). Labels are annotated by hand. Please refer to Section D for additional qualitative examples. Our method decomposes a potentially polysemantic unit into a … view at source ↗
Figure 4
Figure 4. Figure 4: Overview of ELUDe. (1) Embed every image x (i) in the dataset D through layer L and consider the resulting activation instances ℓ (e.g., spatial locations or tokens); select the top-k instances that activate unit u most strongly. (2) For each selected activation instance, compute a normalized contribution vector c (ℓ) from layer L−1 to a (ℓ) u , revealing which upstream units drive the activation. Cluster … view at source ↗
Figure 5
Figure 5. Figure 5: Steering an ELUDe-disentangled LLaVA-OneVision. Captions generated under two interventions on layer 26 of the disentangled vision encoder, all using the same COCO image [37], prompt, and greedy decoder. “−” and “+” denote removing and adding the concept, respectively. (higher = more stable) [22, 49]; see Section B.3 for details. ELUDe achieves the highest cross-init stability, even outperforming ASAE [22],… view at source ↗
Figure 6
Figure 6. Figure 6: Example pairwise grid shown to a VLM judge. Each panel is a 3×3 collage of the top-9 most-activating images for one disentangled unit; two such panels from competing methods (left: ELUDe, right: Archetypal SAE) are placed side by side and the three judges select the more coherent grid. Beneath the grids each judge’s per-grid reasoning and the resulting consensus winner is shown. The example is from DINOv2-… view at source ↗
Figure 7
Figure 7. Figure 7: Steering example 2: independent foreground and background edits on the train scene. Settings (α, γ, κ, ν) per row: train suppression (5, −50, 50, allpatchtokens); train→bus (50, −2, 100, 50); graffiti→clean (15, −15, 200, 50). The subject and the background style are modified from the same ELUDe layer without cross-interference, with the foreground edit preserving the graffiti wall and the background edit … view at source ↗
Figure 8
Figure 8. Figure 8: Steering example 3: surface, subject, and suppression edits on the giant-teddy scene. Settings (α, γ, κ, ν) per row: chair→table (5, −5, 50, 25); teddy→dog (5, −50, 100, 50); teddy suppression (5, −25, 200, 100). The seat-surface concept is separable from the toy, and the surrounding store context (stand, passers-by) is preserved under both subject substitution and pure suppression. intervention is reporte… view at source ↗
Figure 9
Figure 9. Figure 9: Steering example 4: object swap, object removal, and background transplant on the ac￾tion shot. Settings (α, γ, κ, ν) per row: frisbee→ball (15, −0.5, 200, allpatchtokens); frisbee/ball suppression with jumping insertion (25, −10, 50, 250); indoor→beach/park (25, −25, 200, 250). The dog and the leaping pose persist across all three rows, indicating that pose-related units are decoupled from both the held o… view at source ↗
Figure 10
Figure 10. Figure 10: Steering example 5: subject substitution and a disentanglement boundary on the cats-on-blanket scene. Settings (α, γ, κ, ν) per row: cat→baby (5, −10, 200, allpatchtokens); cat→bear (15, −10, 100, allpatchtokens); cat and blanket suppression (5, −15, 50, allpatchtokens). The cat→baby substitution preserves the blanket and the accompanying prop, while the cat→bear request returns a red panda and replaces o… view at source ↗
Figure 11
Figure 11. Figure 11: K = 1: an already-monosemantic unit. HDBSCAN finds only a single cluster in the unit’s contribution-vector set, so ELUDe leaves the unit unchanged. Original Neuron 28 Reptiles & arthropods Large dogs [PITH_FULL_IMAGE:figures/full_fig_p023_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: 2-way ELUDe decomposition. A polysemantic DINOv2-B/14 unit (yellow frame, left; top-16) decomposed by ELUDe into 2 subunits (pink frames, right; top-4 each). features and z˜sk their pre-activations; maximizing this term encourages dormant features to activate) in all families. The Vanilla SAE additionally varies the ℓ1-sparsity weight wℓ1 ∈ {10−3 , 5 · 10−3 , 2 · 10−2}, and the JumpReLU SAE varies its spa… view at source ↗
Figure 13
Figure 13. Figure 13: 3-way ELUDe decomposition. A polysemantic DINOv2-B/14 unit (yellow frame, left; top-16) decomposed by ELUDe into 3 subunits (pink frames, right; top-4 each). Original Neuron 286 Long-haired dogs Primates Drum kits Insects [PITH_FULL_IMAGE:figures/full_fig_p024_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: 4-way ELUDe decomposition. A polysemantic DINOv2-B/14 unit (yellow frame, left; top-16) decomposed by ELUDe into 4 subunits (pink frames, right; top-4 each). reflects a conservative, unoptimized implementation: neurons are processed sequentially on a single GPU, top-k is set to 400,000 (the high end of our ablation range), and min_cluster_size= 1000 is chosen for cluster quality – reducing it substantiall… view at source ↗
Figure 15
Figure 15. Figure 15: 5-way ELUDe decomposition. A polysemantic DINOv2-B/14 unit (yellow frame, left; top-16) decomposed by ELUDe into 5 subunits (pink frames, right; top-4 each). Original Neuron 45 Shoes Vending machines Clownfish Beds Office printers Lizards [PITH_FULL_IMAGE:figures/full_fig_p025_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: 6-way ELUDe decomposition. A polysemantic DINOv2-B/14 unit (yellow frame, left; top-16) decomposed by ELUDe into 6 subunits (pink frames, right; top-4 each). E.5 Backbones [PITH_FULL_IMAGE:figures/full_fig_p025_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: 7-way ELUDe decomposition. A polysemantic DINOv2-B/14 unit (yellow frame, left; top-16) decomposed by ELUDe into 7 subunits (pink frames, right; top-4 each). sionality reduction. The ablations in Section C.5 sweep each of these one at a time around this baseline. E.8 Seeds For SAE and transcoder baselines, training initialisation uses seed 42 for the headline run and {7, 1024} for the cross-seed stability… view at source ↗
read the original abstract

Deep neural networks (DNNs) are widely used, but interpreting what they actually learn remains difficult. A major obstacle is that individual neurons often encode multiple unrelated concepts, obscuring the decision process of the network. While prior work, such as sparse autoencoders, can separate these mixed signals into more meaningful, "monosemantic" features, this typically requires altering the model in ways that can degrade downstream performance. To overcome this, we introduce ELUDe (explicit, lossless, unsupervised disentanglement), a method for improving the interpretability of DNNs while preserving their functional equivalence. ELUDe breaks latent representations into clear, inspectable sub-units that behave like interpretable features, while guaranteeing that the model's outputs remain exactly the same. It requires no explicit training, no labels, and can be applied to pretrained models. ELUDe works by reorganizing how information flows between layers, re-routing concept-specific contributions while preserving the original computation by construction. Across several vision models, including DINOv2 and supervised ViT-B/16, ELUDe improves interpretability, keeps downstream accuracy unchanged, runs efficiently, and supports practical uses such as steering model representations. In short, ELUDe offers interpretability (almost) without a tradeoff: clearer, scalable, and actionable model insights with no loss in performance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces ELUDe, an unsupervised post-hoc method that reorganizes information flow between layers of pretrained DNNs (e.g., DINOv2, supervised ViT-B/16) by re-routing concept-specific contributions. It claims to produce monosemantic sub-units while guaranteeing exact functional equivalence to the original model (outputs unchanged by construction), with no training, no labels, and no degradation in downstream accuracy. The approach is positioned as enabling interpretability improvements and practical interventions such as representation steering.

Significance. If the lossless equivalence and monosemanticity claims hold, the result would be significant for interpretability research: it would remove the usual accuracy-interpretability tradeoff, allow direct application to existing models, and support downstream uses like steering without retraining. The absence of free parameters or invented entities in the high-level description is a strength if the re-routing mechanism is shown to be parameter-free and the equivalence is proven by construction.

major comments (3)
  1. [Methods (re-routing procedure)] The central claim of exact functional equivalence 'by construction' via re-routing requires an explicit derivation or proof in the methods section showing that the reorganized computation is mathematically identical to the original forward pass; without this, the 'lossless' guarantee cannot be verified and remains an assumption.
  2. [Experiments] Empirical verification that the resulting sub-units are monosemantic (rather than merely reorganized polysemantic features) is load-bearing for the interpretability claim; the experiments section should include quantitative metrics (e.g., feature activation sparsity, concept purity scores) with controls and statistical tests, not only qualitative examples.
  3. [Results (accuracy tables)] The claim that downstream accuracy remains exactly unchanged must be supported by reporting both mean and variance across multiple runs or seeds, plus a direct comparison to the unmodified baseline on the same evaluation protocol; any numerical difference, even if small, would contradict the 'exactly the same' guarantee.
minor comments (2)
  1. [Methods] Notation for the re-routing operation should be defined with explicit equations early in the methods section to avoid ambiguity when describing how contributions are identified and moved between layers.
  2. [Introduction] The abstract and introduction would benefit from a short related-work paragraph contrasting ELUDe with sparse autoencoders on the specific dimension of functional equivalence.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their thoughtful and constructive comments on our manuscript. We address each major comment point by point below.

read point-by-point responses
  1. Referee: [Methods (re-routing procedure)] The central claim of exact functional equivalence 'by construction' via re-routing requires an explicit derivation or proof in the methods section showing that the reorganized computation is mathematically identical to the original forward pass; without this, the 'lossless' guarantee cannot be verified and remains an assumption.

    Authors: We agree that an explicit derivation strengthens the presentation. In the revised manuscript we will add a formal proof in the Methods section. The re-routing decomposes each pre-activation into additive concept-specific contributions and reassigns them to dedicated sub-units; because the subsequent linear transformation receives exactly the same total input, the forward pass is identical by construction. revision: yes

  2. Referee: [Experiments] Empirical verification that the resulting sub-units are monosemantic (rather than merely reorganized polysemantic features) is load-bearing for the interpretability claim; the experiments section should include quantitative metrics (e.g., feature activation sparsity, concept purity scores) with controls and statistical tests, not only qualitative examples.

    Authors: We acknowledge that quantitative metrics would provide stronger support. While the current experiments rely on qualitative visualizations and steering demonstrations, we will add quantitative evaluations (activation sparsity, concept purity scores, and statistical comparisons against baseline features) in the revised Experiments section. revision: yes

  3. Referee: [Results (accuracy tables)] The claim that downstream accuracy remains exactly unchanged must be supported by reporting both mean and variance across multiple runs or seeds, plus a direct comparison to the unmodified baseline on the same evaluation protocol; any numerical difference, even if small, would contradict the 'exactly the same' guarantee.

    Authors: ELUDe is fully deterministic and contains no trainable parameters or stochastic operations. Because functional equivalence holds by construction, every output is identical to the original model for any input; therefore downstream accuracy is exactly the same with zero variance. We will clarify this point explicitly in the revised Results section and omit multiple-run statistics, as they are inapplicable. revision: no

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper presents ELUDe as a reorganization of information flow between layers that re-routes concept-specific contributions while preserving the original computation exactly by construction. No equations, fitted parameters, or derivation chain appear in the provided abstract or description that would reduce any claimed prediction or result to its inputs by definition. The functional equivalence is asserted as a built-in property of the re-routing procedure rather than a statistically forced outcome from data fitting. No self-citation load-bearing steps, uniqueness theorems imported from prior author work, or ansatz smuggling are identifiable from the text. The central claim remains independent of the inputs and does not reduce to renaming or self-definition.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only input supplies no information on free parameters, axioms, or invented entities.

pith-pipeline@v0.9.1-grok · 5792 in / 1095 out tokens · 28502 ms · 2026-06-28T23:11:37.033403+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

78 extracted references · 13 canonical work pages · 8 internal anchors

  1. [1]

    Achtibat, S

    R. Achtibat, S. M. V . Hatefi, M. Dreyer, A. Jain, T. Wiegand, S. Lapuschkin, and W. Samek. AttnLRP: Attention-aware layer-wise relevance propagation for transformers. InICML, pages 135–168, 2024

  2. [2]

    D. Bau, B. Zhou, A. Khosla, A. Oliva, and A. Torralba. Network dissection: Quantifying interpretability of deep visual representations. InCVPR, pages 3319–3327, 2017

  3. [3]

    Bereska and S

    L. Bereska and S. Gavves. Mechanistic interpretability for AI safety - A review.TMLR, 2024

  4. [4]

    Böhle, M

    M. Böhle, M. Fritz, and B. Schiele. B-cos networks: Alignment is all we need for interpretability. InCVPR, pages 10319–10328, 2022

  5. [5]

    On the Opportunities and Risks of Foundation Models

    R. Bommasani, D. A. Hudson, E. Adeli, R. B. Altman, S. Arora, S. von Arx, M. S. Bernstein, J. Bohg, A. Bosselut, E. Brunskill, E. Brynjolfsson, S. Buch, D. Card, R. Castellon, N. S. Chatterji, A. S. Chen, K. Creel, J. Q. Davis, D. Demszky, C. Donahue, M. Doumbouya, E. Durmus, S. Ermon, J. Etchemendy, K. Ethayarajh, L. Fei-Fei, C. Finn, T. Gale, L. E. Gill...

  6. [6]

    Borowski, R

    J. Borowski, R. S. Zimmermann, J. Schepers, R. Geirhos, T. S. A. Wallis, M. Bethge, and W. Brendel. Exemplary natural images explain CNN activations better than state-of-the-art feature visualization. InICLR, 2021

  7. [7]

    Bousselham, A

    W. Bousselham, A. W. Boggust, S. Chaybouti, H. Strobelt, and H. Kuehne. LeGrad: An explainability method for vision transformers via feature formation sensitivity. InICCV, pages 20336–20345, 2024

  8. [8]

    Bricken, A

    T. Bricken, A. Templeton, J. Batson, B. Chen, A. Jermyn, T. Conerly, N. Turner, C. Anil, C. Denison, A. Askell, R. Lasenby, Y . Wu, S. Kravec, N. Schiefer, T. Maxwell, N. Joseph, Z. Hatfield-Dodds, A. Tamkin, K. Nguyen, B. McLean, J. E. Burke, T. Hume, S. Carter, T. Henighan, and C. Olah. Towards monosemanticity: Decomposing language models with dictionar...

  9. [9]

    T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert-V oss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D. M. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever, and D. Amo...

  10. [10]

    Bussmann, P

    B. Bussmann, P. Leask, and N. Nanda. BatchTopK sparse autoencoders. InNeurIPS, 2024

  11. [11]

    Bussmann, N

    B. Bussmann, N. Nabeshima, A. Karvonen, and N. Nanda. Learning multi-level features with matryoshka sparse autoencoders. InICML, 2025

  12. [12]

    R. J. G. B. Campello, D. Moulavi, and J. Sander. Density-based clustering based on hierarchical density estimates. InPAKDD, pages 160–172, 2013

  13. [13]

    C. Chen, O. Li, D. Tao, A. Barnett, C. Rudin, and J. Su. This looks like that: Deep learning for interpretable image recognition. InNeurIPS, pages 8928–8939, 2019

  14. [14]

    Costa, T

    V . Costa, T. Fel, E. S. Lubana, B. Tolooshams, and D. E. Ba. From flat to hierarchical: Extracting sparse representations with matching pursuit. InNeurIPS, 2025. 10

  15. [15]

    J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-Fei. ImageNet: A large-scale hierarchical image database. InCVPR, pages 248–255, 2009

  16. [16]

    Dosovitskiy, L

    A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. InICLR, 2021

  17. [17]

    Dreyer, E

    M. Dreyer, E. Purelku, J. Vielhaben, W. Samek, and S. Lapuschkin. PURE: Turning polyseman- tic neurons into pure features by identifying relevant circuits. InCVPRW, pages 8212–8217, 2024

  18. [18]

    Dunefsky, P

    J. Dunefsky, P. Chlenski, and N. Nanda. Transcoders find interpretable LLM feature circuits. In NeurIPS, 2024

  19. [19]

    Elhage, T

    N. Elhage, T. Hume, C. Olsson, N. Schiefer, T. Henighan, S. Kravec, Z. Hatfield-Dodds, R. Lasenby, D. Drain, C. Chen, R. B. Grosse, S. McCandlish, J. Kaplan, D. Amodei, M. Watten- berg, and C. Olah. Toy models of superposition.Transformer Circuits Thread, 2022

  20. [20]

    Eslami, M

    S. Eslami, M. Gaiduk, M. Krimmel, L. Milliken, B. Wang, and D. Bykov. Diffusion-pretrained dense and contextual embeddings.arXiv:2602.11151 [cs.LG], 2026

  21. [21]

    T. Fel, T. Boissin, V . Boutin, A. M. Picard, P. Novello, J. Colin, D. Linsley, T. Rousseau, R. Cadène, L. Gardes, and T. Serre. Unlocking feature visualization for deeper networks with magnitude constrained optimization. InNeurIPS, 2023

  22. [22]

    T. Fel, E. S. Lubana, J. S. Prince, M. Kowal, V . Boutin, I. Papadimitriou, B. Wang, M. Watten- berg, D. E. Ba, and T. Konkle. Archetypal SAE: Adaptive and stable dictionary learning for concept extraction in large vision models. InICML, 2025

  23. [23]

    Fischer, A

    J. Fischer, A. Olah, and J. Vreeken. What’s in the box? exploring the inner life of neural networks with robust rules. InICML, 2021

  24. [24]

    L. Gao, T. D. la Tour, H. Tillman, G. Goh, R. Troll, A. Radford, I. Sutskever, J. Leike, and J. Wu. Scaling and evaluating sparse autoencoders. InICLR, 2025

  25. [25]

    Gorgun, B

    A. Gorgun, B. Schiele, and J. Fischer. VITAL: More understandable feature visualization through distribution alignment and relevant information flow. InICCV, 2025

  26. [26]

    J. A. Hartigan and M. A. Wong. Algorithm as 136: A k-means clustering algorithm.Journal of the royal statistical society. series c (applied statistics), 28(1):100–108, 1979

  27. [27]

    Hendrycks, S

    D. Hendrycks, S. Basart, N. Mu, S. Kadavath, F. Wang, E. Dorundo, R. Desai, T. Zhu, S. Parajuli, M. Guo, D. Song, J. Steinhardt, and J. Gilmer. The many faces of robustness: A critical analysis of out-of-distribution generalization. InICCV, pages 8320–8329, 2021

  28. [28]

    Hesse, S

    R. Hesse, S. Schaub-Meyer, and S. Roth. Fast axiomatic attribution for neural networks. In NeurIPS, pages 19513–19524, 2021

  29. [29]

    Hesse, J

    R. Hesse, J. Fischer, S. Schaub-Meyer, and S. Roth. Disentangling polysemantic channels in convolutional neural networks. InCVPRW, pages 4799–4803, 2025

  30. [30]

    Hesse, D

    R. Hesse, D. Bagci, B. Schiele, S. Schaub-Meyer, and S. Roth. Beyond accuracy: What matters in designing well-behaved image classification models?TMLR, 2026

  31. [31]

    Hesse, S

    R. Hesse, S. Schaub-Meyer, J. Hesse, B. Schiele, and S. Roth. What is missing? explaining neurons activated by absent concepts. InICML, 2026

  32. [32]

    Huben, H

    R. Huben, H. Cunningham, L. R. Smith, A. Ewart, and L. Sharkey. Sparse autoencoders find highly interpretable features in language models. InICLR, 2024

  33. [33]

    Z. Ji, N. Lee, R. Frieske, T. Yu, D. Su, Y . Xu, E. Ishii, Y . Bang, A. Madotto, and P. Fung. Survey of hallucination in natural language generation.ACM Comput. Surv., 55(12):248:1–248:38, 2023. 11

  34. [34]

    Jumper, R

    J. Jumper, R. Evans, A. Pritzel, T. Green, M. Figurnov, O. Ronneberger, K. Tunyasuvunakool, R. Bates, A. Žídek, A. Potapenko, A. Bridgland, C. Meyer, S. A. A. Kohl, A. J. Ballard, A. Cowie, B. Romera-Paredes, S. Nikolov, R. Jain, J. Adler, T. Back, S. Petersen, D. Reiman, E. Clancy, M. Zielinski, M. Steinegger, M. Pacholska, T. Berghammer, S. Bodenstein, ...

  35. [35]

    P. W. Koh, T. Nguyen, Y . S. Tang, S. Mussmann, E. Pierson, B. Kim, and P. Liang. Concept bottleneck models. InICML, pages 5338–5348, 2020

  36. [36]

    B. Li, Y . Zhang, D. Guo, R. Zhang, F. Li, H. Zhang, K. Zhang, P. Zhang, Y . Li, Z. Liu, and C. Li. LLaV A-OneVision: Easy visual task transfer.TMLR, 2025

  37. [37]

    T. Lin, M. Maire, S. J. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick. Microsoft COCO: Common objects in context. InECCV, pages 740–755, 2014

  38. [38]

    H. Liu, W. Xue, Y . Chen, D. Chen, X. Zhao, K. Wang, L. Hou, R. Li, and W. Peng. A survey on hallucination in large vision-language models.arXiv:2402.00253 [cs.CV], 2024

  39. [39]

    Makhzani and B

    A. Makhzani and B. J. Frey. k-sparse autoencoders. InICLR, 2014

  40. [40]

    Maser, S

    R. Maser, S. Gairola, S. Rao, and B. Schiele. Align once to explain: Feature alignment for scalable b-cosification of foundational vision transformers. InCVPR, pages 9869–9879, 2026

  41. [41]

    Mueller, J

    A. Mueller, J. Brinkmann, M. L. Li, S. Marks, K. Pal, N. Prakash, C. Rager, A. Sankara- narayanan, A. S. Sharma, J. Sun, E. Todd, D. Bau, and Y . Belinkov. The quest for the right mediator: Surveying mechanistic interpretability through the lens of causal mediation analysis. arXiv:2408.01416 [cs.LG], 2024

  42. [42]

    T. P. Oikarinen and T. Weng. Clip-dissect: Automatic description of neuron representations in deep vision networks. InICLR, 2023

  43. [43]

    C. Olah, L. Schubert, and A. Mordvintsev. Feature visualization.Distill, 2017

  44. [44]

    C. Olah, N. Cammarata, L. Schubert, G. Goh, M. Petrov, and S. Carter. Zoom in: An introduction to circuits.Distill, 2020

  45. [45]

    O’Mahony, V

    L. O’Mahony, V . Andrearczyk, H. Müller, and M. Graziani. Disentangling neuron representa- tions with concept vectors. InCVPRW, pages 3770–3775, 2023

  46. [46]

    Oquab, T

    M. Oquab, T. Darcet, T. Moutakanni, H. V . V o, M. Szafraniec, V . Khalidov, P. Fernandez, D. Haziza, F. Massa, A. El-Nouby, M. Assran, N. Ballas, W. Galuba, R. Howes, P. Huang, S. Li, I. Misra, M. Rabbat, V . Sharma, G. Synnaeve, H. Xu, H. Jégou, J. Mairal, P. Labatut, A. Joulin, and P. Bojanowski. DINOv2: Learning robust visual features without supervis...

  47. [47]

    M. Pach, S. Karthik, Q. Bouniot, S. J. Belongie, and Z. Akata. Sparse autoencoders learn monosemantic features in vision-language models. InNeurIPS, 2025

  48. [48]

    Parchami-Araghi, S

    A. Parchami-Araghi, S. Rao, J. Fischer, and B. Schiele. Fact: Faithful concept traces for explaining neural network decisions. InNeurIPS, 2025

  49. [49]

    Paulo and N

    G. Paulo and N. Belrose. Sparse autoencoders trained on the same data learn different features. ICLR, 2026

  50. [50]

    N. Pham, A. Jesslen, B. Schiele, A. Kortylewski, and J. Fischer. Interpretable 3D neural object volumes for robust conceptual reasoning. InICLR, 2026

  51. [51]

    C. Qin, C. Venhoff, S. Joseph, F. Xiao, and S. Scherer. Sparse CLIP: Co-optimizing inter- pretability and performance in contrastive learning.ICLR, 2026

  52. [52]

    D. Rai, Y . Zhou, S. Feng, A. Saparov, and Z. Yao. A practical review of mechanistic inter- pretability for transformer-based language models.arXiv:2407.02646 [cs.AI], 2024. 12

  53. [53]

    Jumping Ahead: Improving Reconstruction Fidelity with JumpReLU Sparse Autoencoders

    S. Rajamanoharan, T. Lieberum, N. Sonnerat, A. Conmy, V . Varma, J. Kramár, and N. Nanda. Jumping ahead: Improving reconstruction fidelity with JumpReLU sparse autoencoders. arXiv:2407.14435 [cs.LG], 2024

  54. [54]

    Reimers and I

    N. Reimers and I. Gurevych. Sentence-BERT: Sentence embeddings using siamese BERT- networks. InEMNLP-IJCNLP, pages 3980–3990, 2019

  55. [55]

    Russakovsky, J

    O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. S. Bernstein, A. C. Berg, and L. Fei-Fei. ImageNet large scale visual recognition challenge.IJCV, 115(3):211–252, 2015

  56. [56]

    Polysemanticity and capacity in neural networks.arXiv preprint arXiv:2210.01892,

    A. Scherlis, K. Sachan, A. S. Jermyn, J. Benton, and B. Shlegeris. Polysemanticity and capacity in neural networks.arXiv:2210.01892 [cs.NE], 2022

  57. [57]

    R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, and D. Batra. Grad-CAM: Visual explanations from deep networks via gradient-based localization. InICCV, pages 618–626, 2017

  58. [58]

    Sharkey, B

    L. Sharkey, B. Chughtai, J. Batson, J. Lindsey, J. Wu, L. Bushnaq, N. Goldowsky-Dill, S. Heimer- sheim, A. Ortega, J. I. Bloom, S. Biderman, A. Garriga-Alonso, A. Conmy, N. Nanda, J. Rumbe- low, M. Wattenberg, N. Schoots, J. Miller, W. Saunders, E. J. Michaud, S. Casper, M. Tegmark, D. Bau, E. Todd, A. Geiger, M. Geva, J. Hoogland, D. Murfet, and T. McGra...

  59. [59]

    Simonyan, A

    K. Simonyan, A. Vedaldi, and A. Zisserman. Deep inside convolutional networks: Visualising image classification models and saliency maps. InICLRW, 2014

  60. [60]

    Steiner, A

    A. Steiner, A. Kolesnikov, X. Zhai, R. Wightman, J. Uszkoreit, and L. Beyer. How to train your ViT? Data, augmentation, and regularization in vision transformers.TMLR, 2022

  61. [61]

    Sundararajan, A

    M. Sundararajan, A. Taly, and Q. Yan. Axiomatic attribution for deep networks. In D. Precup and Y . W. Teh, editors,ICML, pages 3319–3328, 2017

  62. [62]

    G. Team. Gemma 3 technical report.arXiv:2503.19786 [cs.CL], 2025

  63. [63]

    Q. Team. Qwen3-VL technical report.arXiv:2505.09388 [cs.CL], 2025

  64. [64]

    Templeton, T

    A. Templeton, T. Conerly, J. Marcus, J. Lindsey, T. Bricken, B. Chen, A. Pearce, C. Citro, E. Ameisen, A. Jones, H. Cunningham, N. L. Turner, C. McDougall, M. MacDiarmid, C. D. Freeman, T. R. Sumers, E. Rees, J. Batson, A. Jermyn, S. Carter, C. Olah, and T. Henighan. Scaling monosemanticity: Extracting interpretable features from Claude 3 Sonnet.Transform...

  65. [65]

    SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

    M. Tschannen, A. A. Gritsenko, X. Wang, M. F. Naeem, I. Alabdulmohsin, N. Parthasarathy, T. Evans, L. Beyer, Y . Xia, B. Mustafa, O. J. Hénaff, J. Harmsen, A. Steiner, and X. Zhai. SigLIP 2: Multilingual vision-language encoders with improved semantic understanding, localization, and dense features.arXiv:2502.14786 [cs.CV], 2025

  66. [66]

    A. M. Turner, L. Thiergart, G. Leech, D. Udell, J. J. Vazquez, U. Mini, and M. MacDiarmid. Steering language models with activation engineering.arXiv:2308.10248 [cs.CL], 2023

  67. [67]

    Y . Wang, H. Li, X. Han, P. Nakov, and T. Baldwin. Do-not-answer: A dataset for evaluating safeguards in LLMs. InEACL, 2024

  68. [68]

    Wightman

    R. Wightman. Pytorch image models, 2019

  69. [69]

    Wittenmayer, S

    K. Wittenmayer, S. Rao, A. Parchami-Araghi, B. Schiele, and J. Fischer. CFM: Language- aligned concept foundation model for vision.arXiv:2601.13798 [cs.CV], 2026

  70. [70]

    S. Woo, S. Debnath, R. Hu, X. Chen, Z. Liu, I. S. Kweon, and S. Xie. ConvNeXt V2: Co- designing and scaling convnets with masked autoencoders. InCVPR, pages 16133–16142, 2023. 13

  71. [71]

    Z. Wu, A. Arora, A. Geiger, Z. Wang, J. Huang, D. Jurafsky, C. D. Manning, and C. Potts. AxBench: Steering llms? even simple baselines outperform sparse autoencoders. InICML, 2025

  72. [72]

    M. Xue, Q. Huang, H. Zhang, J. Hu, J. Song, M. Song, and C. Jin. ProtoPFormer: Concentrating on prototypical parts in vision transformers for interpretable image recognition. InIJCAI, pages 1516–1524, 2024

  73. [73]

    H. Yin, P. Molchanov, J. M. Álvarez, Z. Li, A. Mallya, D. Hoiem, N. K. Jha, and J. Kautz. Dreaming to distill: Data-free knowledge transfer via deepinversion. InCVPR, pages 8712– 8721, 2020

  74. [74]

    Zaigrajew, H

    V . Zaigrajew, H. Baniecki, and P. Biecek. Interpreting CLIP with hierarchical sparse autoen- coders. InICML, 2025

  75. [75]

    M. E. Zarlenga, P. Barbiero, G. Ciravegna, G. Marra, F. Giannini, M. Diligenti, Z. Shams, F. Precioso, S. Melacci, A. Weller, P. Lió, and M. Jamnik. Concept embedding models: Beyond the accuracy-explainability trade-off. InNeurIPS, 2022

  76. [76]

    Zhang, J

    Y . Zhang, J. Jia, X. Chen, A. Chen, Y . Zhang, J. Liu, K. Ding, and S. Liu. To generate or not? safety-driven unlearned diffusion models are still easy to generate unsafe images ... for now. In ECCV, pages 385–403, 2024

  77. [77]

    Zheng, W

    L. Zheng, W. Chiang, Y . Sheng, S. Zhuang, Z. Wu, Y . Zhuang, Z. Lin, Z. Li, D. Li, E. P. Xing, H. Zhang, J. E. Gonzalez, and I. Stoica. Judging LLM-as-a-judge with MT-bench and chatbot arena. InNeurIPS, 2023

  78. [78]

    InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

    J. Zhu, W. Wang, Z. Chen, Z. Liu, S. Ye, L. Gu, H. Tian, Y . Duan, W. Su, J. Shao, Z. Gao, E. Cui, X. Wang, Y . Cao, Y . Liu, X. Wei, H. Zhang, H. Wang, W. Xu, H. Li, J. Wang, N. Deng, S. Li, Y . He, T. Jiang, J. Luo, Y . Wang, C. He, B. Shi, X. Zhang, W. Shao, J. He, Y . Xiong, W. Qu, P. Sun, P. Jiao, H. Lv, L. Wu, K. Zhang, H. Deng, J. Ge, K. Chen, L. W...