pith. machine review for the scientific record. sign in

arxiv: 2605.09438 · v1 · submitted 2026-05-10 · 💻 cs.LG

Recognition: 2 theorem links

· Lean Theorem

fmxcoders: Factorized Masked Crosscoders for Cross-Layer Feature Discovery

Authors on Pith no claims yet

Pith reviewed 2026-05-12 04:49 UTC · model grok-4.3

classification 💻 cs.LG
keywords crosscoderssparse autoencoderscross-layer featurestransformer interpretabilityfactorized modelsfeature discoverymechanistic interpretabilityLLM evaluation
0
0 comments X

The pith

Factorized masked crosscoders recover 3-13 times more cross-layer features than standard crosscoders in transformers.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Many features in pretrained transformers emerge or persist across multiple layers, yet standard crosscoders trained jointly on all layers mostly learn latents driven by only one or two layers. The paper shows this happens because the encoder and decoder weights can vary freely per layer and because nothing penalizes a latent for ignoring most layers. fmxcoders replace the weights with low-rank tensor factorizations that draw every latent's per-layer components from one shared cross-layer basis, then add stochastic layer masking during training so that a latent loses its signal if any single layer is removed. Across GPT-2 Small, Pythia-410M, Pythia-1.4B, and Gemma-2-2B, the resulting latents show 10-30 point gains in probing F1, 25-50% lower reconstruction error, roughly double the functional coherence, and 3-13 times more latents judged semantically coherent by an LLM.

Core claim

Standard crosscoders fail to recover cross-layer features because their per-layer weights are unconstrained and cross-layer dependence is unregularized, so most latents collapse to surface patterns in a single layer. fmxcoders address both problems by replacing the encoder and decoder with low-rank tensor factorizations whose per-layer slices come from a shared basis, and by training with stochastic layer masking that penalizes any latent whose activation collapses when one layer is masked. The resulting shared latents act as better concept detectors and yield large gains in probing accuracy, reconstruction quality, and semantic coherence across four different base models.

What carries the argument

Low-rank tensor factorization of the encoder and decoder weights that forces every latent to draw its per-layer contributions from one shared cross-layer basis, combined with stochastic layer masking that penalizes latents for depending on only one or two layers.

Load-bearing premise

The functional coherence metric and the LLM-as-a-judge scores truly reflect genuine cross-layer semantic meaning rather than simply rewarding the factorized architecture.

What would settle it

An ablation that removes either the tensor factorization or the layer masking and finds that functional coherence and probing F1 gains disappear, or a human annotation study in which the new latents do not receive higher semantic coherence ratings than those from standard crosscoders.

Figures

Figures reproduced from arXiv: 2605.09438 by Andreas D. Demou, James Oldfield, Mihalis A. Nicolaou, Panagiotis Koromilas, Yannis Panagakis.

Figure 1
Figure 1. Figure 1: Standard crosscoders (top) parameterize the encoder and decoder as L independent per￾layer matrices and recover latents whose functional dependence collapses onto a few layers, capturing surface-level patterns. fmxcoders (bottom) replace these with low-rank tensor factorizations that share a cross-layer basis across all latents, and add stochastic layer masking as a denoising regularizer along the layer ax… view at source ↗
Figure 2
Figure 2. Figure 2: Norm coherence c n (blue) and functional coherence c f (green) latent distributions for crosscoders (top row) and fmxcoder (bottom row) variants trained on Pythia-410M. The average value and standard deviation of each coherence distribution are also shown in red color. 4.3 Qualitative and LLM-Judge Analysis of Features [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Heat maps for MSE, mean F1 and mean functional coherence [PITH_FULL_IMAGE:figures/full_fig_p016_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Similar to Figure 3, for CP [PITH_FULL_IMAGE:figures/full_fig_p016_4.png] view at source ↗
read the original abstract

Many features in pretrained Transformers span multiple layers: they emerge through stages of inference, persist in the residual stream, or are built jointly by parallel MLPs. Crosscoders (namely, sparse dictionaries trained jointly across layers) aim to recover these cross-layer features in a single shared latent space. We show that standard crosscoders largely fail at this purpose. Although their decoder weight norms spread evenly across layers, a functional coherence metric we introduce reveals that each latent's activation is effectively driven by only one or two layers on average. While functionally coherent latents act as human-interpretable concept detectors (e.g., US states and cities), the layer-localized latents that crosscoders predominantly learn collapse onto surface-level patterns such as digit detectors. We trace this failure to two structural limitations: unconstrained cross-layer parameterization and unregularized cross-layer dependence. We address both by introducing fmxcoders, which (i) replace the encoder and decoder with low-rank tensor factorizations that draw every latent's per-layer weights from a shared cross-layer basis, and (ii) apply stochastic layer masking, a denoising regularizer along the layer axis that penalizes latents whose contribution collapses when a single layer is masked. Across GPT2-Small, Pythia-410M, Pythia-1.4B, and Gemma2-2B, fmxcoders lift mean probing F1 by 10-30 points, surpassing per-layer SAE baselines that standard crosscoders fail to reach, reduce reconstruction MSE by 25-50%, and roughly double mean functional coherence. An LLM-as-a-judge evaluation further shows that fmxcoders recover 3-13$\times$ more semantically coherent latents than standard crosscoders across all four base LLMs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 3 minor

Summary. The paper claims that standard crosscoders for sparse dictionary learning across transformer layers largely fail to recover truly cross-layer features, as their latents are predominantly driven by only one or two layers despite even decoder weight norms; it introduces fmxcoders using low-rank tensor factorizations for shared cross-layer bases in encoder and decoder plus stochastic layer masking as a regularizer, and reports that this yields 10-30 point gains in mean probing F1 (surpassing per-layer SAE baselines), 25-50% lower reconstruction MSE, roughly doubled mean functional coherence, and 3-13× more semantically coherent latents (via LLM-as-a-judge) across GPT2-Small, Pythia-410M, Pythia-1.4B, and Gemma2-2B.

Significance. If the newly introduced functional coherence metric and LLM judge evaluations prove to be architecture-agnostic measures of genuine cross-layer semantics, the work would meaningfully advance mechanistic interpretability by providing a practical method to extract features that emerge or persist across layers rather than collapsing to surface patterns. The multi-model scale of the evaluation and direct comparison against both standard crosscoders and per-layer SAEs are clear strengths; the introduction of a denoising regularizer along the layer dimension is a technically clean idea.

major comments (3)
  1. [Abstract and §3] Abstract and §3 (functional coherence metric definition): The metric is defined to penalize latents whose activations are driven by few layers on average. Because fmxcoders explicitly optimize against single-layer collapse via stochastic masking, reported doublings of mean functional coherence risk being partly tautological rather than evidence of superior feature discovery; an ablation isolating the masking term from the factorization (or an external validation set of known cross-layer concepts) is needed to establish that the metric adds independent information.
  2. [Abstract and §4] Abstract and §4 (LLM-as-a-judge evaluation): The claim of recovering 3-13× more semantically coherent latents rests on an LLM judge whose prompts and scoring rubric are not shown to be invariant to the structural differences (low-rank shared bases) introduced by fmxcoders. If the judge implicitly rewards more factorized or less surface-level activations, the multiplier cannot be attributed cleanly to better cross-layer discovery; a blinded human evaluation on a held-out subset or comparison against an architecture-agnostic coherence probe would strengthen the result.
  3. [§5] §5 (experimental results): The reported probing F1 and MSE gains are presented without error bars, standard deviations across random seeds, or statistical significance tests. Given that the central empirical claim is consistent improvement across four models and multiple metrics, the absence of variability estimates makes it difficult to assess whether the 10-30 point F1 lift and 25-50% MSE reduction are robust or sensitive to hyperparameter choices such as factorization rank.
minor comments (3)
  1. [Methods] The paper should report the chosen factorization rank for each model and any sensitivity analysis, as this is listed among the free parameters and directly affects the low-rank assumption.
  2. [Figures] Figure captions and axis labels in the results section would benefit from explicit mention of the number of latents and the exact masking probability used, to aid reproducibility.
  3. [§5] A brief discussion of how the per-layer SAE baselines were trained (same sparsity, same dictionary size) would clarify that the comparison is fair.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their thoughtful and constructive feedback on our manuscript. We have carefully reviewed each major comment and provide detailed point-by-point responses below. We outline the specific revisions we will make to address the concerns raised.

read point-by-point responses
  1. Referee: [Abstract and §3] Abstract and §3 (functional coherence metric definition): The metric is defined to penalize latents whose activations are driven by few layers on average. Because fmxcoders explicitly optimize against single-layer collapse via stochastic masking, reported doublings of mean functional coherence risk being partly tautological rather than evidence of superior feature discovery; an ablation isolating the masking term from the factorization (or an external validation set of known cross-layer concepts) is needed to establish that the metric adds independent information.

    Authors: We acknowledge the close relationship between the stochastic layer masking regularizer and the functional coherence metric, as the former is explicitly designed to discourage single-layer collapse. However, the metric itself is an independent, post-hoc measure of the average number of layers that meaningfully contribute to each latent's activations, and it is not part of the training loss. To demonstrate that the reported gains reflect genuine improvements in cross-layer feature discovery rather than a tautology, we will add an ablation study in the revised manuscript. This ablation will train factorized models both with and without the masking term and report the resulting differences in functional coherence, probing F1, MSE, and the number of coherent latents. We will also include concrete examples of known cross-layer concepts (e.g., entity tracking and syntactic features) recovered by fmxcoders to provide external validation of the metric. revision: yes

  2. Referee: [Abstract and §4] Abstract and §4 (LLM-as-a-judge evaluation): The claim of recovering 3-13× more semantically coherent latents rests on an LLM judge whose prompts and scoring rubric are not shown to be invariant to the structural differences (low-rank shared bases) introduced by fmxcoders. If the judge implicitly rewards more factorized or less surface-level activations, the multiplier cannot be attributed cleanly to better cross-layer discovery; a blinded human evaluation on a held-out subset or comparison against an architecture-agnostic coherence probe would strengthen the result.

    Authors: We agree that transparency and validation of the LLM-as-a-judge protocol are essential to rule out bias from the low-rank factorization. In the revised manuscript, we will include the complete prompts and scoring rubric in the appendix. To further strengthen the claim, we will add a blinded human evaluation on a held-out subset of 100 latents per model, where human evaluators (unaware of the source method) rate semantic coherence based on top activating examples and tokens. We will report inter-rater agreement and correlation with the LLM scores. While a full-scale human study across all models and latents is not feasible, this targeted evaluation will provide independent corroboration that the 3-13× increase corresponds to improved cross-layer semantic coherence. revision: partial

  3. Referee: [§5] §5 (experimental results): The reported probing F1 and MSE gains are presented without error bars, standard deviations across random seeds, or statistical significance tests. Given that the central empirical claim is consistent improvement across four models and multiple metrics, the absence of variability estimates makes it difficult to assess whether the 10-30 point F1 lift and 25-50% MSE reduction are robust or sensitive to hyperparameter choices such as factorization rank.

    Authors: We thank the referee for highlighting this important presentation issue. In the revised §5, we will report all key metrics (probing F1, MSE, functional coherence) with error bars indicating standard deviation across at least three independent random seeds per configuration. We will also include statistical significance tests (paired t-tests with p-values) for the primary comparisons against standard crosscoders and per-layer SAEs. Additionally, we will add a sensitivity analysis varying the factorization rank and reporting its effect on performance to address robustness to hyperparameter choices. revision: yes

Circularity Check

0 steps flagged

No significant circularity in fmxcoders empirical claims

full rationale

The paper introduces a new architecture (low-rank factorized encoder/decoder plus stochastic layer masking) and a new functional coherence metric to diagnose standard crosscoders. Reported gains in probing F1 (10-30 points) and reconstruction MSE (25-50%) are measured on independent tasks across multiple models and are not reducible to quantities defined inside the training objective or metric by construction. The coherence improvement is expected from the explicit regularizer but does not render the other metrics tautological. No self-citations, uniqueness theorems, ansatzes smuggled via prior work, or renamings of known results appear as load-bearing steps. This is the normal non-circular outcome for an empirical architecture paper.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The approach rests on standard assumptions from sparse dictionary learning plus two new architectural choices; no new physical entities are postulated.

free parameters (2)
  • factorization rank
    The shared basis dimension in the low-rank tensor factorization is a hyperparameter that must be chosen.
  • masking probability
    The probability used for stochastic layer masking is a design choice that affects regularization strength.
axioms (1)
  • domain assumption Features in pretrained transformers can be usefully represented as sparse activations of dictionary elements that are consistent across layers.
    This is the core premise that justifies training any crosscoder.

pith-pipeline@v0.9.0 · 5647 in / 1325 out tokens · 59962 ms · 2026-05-12T04:49:41.685858+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

56 extracted references · 56 canonical work pages · 6 internal anchors

  1. [1]

    Toy models of superposition,

    N. Elhage, T. Hume, C. Olsson, N. Schiefer, T. Henighan, S. Kravec, Z. Hatfield-Dodds, R. Lasenby, D. Drain, C. Chen, R. Grosse, S. McCandlish, J. Kaplan, D. Amodei, M. Wattenberg, and C. Olah, “Toy models of superposition,”Transformer Circuits Thread,

  2. [2]

    Available: https://transformer-circuits.pub/2022/toy_model/index.html

    [Online]. Available: https://transformer-circuits.pub/2022/toy_model/index.html

  3. [3]

    Open Problems in Mechanistic Interpretability

    L. Sharkey, B. Chughtai, J. Batson, J. Lindsey, J. Wu, L. Bushnaq, N. Goldowsky-Dill, S. Heimer- sheim, A. Ortega, J. Bloom, S. Biderman, A. Garriga-Alonso, A. Conmy, N. Nanda, J. Rumbe- low, M. Wattenberg, N. Schoots, J. Miller, E. J. Michaud, S. Casper, M. Tegmark, W. Saunders, D. Bau, E. Todd, A. Geiger, M. Geva, J. Hoogland, D. Murfet, and T. McGrath,...

  4. [4]

    Towards monosemanticity: Decomposing language models with dictionary learning,

    T. Bricken, A. Templeton, J. Batson, B. Chen, A. Jermyn, T. Conerly, N. Turner, C. Anil, C. Denison, A. Askell, R. Lasenby, Y . Wu, S. Kravec, N. Schiefer, T. Maxwell, N. Joseph, Z. Hatfield-Dodds, A. Tamkin, K. Nguyen, B. McLean, J. E. Burke, T. Hume, S. Carter, T. Henighan, and C. Olah, “Towards monosemanticity: Decomposing language models with dictiona...

  5. [5]

    Sparse autoencoders find highly interpretable features in language models,

    H. Cunningham, A. Ewart, L. Riggs, R. Huben, and L. Sharkey, “Sparse autoencoders find highly interpretable features in language models,” inInternational Conference on Learning Representations (ICLR), 2024

  6. [6]

    Scaling monosemanticity: Extracting interpretable features from Claude 3 Sonnet,

    A. Templeton, T. Conerly, J. Marcus, J. Lindsey, T. Bricken, B. Chen, A. Pearce, C. Citro, E. Ameisen, A. Jones, H. Cunningham, N. L. Turner, C. McDougall, M. MacDiarmid, C. D. Freeman, T. R. Sumers, E. Rees, J. Batson, A. Jermyn, S. Carter, C. Olah, and T. Henighan, “Scaling monosemanticity: Extracting interpretable features from Claude 3 Sonnet,”Transfo...

  7. [7]

    Scaling and evaluating sparse autoencoders,

    L. Gao, T. Dupré la Tour, H. Tillman, G. Goh, R. Troll, A. Radford, I. Sutskever, J. Leike, and J. Wu, “Scaling and evaluating sparse autoencoders,” inInternational Conference on Learning Representations (ICLR), 2025, oral presentation

  8. [8]

    Tang, T., Luo, W., Huang, H., Zhang, D., Wang, X., Zhao, W

    S. Rajamanoharan, T. Lieberum, N. Sonnerat, A. Conmy, V . Varma, J. Kramár, and N. Nanda, “Jumping ahead: Improving reconstruction fidelity with JumpReLU sparse autoencoders,”arXiv preprint arXiv:2407.14435, 2024

  9. [9]

    BatchTopK sparse autoencoders,

    B. Bussmann, P. Leask, and N. Nanda, “BatchTopK sparse autoencoders,” inNeurIPS 2024 Workshop on Scientific Methods for Understanding Neural Networks, 2024

  10. [10]

    and Oldfield, James and Panagakis, Yannis and Nicolaou, Mihalis A

    P. Koromilas, A. D. Demou, J. Oldfield, Y . Panagakis, and M. Nicolaou, “Polysae: Mod- eling feature interactions in sparse autoencoders via polynomial decoding,”arXiv preprint arXiv:2602.01322, 2026

  11. [11]

    Knowledge neurons in pretrained transformers,

    D. Dai, L. Dong, Y . Hao, Z. Sui, B. Chang, and F. Wei, “Knowledge neurons in pretrained transformers,” inProceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2022, pp. 8493–8502

  12. [12]

    Remarkable robustness of LLMs: Stages of inference?

    V . Lad, J. H. Lee, W. Gurnee, and M. Tegmark, “Remarkable robustness of LLMs: Stages of inference?” inThe Thirty-ninth Annual Conference on Neural Information Processing Systems,

  13. [13]

    Available: https://openreview.net/forum?id=Wxh5Xz7NpJ

    [Online]. Available: https://openreview.net/forum?id=Wxh5Xz7NpJ

  14. [14]

    Transformer feed-forward layers build pre- dictions by promoting concepts in the vocabulary space,

    M. Geva, A. Caciularu, K. Wang, and Y . Goldberg, “Transformer feed-forward layers build pre- dictions by promoting concepts in the vocabulary space,” inProceedings of the 2022 conference on empirical methods in natural language processing, 2022, pp. 30–45

  15. [15]

    arXiv preprint arXiv:2401.12181 , year=

    W. Gurnee, T. Horsley, Z. C. Guo, T. R. Kheirkhah, Q. Sun, W. Hathaway, N. Nanda, and D. Bertsimas, “Universal neurons in gpt2 language models,”arXiv preprint arXiv:2401.12181, 2024

  16. [16]

    Residual stream analysis with multi-layer SAEs,

    T. Lawson, L. Farnik, C. Houghton, and L. Aitchison, “Residual stream analysis with multi-layer SAEs,” inInternational Conference on Learning Representations (ICLR), 2025. 10

  17. [17]

    Mechanistic permutability: Match features across layers,

    N. Balagansky, I. Maksimov, and D. Gavrilov, “Mechanistic permutability: Match features across layers,” inInternational Conference on Learning Representations (ICLR), 2025

  18. [18]

    Sparse crosscoders for cross-layer features and model diffing,

    J. Lindsey, A. Templeton, J. Marcus, T. Conerly, J. Batson, and C. Olah, “Sparse crosscoders for cross-layer features and model diffing,”Transformer Circuits Thread, 2024. [Online]. Available: https://transformer-circuits.pub/2024/crosscoders/index.html

  19. [19]

    Circuit tracing: Revealing computational graphs in language models,

    E. Ameisen, J. Lindsey, A. Pearce, W. Gurnee, N. L. Turner, B. Chen, C. Citro, D. Abrahams, S. Carter, B. Hosmer, J. Marcus, M. Sklar, A. Templeton, T. Bricken, C. McDougall, H. Cunningham, T. Henighan, A. Jermyn, A. Jones, A. Persic, Z. Qi, T. B. Thompson, S. Zimmerman, K. Rivoire, T. Conerly, C. Olah, and J. Batson, “Circuit tracing: Revealing computati...

  20. [20]

    Overcoming sparsity artifacts in crosscoders to interpret chat-tuning,

    C. Dumas, J. Minder, C. Juang, B. Chughtai, and N. Nanda, “Overcoming sparsity artifacts in crosscoders to interpret chat-tuning,” inMechanistic Interpretability Workshop at NeurIPS 2025, 2025

  21. [21]

    Tensor methods in computer vision and deep learning,

    Y . Panagakis, J. Kossaifi, G. G. Chrysos, J. Oldfield, M. A. Nicolaou, A. Anandkumar, and S. Zafeiriou, “Tensor methods in computer vision and deep learning,”Proceedings of the IEEE, vol. 109, no. 5, pp. 863–890, 2021

  22. [22]

    Understanding polysemanticity in neural networks through coding theory,

    S. C. Marshall and J. H. Kirchner, “Understanding polysemanticity in neural networks through coding theory,”arXiv preprint arXiv:2401.17975, 2024

  23. [23]

    Dropout: A simple way to prevent neural networks from overfitting,

    N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout: A simple way to prevent neural networks from overfitting,”Journal of Machine Learning Research, vol. 15, no. 56, pp. 1929–1958, 2014

  24. [24]

    Towards automated circuit discovery for mechanistic interpretability,

    A. Conmy, A. Mavor-Parker, A. Lynch, S. Heimersheim, and A. Garriga-Alonso, “Towards automated circuit discovery for mechanistic interpretability,”Advances in Neural Information Processing Systems, vol. 36, pp. 16 318–16 352, 2023

  25. [25]

    How does gpt-2 compute greater-than?: Interpreting mathematical abilities in a pre-trained language model,

    M. Hanna, O. Liu, and A. Variengien, “How does gpt-2 compute greater-than?: Interpreting mathematical abilities in a pre-trained language model,”Advances in Neural Information Processing Systems, vol. 36, pp. 76 033–76 060, 2023

  26. [26]

    Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models

    S. Marks, C. Rager, E. J. Michaud, Y . Belinkov, D. Bau, and A. Mueller, “Sparse feature circuits: Discovering and editing interpretable causal graphs in language models,”arXiv preprint arXiv:2403.19647, 2024

  27. [27]

    Internal states before wait modulate reasoning patterns,

    D. Troitskii, K. Pal, C. Wendler, C. S. McDougall, and N. Nanda, “Internal states before wait modulate reasoning patterns,”Proceedings of the Findings of the Association for Computational Linguistics: EMNLP, 2025

  28. [28]

    Foundation Models for Discovery and Exploration in Chemical Space

    A. Wadell, A. Bhutani, V . Azumah, A. R. Ellis-Mohr, C. Kelly, H. Zhao, A. K. Nayak, K. Hegazy, A. Brace, and H. Lin, “Foundation models for discovery and exploration in chemical space,” arXiv preprint arXiv:2510.18900, 2025

  29. [29]

    arXiv preprint arXiv:2503.03730 , year=

    D. D. Baek and M. Tegmark, “Towards understanding distilled reasoning models: A representa- tional approach,”arXiv preprint arXiv:2503.03730, 2025

  30. [30]

    Evolution of concepts in language model pre-training,

    X. Ge, W. Shu, J. Wu, Y . Zhou, Z. He, and X. Qiu, “Evolution of concepts in language model pre-training,” inThe Fourteenth International Conference on Learning Representations, 2026

  31. [31]

    Crosscoding Through Time: Tracking Emergence & Consolidation Of Linguistic Representations Throughout LLM Pretraining

    D. Bayazit, A. Mueller, and A. Bosselut, “Crosscoding through time: Tracking emergence & consolidation of linguistic representations throughout LLM pretraining,”arXiv preprint arXiv:2509.05291, 2025

  32. [32]

    The expression of a tensor or a polyadic as a sum of products,

    F. L. Hitchcock, “The expression of a tensor or a polyadic as a sum of products,”Journal of Mathematics and Physics, vol. 6, no. 1-4, pp. 164–189, 1927

  33. [33]

    Tensor Ring Decomposition

    Q. Zhao, G. Zhou, S. Xie, L. Zhang, and A. Cichocki, “Tensor Ring Decomposition,”arXiv preprint arXiv:1606.05535, 2016. 11

  34. [34]

    Multilinear mixture of experts: Scalable expert specialization through factorization,

    J. Oldfield, M. Georgopoulos, G. G. Chrysos, C. Tzelepis, Y . Panagakis, M. A. Nicolaou, J. Deng, and I. Patras, “Multilinear mixture of experts: Scalable expert specialization through factorization,”Advances in Neural Information Processing Systems, vol. 37, pp. 53 022–53 063, 2024

  35. [35]

    Towards interpretability without sacrifice: Faithful dense layer decomposition with mixture of decoders,

    J. Oldfield, S. Im, S. Li, M. A. Nicolaou, I. Patras, and G. G. Chrysos, “Towards interpretability without sacrifice: Faithful dense layer decomposition with mixture of decoders,”arXiv preprint arXiv:2505.21364, 2025

  36. [36]

    Extracting and composing robust features with denoising autoencoders,

    P. Vincent, H. Larochelle, Y . Bengio, and P.-A. Manzagol, “Extracting and composing robust features with denoising autoencoders,” inProceedings of the 25th International Conference on Machine Learning (ICML). ACM, 2008, pp. 1096–1103

  37. [37]

    Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion,

    P. Vincent, H. Larochelle, I. Lajoie, Y . Bengio, and P.-A. Manzagol, “Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion,” Journal of Machine Learning Research, vol. 11, pp. 3371–3408, 2010

  38. [38]

    BERT: Pre-training of deep bidirec- tional transformers for language understanding,

    J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of deep bidirec- tional transformers for language understanding,” inProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT), 2019, pp. 4171–4186

  39. [39]

    Masked autoencoders are scalable vision learners,

    K. He, X. Chen, S. Xie, Y . Li, P. Dollár, and R. Girshick, “Masked autoencoders are scalable vision learners,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 16 000–16 009

  40. [40]

    Matching pursuits with time-frequency dictionaries,

    S. G. Mallat and Z. Zhang, “Matching pursuits with time-frequency dictionaries,”IEEE Trans- actions on signal processing, vol. 41, no. 12, pp. 3397–3415, 1993

  41. [41]

    Multilinear multitask learning,

    B. Romera-Paredes, H. Aung, N. Bianchi-Berthouze, and M. Pontil, “Multilinear multitask learning,” inInternational Conference on Machine Learning. PMLR, 2013, pp. 1444–1452

  42. [42]

    Deep multi-task representation learning: A tensor factorisation approach,

    Y . Yang and T. Hospedales, “Deep multi-task representation learning: A tensor factorisation approach,”arXiv preprint arXiv:1605.06391, 2016

  43. [43]

    Training with noise is equivalent to Tikhonov regularization,

    C. M. Bishop, “Training with noise is equivalent to Tikhonov regularization,”Neural Computa- tion, vol. 7, no. 1, pp. 108–116, 1995

  44. [44]

    SAELens,

    J. Bloom, C. Tigges, A. Duong, and D. Chanin, “SAELens,” https://github.com/jbloomAus/ SAELens, 2024, gitHub repository

  45. [45]

    Language models are unsu- pervised multitask learners,

    A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever, “Language models are unsu- pervised multitask learners,” OpenAI, Tech. Rep., 2019. [Online]. Available: https://cdn.openai. com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf

  46. [46]

    Pythia: A suite for analyzing large language models across training and scaling,

    S. Biderman, H. Schoelkopf, Q. G. Anthony, H. Bradley, K. O’Brien, E. Hallahan, M. A. Khan, S. Purohit, U. S. Prashanth, E. Raff, A. Skowron, L. Sutawika, and O. Van Der Wal, “Pythia: A suite for analyzing large language models across training and scaling,” in Proceedings of the 40th International Conference on Machine Learning, ser. Proceedings of Machin...

  47. [47]

    Gemma 2: Improving open language models at a practical size,

    Gemma Team, “Gemma 2: Improving open language models at a practical size,” Google DeepMind, Tech. Rep., 2024. [Online]. Available: https://storage.googleapis.com/ deepmind-media/gemma/gemma-2-report.pdf

  48. [48]

    OpenWebText corpus,

    A. Gokaslan, V . Cohen, E. Pavlick, and S. Tellex, “OpenWebText corpus,” Zenodo, 2019. [Online]. Available: https://zenodo.org/records/3834942

  49. [49]

    The Pile: An 800GB Dataset of Diverse Text for Language Modeling

    L. Gao, S. Biderman, S. Black, L. Golding, T. Hoppe, C. Foster, J. Phang, H. He, A. Thite, N. Nabeshima, S. Presser, and C. Leahy, “The Pile: An 800GB dataset of diverse text for language modeling,”arXiv preprint arXiv:2101.00027, 2021. 12

  50. [50]

    SAEBench: A comprehensive benchmark for sparse autoencoders in language model interpretability,

    A. Karvonen, C. Rager, J. Lin, C. Tigges, J. I. Bloom, D. Chanin, Y .-T. Lau, E. Farrell, C. S. McDougall, K. Ayonrinde, D. Till, M. Wearden, A. Conmy, S. Marks, and N. Nanda, “SAEBench: A comprehensive benchmark for sparse autoencoders in language model interpretability,” inProceedings of the 42nd International Conference on Machine Learning, ser. Procee...

  51. [51]

    Bias in bios: A case study of semantic representation bias in a high-stakes setting,

    M. De-Arteaga, A. Romanov, H. Wallach, J. Chayes, C. Borgs, A. Chouldechova, S. C. Geyik, K. Kenthapadi, and A. T. Kalai, “Bias in bios: A case study of semantic representation bias in a high-stakes setting,” inProceedings of the Conference on Fairness, Accountability, and Transparency (FAT* ’19). ACM, 2019, pp. 120–128

  52. [52]

    Character-level convolutional networks for text classification,

    X. Zhang, J. Zhao, and Y . LeCun, “Character-level convolutional networks for text classification,” inAdvances in Neural Information Processing Systems, vol. 28, 2015. [Online]. Available: https: //proceedings.neurips.cc/paper/2015/file/250cf8b51c773f3f8dc8b4be867a9a02-Paper.pdf

  53. [53]

    Europarl: A parallel corpus for statistical machine translation,

    P. Koehn, “Europarl: A parallel corpus for statistical machine translation,” inProceedings of Machine Translation Summit X: Papers, Phuket, Thailand, Sep. 13-15 2005, pp. 79–86

  54. [54]

    Github code dataset,

    CodeParrot, “Github code dataset,” https://huggingface.co/datasets/codeparrot/github-code, 2022

  55. [55]

    Bridging Language and Items for Retrieval and Recommendation: Benchmarking LLMs as Semantic Encoders

    Y . Hou, J. Li, Z. He, A. Yan, X. Chen, and J. McAuley, “Bridging language and items for retrieval and recommendation,”arXiv preprint arXiv:2403.03952, 2024

  56. [56]

    Foundations of the PARAFAC procedure: Models and conditions for an “explanatory

    R. A. Harshman, “Foundations of the PARAFAC procedure: Models and conditions for an “explanatory” multi-modal factor analysis,”UCLA working papers in phonetics, vol. 16, no. 1, p. 84, 1970. A Implementation Details This appendix documents the training setup, hyperparameters, and infrastructure used for all experiments in the main paper. All crosscoder var...