arxiv: 2605.09438 · v1 · submitted 2026-05-10 · 💻 cs.LG

Recognition: 2 theorem links

· Lean Theorem

fmxcoders: Factorized Masked Crosscoders for Cross-Layer Feature Discovery

Andreas D. Demou , Panagiotis Koromilas , James Oldfield , Yannis Panagakis , Mihalis A. Nicolaou

Authors on Pith no claims yet

Pith reviewed 2026-05-12 04:49 UTC · model grok-4.3

classification 💻 cs.LG

keywords crosscoderssparse autoencoderscross-layer featurestransformer interpretabilityfactorized modelsfeature discoverymechanistic interpretabilityLLM evaluation

0 comments

The pith

Factorized masked crosscoders recover 3-13 times more cross-layer features than standard crosscoders in transformers.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Many features in pretrained transformers emerge or persist across multiple layers, yet standard crosscoders trained jointly on all layers mostly learn latents driven by only one or two layers. The paper shows this happens because the encoder and decoder weights can vary freely per layer and because nothing penalizes a latent for ignoring most layers. fmxcoders replace the weights with low-rank tensor factorizations that draw every latent's per-layer components from one shared cross-layer basis, then add stochastic layer masking during training so that a latent loses its signal if any single layer is removed. Across GPT-2 Small, Pythia-410M, Pythia-1.4B, and Gemma-2-2B, the resulting latents show 10-30 point gains in probing F1, 25-50% lower reconstruction error, roughly double the functional coherence, and 3-13 times more latents judged semantically coherent by an LLM.

Core claim

Standard crosscoders fail to recover cross-layer features because their per-layer weights are unconstrained and cross-layer dependence is unregularized, so most latents collapse to surface patterns in a single layer. fmxcoders address both problems by replacing the encoder and decoder with low-rank tensor factorizations whose per-layer slices come from a shared basis, and by training with stochastic layer masking that penalizes any latent whose activation collapses when one layer is masked. The resulting shared latents act as better concept detectors and yield large gains in probing accuracy, reconstruction quality, and semantic coherence across four different base models.

What carries the argument

Low-rank tensor factorization of the encoder and decoder weights that forces every latent to draw its per-layer contributions from one shared cross-layer basis, combined with stochastic layer masking that penalizes latents for depending on only one or two layers.

Load-bearing premise

The functional coherence metric and the LLM-as-a-judge scores truly reflect genuine cross-layer semantic meaning rather than simply rewarding the factorized architecture.

What would settle it

An ablation that removes either the tensor factorization or the layer masking and finds that functional coherence and probing F1 gains disappear, or a human annotation study in which the new latents do not receive higher semantic coherence ratings than those from standard crosscoders.

Figures

Figures reproduced from arXiv: 2605.09438 by Andreas D. Demou, James Oldfield, Mihalis A. Nicolaou, Panagiotis Koromilas, Yannis Panagakis.

**Figure 1.** Figure 1: Standard crosscoders (top) parameterize the encoder and decoder as L independent perlayer matrices and recover latents whose functional dependence collapses onto a few layers, capturing surface-level patterns. fmxcoders (bottom) replace these with low-rank tensor factorizations that share a cross-layer basis across all latents, and add stochastic layer masking as a denoising regularizer along the layer ax… view at source ↗

**Figure 2.** Figure 2: Norm coherence c n (blue) and functional coherence c f (green) latent distributions for crosscoders (top row) and fmxcoder (bottom row) variants trained on Pythia-410M. The average value and standard deviation of each coherence distribution are also shown in red color. 4.3 Qualitative and LLM-Judge Analysis of Features [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 3.** Figure 3: Heat maps for MSE, mean F1 and mean functional coherence [PITH_FULL_IMAGE:figures/full_fig_p016_3.png] view at source ↗

**Figure 4.** Figure 4: Similar to Figure 3, for CP [PITH_FULL_IMAGE:figures/full_fig_p016_4.png] view at source ↗

read the original abstract

Many features in pretrained Transformers span multiple layers: they emerge through stages of inference, persist in the residual stream, or are built jointly by parallel MLPs. Crosscoders (namely, sparse dictionaries trained jointly across layers) aim to recover these cross-layer features in a single shared latent space. We show that standard crosscoders largely fail at this purpose. Although their decoder weight norms spread evenly across layers, a functional coherence metric we introduce reveals that each latent's activation is effectively driven by only one or two layers on average. While functionally coherent latents act as human-interpretable concept detectors (e.g., US states and cities), the layer-localized latents that crosscoders predominantly learn collapse onto surface-level patterns such as digit detectors. We trace this failure to two structural limitations: unconstrained cross-layer parameterization and unregularized cross-layer dependence. We address both by introducing fmxcoders, which (i) replace the encoder and decoder with low-rank tensor factorizations that draw every latent's per-layer weights from a shared cross-layer basis, and (ii) apply stochastic layer masking, a denoising regularizer along the layer axis that penalizes latents whose contribution collapses when a single layer is masked. Across GPT2-Small, Pythia-410M, Pythia-1.4B, and Gemma2-2B, fmxcoders lift mean probing F1 by 10-30 points, surpassing per-layer SAE baselines that standard crosscoders fail to reach, reduce reconstruction MSE by 25-50%, and roughly double mean functional coherence. An LLM-as-a-judge evaluation further shows that fmxcoders recover 3-13$\times$ more semantically coherent latents than standard crosscoders across all four base LLMs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

fmxcoders add low-rank factorization and layer masking to crosscoders and report gains on four models, but the new coherence metrics look too close to the changes they introduce.

read the letter

The paper's main contribution is a pair of concrete fixes for crosscoders that tend to learn layer-local rather than truly cross-layer features. They replace the usual encoder and decoder with low-rank tensor factorizations that tie every latent's weights to a shared cross-layer basis, and they add stochastic layer masking during training to penalize latents that collapse when one layer is dropped. Both moves directly target the problems they diagnose in standard crosscoders.

Referee Report

3 major / 3 minor

Summary. The paper claims that standard crosscoders for sparse dictionary learning across transformer layers largely fail to recover truly cross-layer features, as their latents are predominantly driven by only one or two layers despite even decoder weight norms; it introduces fmxcoders using low-rank tensor factorizations for shared cross-layer bases in encoder and decoder plus stochastic layer masking as a regularizer, and reports that this yields 10-30 point gains in mean probing F1 (surpassing per-layer SAE baselines), 25-50% lower reconstruction MSE, roughly doubled mean functional coherence, and 3-13× more semantically coherent latents (via LLM-as-a-judge) across GPT2-Small, Pythia-410M, Pythia-1.4B, and Gemma2-2B.

Significance. If the newly introduced functional coherence metric and LLM judge evaluations prove to be architecture-agnostic measures of genuine cross-layer semantics, the work would meaningfully advance mechanistic interpretability by providing a practical method to extract features that emerge or persist across layers rather than collapsing to surface patterns. The multi-model scale of the evaluation and direct comparison against both standard crosscoders and per-layer SAEs are clear strengths; the introduction of a denoising regularizer along the layer dimension is a technically clean idea.

major comments (3)

[Abstract and §3] Abstract and §3 (functional coherence metric definition): The metric is defined to penalize latents whose activations are driven by few layers on average. Because fmxcoders explicitly optimize against single-layer collapse via stochastic masking, reported doublings of mean functional coherence risk being partly tautological rather than evidence of superior feature discovery; an ablation isolating the masking term from the factorization (or an external validation set of known cross-layer concepts) is needed to establish that the metric adds independent information.
[Abstract and §4] Abstract and §4 (LLM-as-a-judge evaluation): The claim of recovering 3-13× more semantically coherent latents rests on an LLM judge whose prompts and scoring rubric are not shown to be invariant to the structural differences (low-rank shared bases) introduced by fmxcoders. If the judge implicitly rewards more factorized or less surface-level activations, the multiplier cannot be attributed cleanly to better cross-layer discovery; a blinded human evaluation on a held-out subset or comparison against an architecture-agnostic coherence probe would strengthen the result.
[§5] §5 (experimental results): The reported probing F1 and MSE gains are presented without error bars, standard deviations across random seeds, or statistical significance tests. Given that the central empirical claim is consistent improvement across four models and multiple metrics, the absence of variability estimates makes it difficult to assess whether the 10-30 point F1 lift and 25-50% MSE reduction are robust or sensitive to hyperparameter choices such as factorization rank.

minor comments (3)

[Methods] The paper should report the chosen factorization rank for each model and any sensitivity analysis, as this is listed among the free parameters and directly affects the low-rank assumption.
[Figures] Figure captions and axis labels in the results section would benefit from explicit mention of the number of latents and the exact masking probability used, to aid reproducibility.
[§5] A brief discussion of how the per-layer SAE baselines were trained (same sparsity, same dictionary size) would clarify that the comparison is fair.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their thoughtful and constructive feedback on our manuscript. We have carefully reviewed each major comment and provide detailed point-by-point responses below. We outline the specific revisions we will make to address the concerns raised.

read point-by-point responses

Referee: [Abstract and §3] Abstract and §3 (functional coherence metric definition): The metric is defined to penalize latents whose activations are driven by few layers on average. Because fmxcoders explicitly optimize against single-layer collapse via stochastic masking, reported doublings of mean functional coherence risk being partly tautological rather than evidence of superior feature discovery; an ablation isolating the masking term from the factorization (or an external validation set of known cross-layer concepts) is needed to establish that the metric adds independent information.

Authors: We acknowledge the close relationship between the stochastic layer masking regularizer and the functional coherence metric, as the former is explicitly designed to discourage single-layer collapse. However, the metric itself is an independent, post-hoc measure of the average number of layers that meaningfully contribute to each latent's activations, and it is not part of the training loss. To demonstrate that the reported gains reflect genuine improvements in cross-layer feature discovery rather than a tautology, we will add an ablation study in the revised manuscript. This ablation will train factorized models both with and without the masking term and report the resulting differences in functional coherence, probing F1, MSE, and the number of coherent latents. We will also include concrete examples of known cross-layer concepts (e.g., entity tracking and syntactic features) recovered by fmxcoders to provide external validation of the metric. revision: yes
Referee: [Abstract and §4] Abstract and §4 (LLM-as-a-judge evaluation): The claim of recovering 3-13× more semantically coherent latents rests on an LLM judge whose prompts and scoring rubric are not shown to be invariant to the structural differences (low-rank shared bases) introduced by fmxcoders. If the judge implicitly rewards more factorized or less surface-level activations, the multiplier cannot be attributed cleanly to better cross-layer discovery; a blinded human evaluation on a held-out subset or comparison against an architecture-agnostic coherence probe would strengthen the result.

Authors: We agree that transparency and validation of the LLM-as-a-judge protocol are essential to rule out bias from the low-rank factorization. In the revised manuscript, we will include the complete prompts and scoring rubric in the appendix. To further strengthen the claim, we will add a blinded human evaluation on a held-out subset of 100 latents per model, where human evaluators (unaware of the source method) rate semantic coherence based on top activating examples and tokens. We will report inter-rater agreement and correlation with the LLM scores. While a full-scale human study across all models and latents is not feasible, this targeted evaluation will provide independent corroboration that the 3-13× increase corresponds to improved cross-layer semantic coherence. revision: partial
Referee: [§5] §5 (experimental results): The reported probing F1 and MSE gains are presented without error bars, standard deviations across random seeds, or statistical significance tests. Given that the central empirical claim is consistent improvement across four models and multiple metrics, the absence of variability estimates makes it difficult to assess whether the 10-30 point F1 lift and 25-50% MSE reduction are robust or sensitive to hyperparameter choices such as factorization rank.

Authors: We thank the referee for highlighting this important presentation issue. In the revised §5, we will report all key metrics (probing F1, MSE, functional coherence) with error bars indicating standard deviation across at least three independent random seeds per configuration. We will also include statistical significance tests (paired t-tests with p-values) for the primary comparisons against standard crosscoders and per-layer SAEs. Additionally, we will add a sensitivity analysis varying the factorization rank and reporting its effect on performance to address robustness to hyperparameter choices. revision: yes

Circularity Check

0 steps flagged

No significant circularity in fmxcoders empirical claims

full rationale

The paper introduces a new architecture (low-rank factorized encoder/decoder plus stochastic layer masking) and a new functional coherence metric to diagnose standard crosscoders. Reported gains in probing F1 (10-30 points) and reconstruction MSE (25-50%) are measured on independent tasks across multiple models and are not reducible to quantities defined inside the training objective or metric by construction. The coherence improvement is expected from the explicit regularizer but does not render the other metrics tautological. No self-citations, uniqueness theorems, ansatzes smuggled via prior work, or renamings of known results appear as load-bearing steps. This is the normal non-circular outcome for an empirical architecture paper.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The approach rests on standard assumptions from sparse dictionary learning plus two new architectural choices; no new physical entities are postulated.

free parameters (2)

factorization rank
The shared basis dimension in the low-rank tensor factorization is a hyperparameter that must be chosen.
masking probability
The probability used for stochastic layer masking is a design choice that affects regularization strength.

axioms (1)

domain assumption Features in pretrained transformers can be usefully represented as sparse activations of dictionary elements that are consistent across layers.
This is the core premise that justifies training any crosscoder.

pith-pipeline@v0.9.0 · 5647 in / 1325 out tokens · 59962 ms · 2026-05-12T04:49:41.685858+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

fmxcoders replace the encoder and decoder with low-rank tensor factorizations... and apply stochastic layer masking
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

functional coherence cf_i = sum Sℓ_i / max Sℓ_i

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

56 extracted references · 56 canonical work pages · 6 internal anchors

[1]

Toy models of superposition,

N. Elhage, T. Hume, C. Olsson, N. Schiefer, T. Henighan, S. Kravec, Z. Hatfield-Dodds, R. Lasenby, D. Drain, C. Chen, R. Grosse, S. McCandlish, J. Kaplan, D. Amodei, M. Wattenberg, and C. Olah, “Toy models of superposition,”Transformer Circuits Thread,

work page
[2]

Available: https://transformer-circuits.pub/2022/toy_model/index.html

[Online]. Available: https://transformer-circuits.pub/2022/toy_model/index.html

work page 2022
[3]

Open Problems in Mechanistic Interpretability

L. Sharkey, B. Chughtai, J. Batson, J. Lindsey, J. Wu, L. Bushnaq, N. Goldowsky-Dill, S. Heimer- sheim, A. Ortega, J. Bloom, S. Biderman, A. Garriga-Alonso, A. Conmy, N. Nanda, J. Rumbe- low, M. Wattenberg, N. Schoots, J. Miller, E. J. Michaud, S. Casper, M. Tegmark, W. Saunders, D. Bau, E. Todd, A. Geiger, M. Geva, J. Hoogland, D. Murfet, and T. McGrath,...

work page internal anchor Pith review arXiv 2025
[4]

Towards monosemanticity: Decomposing language models with dictionary learning,

T. Bricken, A. Templeton, J. Batson, B. Chen, A. Jermyn, T. Conerly, N. Turner, C. Anil, C. Denison, A. Askell, R. Lasenby, Y . Wu, S. Kravec, N. Schiefer, T. Maxwell, N. Joseph, Z. Hatfield-Dodds, A. Tamkin, K. Nguyen, B. McLean, J. E. Burke, T. Hume, S. Carter, T. Henighan, and C. Olah, “Towards monosemanticity: Decomposing language models with dictiona...

work page 2023
[5]

Sparse autoencoders find highly interpretable features in language models,

H. Cunningham, A. Ewart, L. Riggs, R. Huben, and L. Sharkey, “Sparse autoencoders find highly interpretable features in language models,” inInternational Conference on Learning Representations (ICLR), 2024

work page 2024
[6]

Scaling monosemanticity: Extracting interpretable features from Claude 3 Sonnet,

A. Templeton, T. Conerly, J. Marcus, J. Lindsey, T. Bricken, B. Chen, A. Pearce, C. Citro, E. Ameisen, A. Jones, H. Cunningham, N. L. Turner, C. McDougall, M. MacDiarmid, C. D. Freeman, T. R. Sumers, E. Rees, J. Batson, A. Jermyn, S. Carter, C. Olah, and T. Henighan, “Scaling monosemanticity: Extracting interpretable features from Claude 3 Sonnet,”Transfo...

work page 2024
[7]

Scaling and evaluating sparse autoencoders,

L. Gao, T. Dupré la Tour, H. Tillman, G. Goh, R. Troll, A. Radford, I. Sutskever, J. Leike, and J. Wu, “Scaling and evaluating sparse autoencoders,” inInternational Conference on Learning Representations (ICLR), 2025, oral presentation

work page 2025
[8]

Tang, T., Luo, W., Huang, H., Zhang, D., Wang, X., Zhao, W

S. Rajamanoharan, T. Lieberum, N. Sonnerat, A. Conmy, V . Varma, J. Kramár, and N. Nanda, “Jumping ahead: Improving reconstruction fidelity with JumpReLU sparse autoencoders,”arXiv preprint arXiv:2407.14435, 2024

work page arXiv 2024
[9]

BatchTopK sparse autoencoders,

B. Bussmann, P. Leask, and N. Nanda, “BatchTopK sparse autoencoders,” inNeurIPS 2024 Workshop on Scientific Methods for Understanding Neural Networks, 2024

work page 2024
[10]

and Oldfield, James and Panagakis, Yannis and Nicolaou, Mihalis A

P. Koromilas, A. D. Demou, J. Oldfield, Y . Panagakis, and M. Nicolaou, “Polysae: Mod- eling feature interactions in sparse autoencoders via polynomial decoding,”arXiv preprint arXiv:2602.01322, 2026

work page arXiv 2026
[11]

Knowledge neurons in pretrained transformers,

D. Dai, L. Dong, Y . Hao, Z. Sui, B. Chang, and F. Wei, “Knowledge neurons in pretrained transformers,” inProceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2022, pp. 8493–8502

work page 2022
[12]

Remarkable robustness of LLMs: Stages of inference?

V . Lad, J. H. Lee, W. Gurnee, and M. Tegmark, “Remarkable robustness of LLMs: Stages of inference?” inThe Thirty-ninth Annual Conference on Neural Information Processing Systems,

work page
[13]

Available: https://openreview.net/forum?id=Wxh5Xz7NpJ

[Online]. Available: https://openreview.net/forum?id=Wxh5Xz7NpJ

work page
[14]

Transformer feed-forward layers build pre- dictions by promoting concepts in the vocabulary space,

M. Geva, A. Caciularu, K. Wang, and Y . Goldberg, “Transformer feed-forward layers build pre- dictions by promoting concepts in the vocabulary space,” inProceedings of the 2022 conference on empirical methods in natural language processing, 2022, pp. 30–45

work page 2022
[15]

arXiv preprint arXiv:2401.12181 , year=

W. Gurnee, T. Horsley, Z. C. Guo, T. R. Kheirkhah, Q. Sun, W. Hathaway, N. Nanda, and D. Bertsimas, “Universal neurons in gpt2 language models,”arXiv preprint arXiv:2401.12181, 2024

work page arXiv 2024
[16]

Residual stream analysis with multi-layer SAEs,

T. Lawson, L. Farnik, C. Houghton, and L. Aitchison, “Residual stream analysis with multi-layer SAEs,” inInternational Conference on Learning Representations (ICLR), 2025. 10

work page 2025
[17]

Mechanistic permutability: Match features across layers,

N. Balagansky, I. Maksimov, and D. Gavrilov, “Mechanistic permutability: Match features across layers,” inInternational Conference on Learning Representations (ICLR), 2025

work page 2025
[18]

Sparse crosscoders for cross-layer features and model diffing,

J. Lindsey, A. Templeton, J. Marcus, T. Conerly, J. Batson, and C. Olah, “Sparse crosscoders for cross-layer features and model diffing,”Transformer Circuits Thread, 2024. [Online]. Available: https://transformer-circuits.pub/2024/crosscoders/index.html

work page 2024
[19]

Circuit tracing: Revealing computational graphs in language models,

E. Ameisen, J. Lindsey, A. Pearce, W. Gurnee, N. L. Turner, B. Chen, C. Citro, D. Abrahams, S. Carter, B. Hosmer, J. Marcus, M. Sklar, A. Templeton, T. Bricken, C. McDougall, H. Cunningham, T. Henighan, A. Jermyn, A. Jones, A. Persic, Z. Qi, T. B. Thompson, S. Zimmerman, K. Rivoire, T. Conerly, C. Olah, and J. Batson, “Circuit tracing: Revealing computati...

work page 2025
[20]

Overcoming sparsity artifacts in crosscoders to interpret chat-tuning,

C. Dumas, J. Minder, C. Juang, B. Chughtai, and N. Nanda, “Overcoming sparsity artifacts in crosscoders to interpret chat-tuning,” inMechanistic Interpretability Workshop at NeurIPS 2025, 2025

work page 2025
[21]

Tensor methods in computer vision and deep learning,

Y . Panagakis, J. Kossaifi, G. G. Chrysos, J. Oldfield, M. A. Nicolaou, A. Anandkumar, and S. Zafeiriou, “Tensor methods in computer vision and deep learning,”Proceedings of the IEEE, vol. 109, no. 5, pp. 863–890, 2021

work page 2021
[22]

Understanding polysemanticity in neural networks through coding theory,

S. C. Marshall and J. H. Kirchner, “Understanding polysemanticity in neural networks through coding theory,”arXiv preprint arXiv:2401.17975, 2024

work page arXiv 2024
[23]

Dropout: A simple way to prevent neural networks from overfitting,

N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout: A simple way to prevent neural networks from overfitting,”Journal of Machine Learning Research, vol. 15, no. 56, pp. 1929–1958, 2014

work page 1929
[24]

Towards automated circuit discovery for mechanistic interpretability,

A. Conmy, A. Mavor-Parker, A. Lynch, S. Heimersheim, and A. Garriga-Alonso, “Towards automated circuit discovery for mechanistic interpretability,”Advances in Neural Information Processing Systems, vol. 36, pp. 16 318–16 352, 2023

work page 2023
[25]

How does gpt-2 compute greater-than?: Interpreting mathematical abilities in a pre-trained language model,

M. Hanna, O. Liu, and A. Variengien, “How does gpt-2 compute greater-than?: Interpreting mathematical abilities in a pre-trained language model,”Advances in Neural Information Processing Systems, vol. 36, pp. 76 033–76 060, 2023

work page 2023
[26]

Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models

S. Marks, C. Rager, E. J. Michaud, Y . Belinkov, D. Bau, and A. Mueller, “Sparse feature circuits: Discovering and editing interpretable causal graphs in language models,”arXiv preprint arXiv:2403.19647, 2024

work page internal anchor Pith review arXiv 2024
[27]

Internal states before wait modulate reasoning patterns,

D. Troitskii, K. Pal, C. Wendler, C. S. McDougall, and N. Nanda, “Internal states before wait modulate reasoning patterns,”Proceedings of the Findings of the Association for Computational Linguistics: EMNLP, 2025

work page 2025
[28]

Foundation Models for Discovery and Exploration in Chemical Space

A. Wadell, A. Bhutani, V . Azumah, A. R. Ellis-Mohr, C. Kelly, H. Zhao, A. K. Nayak, K. Hegazy, A. Brace, and H. Lin, “Foundation models for discovery and exploration in chemical space,” arXiv preprint arXiv:2510.18900, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[29]

arXiv preprint arXiv:2503.03730 , year=

D. D. Baek and M. Tegmark, “Towards understanding distilled reasoning models: A representa- tional approach,”arXiv preprint arXiv:2503.03730, 2025

work page arXiv 2025
[30]

Evolution of concepts in language model pre-training,

X. Ge, W. Shu, J. Wu, Y . Zhou, Z. He, and X. Qiu, “Evolution of concepts in language model pre-training,” inThe Fourteenth International Conference on Learning Representations, 2026

work page 2026
[31]

Crosscoding Through Time: Tracking Emergence & Consolidation Of Linguistic Representations Throughout LLM Pretraining

D. Bayazit, A. Mueller, and A. Bosselut, “Crosscoding through time: Tracking emergence & consolidation of linguistic representations throughout LLM pretraining,”arXiv preprint arXiv:2509.05291, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[32]

The expression of a tensor or a polyadic as a sum of products,

F. L. Hitchcock, “The expression of a tensor or a polyadic as a sum of products,”Journal of Mathematics and Physics, vol. 6, no. 1-4, pp. 164–189, 1927

work page 1927
[33]

Tensor Ring Decomposition

Q. Zhao, G. Zhou, S. Xie, L. Zhang, and A. Cichocki, “Tensor Ring Decomposition,”arXiv preprint arXiv:1606.05535, 2016. 11

work page Pith review arXiv 2016
[34]

Multilinear mixture of experts: Scalable expert specialization through factorization,

J. Oldfield, M. Georgopoulos, G. G. Chrysos, C. Tzelepis, Y . Panagakis, M. A. Nicolaou, J. Deng, and I. Patras, “Multilinear mixture of experts: Scalable expert specialization through factorization,”Advances in Neural Information Processing Systems, vol. 37, pp. 53 022–53 063, 2024

work page 2024
[35]

Towards interpretability without sacrifice: Faithful dense layer decomposition with mixture of decoders,

J. Oldfield, S. Im, S. Li, M. A. Nicolaou, I. Patras, and G. G. Chrysos, “Towards interpretability without sacrifice: Faithful dense layer decomposition with mixture of decoders,”arXiv preprint arXiv:2505.21364, 2025

work page arXiv 2025
[36]

Extracting and composing robust features with denoising autoencoders,

P. Vincent, H. Larochelle, Y . Bengio, and P.-A. Manzagol, “Extracting and composing robust features with denoising autoencoders,” inProceedings of the 25th International Conference on Machine Learning (ICML). ACM, 2008, pp. 1096–1103

work page 2008
[37]

Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion,

P. Vincent, H. Larochelle, I. Lajoie, Y . Bengio, and P.-A. Manzagol, “Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion,” Journal of Machine Learning Research, vol. 11, pp. 3371–3408, 2010

work page 2010
[38]

BERT: Pre-training of deep bidirec- tional transformers for language understanding,

J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of deep bidirec- tional transformers for language understanding,” inProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT), 2019, pp. 4171–4186

work page 2019
[39]

Masked autoencoders are scalable vision learners,

K. He, X. Chen, S. Xie, Y . Li, P. Dollár, and R. Girshick, “Masked autoencoders are scalable vision learners,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 16 000–16 009

work page 2022
[40]

Matching pursuits with time-frequency dictionaries,

S. G. Mallat and Z. Zhang, “Matching pursuits with time-frequency dictionaries,”IEEE Trans- actions on signal processing, vol. 41, no. 12, pp. 3397–3415, 1993

work page 1993
[41]

Multilinear multitask learning,

B. Romera-Paredes, H. Aung, N. Bianchi-Berthouze, and M. Pontil, “Multilinear multitask learning,” inInternational Conference on Machine Learning. PMLR, 2013, pp. 1444–1452

work page 2013
[42]

Deep multi-task representation learning: A tensor factorisation approach,

Y . Yang and T. Hospedales, “Deep multi-task representation learning: A tensor factorisation approach,”arXiv preprint arXiv:1605.06391, 2016

work page arXiv 2016
[43]

Training with noise is equivalent to Tikhonov regularization,

C. M. Bishop, “Training with noise is equivalent to Tikhonov regularization,”Neural Computa- tion, vol. 7, no. 1, pp. 108–116, 1995

work page 1995
[44]

SAELens,

J. Bloom, C. Tigges, A. Duong, and D. Chanin, “SAELens,” https://github.com/jbloomAus/ SAELens, 2024, gitHub repository

work page 2024
[45]

Language models are unsu- pervised multitask learners,

A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever, “Language models are unsu- pervised multitask learners,” OpenAI, Tech. Rep., 2019. [Online]. Available: https://cdn.openai. com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf

work page 2019
[46]

Pythia: A suite for analyzing large language models across training and scaling,

S. Biderman, H. Schoelkopf, Q. G. Anthony, H. Bradley, K. O’Brien, E. Hallahan, M. A. Khan, S. Purohit, U. S. Prashanth, E. Raff, A. Skowron, L. Sutawika, and O. Van Der Wal, “Pythia: A suite for analyzing large language models across training and scaling,” in Proceedings of the 40th International Conference on Machine Learning, ser. Proceedings of Machin...

work page 2023
[47]

Gemma 2: Improving open language models at a practical size,

Gemma Team, “Gemma 2: Improving open language models at a practical size,” Google DeepMind, Tech. Rep., 2024. [Online]. Available: https://storage.googleapis.com/ deepmind-media/gemma/gemma-2-report.pdf

work page 2024
[48]

OpenWebText corpus,

A. Gokaslan, V . Cohen, E. Pavlick, and S. Tellex, “OpenWebText corpus,” Zenodo, 2019. [Online]. Available: https://zenodo.org/records/3834942

work page arXiv 2019
[49]

The Pile: An 800GB Dataset of Diverse Text for Language Modeling

L. Gao, S. Biderman, S. Black, L. Golding, T. Hoppe, C. Foster, J. Phang, H. He, A. Thite, N. Nabeshima, S. Presser, and C. Leahy, “The Pile: An 800GB dataset of diverse text for language modeling,”arXiv preprint arXiv:2101.00027, 2021. 12

work page internal anchor Pith review Pith/arXiv arXiv 2021
[50]

SAEBench: A comprehensive benchmark for sparse autoencoders in language model interpretability,

A. Karvonen, C. Rager, J. Lin, C. Tigges, J. I. Bloom, D. Chanin, Y .-T. Lau, E. Farrell, C. S. McDougall, K. Ayonrinde, D. Till, M. Wearden, A. Conmy, S. Marks, and N. Nanda, “SAEBench: A comprehensive benchmark for sparse autoencoders in language model interpretability,” inProceedings of the 42nd International Conference on Machine Learning, ser. Procee...

work page 2025
[51]

Bias in bios: A case study of semantic representation bias in a high-stakes setting,

M. De-Arteaga, A. Romanov, H. Wallach, J. Chayes, C. Borgs, A. Chouldechova, S. C. Geyik, K. Kenthapadi, and A. T. Kalai, “Bias in bios: A case study of semantic representation bias in a high-stakes setting,” inProceedings of the Conference on Fairness, Accountability, and Transparency (FAT* ’19). ACM, 2019, pp. 120–128

work page 2019
[52]

Character-level convolutional networks for text classification,

X. Zhang, J. Zhao, and Y . LeCun, “Character-level convolutional networks for text classification,” inAdvances in Neural Information Processing Systems, vol. 28, 2015. [Online]. Available: https: //proceedings.neurips.cc/paper/2015/file/250cf8b51c773f3f8dc8b4be867a9a02-Paper.pdf

work page 2015
[53]

Europarl: A parallel corpus for statistical machine translation,

P. Koehn, “Europarl: A parallel corpus for statistical machine translation,” inProceedings of Machine Translation Summit X: Papers, Phuket, Thailand, Sep. 13-15 2005, pp. 79–86

work page 2005
[54]

Github code dataset,

CodeParrot, “Github code dataset,” https://huggingface.co/datasets/codeparrot/github-code, 2022

work page 2022
[55]

Bridging Language and Items for Retrieval and Recommendation: Benchmarking LLMs as Semantic Encoders

Y . Hou, J. Li, Z. He, A. Yan, X. Chen, and J. McAuley, “Bridging language and items for retrieval and recommendation,”arXiv preprint arXiv:2403.03952, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[56]

Foundations of the PARAFAC procedure: Models and conditions for an “explanatory

R. A. Harshman, “Foundations of the PARAFAC procedure: Models and conditions for an “explanatory” multi-modal factor analysis,”UCLA working papers in phonetics, vol. 16, no. 1, p. 84, 1970. A Implementation Details This appendix documents the training setup, hyperparameters, and infrastructure used for all experiments in the main paper. All crosscoder var...

work page arXiv 1970