Features have life history. And we should care

Philipp Stecher; Reinhard Kahle; Sandro Radovanovi\'c; Vlasta Sikimi\'c

arxiv: 2605.18789 · v1 · pith:PIMRL5A3new · submitted 2026-05-07 · 🧬 q-bio.NC · cs.AI

Features have life history. And we should care

Philipp Stecher , Sandro Radovanovi\'c , Vlasta Sikimi\'c , Reinhard Kahle This is my paper

Pith reviewed 2026-05-20 23:28 UTC · model grok-4.3

classification 🧬 q-bio.NC cs.AI

keywords feature life historyrepresentational backbonecarrier scaffoldtraining dynamicssparse featuresPythia modelstwo-phase training

0 comments

The pith

Language models form a stable scaffold of about 50 sparse features early in training that organizes the rest of their representations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Features in language models emerge, persist, and die as training proceeds, yet their histories show a persistent backbone. In Pythia-160M and -410M this backbone appears as roughly 50 sparse features with stable life histories around which the model's structure organizes. The scaffold assembles in the first 1 percent of training, proves load-bearing under joint ablation, is predictable from initial firing patterns alone, and later recruits 64 percent of active features into its hierarchy. These observations support a two-phase account in which early selection sets the scaffold while the remaining training calibrates geometry around it.

Core claim

The paper identifies a carrier scaffold of approximately 50 sparse features with stable life histories that functions as the persistent representational backbone in Pythia models. This scaffold assembles early, is load-bearing under joint cross-layer ablation, has its membership predictable from training-onset firing patterns before geometry settles, and seeds later development by recruiting most active features into the scaffold hierarchy by the end of training.

What carries the argument

The carrier scaffold: a small population of sparse features with stable life histories identified through joint cross-layer ablation and life-history tracking that serves as the organizing backbone for the model's representations.

If this is right

The scaffold is largely fixed after the first 1 percent of training, so later steps mainly adjust geometry around an already chosen substrate.
Joint ablation of scaffold carriers reveals outsized impact on model behavior compared with non-scaffold populations of equal size.
Onset firing patterns alone distinguish future carriers from non-carriers in four out of five cases.
By training's end the scaffold has incorporated 64 percent of all active features into its hierarchy.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the scaffold is set so early, interventions confined to the first percent of training could steer final model organization more efficiently than later adjustments.
The two-phase pattern may extend to other model families and scales, implying that representational structure is similarly fixed early across architectures.
Life-history tracking of features could serve as a diagnostic for when a model has completed its structural selection phase.

Load-bearing premise

Joint cross-layer ablation correctly measures load-bearing importance without being confounded by feature interactions or correlations.

What would settle it

Ablating the identified scaffold features produces no larger performance drop than ablating a count-matched set of non-scaffold features.

Figures

Figures reproduced from arXiv: 2605.18789 by Philipp Stecher, Reinhard Kahle, Sandro Radovanovi\'c, Vlasta Sikimi\'c.

**Figure 2.** Figure 2: Joint cross-layer ablation quantifies the scaffold’s load. [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: Function settles early; direction calibrates late; the gap makes the scaffold legible at step [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: The scaffold organises two-thirds of the mature network through hierarchically scaffolded [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Kaplan–Meier survival by CI tier at Pythia-160M (left) and Pythia-410M (right), with [PITH_FULL_IMAGE:figures/full_fig_p015_5.png] view at source ↗

**Figure 6.** Figure 6: Firing breadth (fraction of validation tokens on which a feature activates) by CI tier at step [PITH_FULL_IMAGE:figures/full_fig_p015_6.png] view at source ↗

**Figure 7.** Figure 7: Pythia-410M per-layer cross-checks for the unembedding-cosine and within-layer DAG [PITH_FULL_IMAGE:figures/full_fig_p021_7.png] view at source ↗

read the original abstract

Features in language models have life history: they emerge, persist, and die during training, yet the importance of that history remains largely unexplored. We find evidence of a persistent representational backbone, which we identify in Pythia-160M and -410M as the carrier scaffold: ${\sim}50$ sparse features with stable life histories, around which the model's representational structure organises. It has four properties. \emph{(i)}~\emph{It assembles early:} features emerge, die, and reorganise ${\sim}40\!\times$ faster in the first $1\%$ of training than afterwards, and the scaffold is already largely fixed by then. \emph{(ii)}~\emph{It is load-bearing:} joint cross-layer ablation identifies the carriers as far more load-bearing than any count-matched non-scaffold population, a gap invisible to per-firing single-feature methods. \emph{(iii)}~\emph{Function precedes direction:} which features will become carriers is already predictable from training-onset firing patterns alone, correctly distinguishing future carriers from non-carriers in $4$ of $5$ cases, before the geometry has settled. \emph{(iv)}~\emph{It seeds subsequent development:} by the end of training, scaffold carriers have recruited $64\%$ of all active features into the scaffold hierarchy. Life history is consistent with a two-phase account of training: selection appears to largely determine the scaffold in the first $1\%$; the remaining $99\%$ appears to calibrate geometry around a substrate already set.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper tracks sparse feature emergence in Pythia models and flags a small early group that appears to anchor later representations, with the ablation results needing checks for correlations.

read the letter

The main point is that the authors follow how sparse features come and go during training and single out roughly 50 that stay stable and seem to shape the rest of the model's structure in Pythia-160M and 410M. They call this the carrier scaffold and list four properties: it forms early with rapid reorganization in the first percent of training, joint cross-layer ablation shows it matters more than matched non-scaffold sets, onset firing patterns already predict which features will end up as carriers, and it pulls in most other active features by the end. This leads to their two-phase view of training where selection happens fast and the rest is mostly calibration around that base. The observations line up across the two model sizes and the timing data plus the single-versus-joint ablation contrast are concrete enough to be useful. The work is honest about resting on empirical patterns rather than new equations. The softer part is the load-bearing claim. Joint ablation can produce larger drops simply because the carriers are correlated with each other or share downstream effects, so the gap might reflect multicollinearity more than a true organizing backbone. Without variance inflation checks or orthogonalized tests, the structural interpretation stays provisional. The 4-out-of-5 prediction from onset patterns is worth following up but needs clearer reporting on how the classifier was trained and validated. This is for people working on sparse autoencoders and mechanistic interpretability who want to think about training dynamics instead of just final representations. Readers who run feature analyses on checkpoints will get practical ideas from the life-history framing. It deserves a serious referee because the experiments are on real models and the claims are falsifiable with more controls. I would send it for review and ask the authors to add robustness checks on the ablation results.

Referee Report

2 major / 2 minor

Summary. The manuscript analyzes the life histories of sparse features during training in language models, identifying a persistent 'carrier scaffold' of ~50 features in Pythia-160M and -410M that organizes representational structure. It reports four properties: (i) early assembly, with ~40x faster emergence/die/reorganization in the first 1% of training and the scaffold largely fixed by then; (ii) load-bearing, via joint cross-layer ablation yielding larger performance gaps than count-matched non-scaffold sets (invisible to single-feature ablations); (iii) function precedes direction, with carrier status predictable from onset firing patterns alone (4/5 accuracy); (iv) seeds development by recruiting 64% of active features. The authors propose a two-phase training account with early selection and later calibration.

Significance. If the ablation results and predictions hold after addressing potential confounds, the work would usefully draw attention to temporal feature dynamics that are often ignored in interpretability studies, potentially informing staged training protocols or early intervention strategies.

major comments (2)

[Abstract, property (ii)] Abstract, property (ii): The load-bearing claim for the carrier scaffold rests on joint cross-layer ablation producing a performance gap versus count-matched controls. Without explicit checks for multicollinearity (e.g., variance inflation factors) or orthogonalized interventions, the gap could arise from feature correlations or shared downstream effects rather than an organizing backbone role.
[Abstract, property (iii)] Abstract, property (iii): The 4-of-5 accuracy claim that onset firing patterns predict future carriers requires details on the classifier, cross-validation procedure, and comparison to chance or non-onset baselines to confirm it is not inflated by correlations in the early data.

minor comments (2)

[Abstract] The abstract supplies no methods details, error bars, or statistical tests, which hinders evaluation of the reported percentages and accuracies.
[Abstract] Define 'sparse features' and the exact selection criteria for the ~50 carriers more explicitly.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their careful reading and constructive suggestions, which help clarify the methodological foundations of our claims about the carrier scaffold. We respond to each major comment below and indicate the revisions we will make to address the concerns raised.

read point-by-point responses

Referee: [Abstract, property (ii)] Abstract, property (ii): The load-bearing claim for the carrier scaffold rests on joint cross-layer ablation producing a performance gap versus count-matched controls. Without explicit checks for multicollinearity (e.g., variance inflation factors) or orthogonalized interventions, the gap could arise from feature correlations or shared downstream effects rather than an organizing backbone role.

Authors: We agree that multicollinearity among features could contribute to the observed performance gap and that additional controls would strengthen the load-bearing interpretation. In the revised manuscript we will add a variance inflation factor analysis computed on the joint activation matrix of scaffold features across layers. We will also report an orthogonalized ablation variant in which we first project activations onto the subspace orthogonal to the top principal components of the non-scaffold population before performing the joint ablation. Our existing controls already match on feature count, mean activation magnitude, and layer distribution; the gap remains large under these constraints. We will incorporate the new VIF and orthogonalized results to directly address the referee’s concern. revision: yes
Referee: [Abstract, property (iii)] Abstract, property (iii): The 4-of-5 accuracy claim that onset firing patterns predict future carriers requires details on the classifier, cross-validation procedure, and comparison to chance or non-onset baselines to confirm it is not inflated by correlations in the early data.

Authors: We will expand the Methods section to specify that a logistic regression classifier with L2 regularization was trained on per-feature onset firing rates (first 1 % of training steps) to predict end-of-training carrier status. We employed 5-fold stratified cross-validation to preserve class balance and report mean accuracy of 80 % (4/5). This is compared against (i) a random-label baseline (50 %) and (ii) a non-onset baseline using firing rates from the middle of training; onset firing alone yields significantly higher accuracy (permutation test, p < 0.01). A supplementary figure showing the ROC curve and feature-importance weights will be added. These details will be included in the revised version. revision: yes

Circularity Check

0 steps flagged

No significant circularity; claims are observational

full rationale

The paper reports empirical observations of feature life histories during model training, identifying a carrier scaffold via patterns in emergence, persistence, ablation effects, and predictability from onset firing. No equations, derivations, or self-referential definitions appear in the provided text. Properties (i)-(iv) are presented as data-driven findings rather than quantities constructed from fitted parameters or prior self-citations within the same chain. The analysis relies on external benchmarks like cross-layer ablation and temporal prediction without reducing the central claims to tautological inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The analysis rests on domain assumptions about feature tracking and sparsity rather than new free parameters or invented physical entities.

axioms (2)

domain assumption Individual features in language models can be tracked across training steps for emergence, persistence, and death.
Required to define life histories and the scaffold.
domain assumption Sparse features extracted via standard methods form the right granularity for representational analysis.
Underlies identification of the ~50 carrier features.

invented entities (1)

carrier scaffold no independent evidence
purpose: Label for the persistent backbone of ~50 stable sparse features that organizes model representations.
New organizing concept introduced to describe the observed stable features and their properties.

pith-pipeline@v0.9.0 · 5820 in / 1429 out tokens · 61119 ms · 2026-05-20T23:28:04.048133+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

persistent representational backbone... carrier scaffold: ∼50 sparse features with stable life histories... two-phase account of training: selection... first 1%; ... calibrate geometry around a substrate already set
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

joint cross-layer ablation... load-bearing... Function precedes direction

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

24 extracted references · 24 canonical work pages · 7 internal anchors

[1]

Trenton Bricken, Adly Templeton, Joshua Batson, Brian Chen, Adam Jermyn, Tom Conerly, Nicholas L. Turner, Cem Anil, Carson Denison, Amanda Askell, Robert Lasenby, Yifan Wu, Shauna Kravec, Nicholas Schiefer, Tim Maxwell, Nicholas Joseph, Alex Tamkin, Karina Nguyen, Brayden McLean, Josiah E. Burke, Tristan Hume, Shan Carter, Tom Henighan, and Christopher Ol...

work page 2023
[2]

Scaling and evaluating sparse autoencoders

Leo Gao, Tom Dupré la Tour, Henk Tillman, Gabriel Goh, Rajan Troll, Alec Radford, Ilya Sutskever, Jan Leike, and Jeffrey Wu. Scaling and evaluating sparse autoencoders.arXiv preprint arXiv:2406.04093, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[3]

Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling

Stella Biderman, Hailey Schoelkopf, Quentin Anthony, Herbie Bradley, Kyle O’Brien, Eric Hallahan, Mohammad Aflah Khan, Shivanshu Purohit, Usvsn Sai Prashanth, Edward Raff, Aviya Skowron, Lintang Sutawika, and Oskar van der Wal. Pythia: a suite for analyzing large language models across training and scaling. InProceedings of the 40th International Conferen...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[4]

Tracking the feature dynamics in LLM training: A mechanistic study.arXiv preprint arXiv:2412.17626, 2024

Yang Xu, Yi Wang, Hengguan Huang, and Hao Wang. Tracking the feature dynamics in LLM training: A mechanistic study.arXiv preprint arXiv:2412.17626, 2024. URL https: //arxiv.org/abs/2412.17626

work page arXiv 2024
[5]

Evolution of concepts in language model pre-training

Xuyang Ge, Wentao Shu, Jiaxing Wu, Yunhua Zhou, Zhengfu He, and Xipeng Qiu. Evolution of concepts in language model pre-training. InThe Fourteenth International Conference on Learning Representations (ICLR), 2026. URLhttps://arxiv.org/abs/2509.17196

work page arXiv 2026
[6]

Crosscoding Through Time: Tracking Emergence & Consolidation Of Linguistic Representations Throughout LLM Pretraining

Deniz Bayazit, Aaron Mueller, and Antoine Bosselut. Crosscoding through time: Tracking emergence and consolidation of linguistic representations throughout llm pretraining. In Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (ACL), 2026. URLhttps://arxiv.org/abs/2509.05291

work page internal anchor Pith review Pith/arXiv arXiv 2026
[7]

How a bilingual LM becomes bilingual: Tracing internal representations with sparse autoencoders

Tatsuro Inaba, Go Kamoda, Kentaro Inui, Masaru Isonuma, Yusuke Miyao, Yohei Oseki, Benjamin Heinzerling, and Yu Takagi. How a bilingual LM becomes bilingual: Tracing internal representations with sparse autoencoders. InFindings of EMNLP, 2025. URL https: //arxiv.org/abs/2503.06394

work page arXiv 2025
[8]

Predicting the formation of induc- tion heads.arXiv preprint arXiv:2511.16893, 2025

Tatsuya Aoyama, Ethan Wilcox, and Nathan Schneider. Predicting the formation of induc- tion heads.arXiv preprint arXiv:2511.16893, 2025. URL https://arxiv.org/abs/2511. 16893

work page arXiv 2025
[9]

The Pile: An 800GB Dataset of Diverse Text for Language Modeling

Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshnik, Shawn Presser, and Connor Leahy. The pile: An 800GB dataset of diverse text for language modeling.arXiv preprint arXiv:2101.00027, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[10]

Taking features out of superposition with sparse autoencoders

Lee Sharkey, Dan Braun, and Beren Millidge. Taking features out of superposition with sparse autoencoders. AI Alignment Forum, 2022

work page 2022
[11]

Harold W. Kuhn. The Hungarian method for the assignment problem.Naval Research Logistics Quarterly, 2(1–2):83–97, 1955. doi: 10.1002/nav.3800020109

work page doi:10.1002/nav.3800020109 1955
[12]

From birth to loss of representations in artificial neural networks

Philipp Stecher. From birth to loss of representations in artificial neural networks. InSoft- ware Engineering and Formal Methods. SEFM 2024 Collocated Workshops, Lecture Notes in Computer Science, pages 271–289. Springer, Cham, 2026. ISBN 978-3-031-94748-3. doi: 10.1007/978-3-031-94748-3\_20

work page doi:10.1007/978-3-031-94748-3 2024
[13]

Towards principled evaluations of sparse autoencoders for interpretability and control.arXiv preprint arXiv:2405.08366, 2024

Aleksandar Makelov, George Lange, and Neel Nanda. Towards principled evaluations of sparse autoencoders for interpretability and control.arXiv preprint arXiv:2405.08366, 2024

work page arXiv 2024
[14]

Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models

Samuel Marks, Can Rager, Eric J. Michaud, Yonatan Belinkov, David Bau, and Aaron Mueller. Sparse feature circuits: discovering and editing interpretable causal graphs in language models. InThe Thirteenth International Conference on Learning Representations (ICLR), 2025. URL https://arxiv.org/abs/2403.19647. 10

work page internal anchor Pith review Pith/arXiv arXiv 2025
[15]

Oxford University Press, 2018

Nicholas Shea.Representation in Cognitive Science. Oxford University Press, 2018

work page 2018
[16]

Ben Baker, Benjamin Lansdell, and Konrad P. Kording. Three aspects of representation in neuroscience.Trends in Cognitive Sciences, 26(11):942–958, 2022

work page 2022
[17]

& Domenico, M

Philipp Stecher, Sandro Radovanovi´c, Vlasta Sikimi ´c, and Reinhard Kahle. Scaffolded rep- resentation learning in deep networks.Research Square preprint, 2026. doi: 10.21203/rs.3. rs-9269961/v1. Research Square preprint, under review

work page doi:10.21203/rs.3 2026
[18]

Locating and editing factual associations in GPT

Kevin Meng, David Bau, Alex Andonian, and Yonatan Belinkov. Locating and editing factual associations in GPT. InAdvances in Neural Information Processing Systems (NeurIPS), 2022

work page 2022
[19]

Critical Learning Periods in Deep Neural Networks

Alessandro Achille, Matteo Rovere, and Stefano Soatto. Critical learning periods in deep neural networks. InInternational Conference on Learning Representations, 2019. URL https://arxiv.org/abs/1711.08856

work page internal anchor Pith review Pith/arXiv arXiv 2019
[20]

Critical learning periods emerge even in deep linear networks

Michael Kleinman, Alessandro Achille, and Stefano Soatto. Critical learning periods emerge even in deep linear networks. InInternational Conference on Learning Representations, 2024. URLhttps://openreview.net/forum?id=Aq35gl2c1k

work page 2024
[21]

The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks

Jonathan Frankle and Michael Carbin. The lottery ticket hypothesis: Finding sparse, trainable neural networks. InInternational Conference on Learning Representations, 2019. URL https://arxiv.org/abs/1803.03635

work page internal anchor Pith review Pith/arXiv arXiv 2019
[22]

Language models can explain neurons in language models

Steven Bills, Nick Cammarata, Dan Mossing, Henk Tillman, Leo Gao, Gabriel Goh, Ilya Sutskever, Jan Leike, Jeff Wu, and William Saunders. Language models can explain neurons in language models. OpenAI, 2023. URL https://openaipublic.blob.core.windows. net/neuron-explainer/paper/index.html

work page 2023
[23]

Automatically interpreting millions of features in large language models.arXiv preprint arXiv:2410.13928, 2024

Gonçalo Paulo, Alex Mallen, Caden Juang, and Nora Belrose. Automatically interpreting millions of features in large language models.arXiv preprint arXiv:2410.13928, 2024

work page arXiv 2024
[24]

A is for Absorption: Studying Feature Splitting and Absorption in Sparse Autoencoders , journal =

David Chanin, James Wilken-Smith, Tomáš Dulka, Hardik Bhatnagar, Satvik Golechha, and Joseph Bloom. A is for absorption: Studying feature splitting and absorption in sparse au- toencoders.arXiv preprint arXiv:2409.14507, 2024. URL https://arxiv.org/abs/2409. 14507. Accepted at NeurIPS 2025 (Oral). 11 A Semantic Anatomy of the Carrier Scaffold Of the 51 pe...

work page arXiv 2024

[1] [1]

Trenton Bricken, Adly Templeton, Joshua Batson, Brian Chen, Adam Jermyn, Tom Conerly, Nicholas L. Turner, Cem Anil, Carson Denison, Amanda Askell, Robert Lasenby, Yifan Wu, Shauna Kravec, Nicholas Schiefer, Tim Maxwell, Nicholas Joseph, Alex Tamkin, Karina Nguyen, Brayden McLean, Josiah E. Burke, Tristan Hume, Shan Carter, Tom Henighan, and Christopher Ol...

work page 2023

[2] [2]

Scaling and evaluating sparse autoencoders

Leo Gao, Tom Dupré la Tour, Henk Tillman, Gabriel Goh, Rajan Troll, Alec Radford, Ilya Sutskever, Jan Leike, and Jeffrey Wu. Scaling and evaluating sparse autoencoders.arXiv preprint arXiv:2406.04093, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[3] [3]

Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling

Stella Biderman, Hailey Schoelkopf, Quentin Anthony, Herbie Bradley, Kyle O’Brien, Eric Hallahan, Mohammad Aflah Khan, Shivanshu Purohit, Usvsn Sai Prashanth, Edward Raff, Aviya Skowron, Lintang Sutawika, and Oskar van der Wal. Pythia: a suite for analyzing large language models across training and scaling. InProceedings of the 40th International Conferen...

work page internal anchor Pith review Pith/arXiv arXiv 2023

[4] [4]

Tracking the feature dynamics in LLM training: A mechanistic study.arXiv preprint arXiv:2412.17626, 2024

Yang Xu, Yi Wang, Hengguan Huang, and Hao Wang. Tracking the feature dynamics in LLM training: A mechanistic study.arXiv preprint arXiv:2412.17626, 2024. URL https: //arxiv.org/abs/2412.17626

work page arXiv 2024

[5] [5]

Evolution of concepts in language model pre-training

Xuyang Ge, Wentao Shu, Jiaxing Wu, Yunhua Zhou, Zhengfu He, and Xipeng Qiu. Evolution of concepts in language model pre-training. InThe Fourteenth International Conference on Learning Representations (ICLR), 2026. URLhttps://arxiv.org/abs/2509.17196

work page arXiv 2026

[6] [6]

Crosscoding Through Time: Tracking Emergence & Consolidation Of Linguistic Representations Throughout LLM Pretraining

Deniz Bayazit, Aaron Mueller, and Antoine Bosselut. Crosscoding through time: Tracking emergence and consolidation of linguistic representations throughout llm pretraining. In Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (ACL), 2026. URLhttps://arxiv.org/abs/2509.05291

work page internal anchor Pith review Pith/arXiv arXiv 2026

[7] [7]

How a bilingual LM becomes bilingual: Tracing internal representations with sparse autoencoders

Tatsuro Inaba, Go Kamoda, Kentaro Inui, Masaru Isonuma, Yusuke Miyao, Yohei Oseki, Benjamin Heinzerling, and Yu Takagi. How a bilingual LM becomes bilingual: Tracing internal representations with sparse autoencoders. InFindings of EMNLP, 2025. URL https: //arxiv.org/abs/2503.06394

work page arXiv 2025

[8] [8]

Predicting the formation of induc- tion heads.arXiv preprint arXiv:2511.16893, 2025

Tatsuya Aoyama, Ethan Wilcox, and Nathan Schneider. Predicting the formation of induc- tion heads.arXiv preprint arXiv:2511.16893, 2025. URL https://arxiv.org/abs/2511. 16893

work page arXiv 2025

[9] [9]

The Pile: An 800GB Dataset of Diverse Text for Language Modeling

Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshnik, Shawn Presser, and Connor Leahy. The pile: An 800GB dataset of diverse text for language modeling.arXiv preprint arXiv:2101.00027, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[10] [10]

Taking features out of superposition with sparse autoencoders

Lee Sharkey, Dan Braun, and Beren Millidge. Taking features out of superposition with sparse autoencoders. AI Alignment Forum, 2022

work page 2022

[11] [11]

Harold W. Kuhn. The Hungarian method for the assignment problem.Naval Research Logistics Quarterly, 2(1–2):83–97, 1955. doi: 10.1002/nav.3800020109

work page doi:10.1002/nav.3800020109 1955

[12] [12]

From birth to loss of representations in artificial neural networks

Philipp Stecher. From birth to loss of representations in artificial neural networks. InSoft- ware Engineering and Formal Methods. SEFM 2024 Collocated Workshops, Lecture Notes in Computer Science, pages 271–289. Springer, Cham, 2026. ISBN 978-3-031-94748-3. doi: 10.1007/978-3-031-94748-3\_20

work page doi:10.1007/978-3-031-94748-3 2024

[13] [13]

Towards principled evaluations of sparse autoencoders for interpretability and control.arXiv preprint arXiv:2405.08366, 2024

Aleksandar Makelov, George Lange, and Neel Nanda. Towards principled evaluations of sparse autoencoders for interpretability and control.arXiv preprint arXiv:2405.08366, 2024

work page arXiv 2024

[14] [14]

Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models

Samuel Marks, Can Rager, Eric J. Michaud, Yonatan Belinkov, David Bau, and Aaron Mueller. Sparse feature circuits: discovering and editing interpretable causal graphs in language models. InThe Thirteenth International Conference on Learning Representations (ICLR), 2025. URL https://arxiv.org/abs/2403.19647. 10

work page internal anchor Pith review Pith/arXiv arXiv 2025

[15] [15]

Oxford University Press, 2018

Nicholas Shea.Representation in Cognitive Science. Oxford University Press, 2018

work page 2018

[16] [16]

Ben Baker, Benjamin Lansdell, and Konrad P. Kording. Three aspects of representation in neuroscience.Trends in Cognitive Sciences, 26(11):942–958, 2022

work page 2022

[17] [17]

& Domenico, M

Philipp Stecher, Sandro Radovanovi´c, Vlasta Sikimi ´c, and Reinhard Kahle. Scaffolded rep- resentation learning in deep networks.Research Square preprint, 2026. doi: 10.21203/rs.3. rs-9269961/v1. Research Square preprint, under review

work page doi:10.21203/rs.3 2026

[18] [18]

Locating and editing factual associations in GPT

Kevin Meng, David Bau, Alex Andonian, and Yonatan Belinkov. Locating and editing factual associations in GPT. InAdvances in Neural Information Processing Systems (NeurIPS), 2022

work page 2022

[19] [19]

Critical Learning Periods in Deep Neural Networks

Alessandro Achille, Matteo Rovere, and Stefano Soatto. Critical learning periods in deep neural networks. InInternational Conference on Learning Representations, 2019. URL https://arxiv.org/abs/1711.08856

work page internal anchor Pith review Pith/arXiv arXiv 2019

[20] [20]

Critical learning periods emerge even in deep linear networks

Michael Kleinman, Alessandro Achille, and Stefano Soatto. Critical learning periods emerge even in deep linear networks. InInternational Conference on Learning Representations, 2024. URLhttps://openreview.net/forum?id=Aq35gl2c1k

work page 2024

[21] [21]

The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks

Jonathan Frankle and Michael Carbin. The lottery ticket hypothesis: Finding sparse, trainable neural networks. InInternational Conference on Learning Representations, 2019. URL https://arxiv.org/abs/1803.03635

work page internal anchor Pith review Pith/arXiv arXiv 2019

[22] [22]

Language models can explain neurons in language models

Steven Bills, Nick Cammarata, Dan Mossing, Henk Tillman, Leo Gao, Gabriel Goh, Ilya Sutskever, Jan Leike, Jeff Wu, and William Saunders. Language models can explain neurons in language models. OpenAI, 2023. URL https://openaipublic.blob.core.windows. net/neuron-explainer/paper/index.html

work page 2023

[23] [23]

Automatically interpreting millions of features in large language models.arXiv preprint arXiv:2410.13928, 2024

Gonçalo Paulo, Alex Mallen, Caden Juang, and Nora Belrose. Automatically interpreting millions of features in large language models.arXiv preprint arXiv:2410.13928, 2024

work page arXiv 2024

[24] [24]

A is for Absorption: Studying Feature Splitting and Absorption in Sparse Autoencoders , journal =

David Chanin, James Wilken-Smith, Tomáš Dulka, Hardik Bhatnagar, Satvik Golechha, and Joseph Bloom. A is for absorption: Studying feature splitting and absorption in sparse au- toencoders.arXiv preprint arXiv:2409.14507, 2024. URL https://arxiv.org/abs/2409. 14507. Accepted at NeurIPS 2025 (Oral). 11 A Semantic Anatomy of the Carrier Scaffold Of the 51 pe...

work page arXiv 2024