pith. sign in

arxiv: 2605.18789 · v1 · pith:PIMRL5A3new · submitted 2026-05-07 · 🧬 q-bio.NC · cs.AI

Features have life history. And we should care

Pith reviewed 2026-05-20 23:28 UTC · model grok-4.3

classification 🧬 q-bio.NC cs.AI
keywords feature life historyrepresentational backbonecarrier scaffoldtraining dynamicssparse featuresPythia modelstwo-phase training
0
0 comments X

The pith

Language models form a stable scaffold of about 50 sparse features early in training that organizes the rest of their representations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Features in language models emerge, persist, and die as training proceeds, yet their histories show a persistent backbone. In Pythia-160M and -410M this backbone appears as roughly 50 sparse features with stable life histories around which the model's structure organizes. The scaffold assembles in the first 1 percent of training, proves load-bearing under joint ablation, is predictable from initial firing patterns alone, and later recruits 64 percent of active features into its hierarchy. These observations support a two-phase account in which early selection sets the scaffold while the remaining training calibrates geometry around it.

Core claim

The paper identifies a carrier scaffold of approximately 50 sparse features with stable life histories that functions as the persistent representational backbone in Pythia models. This scaffold assembles early, is load-bearing under joint cross-layer ablation, has its membership predictable from training-onset firing patterns before geometry settles, and seeds later development by recruiting most active features into the scaffold hierarchy by the end of training.

What carries the argument

The carrier scaffold: a small population of sparse features with stable life histories identified through joint cross-layer ablation and life-history tracking that serves as the organizing backbone for the model's representations.

If this is right

  • The scaffold is largely fixed after the first 1 percent of training, so later steps mainly adjust geometry around an already chosen substrate.
  • Joint ablation of scaffold carriers reveals outsized impact on model behavior compared with non-scaffold populations of equal size.
  • Onset firing patterns alone distinguish future carriers from non-carriers in four out of five cases.
  • By training's end the scaffold has incorporated 64 percent of all active features into its hierarchy.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the scaffold is set so early, interventions confined to the first percent of training could steer final model organization more efficiently than later adjustments.
  • The two-phase pattern may extend to other model families and scales, implying that representational structure is similarly fixed early across architectures.
  • Life-history tracking of features could serve as a diagnostic for when a model has completed its structural selection phase.

Load-bearing premise

Joint cross-layer ablation correctly measures load-bearing importance without being confounded by feature interactions or correlations.

What would settle it

Ablating the identified scaffold features produces no larger performance drop than ablating a count-matched set of non-scaffold features.

Figures

Figures reproduced from arXiv: 2605.18789 by Philipp Stecher, Reinhard Kahle, Sandro Radovanovi\'c, Vlasta Sikimi\'c.

Figure 1
Figure 1. Figure 1: The scaffold’s assembly dynamics and its organisational reach. [PITH_FULL_IMAGE:figures/full_fig_p005_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Joint cross-layer ablation quantifies the scaffold’s load. [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Function settles early; direction calibrates late; the gap makes the scaffold legible at step [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: The scaffold organises two-thirds of the mature network through hierarchically scaffolded [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Kaplan–Meier survival by CI tier at Pythia-160M (left) and Pythia-410M (right), with [PITH_FULL_IMAGE:figures/full_fig_p015_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Firing breadth (fraction of validation tokens on which a feature activates) by CI tier at step [PITH_FULL_IMAGE:figures/full_fig_p015_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Pythia-410M per-layer cross-checks for the unembedding-cosine and within-layer DAG [PITH_FULL_IMAGE:figures/full_fig_p021_7.png] view at source ↗
read the original abstract

Features in language models have life history: they emerge, persist, and die during training, yet the importance of that history remains largely unexplored. We find evidence of a persistent representational backbone, which we identify in Pythia-160M and -410M as the carrier scaffold: ${\sim}50$ sparse features with stable life histories, around which the model's representational structure organises. It has four properties. \emph{(i)}~\emph{It assembles early:} features emerge, die, and reorganise ${\sim}40\!\times$ faster in the first $1\%$ of training than afterwards, and the scaffold is already largely fixed by then. \emph{(ii)}~\emph{It is load-bearing:} joint cross-layer ablation identifies the carriers as far more load-bearing than any count-matched non-scaffold population, a gap invisible to per-firing single-feature methods. \emph{(iii)}~\emph{Function precedes direction:} which features will become carriers is already predictable from training-onset firing patterns alone, correctly distinguishing future carriers from non-carriers in $4$ of $5$ cases, before the geometry has settled. \emph{(iv)}~\emph{It seeds subsequent development:} by the end of training, scaffold carriers have recruited $64\%$ of all active features into the scaffold hierarchy. Life history is consistent with a two-phase account of training: selection appears to largely determine the scaffold in the first $1\%$; the remaining $99\%$ appears to calibrate geometry around a substrate already set.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript analyzes the life histories of sparse features during training in language models, identifying a persistent 'carrier scaffold' of ~50 features in Pythia-160M and -410M that organizes representational structure. It reports four properties: (i) early assembly, with ~40x faster emergence/die/reorganization in the first 1% of training and the scaffold largely fixed by then; (ii) load-bearing, via joint cross-layer ablation yielding larger performance gaps than count-matched non-scaffold sets (invisible to single-feature ablations); (iii) function precedes direction, with carrier status predictable from onset firing patterns alone (4/5 accuracy); (iv) seeds development by recruiting 64% of active features. The authors propose a two-phase training account with early selection and later calibration.

Significance. If the ablation results and predictions hold after addressing potential confounds, the work would usefully draw attention to temporal feature dynamics that are often ignored in interpretability studies, potentially informing staged training protocols or early intervention strategies.

major comments (2)
  1. [Abstract, property (ii)] Abstract, property (ii): The load-bearing claim for the carrier scaffold rests on joint cross-layer ablation producing a performance gap versus count-matched controls. Without explicit checks for multicollinearity (e.g., variance inflation factors) or orthogonalized interventions, the gap could arise from feature correlations or shared downstream effects rather than an organizing backbone role.
  2. [Abstract, property (iii)] Abstract, property (iii): The 4-of-5 accuracy claim that onset firing patterns predict future carriers requires details on the classifier, cross-validation procedure, and comparison to chance or non-onset baselines to confirm it is not inflated by correlations in the early data.
minor comments (2)
  1. [Abstract] The abstract supplies no methods details, error bars, or statistical tests, which hinders evaluation of the reported percentages and accuracies.
  2. [Abstract] Define 'sparse features' and the exact selection criteria for the ~50 carriers more explicitly.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their careful reading and constructive suggestions, which help clarify the methodological foundations of our claims about the carrier scaffold. We respond to each major comment below and indicate the revisions we will make to address the concerns raised.

read point-by-point responses
  1. Referee: [Abstract, property (ii)] Abstract, property (ii): The load-bearing claim for the carrier scaffold rests on joint cross-layer ablation producing a performance gap versus count-matched controls. Without explicit checks for multicollinearity (e.g., variance inflation factors) or orthogonalized interventions, the gap could arise from feature correlations or shared downstream effects rather than an organizing backbone role.

    Authors: We agree that multicollinearity among features could contribute to the observed performance gap and that additional controls would strengthen the load-bearing interpretation. In the revised manuscript we will add a variance inflation factor analysis computed on the joint activation matrix of scaffold features across layers. We will also report an orthogonalized ablation variant in which we first project activations onto the subspace orthogonal to the top principal components of the non-scaffold population before performing the joint ablation. Our existing controls already match on feature count, mean activation magnitude, and layer distribution; the gap remains large under these constraints. We will incorporate the new VIF and orthogonalized results to directly address the referee’s concern. revision: yes

  2. Referee: [Abstract, property (iii)] Abstract, property (iii): The 4-of-5 accuracy claim that onset firing patterns predict future carriers requires details on the classifier, cross-validation procedure, and comparison to chance or non-onset baselines to confirm it is not inflated by correlations in the early data.

    Authors: We will expand the Methods section to specify that a logistic regression classifier with L2 regularization was trained on per-feature onset firing rates (first 1 % of training steps) to predict end-of-training carrier status. We employed 5-fold stratified cross-validation to preserve class balance and report mean accuracy of 80 % (4/5). This is compared against (i) a random-label baseline (50 %) and (ii) a non-onset baseline using firing rates from the middle of training; onset firing alone yields significantly higher accuracy (permutation test, p < 0.01). A supplementary figure showing the ROC curve and feature-importance weights will be added. These details will be included in the revised version. revision: yes

Circularity Check

0 steps flagged

No significant circularity; claims are observational

full rationale

The paper reports empirical observations of feature life histories during model training, identifying a carrier scaffold via patterns in emergence, persistence, ablation effects, and predictability from onset firing. No equations, derivations, or self-referential definitions appear in the provided text. Properties (i)-(iv) are presented as data-driven findings rather than quantities constructed from fitted parameters or prior self-citations within the same chain. The analysis relies on external benchmarks like cross-layer ablation and temporal prediction without reducing the central claims to tautological inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The analysis rests on domain assumptions about feature tracking and sparsity rather than new free parameters or invented physical entities.

axioms (2)
  • domain assumption Individual features in language models can be tracked across training steps for emergence, persistence, and death.
    Required to define life histories and the scaffold.
  • domain assumption Sparse features extracted via standard methods form the right granularity for representational analysis.
    Underlies identification of the ~50 carrier features.
invented entities (1)
  • carrier scaffold no independent evidence
    purpose: Label for the persistent backbone of ~50 stable sparse features that organizes model representations.
    New organizing concept introduced to describe the observed stable features and their properties.

pith-pipeline@v0.9.0 · 5820 in / 1429 out tokens · 61119 ms · 2026-05-20T23:28:04.048133+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

24 extracted references · 24 canonical work pages · 7 internal anchors

  1. [1]

    Trenton Bricken, Adly Templeton, Joshua Batson, Brian Chen, Adam Jermyn, Tom Conerly, Nicholas L. Turner, Cem Anil, Carson Denison, Amanda Askell, Robert Lasenby, Yifan Wu, Shauna Kravec, Nicholas Schiefer, Tim Maxwell, Nicholas Joseph, Alex Tamkin, Karina Nguyen, Brayden McLean, Josiah E. Burke, Tristan Hume, Shan Carter, Tom Henighan, and Christopher Ol...

  2. [2]

    Scaling and evaluating sparse autoencoders

    Leo Gao, Tom Dupré la Tour, Henk Tillman, Gabriel Goh, Rajan Troll, Alec Radford, Ilya Sutskever, Jan Leike, and Jeffrey Wu. Scaling and evaluating sparse autoencoders.arXiv preprint arXiv:2406.04093, 2024

  3. [3]

    Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling

    Stella Biderman, Hailey Schoelkopf, Quentin Anthony, Herbie Bradley, Kyle O’Brien, Eric Hallahan, Mohammad Aflah Khan, Shivanshu Purohit, Usvsn Sai Prashanth, Edward Raff, Aviya Skowron, Lintang Sutawika, and Oskar van der Wal. Pythia: a suite for analyzing large language models across training and scaling. InProceedings of the 40th International Conferen...

  4. [4]

    Tracking the feature dynamics in LLM training: A mechanistic study.arXiv preprint arXiv:2412.17626, 2024

    Yang Xu, Yi Wang, Hengguan Huang, and Hao Wang. Tracking the feature dynamics in LLM training: A mechanistic study.arXiv preprint arXiv:2412.17626, 2024. URL https: //arxiv.org/abs/2412.17626

  5. [5]

    Evolution of concepts in language model pre-training

    Xuyang Ge, Wentao Shu, Jiaxing Wu, Yunhua Zhou, Zhengfu He, and Xipeng Qiu. Evolution of concepts in language model pre-training. InThe Fourteenth International Conference on Learning Representations (ICLR), 2026. URLhttps://arxiv.org/abs/2509.17196

  6. [6]

    Crosscoding Through Time: Tracking Emergence & Consolidation Of Linguistic Representations Throughout LLM Pretraining

    Deniz Bayazit, Aaron Mueller, and Antoine Bosselut. Crosscoding through time: Tracking emergence and consolidation of linguistic representations throughout llm pretraining. In Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (ACL), 2026. URLhttps://arxiv.org/abs/2509.05291

  7. [7]

    How a bilingual LM becomes bilingual: Tracing internal representations with sparse autoencoders

    Tatsuro Inaba, Go Kamoda, Kentaro Inui, Masaru Isonuma, Yusuke Miyao, Yohei Oseki, Benjamin Heinzerling, and Yu Takagi. How a bilingual LM becomes bilingual: Tracing internal representations with sparse autoencoders. InFindings of EMNLP, 2025. URL https: //arxiv.org/abs/2503.06394

  8. [8]

    Predicting the formation of induc- tion heads.arXiv preprint arXiv:2511.16893, 2025

    Tatsuya Aoyama, Ethan Wilcox, and Nathan Schneider. Predicting the formation of induc- tion heads.arXiv preprint arXiv:2511.16893, 2025. URL https://arxiv.org/abs/2511. 16893

  9. [9]

    The Pile: An 800GB Dataset of Diverse Text for Language Modeling

    Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshnik, Shawn Presser, and Connor Leahy. The pile: An 800GB dataset of diverse text for language modeling.arXiv preprint arXiv:2101.00027, 2021

  10. [10]

    Taking features out of superposition with sparse autoencoders

    Lee Sharkey, Dan Braun, and Beren Millidge. Taking features out of superposition with sparse autoencoders. AI Alignment Forum, 2022

  11. [11]

    Harold W. Kuhn. The Hungarian method for the assignment problem.Naval Research Logistics Quarterly, 2(1–2):83–97, 1955. doi: 10.1002/nav.3800020109

  12. [12]

    From birth to loss of representations in artificial neural networks

    Philipp Stecher. From birth to loss of representations in artificial neural networks. InSoft- ware Engineering and Formal Methods. SEFM 2024 Collocated Workshops, Lecture Notes in Computer Science, pages 271–289. Springer, Cham, 2026. ISBN 978-3-031-94748-3. doi: 10.1007/978-3-031-94748-3\_20

  13. [13]

    Towards principled evaluations of sparse autoencoders for interpretability and control.arXiv preprint arXiv:2405.08366, 2024

    Aleksandar Makelov, George Lange, and Neel Nanda. Towards principled evaluations of sparse autoencoders for interpretability and control.arXiv preprint arXiv:2405.08366, 2024

  14. [14]

    Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models

    Samuel Marks, Can Rager, Eric J. Michaud, Yonatan Belinkov, David Bau, and Aaron Mueller. Sparse feature circuits: discovering and editing interpretable causal graphs in language models. InThe Thirteenth International Conference on Learning Representations (ICLR), 2025. URL https://arxiv.org/abs/2403.19647. 10

  15. [15]

    Oxford University Press, 2018

    Nicholas Shea.Representation in Cognitive Science. Oxford University Press, 2018

  16. [16]

    Ben Baker, Benjamin Lansdell, and Konrad P. Kording. Three aspects of representation in neuroscience.Trends in Cognitive Sciences, 26(11):942–958, 2022

  17. [17]

    & Domenico, M

    Philipp Stecher, Sandro Radovanovi´c, Vlasta Sikimi ´c, and Reinhard Kahle. Scaffolded rep- resentation learning in deep networks.Research Square preprint, 2026. doi: 10.21203/rs.3. rs-9269961/v1. Research Square preprint, under review

  18. [18]

    Locating and editing factual associations in GPT

    Kevin Meng, David Bau, Alex Andonian, and Yonatan Belinkov. Locating and editing factual associations in GPT. InAdvances in Neural Information Processing Systems (NeurIPS), 2022

  19. [19]

    Critical Learning Periods in Deep Neural Networks

    Alessandro Achille, Matteo Rovere, and Stefano Soatto. Critical learning periods in deep neural networks. InInternational Conference on Learning Representations, 2019. URL https://arxiv.org/abs/1711.08856

  20. [20]

    Critical learning periods emerge even in deep linear networks

    Michael Kleinman, Alessandro Achille, and Stefano Soatto. Critical learning periods emerge even in deep linear networks. InInternational Conference on Learning Representations, 2024. URLhttps://openreview.net/forum?id=Aq35gl2c1k

  21. [21]

    The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks

    Jonathan Frankle and Michael Carbin. The lottery ticket hypothesis: Finding sparse, trainable neural networks. InInternational Conference on Learning Representations, 2019. URL https://arxiv.org/abs/1803.03635

  22. [22]

    Language models can explain neurons in language models

    Steven Bills, Nick Cammarata, Dan Mossing, Henk Tillman, Leo Gao, Gabriel Goh, Ilya Sutskever, Jan Leike, Jeff Wu, and William Saunders. Language models can explain neurons in language models. OpenAI, 2023. URL https://openaipublic.blob.core.windows. net/neuron-explainer/paper/index.html

  23. [23]

    Automatically interpreting millions of features in large language models.arXiv preprint arXiv:2410.13928, 2024

    Gonçalo Paulo, Alex Mallen, Caden Juang, and Nora Belrose. Automatically interpreting millions of features in large language models.arXiv preprint arXiv:2410.13928, 2024

  24. [24]

    A is for Absorption: Studying Feature Splitting and Absorption in Sparse Autoencoders , journal =

    David Chanin, James Wilken-Smith, Tomáš Dulka, Hardik Bhatnagar, Satvik Golechha, and Joseph Bloom. A is for absorption: Studying feature splitting and absorption in sparse au- toencoders.arXiv preprint arXiv:2409.14507, 2024. URL https://arxiv.org/abs/2409. 14507. Accepted at NeurIPS 2025 (Oral). 11 A Semantic Anatomy of the Carrier Scaffold Of the 51 pe...