When Probing Accuracy Saturates, Fragility Resolves: A Complementary Metric for LLM Pre-Training Analysis

Orion Reblitz-Richardson

arxiv: 2606.11375 · v1 · pith:SRU3JXLPnew · submitted 2026-06-09 · 💻 cs.CL · cs.AI· cs.LG

When Probing Accuracy Saturates, Fragility Resolves: A Complementary Metric for LLM Pre-Training Analysis

Orion Reblitz-Richardson This is my paper

Pith reviewed 2026-06-27 13:23 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG

keywords fragility metriclinear probingLLM pre-trainingprobing accuracyrepresentation robustnessmoral representationsfine-tuning analysis

0 comments

The pith

Fragility metric tracks representation changes in language models long after probing accuracy saturates.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Standard linear probing on hidden states reaches high accuracy early in pre-training and then stays flat, hiding most of the training dynamics. Fragility is introduced as the level of added activation noise at which that accuracy collapses, making it sensitive to separability margins and representational redundancy that keep shifting afterward. When applied to open checkpoints, it exposes a shift from lexical to compositional encoding of moral properties, a steady increase in robustness from early to late layers, and different robustness signatures from fine-tuning sets that match exactly on accuracy. A reader would care because it turns a flat instrument into one that continues to resolve structure throughout training.

Core claim

Fragility, defined as the activation-noise level at which probe accuracy collapses, recovers structure that accuracy alone cannot see. Moralized representations emerge along a lexical to compositional gradient, with lexical detection appearing first and compositional encoding later, established by transfer across construction types sharing no contrast tokens. A monotonic layer-depth robustness gradient develops across training while accuracy remains flat. Matched fine-tuning corpora that yield identical probing accuracy produce distinct fragility fingerprints, showing that data curation reshapes probe robustness without changing accuracy.

What carries the argument

Fragility, the activation-noise level at which probe accuracy collapses, which measures both margin of separability and redundancy of representation.

If this is right

Moralized representations develop first through lexical cues and later through compositional structure, shown by cross-construction transfer.
Robustness to noise increases monotonically from early to late layers during pre-training.
Different fine-tuning data sets can produce the same probing accuracy yet leave measurably different robustness profiles.
Where accuracy returns flat results across training steps or data conditions, fragility returns structured differences.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Fragility could be used to monitor when specific capabilities stabilize during large-scale pre-training runs.
The same noise-threshold approach might distinguish representation quality in other probing tasks beyond moral detection.
Data curation choices may affect model internals more than accuracy-based evaluations currently reveal.

Load-bearing premise

The fragility metric captures margin of separability and redundancy of representation without being confounded by probe architecture, noise distribution, or dataset lexical properties.

What would settle it

Measuring fragility across successive pre-training checkpoints after accuracy has plateaued and finding no further change, or finding identical fragility values for matched fine-tuning corpora claimed to differ, would falsify the claim that fragility continues to resolve structure where accuracy does not.

Figures

Figures reproduced from arXiv: 2606.11375 by Orion Reblitz-Richardson.

**Figure 1.** Figure 1: Lexical → compositional emergence gradient on OLMo-2 1B early-training. Standard moral and sentiment probes (single-token swap) plateau near 0.97; compositional moral and syntax (multi-token integration) plateau near 0.77. Compositional curve is 4-seed mean ± std (split seeds 42/43/44/45). Onsets: standard moral 1K, sentiment 2K, compositional moral 5K (per-seed range 4K–7K), syntax 6K. 3.5 Target models a… view at source ↗

**Figure 2.** Figure 2: Compositional moral encoding emerges in early pre-training and then holds (OLMo-2 1B earlytraining, 37 checkpoints). (a) When: leave-construction-out transfer accuracy rises from chance (∼0.55) across steps 2K–9K, crosses the bag-of-words transfer floor (∼0.60) by step 2K, and plateaus near 0.82 (lift ∼+0.20 over the lexical floor) through step 36K; the role_reversal curve (lexical cues scrambled) tracks … view at source ↗

**Figure 3.** Figure 3: Probing accuracy saturates; fragility resolves. OLMo-2 1B early-training, 37 checkpoints. Top: mean probing accuracy across all 16 layers — saturates near 0.95 by step 4K and stays flat for the remaining 32K steps. Bottom: mean critical noise — continues evolving long after accuracy plateaus, drifting from ∼10 down toward ∼6 between steps 4K and 36K. Top panel reaches a ceiling and stops; bottom panel keep… view at source ↗

**Figure 4.** Figure 4: Layer-depth structure over training (OLMo-2 1B early-training). (a) Probing accuracy: uniformly high across layers after step 4K. (b) Critical noise: a layer-depth gradient develops, with late layers holding maximum noise tolerance while early layers grow progressively more brittle. Same data, same model; structure visible only under the fragility metric. if 4-seed mean critical noise drops by ≥ 1.0 betwee… view at source ↗

**Figure 5.** Figure 5: Data curation reshapes probe robustness, not probe accuracy (OLMo-2 1B; LoRA fine-tuning from step 1000 on three matched corpora). (a) Final probing accuracy is near-identical across narrative-moral, declarative-moral, and general-text control conditions (0.740 / 0.750 / 0.750). (b) Per-layer critical noise is condition-specific: declarative-moral training produces diffuse fragility across 10 of 16 layers … view at source ↗

read the original abstract

Standard linear probing declares a property "encoded" when a classifier on hidden states achieves high accuracy. The protocol works well on a snapshot but breaks across pre-training: probe accuracy saturates within the first few thousand steps, leaving most of training invisible to the instrument. We introduce fragility, a complementary per-layer metric defined as the activation-noise level at which probe accuracy collapses. Fragility is sensitive to both the margin of separability and the redundancy of representation, both of which keep evolving long after accuracy plateaus. Applied to open-checkpoint language models, fragility recovers structure that accuracy alone cannot see. Moralized representations emerge along a lexical $\to$ compositional gradient: lexical moral detection first, compositional moral encoding later. Because probe accuracy on its own tracks how lexically separable a dataset is, we establish the compositional encoding directly, by showing it transfers across construction types that share no contrast tokens. A layer-depth robustness gradient develops monotonically across training while accuracy stays flat. And matched fine-tuning corpora that produce identical probing accuracy leave distinct fragility fingerprints, showing that data curation reshapes probe robustness without changing probe accuracy. In every comparison we test, where probing accuracy returns a flat answer, fragility returns a structured one.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Fragility tracks post-saturation changes in representations via noise collapse threshold, but needs checks that it is not just probe- or noise-dependent.

read the letter

The paper's main point is that linear probe accuracy saturates early in pre-training, so they define fragility as the noise amplitude where that accuracy drops and use it to keep seeing structure in the hidden states.

What is new is the specific claim that this threshold picks up margin of separability and redundancy that accuracy misses. They report three patterns on open checkpoints: moral detection starts lexical then becomes compositional (shown by transfer across datasets with no shared contrast words), layer robustness increases monotonically with depth while accuracy stays flat, and matched fine-tuning runs that match on accuracy still differ on fragility.

Those examples are concrete and the transfer test is a reasonable control for lexical confounds. The idea of a complementary metric that stays informative later in training is straightforward and addresses a real limitation of the standard probing protocol.

The soft spot is that fragility is defined only as the collapse point under added noise, with no shown invariance to probe architecture, noise distribution, or dataset properties. If the reported gradients shift when you swap a linear probe for a nonlinear one or change the noise, the patterns could be measurement artifacts rather than intrinsic to the representations. The abstract gives no equations or controls, so the full paper has to demonstrate that the metric is stable enough to support the conclusions.

This is for people who run long pre-training runs and want instruments that do not go blind after a few thousand steps. It is worth sending to referees because the saturation problem is common and a working post-saturation metric would be practical, even if the current version needs tighter validation on the measurement choices.

Referee Report

3 major / 2 minor

Summary. The paper claims that standard linear probing accuracy saturates early in LLM pre-training (within first few thousand steps), leaving most training invisible. It introduces fragility, defined as the activation-noise level at which probe accuracy collapses, as a complementary per-layer metric sensitive to margin of separability and redundancy. Applied to open-checkpoint models, fragility reveals moralized representations emerging along a lexical to compositional gradient (established via transfer across construction types sharing no contrast tokens), a monotonic layer-depth robustness gradient, and distinct fragility fingerprints from matched fine-tuning corpora that produce identical probing accuracy.

Significance. If the central claim holds, the work supplies a useful complementary instrument for tracking representation evolution after accuracy saturation, with concrete structure recovered in moral encoding gradients and fine-tuning effects. Credit is due for the transfer-based argument establishing compositional encoding and for applying the metric across open checkpoints to produce falsifiable gradients where accuracy is flat.

major comments (3)

[Method / fragility definition] The definition of fragility (noise amplitude at accuracy collapse) is presented as capturing intrinsic margin/redundancy, but the manuscript must demonstrate invariance to probe architecture (linear vs. nonlinear) and noise distribution; without such controls the reported lexical-to-compositional gradient and layer-depth robustness could be artifacts of the measurement procedure rather than representation properties.
[Moralized representations experiments] The compositional moral encoding claim rests on transfer across construction types sharing no contrast tokens; the paper should report quantitative lexical overlap statistics or ablation controls on the datasets to confirm the transfer isolates compositional structure rather than residual lexical cues.
[Fine-tuning corpus comparisons] The distinct fragility fingerprints from matched fine-tuning corpora (identical accuracy) are load-bearing for the data-curation claim; statistical significance tests, error bars across runs, and details on corpus matching criteria are required to establish that the fragility differences are robust and not due to unaccounted variance.

minor comments (2)

[Abstract / Introduction] The abstract and introduction should explicitly name the open-checkpoint models and layer indices used, to allow immediate replication of the reported gradients.
[Method] Formalize the fragility threshold with an equation (e.g., the minimal noise amplitude σ such that accuracy drops below threshold) rather than prose description alone.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback. The suggestions strengthen the methodological rigor and empirical support for the fragility metric. We address each major comment below and will revise the manuscript to incorporate the requested controls and analyses.

read point-by-point responses

Referee: [Method / fragility definition] The definition of fragility (noise amplitude at accuracy collapse) is presented as capturing intrinsic margin/redundancy, but the manuscript must demonstrate invariance to probe architecture (linear vs. nonlinear) and noise distribution; without such controls the reported lexical-to-compositional gradient and layer-depth robustness could be artifacts of the measurement procedure rather than representation properties.

Authors: We agree that invariance checks are required to rule out measurement artifacts. In the revised manuscript we will add experiments comparing linear probes to nonlinear probes (two-layer MLPs with ReLU) and will test both uniform and Gaussian noise distributions, confirming that the lexical-to-compositional gradient and layer-depth robustness pattern remain stable across these choices. revision: yes
Referee: [Moralized representations experiments] The compositional moral encoding claim rests on transfer across construction types sharing no contrast tokens; the paper should report quantitative lexical overlap statistics or ablation controls on the datasets to confirm the transfer isolates compositional structure rather than residual lexical cues.

Authors: We will add quantitative lexical overlap statistics (token-level Jaccard and type overlap) between the construction-type datasets and will include ablation controls that remove any shared contrast tokens before measuring transfer. These additions will be reported in a new subsection of the moral-encoding experiments. revision: yes
Referee: [Fine-tuning corpus comparisons] The distinct fragility fingerprints from matched fine-tuning corpora (identical accuracy) are load-bearing for the data-curation claim; statistical significance tests, error bars across runs, and details on corpus matching criteria are required to establish that the fragility differences are robust and not due to unaccounted variance.

Authors: We will expand the fine-tuning section to include (i) explicit corpus-matching criteria (token count, domain distribution, and moral-token frequency), (ii) error bars from five independent fine-tuning runs per corpus, and (iii) paired t-tests or Wilcoxon tests with p-values for fragility differences at each layer. These results will be added to the existing figures and tables. revision: yes

Circularity Check

0 steps flagged

No significant circularity; metric defined independently

full rationale

The paper defines fragility directly as the activation-noise level at which probe accuracy collapses, without any equations or reductions that make it equivalent to accuracy or fitted parameters by construction. No self-citations are invoked as load-bearing for uniqueness theorems or ansatzes. Claims about gradients and fingerprints are supported by direct comparisons (e.g., transfer across construction types with no shared tokens) that do not reduce to the input definitions. This is a self-contained derivation against external benchmarks, consistent with the most common honest finding of score 0-2.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

Central claim rests on the assumption that noise collapse level independently tracks separability and redundancy; no free parameters or invented entities beyond the new metric itself are stated.

axioms (1)

domain assumption Probe accuracy saturates within the first few thousand steps of pre-training
Invoked as the core limitation that fragility is designed to address.

invented entities (1)

fragility metric no independent evidence
purpose: Complementary per-layer measure of representation robustness via noise threshold
Newly introduced quantity with no independent evidence outside the paper's claims.

pith-pipeline@v0.9.1-grok · 5749 in / 1380 out tokens · 26503 ms · 2026-06-27T13:23:45.424915+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

33 extracted references · 18 canonical work pages · 10 internal anchors

[2]

Computational Linguistics , volume =

Belinkov, Yonatan , title =. Computational Linguistics , volume =. 2022 , doi =

2022
[3]

Advances in Neural Information Processing Systems (NeurIPS) , volume =

Meng, Kevin and Bau, David and Andonian, Alex and Belinkov, Yonatan , title =. Advances in Neural Information Processing Systems (NeurIPS) , volume =. 2022 , url =

2022
[5]

Haidt, Jonathan , title =
[6]

and Ditto, Peter H

Graham, Jesse and Haidt, Jonathan and Koleva, Sena and Motyl, Matt and Iyer, Ravi and Wojcik, Sean P. and Ditto, Peter H. , title =. Advances in Experimental Social Psychology , volume =. 2013 , doi =

2013
[9]

Hewitt, John and Liang, Percy , title =. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP) , pages =. 2019 , doi =

2019
[10]

Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (ACL) , year =

Pimentel, Tiago and Valvoda, Josef and Hall Maudslay, Rowan and Zmigrod, Ran and Williams, Adina and Cotterell, Ryan , title =. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (ACL) , year =
[11]

Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) , year =

Voita, Elena and Titov, Ivan , title =. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) , year =

2020
[12]

and Shen, Yelong and Wallis, Phillip and Allen-Zhu, Zeyuan and Li, Yuanzhi and Wang, Shean and Wang, Lu and Chen, Weizhu , title =

Hu, Edward J. and Shen, Yelong and Wallis, Phillip and Allen-Zhu, Zeyuan and Li, Yuanzhi and Wang, Shean and Wang, Lu and Chen, Weizhu , title =. International Conference on Learning Representations (ICLR) , year =
[14]

International Conference on Machine Learning (ICML) , year =

Biderman, Stella and Schoelkopf, Hailey and Anthony, Quentin and Bradley, Herbie and O'Brien, Kyle and Hallahan, Eric and Khan, Mohammad Aflah and Purohit, Shivanshu and Prashanth, USVSN Sai and Raff, Edward and others , title =. International Conference on Machine Learning (ICML) , year =
[15]

Transformer Circuits Thread , year =

Olsson, Catherine and Elhage, Nelson and Nanda, Neel and Joseph, Nicholas and DasSarma, Nova and Henighan, Tom and Mann, Ben and Askell, Amanda and Bai, Yuntao and Chen, Anna and others , title =. Transformer Circuits Thread , year =
[16]

International Conference on Learning Representations (ICLR) , year =

Nanda, Neel and Chan, Lawrence and Lieberum, Tom and Smith, Jess and Steinhardt, Jacob , title =. International Conference on Learning Representations (ICLR) , year =
[17]

Walking Noise: On Layer-Specific Robustness of Neural Architectures against Noisy Computations and Associated Characteristic Learning Dynamics , journal =

Borras, Hendrik and Klein, Bernhard and Fr. Walking Noise: On Layer-Specific Robustness of Neural Architectures against Noisy Computations and Associated Characteristic Learning Dynamics , journal =. 2022 , url =

2022
[18]

Findings of the Association for Computational Linguistics (ACL) , year =

Qian, Chen and Zhang, Jie and Yao, Wei and Liu, Dongrui and Yin, Zhenfei and Qiao, Yu and Liu, Yong and Shao, Jing , title =. Findings of the Association for Computational Linguistics (ACL) , year =
[21]

Understanding intermediate layers using linear classifier probes

Guillaume Alain and Yoshua Bengio. Understanding intermediate layers using linear classifier probes. arXiv preprint arXiv:1610.01644, 2017. URL https://arxiv.org/abs/1610.01644

work page internal anchor Pith review Pith/arXiv arXiv 2017
[22]

Refusal in Language Models Is Mediated by a Single Direction

Andy Arditi, Oscar Obeso, Aaquib Syed, Daniel Paleka, Nina Panickssery, Wes Gurnee, and Neel Nanda. Refusal in language models is mediated by a single direction. arXiv preprint arXiv:2406.11717, 2024. URL https://arxiv.org/abs/2406.11717

work page internal anchor Pith review Pith/arXiv arXiv 2024
[23]

Computational Linguistics34(1), 1–34 (2008).https://doi.org/10.1162/coli

Yonatan Belinkov. Probing classifiers: Promises, shortcomings, and advances. Computational Linguistics, 48 0 (1): 0 207--219, 2022. doi:10.1162/coli\_a\_00422

work page doi:10.1162/coli 2022
[24]

Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling

Stella Biderman, Hailey Schoelkopf, Quentin Anthony, Herbie Bradley, Kyle O'Brien, Eric Hallahan, Mohammad Aflah Khan, Shivanshu Purohit, USVSN Sai Prashanth, Edward Raff, et al. Pythia: A suite for analyzing large language models across training and scaling. In International Conference on Machine Learning (ICML), 2023. URL https://arxiv.org/abs/2304.01373

work page internal anchor Pith review Pith/arXiv arXiv 2023
[25]

Walking noise: On layer-specific robustness of neural architectures against noisy computations and associated characteristic learning dynamics

Hendrik Borras, Bernhard Klein, and Holger Fr \"o ning. Walking noise: On layer-specific robustness of neural architectures against noisy computations and associated characteristic learning dynamics. arXiv preprint arXiv:2212.10430, 2022. URL https://arxiv.org/abs/2212.10430

work page arXiv 2022
[26]

Wojcik, and Peter H

Jesse Graham, Jonathan Haidt, Sena Koleva, Matt Motyl, Ravi Iyer, Sean P. Wojcik, and Peter H. Ditto. Moral foundations theory: The pragmatic validity of moral pluralism. Advances in Experimental Social Psychology, 47: 0 55--130, 2013. doi:10.1016/B978-0-12-407236-7.00002-4

work page doi:10.1016/b978-0-12-407236-7.00002-4 2013
[27]

OLMo: Accelerating the Science of Language Models

Dirk Groeneveld, Iz Beltagy, Pete Walsh, Akshita Bhagia, Rodney Kinney, Oyvind Tafjord, et al. OLMo : Accelerating the science of language models. arXiv preprint arXiv:2402.00838, 2024. URL https://arxiv.org/abs/2402.00838

work page internal anchor Pith review Pith/arXiv arXiv 2024
[28]

The Righteous Mind: Why Good People Are Divided by Politics and Religion

Jonathan Haidt. The Righteous Mind: Why Good People Are Divided by Politics and Religion. Vintage Books, 2012

2012
[29]

Designing and Interpreting Probes with Control Tasks

John Hewitt and Percy Liang. Designing and interpreting probes with control tasks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 2733--2743, 2019. doi:10.18653/v1/D19-1275. URL https://arxiv.org/abs/1909.03368

work page doi:10.18653/v1/d19-1275 2019
[30]

LoRA: Low-Rank Adaptation of Large Language Models

Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA : Low-rank adaptation of large language models. In International Conference on Learning Representations (ICLR), 2022. URL https://arxiv.org/abs/2106.09685

work page internal anchor Pith review Pith/arXiv arXiv 2022
[31]

Locating and Editing Factual Associations in GPT

Kevin Meng, David Bau, Alex Andonian, and Yonatan Belinkov. Locating and editing factual associations in GPT . In Advances in Neural Information Processing Systems (NeurIPS), volume 35, 2022. URL https://arxiv.org/abs/2202.05262

work page internal anchor Pith review Pith/arXiv arXiv 2022
[32]

Progress measures for grokking via mechanistic interpretability

Neel Nanda, Lawrence Chan, Tom Lieberum, Jess Smith, and Jacob Steinhardt. Progress measures for grokking via mechanistic interpretability. In International Conference on Learning Representations (ICLR), 2023. URL https://arxiv.org/abs/2301.05217

work page internal anchor Pith review Pith/arXiv arXiv 2023
[33]

2 OLMo 2 Furious

OLMo Team . 2 OLMo 2 furious. arXiv preprint arXiv:2501.00656, 2025. URL https://arxiv.org/abs/2501.00656

work page internal anchor Pith review Pith/arXiv arXiv 2025
[34]

In-context learning and induction heads

Catherine Olsson, Nelson Elhage, Neel Nanda, Nicholas Joseph, Nova DasSarma, Tom Henighan, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, et al. In-context learning and induction heads. Transformer Circuits Thread, 2022. URL https://transformer-circuits.pub/2022/in-context-learning-and-induction-heads/index.html

2022
[35]

Information-theoretic probing for linguistic structure

Tiago Pimentel, Josef Valvoda, Rowan Hall Maudslay, Ran Zmigrod, Adina Williams, and Ryan Cotterell. Information-theoretic probing for linguistic structure. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (ACL), 2020. URL https://arxiv.org/abs/2004.03061

work page arXiv 2020
[36]

Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets

Alethea Power, Yuri Burda, Harri Edwards, Igor Babuschkin, and Vedant Misra. Grokking: Generalization beyond overfitting on small algorithmic datasets. arXiv preprint arXiv:2201.02177, 2022. URL https://arxiv.org/abs/2201.02177

work page internal anchor Pith review Pith/arXiv arXiv 2022
[37]

Towards tracing trustworthiness dynamics: Revisiting pre-training period of large language models

Chen Qian, Jie Zhang, Wei Yao, Dongrui Liu, Zhenfei Yin, Yu Qiao, Yong Liu, and Jing Shao. Towards tracing trustworthiness dynamics: Revisiting pre-training period of large language models. In Findings of the Association for Computational Linguistics (ACL), 2024. URL https://arxiv.org/abs/2402.19465

work page arXiv 2024
[38]

APEX : Probing neural networks via activation perturbation

Tao Ren, Xiaoyu Luo, and Qiongxiu Li. APEX : Probing neural networks via activation perturbation. arXiv preprint arXiv:2602.03586, 2026. URL https://arxiv.org/abs/2602.03586

work page arXiv 2026
[39]

Information-theoretic probing with minimum description length

Elena Voita and Ivan Titov. Information-theoretic probing with minimum description length. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2020. URL https://arxiv.org/abs/2003.12298

work page arXiv 2020
[40]

Representation Engineering: A Top-Down Approach to AI Transparency

Andy Zou, Long Phan, Sarah Chen, James Campbell, Phillip Guo, Richard Ren, Alexander Pan, Xuwang Yin, Mantas Mazeika, et al. Representation engineering: A top-down approach to AI transparency. arXiv preprint arXiv:2310.01405, 2023. URL https://arxiv.org/abs/2310.01405

work page internal anchor Pith review Pith/arXiv arXiv 2023

[1] [2]

Computational Linguistics , volume =

Belinkov, Yonatan , title =. Computational Linguistics , volume =. 2022 , doi =

2022

[2] [3]

Advances in Neural Information Processing Systems (NeurIPS) , volume =

Meng, Kevin and Bau, David and Andonian, Alex and Belinkov, Yonatan , title =. Advances in Neural Information Processing Systems (NeurIPS) , volume =. 2022 , url =

2022

[3] [5]

Haidt, Jonathan , title =

[4] [6]

and Ditto, Peter H

Graham, Jesse and Haidt, Jonathan and Koleva, Sena and Motyl, Matt and Iyer, Ravi and Wojcik, Sean P. and Ditto, Peter H. , title =. Advances in Experimental Social Psychology , volume =. 2013 , doi =

2013

[5] [9]

Hewitt, John and Liang, Percy , title =. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP) , pages =. 2019 , doi =

2019

[6] [10]

Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (ACL) , year =

Pimentel, Tiago and Valvoda, Josef and Hall Maudslay, Rowan and Zmigrod, Ran and Williams, Adina and Cotterell, Ryan , title =. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (ACL) , year =

[7] [11]

Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) , year =

Voita, Elena and Titov, Ivan , title =. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) , year =

2020

[8] [12]

and Shen, Yelong and Wallis, Phillip and Allen-Zhu, Zeyuan and Li, Yuanzhi and Wang, Shean and Wang, Lu and Chen, Weizhu , title =

Hu, Edward J. and Shen, Yelong and Wallis, Phillip and Allen-Zhu, Zeyuan and Li, Yuanzhi and Wang, Shean and Wang, Lu and Chen, Weizhu , title =. International Conference on Learning Representations (ICLR) , year =

[9] [14]

International Conference on Machine Learning (ICML) , year =

Biderman, Stella and Schoelkopf, Hailey and Anthony, Quentin and Bradley, Herbie and O'Brien, Kyle and Hallahan, Eric and Khan, Mohammad Aflah and Purohit, Shivanshu and Prashanth, USVSN Sai and Raff, Edward and others , title =. International Conference on Machine Learning (ICML) , year =

[10] [15]

Transformer Circuits Thread , year =

Olsson, Catherine and Elhage, Nelson and Nanda, Neel and Joseph, Nicholas and DasSarma, Nova and Henighan, Tom and Mann, Ben and Askell, Amanda and Bai, Yuntao and Chen, Anna and others , title =. Transformer Circuits Thread , year =

[11] [16]

International Conference on Learning Representations (ICLR) , year =

Nanda, Neel and Chan, Lawrence and Lieberum, Tom and Smith, Jess and Steinhardt, Jacob , title =. International Conference on Learning Representations (ICLR) , year =

[12] [17]

Walking Noise: On Layer-Specific Robustness of Neural Architectures against Noisy Computations and Associated Characteristic Learning Dynamics , journal =

Borras, Hendrik and Klein, Bernhard and Fr. Walking Noise: On Layer-Specific Robustness of Neural Architectures against Noisy Computations and Associated Characteristic Learning Dynamics , journal =. 2022 , url =

2022

[13] [18]

Findings of the Association for Computational Linguistics (ACL) , year =

Qian, Chen and Zhang, Jie and Yao, Wei and Liu, Dongrui and Yin, Zhenfei and Qiao, Yu and Liu, Yong and Shao, Jing , title =. Findings of the Association for Computational Linguistics (ACL) , year =

[14] [21]

Understanding intermediate layers using linear classifier probes

Guillaume Alain and Yoshua Bengio. Understanding intermediate layers using linear classifier probes. arXiv preprint arXiv:1610.01644, 2017. URL https://arxiv.org/abs/1610.01644

work page internal anchor Pith review Pith/arXiv arXiv 2017

[15] [22]

Refusal in Language Models Is Mediated by a Single Direction

Andy Arditi, Oscar Obeso, Aaquib Syed, Daniel Paleka, Nina Panickssery, Wes Gurnee, and Neel Nanda. Refusal in language models is mediated by a single direction. arXiv preprint arXiv:2406.11717, 2024. URL https://arxiv.org/abs/2406.11717

work page internal anchor Pith review Pith/arXiv arXiv 2024

[16] [23]

Computational Linguistics34(1), 1–34 (2008).https://doi.org/10.1162/coli

Yonatan Belinkov. Probing classifiers: Promises, shortcomings, and advances. Computational Linguistics, 48 0 (1): 0 207--219, 2022. doi:10.1162/coli\_a\_00422

work page doi:10.1162/coli 2022

[17] [24]

Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling

Stella Biderman, Hailey Schoelkopf, Quentin Anthony, Herbie Bradley, Kyle O'Brien, Eric Hallahan, Mohammad Aflah Khan, Shivanshu Purohit, USVSN Sai Prashanth, Edward Raff, et al. Pythia: A suite for analyzing large language models across training and scaling. In International Conference on Machine Learning (ICML), 2023. URL https://arxiv.org/abs/2304.01373

work page internal anchor Pith review Pith/arXiv arXiv 2023

[18] [25]

Walking noise: On layer-specific robustness of neural architectures against noisy computations and associated characteristic learning dynamics

Hendrik Borras, Bernhard Klein, and Holger Fr \"o ning. Walking noise: On layer-specific robustness of neural architectures against noisy computations and associated characteristic learning dynamics. arXiv preprint arXiv:2212.10430, 2022. URL https://arxiv.org/abs/2212.10430

work page arXiv 2022

[19] [26]

Wojcik, and Peter H

Jesse Graham, Jonathan Haidt, Sena Koleva, Matt Motyl, Ravi Iyer, Sean P. Wojcik, and Peter H. Ditto. Moral foundations theory: The pragmatic validity of moral pluralism. Advances in Experimental Social Psychology, 47: 0 55--130, 2013. doi:10.1016/B978-0-12-407236-7.00002-4

work page doi:10.1016/b978-0-12-407236-7.00002-4 2013

[20] [27]

OLMo: Accelerating the Science of Language Models

Dirk Groeneveld, Iz Beltagy, Pete Walsh, Akshita Bhagia, Rodney Kinney, Oyvind Tafjord, et al. OLMo : Accelerating the science of language models. arXiv preprint arXiv:2402.00838, 2024. URL https://arxiv.org/abs/2402.00838

work page internal anchor Pith review Pith/arXiv arXiv 2024

[21] [28]

The Righteous Mind: Why Good People Are Divided by Politics and Religion

Jonathan Haidt. The Righteous Mind: Why Good People Are Divided by Politics and Religion. Vintage Books, 2012

2012

[22] [29]

Designing and Interpreting Probes with Control Tasks

John Hewitt and Percy Liang. Designing and interpreting probes with control tasks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 2733--2743, 2019. doi:10.18653/v1/D19-1275. URL https://arxiv.org/abs/1909.03368

work page doi:10.18653/v1/d19-1275 2019

[23] [30]

LoRA: Low-Rank Adaptation of Large Language Models

Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA : Low-rank adaptation of large language models. In International Conference on Learning Representations (ICLR), 2022. URL https://arxiv.org/abs/2106.09685

work page internal anchor Pith review Pith/arXiv arXiv 2022

[24] [31]

Locating and Editing Factual Associations in GPT

Kevin Meng, David Bau, Alex Andonian, and Yonatan Belinkov. Locating and editing factual associations in GPT . In Advances in Neural Information Processing Systems (NeurIPS), volume 35, 2022. URL https://arxiv.org/abs/2202.05262

work page internal anchor Pith review Pith/arXiv arXiv 2022

[25] [32]

Progress measures for grokking via mechanistic interpretability

Neel Nanda, Lawrence Chan, Tom Lieberum, Jess Smith, and Jacob Steinhardt. Progress measures for grokking via mechanistic interpretability. In International Conference on Learning Representations (ICLR), 2023. URL https://arxiv.org/abs/2301.05217

work page internal anchor Pith review Pith/arXiv arXiv 2023

[26] [33]

2 OLMo 2 Furious

OLMo Team . 2 OLMo 2 furious. arXiv preprint arXiv:2501.00656, 2025. URL https://arxiv.org/abs/2501.00656

work page internal anchor Pith review Pith/arXiv arXiv 2025

[27] [34]

In-context learning and induction heads

Catherine Olsson, Nelson Elhage, Neel Nanda, Nicholas Joseph, Nova DasSarma, Tom Henighan, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, et al. In-context learning and induction heads. Transformer Circuits Thread, 2022. URL https://transformer-circuits.pub/2022/in-context-learning-and-induction-heads/index.html

2022

[28] [35]

Information-theoretic probing for linguistic structure

Tiago Pimentel, Josef Valvoda, Rowan Hall Maudslay, Ran Zmigrod, Adina Williams, and Ryan Cotterell. Information-theoretic probing for linguistic structure. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (ACL), 2020. URL https://arxiv.org/abs/2004.03061

work page arXiv 2020

[29] [36]

Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets

Alethea Power, Yuri Burda, Harri Edwards, Igor Babuschkin, and Vedant Misra. Grokking: Generalization beyond overfitting on small algorithmic datasets. arXiv preprint arXiv:2201.02177, 2022. URL https://arxiv.org/abs/2201.02177

work page internal anchor Pith review Pith/arXiv arXiv 2022

[30] [37]

Towards tracing trustworthiness dynamics: Revisiting pre-training period of large language models

Chen Qian, Jie Zhang, Wei Yao, Dongrui Liu, Zhenfei Yin, Yu Qiao, Yong Liu, and Jing Shao. Towards tracing trustworthiness dynamics: Revisiting pre-training period of large language models. In Findings of the Association for Computational Linguistics (ACL), 2024. URL https://arxiv.org/abs/2402.19465

work page arXiv 2024

[31] [38]

APEX : Probing neural networks via activation perturbation

Tao Ren, Xiaoyu Luo, and Qiongxiu Li. APEX : Probing neural networks via activation perturbation. arXiv preprint arXiv:2602.03586, 2026. URL https://arxiv.org/abs/2602.03586

work page arXiv 2026

[32] [39]

Information-theoretic probing with minimum description length

Elena Voita and Ivan Titov. Information-theoretic probing with minimum description length. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2020. URL https://arxiv.org/abs/2003.12298

work page arXiv 2020

[33] [40]

Representation Engineering: A Top-Down Approach to AI Transparency

Andy Zou, Long Phan, Sarah Chen, James Campbell, Phillip Guo, Richard Ren, Alexander Pan, Xuwang Yin, Mantas Mazeika, et al. Representation engineering: A top-down approach to AI transparency. arXiv preprint arXiv:2310.01405, 2023. URL https://arxiv.org/abs/2310.01405

work page internal anchor Pith review Pith/arXiv arXiv 2023