Tracking Representation Dynamics in Large Language Models with Persistent Homology

Abbas Schwarz; Abhinav Gupta; Anthea Monod; Jay Ambadkar; Kamillo Ferry; Kushal Kasivel; Naman Malhotra

arxiv: 2606.19542 · v1 · pith:EZ47XM3Cnew · submitted 2026-06-17 · 💻 cs.LG

Tracking Representation Dynamics in Large Language Models with Persistent Homology

Naman Malhotra , Jay Ambadkar , Abhinav Gupta , Kushal Kasivel , Abbas Schwarz , Kamillo Ferry , Anthea Monod This is my paper

Pith reviewed 2026-06-26 21:01 UTC · model grok-4.3

classification 💻 cs.LG

keywords persistent homologylarge language modelsalignmentfine-tuningrepresentation dynamicsactivation spacestopological data analysis

0 comments

The pith

Most topological reorganization in LLM activation spaces occurs during the earliest stages of alignment fine-tuning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper applies persistent homology to track how the shape of activation spaces changes as transformer language models undergo supervised fine-tuning for alignment. Across four models from 1B to 7B parameters and three different alignment objectives, the analysis shows a transient peak in topological activity early on, followed by rapid stabilization. The method also separates the topological trajectories produced by distinct objectives and highlights differences between pretrained and instruction-tuned starting points. A reader would care because the geometric lens reveals internal representation shifts that behavioral metrics alone do not capture.

Core claim

By computing persistent homology on activation spaces throughout fine-tuning, the majority of topological reorganization is found to occur in the earliest stages of training. Dense checkpointing reveals a transient peak in topological activity followed by rapid stabilization. Different alignment objectives produce distinguishable topological trajectories, and instruction-tuned models evolve differently from their pretrained counterparts.

What carries the argument

Persistent homology applied to activation spaces to extract topological features such as connected components and cycles across training checkpoints.

If this is right

Alignment training can be divided into an early high-activity phase and a later stable phase based on topological measures.
Different alignment objectives leave distinct topological signatures in the activation spaces.
Instruction-tuned models follow qualitatively different topological paths than purely pretrained models.
Behavioral metrics may miss representation-level stabilization that occurs after the initial peak.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The early stabilization suggests alignment may largely fix core representational structure after a short initial period.
Persistent homology could be tested as a monitoring tool to decide when fine-tuning has reached diminishing returns on representation change.
The approach might extend to comparing topological evolution under reinforcement learning from human feedback versus supervised fine-tuning.

Load-bearing premise

The topological features extracted from activation spaces reflect changes driven by the alignment process rather than incidental effects of model architecture or training geometry.

What would settle it

Running the same persistent-homology analysis on a new set of models and observing continued topological activity throughout the full fine-tuning schedule instead of early stabilization would contradict the central observation.

Figures

Figures reproduced from arXiv: 2606.19542 by Abbas Schwarz, Abhinav Gupta, Anthea Monod, Jay Ambadkar, Kamillo Ferry, Kushal Kasivel, Naman Malhotra.

**Figure 2.** Figure 2: The early transient topological peak (RQ1). Dense early-window H0 (left) and H1 (right) Wasserstein velocity, normalized per model so the peak position is comparable (magnitudes are scaledependent). The reorganization is a transient rise-peak-decay whose peak moves earlier with model size. 5 Results and Discussion Unless otherwise noted, all analyses use the second-to-last probed layer on the harmful eval… view at source ↗

**Figure 3.** Figure 3: Objective Separation (RQ2). Left: Per-feature effect sizes (Hedges’ g, 95% bootstrap CI) for the most-separated pair (Gemma); a few H1-persistence features carry |g| ∼ 1–2. Right: Objective displacement-direction cosines (model × pair) are all strongly positive; the objectives move the topology the same way, separating in magnitude. appears to play a stronger role than model scale in determining whether al… view at source ↗

**Figure 4.** Figure 4: Behavior and Objective Ordering (RQ3, RQ2). Left: Refusal on harmful prompts over fine-tuning. All three base models stay on the ≈ 2–3% noise floor (shaded); only the instruction-tuned Phi-3 leaves it, reaching 8–40%, and there the objectives separate. Right: Base-to-final change in H0 persistence entropy (headline layer, bootstrap 95% CI): base models compress (below zero, harmless most), Phi-3 enriches d… view at source ↗

**Figure 5.** Figure 5: Gemma-3-1B persistence diagram (top) and barcode (bottom), harmless run, layer 20. Fine-tuning [PITH_FULL_IMAGE:figures/full_fig_p016_5.png] view at source ↗

**Figure 6.** Figure 6: Permuting the checkpoint order removes the early peak. For each dense model (harmless run), the share of total H1 Wasserstein movement at each step under the observed checkpoint order (red, concentrated in the shaded first-third window) versus 1000 random checkpoint orders (grey median, 5–95% band, and a few individual traces) [PITH_FULL_IMAGE:figures/full_fig_p017_6.png] view at source ↗

**Figure 7.** Figure 7: The early shock against a checkpoint-order permutation null (§5.1). For each (model, objective), the null distribution of the early-velocity concentration C (H1 Wasserstein) under randomly shuffled checkpoint order (grey, 2000 permutations), with the observed C as a colored line (shaded tail) and its permutation p; ∗ p<0.05. 17 [PITH_FULL_IMAGE:figures/full_fig_p017_7.png] view at source ↗

**Figure 8.** Figure 8: Representation leads function (Phi-3, dense). H1 Wasserstein velocity (topology, purple) and JS velocity of the output distribution (function, black) on a shared checkpoint axis. Both peak in the first 5-15 steps with topology marginally first (dashed vs. dotted markers). Topology then retains a longer tail. Phi-3 is the model whose refusal genuinely moves, so this is the sharpest available timing test [P… view at source ↗

**Figure 9.** Figure 9: Gemma-3-1B representation (H1 Wasserstein) vs. function (JS) velocity companion to [PITH_FULL_IMAGE:figures/full_fig_p018_9.png] view at source ↗

**Figure 10.** Figure 10: PH vs. geometry (Gemma, harmless, dense). Normalized velocity of the H1 Wasserstein metric against non-topological baselines on the same clouds (§5.4). All rise in the early burst, but centroid drift and variance collapse to floor by step ∼35 while PH stays elevated; PH is therefore not redundant with first/second moments [PITH_FULL_IMAGE:figures/full_fig_p019_10.png] view at source ↗

**Figure 11.** Figure 11: Phi-3 geometric-baseline ablation (companion to Fig. [PITH_FULL_IMAGE:figures/full_fig_p019_11.png] view at source ↗

**Figure 12.** Figure 12: Cell-level energy distance between objectives per pair, per model (standardized, 95% bootstrap [PITH_FULL_IMAGE:figures/full_fig_p020_12.png] view at source ↗

**Figure 13.** Figure 13: Pseudoreplication inflates significance. Subsample-level permutation [PITH_FULL_IMAGE:figures/full_fig_p020_13.png] view at source ↗

**Figure 14.** Figure 14: Per-feature cell-level effect sizes (Hedges’ [PITH_FULL_IMAGE:figures/full_fig_p021_14.png] view at source ↗

**Figure 15.** Figure 15: Per-model objective displacement-direction cosine matrices; every off-diagonal entry is strongly [PITH_FULL_IMAGE:figures/full_fig_p021_15.png] view at source ↗

**Figure 16.** Figure 16: Per-feature reorganisation across the four models (harmless run, harmful eval), each column [PITH_FULL_IMAGE:figures/full_fig_p022_16.png] view at source ↗

**Figure 17.** Figure 17: Dense early-window per-feature peak step (41 features [PITH_FULL_IMAGE:figures/full_fig_p023_17.png] view at source ↗

**Figure 18.** Figure 18: Per-model dense reorganisation maps (harmless run): TinyLlama-1.1B (top left), Gemma-3- [PITH_FULL_IMAGE:figures/full_fig_p024_18.png] view at source ↗

**Figure 19.** Figure 19: Seed replication (TinyLlama-1.1B). Top: early-window H1 velocity overlaid across three independent seeds the burst follows nearly the same curve for every objective and seed. Bottom: the observed objective-separation U falls within the trajectory-permutation null, so objectives are not separated during the burst [PITH_FULL_IMAGE:figures/full_fig_p025_19.png] view at source ↗

**Figure 20.** Figure 20: Reorganization maps for the two added TinyLlama seeds (harmless objective), matching the [PITH_FULL_IMAGE:figures/full_fig_p025_20.png] view at source ↗

**Figure 21.** Figure 21: Seed replication (Gemma-3-1B). Top: early-window H1 velocity overlaid across three independent seeds the burst follows nearly the same curve for every objective and seed. Bottom: the observed objective-separation U falls within the trajectory-permutation null, so objectives are not separated during the burst [PITH_FULL_IMAGE:figures/full_fig_p026_21.png] view at source ↗

**Figure 22.** Figure 22: Reorganization maps for the two added Gemma seeds (harmless objective), reproducing the [PITH_FULL_IMAGE:figures/full_fig_p026_22.png] view at source ↗

**Figure 23.** Figure 23: Seed replication (Mistral-7B). Top: early-window H1 velocity overlaid across three independent seeds the burst follows nearly the same curve for every objective and seed. Bottom: the observed objective-separation U falls within the trajectory-permutation null, so objectives are not separated during the burst [PITH_FULL_IMAGE:figures/full_fig_p027_23.png] view at source ↗

**Figure 24.** Figure 24: Reorganization maps for the two added Mistral seeds (harmless objective), matching the early [PITH_FULL_IMAGE:figures/full_fig_p027_24.png] view at source ↗

**Figure 25.** Figure 25: Persistence landscapes: pedagogical construction (top) and the three objectives’ mean [PITH_FULL_IMAGE:figures/full_fig_p028_25.png] view at source ↗

**Figure 26.** Figure 26: Holm-significant fraction of landscape comparisons, last-token vs. token-level pooling (raw counts annotated). Pooling the last 16 tokens collapses the objective signal (Phi-3 12/30→0/30, Mistral 10/30→ 1/30), so the discriminating structure sits at the decision token. (TinyLlama and Gemma have no token-level variant.) 28 [PITH_FULL_IMAGE:figures/full_fig_p028_26.png] view at source ↗

**Figure 27.** Figure 27: Topology (left axis) vs. refusal rate (right axis) over training, harmless run: TinyLlama (base, [PITH_FULL_IMAGE:figures/full_fig_p029_27.png] view at source ↗

**Figure 28.** Figure 28: Left: a 5-fold CV probe on the 41-vector becomes run-discriminative by step ∼50–100 even where refusal is flat (hidden progress). Right: saturating-exponential fit to h0 persistence entropy [PITH_FULL_IMAGE:figures/full_fig_p029_28.png] view at source ↗

**Figure 29.** Figure 29: The subsample-level Holm-adjusted landscape-permutation heatmap (Gemma, [PITH_FULL_IMAGE:figures/full_fig_p029_29.png] view at source ↗

read the original abstract

Large language models are commonly aligned through supervised fine-tuning, yet little is known about how their internal representations evolve during this process. We study alignment dynamics using persistent homology by tracking the topology of activation spaces throughout fine-tuning. Across four transformer language models ranging from 1B to 7B parameters and three alignment objectives corresponding to helpful, harmless, and mixed training data, we find that the majority of topological reorganization occurs during the earliest stages of training. A dense checkpoint analysis reveals a transient peak in topological activity followed by rapid stabilization. We further show that different alignment objectives induce distinguishable topological trajectories, while instruction-tuned and pretrained models exhibit qualitatively different patterns of evolution. Our results suggest that persistent homology provides a complementary perspective on alignment, revealing representation-level changes that are not apparent from behavioral metrics alone.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Persistent homology on activation spaces during alignment fine-tuning shows early transient peaks and objective-specific paths, but without controls it is unclear if these are alignment-driven or generic to any fine-tuning.

read the letter

The paper's core observation is that persistent homology applied to activation point clouds during supervised fine-tuning on alignment data registers most topological change in the first few steps, followed by quick stabilization, and that helpful, harmless, and mixed objectives produce visibly different trajectories. Pretrained and already-instruction-tuned models also diverge in their patterns.

What is new here is the direct use of persistent homology to monitor the evolving topology of hidden-state distributions across an alignment run. Prior work has used TDA on static model representations or on other domains, but tracking it checkpoint-by-checkpoint through fine-tuning on multiple alignment objectives appears fresh. The multi-model (1B–7B) and multi-objective design gives the results some breadth, and the claim that behavioral metrics miss these representation-level shifts is plausible on its face.

The main weakness is the missing controls. The abstract and stress-test note give no indication of runs on continued pretraining, non-instruction data, or shuffled labels that would show whether the early peak and stabilization are specific to alignment objectives or simply what happens under gradient descent on transformer activations. Without that separation, the link to alignment dynamics stays observational rather than diagnostic. Methodological details on how the point clouds are sampled, which layers are used, the choice of filtration and distance, and robustness to those choices are also not visible in the provided material, which makes it hard to judge whether the reported patterns would survive reasonable variations.

This is for people already working on topological methods in interpretability or on fine-grained analysis of alignment runs. A reader hunting for new measurement tools might extract ideas, but anyone needing evidence that the topology tracks safety-relevant changes would find the current scope too narrow.

I would send it to referees so the authors can add the necessary controls and fill in the implementation details; the basic experimental setup is straightforward enough that a serious review could quickly clarify its value.

Referee Report

1 major / 0 minor

Summary. The paper applies persistent homology to activation spaces of transformer LLMs during supervised fine-tuning for alignment. Across four models (1B–7B parameters) and three objectives (helpful, harmless, mixed), it reports that the majority of topological reorganization occurs in the earliest training stages, with a transient peak in activity followed by rapid stabilization. Different alignment objectives produce distinguishable topological trajectories, and instruction-tuned models differ qualitatively from pretrained ones. The work positions persistent homology as a complementary lens on alignment dynamics beyond behavioral metrics.

Significance. If the central observational patterns hold under appropriate controls, the work would be significant for introducing topological data analysis to the study of LLM representation evolution during alignment. The multi-model, multi-objective design and emphasis on early-training dynamics provide a concrete starting point for further investigation of representation-level changes.

major comments (1)

[Abstract] The central claim that the observed early transient peak and stabilization reflect alignment-specific dynamics requires evidence that these topological signatures are not generic consequences of gradient descent on transformer activations. The manuscript does not report control experiments (continued pretraining on non-instruction data, random-label fine-tuning, or non-alignment distributions) that would isolate the effect of the alignment objective; this absence is load-bearing for interpreting the results as alignment-driven rather than architecture- or optimization-driven.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive review and the opportunity to clarify our work. Below we respond directly to the major comment.

read point-by-point responses

Referee: [Abstract] The central claim that the observed early transient peak and stabilization reflect alignment-specific dynamics requires evidence that these topological signatures are not generic consequences of gradient descent on transformer activations. The manuscript does not report control experiments (continued pretraining on non-instruction data, random-label fine-tuning, or non-alignment distributions) that would isolate the effect of the alignment objective; this absence is load-bearing for interpreting the results as alignment-driven rather than architecture- or optimization-driven.

Authors: We agree that the manuscript lacks explicit control experiments of the form suggested (continued pretraining on non-instruction data, random-label fine-tuning, or non-alignment distributions), and that this limits the strength of claims that the observed topological signatures are driven specifically by alignment objectives rather than by generic properties of gradient descent on transformer activations. Our existing design does compare trajectories across three distinct alignment objectives and between pretrained and instruction-tuned models, which produces distinguishable patterns; however, these comparisons do not fully isolate alignment from other optimization effects. We will revise the manuscript to (1) qualify the abstract and discussion language to avoid implying exclusivity to alignment, (2) add an explicit limitations subsection that states the absence of the recommended controls, and (3) outline how such controls could be implemented in follow-up work. We view this as a substantive but addressable limitation rather than a fatal flaw, given the multi-objective and multi-model scope already present. revision: yes

Circularity Check

0 steps flagged

No circularity: purely observational measurements of topological features

full rationale

The paper reports direct empirical observations obtained by computing persistent homology on activation point clouds at successive fine-tuning checkpoints across multiple models and objectives. No equations, fitted parameters, or predictions are presented that reduce a claimed quantity to a quantity defined from the same data. No self-citation chains, uniqueness theorems, or ansatzes are invoked to justify the central claims. The reported transient peak and stabilization are measurements, not derivations that collapse by construction. This is the expected non-finding for an observational study whose load-bearing steps are external data collection and standard topological computation rather than internal algebraic reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available; no explicit free parameters, axioms, or invented entities are stated. Persistent homology itself is treated as a standard tool whose assumptions are not unpacked here.

pith-pipeline@v0.9.1-grok · 5679 in / 1091 out tokens · 15969 ms · 2026-06-26T21:01:28.171471+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

15 extracted references · 7 linked inside Pith

[1]

Phi-3 technical report: A highly capable language model locally on your phone.URL https://arxiv

Abdin, Marah, Aneja, Jyoti, Awadalla, Hany, Awadallah, Ahmed, Awan, Ammar Ahmad, Bach, Nguyen, Bahree, Amit, Bakhtiari, Arash, Bao, Jianmin, Behl, Harkirat,et al.2024. Phi-3 technical report: A highly capable language model locally on your phone.URL https://arxiv. org/abs/2404.14219,2(6),

Pith/arXiv arXiv 2024
[2]

Refusal in Language Models Is Mediated by a Single Direction.arXiv preprint arXiv:2406.04093. Bai, Yuntao, Jones, Andy, Ndousse, Kamal, Askell, Amanda, Chen, Anna, DasSarma, Nova, Drain, Dawn, Fort, Stanislav, Ganguli, Deep, Henighan, Tom, Joseph, Nicholas, Kadavath, Saurav, Kernion, Jackson, Conerly, Tom, El-Showk, Sheer, Elhage, Nelson, Hatfield- Dodds,...

Pith/arXiv arXiv
[3]

Barak, Boaz, Edelman, Benjamin L., Goel, Surbhi, Kakade, Sham, Malach, Eran, & Zhang, Cyril

Training a helpful and harmless assistant with reinforcement learning from human feedback.arXiv preprint arXiv:2204.05862. Barak, Boaz, Edelman, Benjamin L., Goel, Surbhi, Kakade, Sham, Malach, Eran, & Zhang, Cyril

Pith/arXiv arXiv
[4]

Advances in Neural Information Processing Systems (NeurIPS)

Hidden Progress in Deep Learning: SGD Learns Parities Near the Computational Limit. Advances in Neural Information Processing Systems (NeurIPS). arXiv:2207.08799. Bauer, Ulrich

arXiv
[5]

In: International Conference on Learning Representations (ICLR)

The Shape of Adversarial Influence: Characterizing LLM Latent Spaces with Persistent Homology. In: International Conference on Learning Representations (ICLR). Gardinazzi, Yuri,et al.2024. Persistent Topological Features in Large Language Models.arXiv preprint arXiv:2410.11042. Hu, Edward J, Shen, Yelong, Wallis, Phillip, Allen-Zhu, Zeyuan, Li, Yuanzhi, W...

arXiv 2024
[6]

arXiv:2311.12786

Mechanistically Analyzing the Effects of Fine-Tuning on Procedurally Defined Tasks.In: International Conference on Learning Representations (ICLR). arXiv:2311.12786. Jiang, Albert Q., Sablayrolles, Alexandre, Mensch, Arthur, Bamford, Chris, Chaplot, Devendra Singh, de las Casas, Diego, Bressand, Florian, Lengyel, Gianna, Lample, Guil- laume, Saulnier, Luc...

arXiv
[7]

arXiv:2401.01967

A Mechanistic Understanding of Alignment Algorithms: A Case Study on DPO and Toxicity.In: International Conference on Machine Learning (ICML). arXiv:2401.01967. Mesnard, Thomas, Hardin, Cassidy, Dadashi, Robert, Bhupatiraju, Surya, Pathak, Shreya, Sifre, Laurent, Rivi `ere, Morgane, Kale, Mihir Sanjay, Love, Juliette, Tafti, Pouya, Hussenot, L´eonard, Ses...

arXiv
[8]

Moor, Michael, Horn, Max, Rieck, Bastian, & Borgwardt, Karsten

Gemma: Open models based on gemini research and technology.arXiv preprint arXiv:2403.08295. Moor, Michael, Horn, Max, Rieck, Bastian, & Borgwardt, Karsten

Pith/arXiv arXiv
[9]

Ouyang, Long, Wu, Jeffrey, Jiang, Xu, Almeida, Diogo, Wainwright, Carroll, Mishkin, Pamela, Zhang, Chong, Agarwal, Sandhini, Slama, Katarina, Ray, Alex,et al.2022

Progress Measures for Grokking via Mechanistic Interpretability.In: International Conference on Learning Repre- sentations (ICLR). Ouyang, Long, Wu, Jeffrey, Jiang, Xu, Almeida, Diogo, Wainwright, Carroll, Mishkin, Pamela, Zhang, Chong, Agarwal, Sandhini, Slama, Katarina, Ray, Alex,et al.2022. Training language models to follow instructions with human fee...

2022
[10]

arXiv preprint arXiv:2605.06352

Topological Signatures of Grokking. arXiv preprint arXiv:2605.06352. Zhang, Peiyuan, Zeng, Guangtao, Wang, Tianduo, & Lu, Wei

Pith/arXiv arXiv
[11]

TinyLlama: An open-source small language model.arXiv preprint arXiv:2401.02385. Zhou, Chunting, Liu, Pengfei, Xu, Puxin, Iyer, Srinivasan, Sun, Jiao, Mao, Yuning, Ma, Xuezhe, Efrat, Avia, Yu, Ping, Yu, Lili, Zhang, Susan, Ghosh, Gargi, Lewis, Mike, Zettle- moyer, Luke, & Levy, Omer

Pith/arXiv arXiv
[12]

arXiv:2305.11206

LIMA: Less Is More for Alignment.In: Advances in Neural Information Processing Systems (NeurIPS). arXiv:2305.11206. 9 A Supplementary Results A.1 Robustness Checks and Controls We performed a series of controls to verify that the observed topological phenomena are not artifacts of checkpoint ordering, simple geometric statistics, token selection, or the u...

Pith/arXiv arXiv
[13]

Device selection follows the hierarchymps>cuda>cpu, allowing the same codebase to execute unchanged on CUDA-enabled systems

Model training and extraction use the Metal Performance Shaders (MPS) backend through PyTorch withbfloat16precision and scaled- dot-product attention (sdpa). Device selection follows the hierarchymps>cuda>cpu, allowing the same codebase to execute unchanged on CUDA-enabled systems. Software.The training and extraction pipeline uses PyTorch, Transformers, ...

2022
[14]

The 1B models are trained for 300 steps on 3000 examples with checkpoints every 50 steps

All objectives use identical hyperparameters and random seeds. The 1B models are trained for 300 steps on 3000 examples with checkpoints every 50 steps. The larger models are trained for 400 steps on 6000 examples with checkpoints every 100 steps. To resolve the earliest stages of alignment, we additionally perform a dense early-window analysis that check...

2026
[15]

Fine-tuning mildly shortens and thins the bars. 16 Figure 6:Permuting the checkpoint order removes the early peak.For each dense model (harmless run), the share of totalH 1 Wasserstein movement at each step under theobservedcheckpoint order (red, concentrated in the shaded first-third window) versus 1000 random checkpoint orders (grey median, 5–95% band, ...

2000

[1] [1]

Phi-3 technical report: A highly capable language model locally on your phone.URL https://arxiv

Abdin, Marah, Aneja, Jyoti, Awadalla, Hany, Awadallah, Ahmed, Awan, Ammar Ahmad, Bach, Nguyen, Bahree, Amit, Bakhtiari, Arash, Bao, Jianmin, Behl, Harkirat,et al.2024. Phi-3 technical report: A highly capable language model locally on your phone.URL https://arxiv. org/abs/2404.14219,2(6),

Pith/arXiv arXiv 2024

[2] [2]

Refusal in Language Models Is Mediated by a Single Direction.arXiv preprint arXiv:2406.04093. Bai, Yuntao, Jones, Andy, Ndousse, Kamal, Askell, Amanda, Chen, Anna, DasSarma, Nova, Drain, Dawn, Fort, Stanislav, Ganguli, Deep, Henighan, Tom, Joseph, Nicholas, Kadavath, Saurav, Kernion, Jackson, Conerly, Tom, El-Showk, Sheer, Elhage, Nelson, Hatfield- Dodds,...

Pith/arXiv arXiv

[3] [3]

Barak, Boaz, Edelman, Benjamin L., Goel, Surbhi, Kakade, Sham, Malach, Eran, & Zhang, Cyril

Training a helpful and harmless assistant with reinforcement learning from human feedback.arXiv preprint arXiv:2204.05862. Barak, Boaz, Edelman, Benjamin L., Goel, Surbhi, Kakade, Sham, Malach, Eran, & Zhang, Cyril

Pith/arXiv arXiv

[4] [4]

Advances in Neural Information Processing Systems (NeurIPS)

Hidden Progress in Deep Learning: SGD Learns Parities Near the Computational Limit. Advances in Neural Information Processing Systems (NeurIPS). arXiv:2207.08799. Bauer, Ulrich

arXiv

[5] [5]

In: International Conference on Learning Representations (ICLR)

The Shape of Adversarial Influence: Characterizing LLM Latent Spaces with Persistent Homology. In: International Conference on Learning Representations (ICLR). Gardinazzi, Yuri,et al.2024. Persistent Topological Features in Large Language Models.arXiv preprint arXiv:2410.11042. Hu, Edward J, Shen, Yelong, Wallis, Phillip, Allen-Zhu, Zeyuan, Li, Yuanzhi, W...

arXiv 2024

[6] [6]

arXiv:2311.12786

Mechanistically Analyzing the Effects of Fine-Tuning on Procedurally Defined Tasks.In: International Conference on Learning Representations (ICLR). arXiv:2311.12786. Jiang, Albert Q., Sablayrolles, Alexandre, Mensch, Arthur, Bamford, Chris, Chaplot, Devendra Singh, de las Casas, Diego, Bressand, Florian, Lengyel, Gianna, Lample, Guil- laume, Saulnier, Luc...

arXiv

[7] [7]

arXiv:2401.01967

A Mechanistic Understanding of Alignment Algorithms: A Case Study on DPO and Toxicity.In: International Conference on Machine Learning (ICML). arXiv:2401.01967. Mesnard, Thomas, Hardin, Cassidy, Dadashi, Robert, Bhupatiraju, Surya, Pathak, Shreya, Sifre, Laurent, Rivi `ere, Morgane, Kale, Mihir Sanjay, Love, Juliette, Tafti, Pouya, Hussenot, L´eonard, Ses...

arXiv

[8] [8]

Moor, Michael, Horn, Max, Rieck, Bastian, & Borgwardt, Karsten

Gemma: Open models based on gemini research and technology.arXiv preprint arXiv:2403.08295. Moor, Michael, Horn, Max, Rieck, Bastian, & Borgwardt, Karsten

Pith/arXiv arXiv

[9] [9]

Ouyang, Long, Wu, Jeffrey, Jiang, Xu, Almeida, Diogo, Wainwright, Carroll, Mishkin, Pamela, Zhang, Chong, Agarwal, Sandhini, Slama, Katarina, Ray, Alex,et al.2022

Progress Measures for Grokking via Mechanistic Interpretability.In: International Conference on Learning Repre- sentations (ICLR). Ouyang, Long, Wu, Jeffrey, Jiang, Xu, Almeida, Diogo, Wainwright, Carroll, Mishkin, Pamela, Zhang, Chong, Agarwal, Sandhini, Slama, Katarina, Ray, Alex,et al.2022. Training language models to follow instructions with human fee...

2022

[10] [10]

arXiv preprint arXiv:2605.06352

Topological Signatures of Grokking. arXiv preprint arXiv:2605.06352. Zhang, Peiyuan, Zeng, Guangtao, Wang, Tianduo, & Lu, Wei

Pith/arXiv arXiv

[11] [11]

TinyLlama: An open-source small language model.arXiv preprint arXiv:2401.02385. Zhou, Chunting, Liu, Pengfei, Xu, Puxin, Iyer, Srinivasan, Sun, Jiao, Mao, Yuning, Ma, Xuezhe, Efrat, Avia, Yu, Ping, Yu, Lili, Zhang, Susan, Ghosh, Gargi, Lewis, Mike, Zettle- moyer, Luke, & Levy, Omer

Pith/arXiv arXiv

[12] [12]

arXiv:2305.11206

LIMA: Less Is More for Alignment.In: Advances in Neural Information Processing Systems (NeurIPS). arXiv:2305.11206. 9 A Supplementary Results A.1 Robustness Checks and Controls We performed a series of controls to verify that the observed topological phenomena are not artifacts of checkpoint ordering, simple geometric statistics, token selection, or the u...

Pith/arXiv arXiv

[13] [13]

Device selection follows the hierarchymps>cuda>cpu, allowing the same codebase to execute unchanged on CUDA-enabled systems

Model training and extraction use the Metal Performance Shaders (MPS) backend through PyTorch withbfloat16precision and scaled- dot-product attention (sdpa). Device selection follows the hierarchymps>cuda>cpu, allowing the same codebase to execute unchanged on CUDA-enabled systems. Software.The training and extraction pipeline uses PyTorch, Transformers, ...

2022

[14] [14]

The 1B models are trained for 300 steps on 3000 examples with checkpoints every 50 steps

All objectives use identical hyperparameters and random seeds. The 1B models are trained for 300 steps on 3000 examples with checkpoints every 50 steps. The larger models are trained for 400 steps on 6000 examples with checkpoints every 100 steps. To resolve the earliest stages of alignment, we additionally perform a dense early-window analysis that check...

2026

[15] [15]

Fine-tuning mildly shortens and thins the bars. 16 Figure 6:Permuting the checkpoint order removes the early peak.For each dense model (harmless run), the share of totalH 1 Wasserstein movement at each step under theobservedcheckpoint order (red, concentrated in the shaded first-third window) versus 1000 random checkpoint orders (grey median, 5–95% band, ...

2000