pith. sign in

arxiv: 2606.09850 · v1 · pith:YELJ2MS2new · submitted 2026-05-09 · 💻 cs.LG · cs.CL

Mechanistic Analysis of Alignment Algorithms in Language Models

Pith reviewed 2026-06-30 23:04 UTC · model grok-4.3

classification 💻 cs.LG cs.CL
keywords preference optimizationmechanistic interpretabilityalignment algorithmslanguage modelsrepresentational geometrysparse autoencoderslinear probing
0
0 comments X

The pith

Preference signals concentrate in early-mid or mid-late layers of language models, but alignment methods produce qualitatively different geometric shifts in those representations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines six preference optimization methods inside three families of open-weight models to see how each one changes the internal geometry of preference-related features. It finds that all methods place the relevant signals in similar depth ranges, yet KTO and GRPO increase linear separability by sharing features constructively and recruiting sparse high-salience directions, while DPO and ORPO decrease separability through rotation and attenuation of those directions. PPO and SimPO leave the baseline geometry largely intact. The work therefore treats alignment as a heterogeneous intervention whose internal effects cannot be read off from behavioral metrics alone.

Core claim

Preference representations localize consistently to early-mid or mid-late layers across methods, but the six objectives induce distinct transformations: KTO and GRPO enhance separability via constructive feature sharing and sparse high-salience recruitment; DPO and ORPO degrade separability via non-constructive rotation and feature attenuation; PPO and SimPO largely preserve baseline geometry. These effects vary by architecture.

What carries the argument

Layer-wise linear probing combined with Sparse Autoencoders and crosscoders to localize preference representations and measure alignment-induced changes in latent-space geometry.

If this is right

  • Behavioral alignment scores do not imply uniform internal restructuring across methods.
  • KTO and GRPO produce more linearly separable preference features than DPO or ORPO.
  • PPO and SimPO leave the original representation geometry largely unchanged.
  • Architecture choice modulates how strongly each objective alters internal representations.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Feature-level auditing could become a standard check before deploying an aligned model.
  • Future objectives might be designed explicitly to recruit the constructive patterns observed in KTO and GRPO.
  • The concentration of signals in a narrow band of layers suggests that targeted interventions at those depths could be more efficient than full-model updates.

Load-bearing premise

Layer-wise linear probing, Sparse Autoencoders, and crosscoders accurately localize and quantify preference representations without substantial distortion from probe choice or dictionary-learning hyperparameters.

What would settle it

A controlled experiment in which a new model is aligned with DPO and then probed with an independent dictionary-learning run; if the same rotation and attenuation pattern fails to appear, the reported geometric distinction is not robust.

Figures

Figures reproduced from arXiv: 2606.09850 by Aarush Sinha, Arth Singh, Ishan Garg, Kushal Garg, Veeraraju Elluru.

Figure 1
Figure 1. Figure 1: Overall pipeline of our framework consisting of two phases. In Phase 1, we select pre-trained [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Layer-wise linear probe metrics across models: [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Distribution of anchor feature activations for Llama-3.2-3B (Feature 15132) across align [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Distribution of anchor feature activations for Qwen3-4B (Feature 6910) across alignment [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Distribution of anchor feature activations for SmolLM3-3B (Feature 10154) across align [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Crosscoder Feature Distribution. Feature variance under KTO. The KTO method exhibits consistently higher variance in anchor feature activations across the model families. The boxplots for KTO show larger interquartile ranges and longer whiskers compared to other methods. This indicates that KTO introduces greater dependence on the context, their representations thereof, despite maintaining similar median a… view at source ↗
Figure 7
Figure 7. Figure 7: Combined Receiver Operating Characteristic (ROC) curves at the peak probe layer, one [PITH_FULL_IMAGE:figures/full_fig_p022_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Layer-wise ℓ2 norm of the linear probe weight vector for each base model and AFT method (mean over 5-fold cross-validation at each layer). Next, Figures 9–11 show the PCA projections of the residual stream’s activations (at “best” layer). 22 [PITH_FULL_IMAGE:figures/full_fig_p022_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: SmolLM3-3B. PCA of residual-stream activations at each method’s best probe layer.. −30 −20 −10 0 10 20 30 t-SNE 1 −30 −20 −10 0 10 20 30 t-SNE 2 −30 −20 −10 0 10 20 30 t-SNE 3 Rejected Chosen (i) DPO −30 −20 −10 0 10 20 30 t-SNE 1 −30 −20 −10 0 10 20 30 t-SNE 2 −30 −20 −10 0 10 20 30 t-SNE 3 Rejected Chosen (ii) GRPO −15 −10 −5 0 5 10 15 t-SNE 1 −15 −10 −5 0 5 10 15 t-SNE 2 −15 −10 −5 0 5 10 15 t-SNE 3 Rej… view at source ↗
Figure 10
Figure 10. Figure 10: Llama-3.2-3B. PCA of residual-stream activations at each method’s best probe layer.. 23 [PITH_FULL_IMAGE:figures/full_fig_p023_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Qwen3-4B. PCA of residual-stream activations at each method’s best probe layer. 24 [PITH_FULL_IMAGE:figures/full_fig_p024_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: SAE reconstruction error (MSE) over training steps for each (base model, alignment) [PITH_FULL_IMAGE:figures/full_fig_p025_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Comprehensive SAE reconstruction and preservation metrics. Across almost all metrics, [PITH_FULL_IMAGE:figures/full_fig_p026_13.png] view at source ↗
read the original abstract

Post-training alignment algorithms are predominantly evaluated as black boxes, obscuring how they reshape language models' internal computations. We present a systematic mechanistic analysis of six preference-optimization methods: PPO, DPO, SimPO, ORPO, GRPO, and KTO across three open-weight model families. By integrating layer-wise linear probing, Sparse Autoencoders, and crosscoders, we localize preference representations and quantify alignment-induced geometric transformations in latent space. We find that preference signals consistently concentrate in early--mid or mid--late layers, but different objectives induce qualitatively distinct representational shifts. KTO and GRPO enhance linear separability through constructive feature sharing and sparse, high-salience recruitment. In contrast, DPO and ORPO degrade separability via non-constructive geometric rotation and feature attenuation, while PPO and SimPO largely preserve baseline geometry. These transformations exhibit architecture-dependent variability, demonstrating that behavioral alignment does not imply uniform internal restructuring. Our findings establish alignment as a heterogeneous intervention, motivate standardized feature-level auditing for safety and interpretability, and highlight the need for mechanism-aware optimization objectives.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper claims to mechanistically analyze six preference optimization methods (PPO, DPO, SimPO, ORPO, GRPO, KTO) on three model families via layer-wise linear probing, Sparse Autoencoders, and crosscoders. It reports that preference signals concentrate in early-mid or mid-late layers, with KTO/GRPO producing constructive feature sharing and sparse high-salience recruitment that enhances linear separability, DPO/ORPO producing non-constructive geometric rotation and feature attenuation that degrades separability, and PPO/SimPO largely preserving baseline geometry. These shifts are architecture-dependent, establishing alignment as a heterogeneous intervention.

Significance. If the geometric distinctions hold under validation, the work would usefully demonstrate that behavioral alignment does not imply uniform internal restructuring and would motivate mechanism-aware objectives plus standardized feature auditing. The multi-tool integration (probing + SAEs + crosscoders) is a positive step toward localizing representations, but the lack of controls on those tools prevents the result from being load-bearing.

major comments (2)
  1. [Abstract] Abstract (and implied Methods): the central claim of qualitatively distinct shifts—KTO/GRPO 'constructive' vs. DPO/ORPO 'non-constructive'—is quantified exclusively via layer-wise linear probes, SAEs, and crosscoders. No ablations are described on SAE sparsity, dictionary size, probe regularization strength, or crosscoder training objectives. This is load-bearing because systematic bias in any of these choices could artifactually generate the reported separability changes and geometric labels rather than reflect model properties.
  2. [Abstract] Abstract: the statement that 'preference signals consistently concentrate in early–mid or mid–late layers' is presented without statistical controls, baseline comparisons, error bars, or details on how post-hoc layer selections were performed. This undermines the cross-method and cross-architecture generality asserted in the final sentence.
minor comments (1)
  1. [Abstract] The abstract refers to 'architecture-dependent variability' but provides no quantification or identification of the specific architectures involved.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their detailed and constructive feedback. The comments correctly identify areas where additional methodological controls and statistical details would strengthen the manuscript. We address each major comment below and outline specific revisions.

read point-by-point responses
  1. Referee: [Abstract] Abstract (and implied Methods): the central claim of qualitatively distinct shifts—KTO/GRPO 'constructive' vs. DPO/ORPO 'non-constructive'—is quantified exclusively via layer-wise linear probes, SAEs, and crosscoders. No ablations are described on SAE sparsity, dictionary size, probe regularization strength, or crosscoder training objectives. This is load-bearing because systematic bias in any of these choices could artifactually generate the reported separability changes and geometric labels rather than reflect model properties.

    Authors: We agree that the absence of explicit ablations on these hyperparameters represents a limitation in the current manuscript. While the convergence of results across three distinct analysis tools and three model families offers some protection against isolated artifacts, this does not substitute for targeted sensitivity checks. In the revised manuscript we will add a dedicated ablation subsection reporting: (i) SAE results across a range of L1 coefficients and dictionary sizes, (ii) linear probe performance under varying L2 regularization strengths, and (iii) crosscoder variants with altered training objectives. These will be evaluated against the same separability and geometric metrics used in the main results. revision: yes

  2. Referee: [Abstract] Abstract: the statement that 'preference signals consistently concentrate in early–mid or mid–late layers' is presented without statistical controls, baseline comparisons, error bars, or details on how post-hoc layer selections were performed. This undermines the cross-method and cross-architecture generality asserted in the final sentence.

    Authors: The observed layer concentrations were identified by inspecting where preference-related metrics (probe accuracy and feature activation) exceeded baseline levels in each model family. We acknowledge that the manuscript currently lacks error bars, formal baseline comparisons to unaligned checkpoints, and explicit selection criteria. In revision we will (a) report mean and standard deviation across random seeds for all layer-wise curves, (b) include unaligned model baselines, and (c) specify the quantitative rule used to label layers as 'early–mid' or 'mid–late' (e.g., first layer where metric exceeds 1.5× random baseline and remains elevated for at least two consecutive layers). These additions will appear in both the Methods and Results sections. revision: yes

Circularity Check

0 steps flagged

No circularity; purely observational empirical analysis with no derivations or self-referential steps.

full rationale

The paper reports experimental observations obtained by applying layer-wise linear probes, SAEs, and crosscoders to measure representational changes after alignment. No equations, derivations, fitted parameters renamed as predictions, or load-bearing self-citations appear in the provided text. The central claims are presented as direct outcomes of the chosen analysis pipeline rather than quantities that reduce to their own inputs by construction. This is the expected non-finding for an empirical mechanistic study without mathematical claims.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract-only review provides no explicit free parameters or invented entities. The analysis implicitly assumes the chosen interpretability tools faithfully capture preference geometry.

axioms (1)
  • domain assumption Linear probes and sparse autoencoders faithfully recover preference-related directions in activation space
    The localization and separability claims rest on these tools without reported validation against alternative methods.

pith-pipeline@v0.9.1-grok · 5725 in / 1107 out tokens · 21744 ms · 2026-06-30T23:04:38.418067+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

49 extracted references · 8 canonical work pages · 6 internal anchors

  1. [1]

    Understanding intermediate layers using linear classifier probes, 2018

    Guillaume Alain and Yoshua Bengio. Understanding intermediate layers using linear classifier probes, 2018

  2. [2]

    Mt-bench-101: A fine-grained benchmark for evaluating large language models in multi-turn dialogues.arXiv preprint arXiv:2402.14762, 2024

    Ge Bai, Jie Liu, Xingyuan Bu, Yancheng He, Jiaheng Liu, Zhanhui Zhou, Zhuoran Lin, Wenbo Su, Tiezheng Ge, Bo Zheng, et al. Mt-bench-101: A fine-grained benchmark for evaluating large language models in multi-turn dialogues.arXiv preprint arXiv:2402.14762, 2024

  3. [3]

    Training a helpful and harmless assistant with reinforcement learning from human feedback, 2022

    Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, Nicholas Joseph, Saurav Kadavath, Jackson Kernion, Tom Conerly, Sheer El-Showk, Nelson Elhage, Zac Hatfield-Dodds, Danny Hernandez, Tristan Hume, Scott Johnston, Shauna Kravec, Liane Lovitt, Neel Nanda, Catherine Olsson, ...

  4. [4]

    Michiel A. Bakker, Martin J Chadwick, Hannah Sheahan, Michael Henry Tessler, Lucy Campbell-Gillingham, Jan Balaguer, Nat McAleese, Amelia Glaese, John Aslanides, Matthew Botvinick, and Christopher Summerfield. Fine-tuning language models to find agreement among humans with diverse preferences. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyu...

  5. [5]

    SmolLM3: smol, multilingual, long-context reasoner.https://huggingface.co/blog/smollm3, 2025

    Elie Bakouch, Loubna Ben Allal, Anton Lozhkov, Nouamane Tazi, Lewis Tunstall, Car- los Miguel Patiño, Edward Beeching, Aymeric Roucher, Aksel Joonas Reedi, Quentin Gal- louédec, Kashif Rasul, Nathan Habib, Clémentine Fourrier, Hynek Kydlicek, Guilherme Penedo, Hugo Larcher, Mathieu Morlon, Vaibhav Srivastav, Joshua Lochner, Xuan-Son Nguyen, Colin Raffel, ...

  6. [6]

    Bronstein, and Xiaowen Dong

    Hugh Blayney, Álvaro Arroyo, Johan Obando-Ceron, Pablo Samuel Castro, Aaron Courville, Michael M. Bronstein, and Xiaowen Dong. A mechanistic analysis of looped reasoning language models, 2026

  7. [7]

    Towards monosemanticity: Decomposing language models with dictionary learning, 2023

    Trenton Bricken, Adly Templeton, Joshua Batson, Brian Chen, Adam Jermyn, Tom Con- erly, Nick Turner, Cem Anil, Carson Denison, Amanda Askell, Robert Lasenby, Yifan Wu, Shauna Kravec, Nicholas Schiefer, Tim Maxwell, Nicholas Joseph, Zac Hatfield-Dodds, Alex Tamkin, Karina Nguyen, Brayden McLean, Josiah E Burke, Tristan Hume, Shan Carter, Tom Henighan, and ...

  8. [8]

    Batchtopk sparse autoencoders

    Bart Bussmann, Patrick Leask, and Neel Nanda. Batchtopk sparse autoencoders. InNeurIPS 2024 Workshop on Scientific Methods for Understanding Deep Learning, 2024

  9. [9]

    Deep reinforcement learning from human preferences

    Paul Francis Christiano, Jan Leike, Tom B. Brown, Miljan Martic, Shane Legg, and Dario Amodei. Deep reinforcement learning from human preferences.ArXiv, abs/1706.03741, 2017

  10. [10]

    Ultrafeedback: Boosting language models with scaled ai feedback, 2024

    Ganqu Cui, Lifan Yuan, Ning Ding, Guanming Yao, Bingxiang He, Wei Zhu, Yuan Ni, Guotong Xie, Ruobing Xie, Yankai Lin, Zhiyuan Liu, and Maosong Sun. Ultrafeedback: Boosting language models with scaled ai feedback, 2024

  11. [11]

    Sparse Autoencoders Find Highly Interpretable Features in Language Models

    Hoagy Cunningham, Adam Ewart, Logan Riggs, Robert Huben, and Lee Sharkey. Sparse autoen- coders find highly interpretable features in language models.arXiv preprint arXiv:2309.08600, 2023. 10

  12. [12]

    Enhancing chat language models by scaling high-quality instructional conversations, 2023

    Ning Ding, Yulin Chen, Bokai Xu, Yujia Qin, Zhi Zheng, Shengding Hu, Zhiyuan Liu, Maosong Sun, and Bowen Zhou. Enhancing chat language models by scaling high-quality instructional conversations, 2023

  13. [13]

    A mathematical framework for transformer circuits.Transformer Circuits Thread,

    Nelson Elhage, Neel Nanda, Catherine Olsson, Tom Henighan, Nicholas Joseph, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, Tom Conerly, Nova DasSarma, Dawn Drain, Deep Ganguli, Zac Hatfield-Dodds, Danny Hernandez, Andy Jones, Jackson Kernion, Liane Lovitt, Kamal Ndousse, Dario Amodei, Tom Brown, Jack Clark, Jared Kaplan, Sam McCandlish, and Chris Olah. A...

  14. [14]

    https://transformer-circuits.pub/2021/framework/index.html

  15. [15]

    Mechanistically interpreting compression in vision-language models, 2026

    Veeraraju Elluru, Arth Singh, Roberto Aguero, Ajay Agarwal, Debojyoti Das, and Hreetam Paul. Mechanistically interpreting compression in vision-language models, 2026

  16. [16]

    KTO: Model Alignment as Prospect Theoretic Optimization

    Kawin Ethayarajh, Winnie Xu, Niklas Muennighoff, Dan Jurafsky, and Douwe Kiela. Kto: Model alignment as prospect theoretic optimization.arXiv preprint arXiv:2402.01306, 2024

  17. [17]

    Transformer feed-forward layers are key-value memories, 2021

    Mor Geva, Roei Schuster, Jonathan Berant, and Omer Levy. Transformer feed-forward layers are key-value memories, 2021

  18. [18]

    Glm-5: from vibe coding to agentic engineering, 2026

    GLM-5-Team, :, Aohan Zeng, Xin Lv, Zhenyu Hou, Zhengxiao Du, Qinkai Zheng, Bin Chen, Da Yin, Chendi Ge, Chenghua Huang, Chengxing Xie, Chenzheng Zhu, Congfeng Yin, Cunx- iang Wang, Gengzheng Pan, Hao Zeng, Haoke Zhang, Haoran Wang, Huilong Chen, Jiajie Zhang, Jian Jiao, Jiaqi Guo, Jingsen Wang, Jingzhao Du, Jinzhu Wu, Kedong Wang, Lei Li, Lin Fan, Lucen Z...

  19. [19]

    Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ah- mad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, Amy Yang, Angela Fan, Anirudh Goyal, Anthony Hartshorn, Aobo Yang, Archi Mitra, Archie Sravankumar, Artem Korenev, Arthur Hinsvark, Arun Rao, Aston Zhang, Aurelien Rodriguez, Austen Gregerson, Ava...

  20. [20]

    ORPO: Monolithic Preference Optimization without Reference Model

    Jiwoo Hong, Noah Lee, and James Thorne. Orpo: Monolithic preference optimization without reference model.arXiv preprint arXiv:2403.07691, 2024

  21. [21]

    Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen

    Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models, 2021

  22. [22]

    Samyak Jain, Ekdeep Singh Lubana, Kemal Oksuz, Tom Joy, Philip Torr, Amartya Sanyal, and Puneet K. Dokania. What makes and breaks safety fine-tuning? a mechanistic study. InICML 2024 Workshop on Mechanistic Interpretability, 2024

  23. [23]

    Kingma and Jimmy Ba

    Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization, 2017

  24. [24]

    Style vectors for steering generative large language models

    Kai Konen, Sophie Jentzsch, Diaoulé Diallo, Peer Schütt, Oliver Bensch, Roxanne El Baff, Dominik Opitz, and Tobias Hecking. Style vectors for steering generative large language models. In Yvette Graham and Matthew Purver, editors,Findings of the Association for Computational Linguistics: EACL 2024, pages 782–802, St. Julian’s, Malta, March 2024. Associati...

  25. [25]

    Kummerfeld, and Rada Mihalcea

    Andrew Lee, Xiaoyan Bai, Itamar Pres, Martin Wattenberg, Jonathan K. Kummerfeld, and Rada Mihalcea. A mechanistic understanding of alignment algorithms: A case study on dpo and toxicity, 2024

  26. [26]

    Gemma scope: Open sparse autoencoders everywhere all at once on gemma 2, 2024

    Tom Lieberum, Senthooran Rajamanoharan, Arthur Conmy, Lewis Smith, Nicolas Sonnerat, Vikrant Varma, János Kramár, Anca Dragan, Rohin Shah, and Neel Nanda. Gemma scope: Open sparse autoencoders everywhere all at once on gemma 2, 2024

  27. [27]

    Sparse crosscoders for cross-layer features and model diffing, October 2024

    Jack Lindsey, Adly Templeton, Jonathan Marcus, Thomas Conerly, Joshua Batson, and Christo- pher Olah. Sparse crosscoders for cross-layer features and model diffing, October 2024

  28. [28]

    Locating and editing factual associations in gpt, 2023

    Kevin Meng, David Bau, Alex Andonian, and Yonatan Belinkov. Locating and editing factual associations in gpt, 2023

  29. [29]

    Simpo: Simple preference optimization with a reference-free reward.arXiv preprint arXiv:2405.14734, 2024

    Yu Meng, Mengzhou Xia, and Danqi Chen. Simpo: Simple preference optimization with a reference-free reward.arXiv preprint arXiv:2405.14734, 2024

  30. [30]

    Overcoming sparsity artifacts in crosscoders to interpret chat-tuning

    Julian Minder, Clément Dumas, Caden Juang, Bilal Chughtai, and Neel Nanda. Overcoming sparsity artifacts in crosscoders to interpret chat-tuning. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026. 13

  31. [31]

    Insights on crosscoder model diffing, 2025

    Siddharth Mishra-Sharma, Trenton Bricken, Jack Lindsey, Adam Jermyn, Jonathan Marcus, Kelley Rivoire, Christopher Olah, and Thomas Henighan. Insights on crosscoder model diffing, 2025

  32. [32]

    Hosseini

    Ali Nasiri-Sarvi, Hassan Rivaz, and Mahdi S. Hosseini. SPARC: Concept-aligned sparse autoencoders for cross-model and cross-modal interpretability.Transactions on Machine Learning Research, 2026

  33. [33]

    Zoom in: An introduction to circuits.Distill, 2020

    Chris Olah, Nick Cammarata, Ludwig Schubert, Gabriel Goh, Michael Petrov, and Shan Carter. Zoom in: An introduction to circuits.Distill, 2020. https://distill.pub/2020/circuits/zoom-in

  34. [34]

    In-context learning and induction heads, 2022

    Catherine Olsson, Nelson Elhage, Neel Nanda, Nicholas Joseph, Nova DasSarma, Tom Henighan, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, Tom Conerly, Dawn Drain, Deep Ganguli, Zac Hatfield-Dodds, Danny Hernandez, Scott Johnston, Andy Jones, Jackson Kernion, Liane Lovitt, Kamal Ndousse, Dario Amodei, Tom Brown, Jack Clark, Jared Kaplan, Sam McCandlish, a...

  35. [35]

    Introducing gpt-5.4 | openai, 4 2026

    OpenAI. Introducing gpt-5.4 | openai, 4 2026

  36. [36]

    Training language models to follow instructions with human feedback

    Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul F Christiano, Jan Leike, and Ryan Lowe. Training language models to follow instructions with human feedbac...

  37. [37]

    Direct preference optimization: Your language model is secretly a reward model

    Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. InAdvances in Neural Information Processing Systems, volume 36, 2023

  38. [38]

    Proximal Policy Optimization Algorithms

    John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.ArXiv, abs/1707.06347, 2017

  39. [39]

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Jun-Mei Song, Mingchuan Zhang, Y . K. Li, Yu Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.ArXiv, abs/2402.03300, 2024

  40. [40]

    Ziegler, Ryan Lowe, Chelsea V oss, Alec Radford, Dario Amodei, and Paul Christiano

    Nisan Stiennon, Long Ouyang, Jeff Wu, Daniel M. Ziegler, Ryan Lowe, Chelsea V oss, Alec Radford, Dario Amodei, and Paul Christiano. Learning to summarize from human feedback, 2022

  41. [41]

    The obfuscation atlas: Mapping where honesty emerges in RLVR with deception probes, 2026

    Mohammad Taufeeque, Stefan Heimersheim, Adam Gleave, and Chris Cundy. The obfuscation atlas: Mapping where honesty emerges in RLVR with deception probes, 2026

  42. [42]

    Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M. Dai, Anja Hauth, Katie Millican, David Silver, Melvin Johnson, Ioannis Antonoglou, Julian Schrittwieser, Amelia Glaese, Jilin Chen, Emily Pitler, Timothy Lillicrap, Angeliki Lazaridou, Orhan Firat, James Molloy, Michael Isard, Paul R. Ba...

  43. [43]

    Daniel Freeman, Theodore R

    Adly Templeton, Tom Conerly, Jonathan Marcus, Jack Lindsey, Trenton Bricken, Brian Chen, Adam Pearce, Craig Citro, Emmanuel Ameisen, Andy Jones, Hoagy Cunningham, Nicholas L Turner, Callum McDougall, Monte MacDiarmid, C. Daniel Freeman, Theodore R. Sumers, Edward Rees, Joshua Batson, Adam Jermyn, Shan Carter, Chris Olah, and Tom Henighan. Scaling monosema...

  44. [44]

    Linear representations of sentiment in large language models, 2023

    Curt Tigges, Oskar John Hollinsworth, Atticus Geiger, and Neel Nanda. Linear representations of sentiment in large language models, 2023

  45. [45]

    Llama 2: Open foundation and fine-tuned chat models, 2023

    Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, Anthony Harts...

  46. [46]

    TRL: Transformers Rein- forcement Learning, 2020

    Leandro von Werra, Younes Belkada, Lewis Tunstall, Edward Beeching, Tristan Thrush, Nathan Lambert, Shengyi Huang, Kashif Rasul, and Quentin Gallouédec. TRL: Transformers Rein- forcement Learning, 2020

  47. [47]

    Interpretability in the wild: a circuit for indirect object identification in gpt-2 small, 2022

    Kevin Wang, Alexandre Variengien, Arthur Conmy, Buck Shlegeris, and Jacob Steinhardt. Interpretability in the wild: a circuit for indirect object identification in gpt-2 small, 2022

  48. [48]

    Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexander M. Rush. Huggingface’s transformers: St...

  49. [49]

    Qwen3 technical report, 2025

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jing Zhou, Jingren Zhou, Junyang Lin, Kai Dang, Keqin Bao, Kexin Yang, ...