pith. sign in

arxiv: 2606.12360 · v1 · pith:VJM6SH5Tnew · submitted 2026-06-10 · 💻 cs.LG

Anatomy of Post-Training: Using Interpretability to Characterize Data and Shape the Learning Signal

Pith reviewed 2026-06-27 10:31 UTC · model grok-4.3

classification 💻 cs.LG
keywords post-traininginterpretabilitypreference datalanguage modelslearning signaldata-centric pipelineconcept-level analysisreward shaping
0
0 comments X

The pith

Interpretability protocols let practitioners inspect preference data at the concept level and explicitly decide which behaviors a model should learn during post-training.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that post-training currently optimizes scalar rewards that hide what data actually teaches, leading to problems like spurious correlations and unwanted behaviors. It proposes a pipeline that applies interpretability methods to preference datasets to surface the latent concepts distinguishing preferred from dispreferred outputs. These concepts become explicit so users can give fine-grained feedback on what to keep or remove. If the approach works, post-training shifts from opaque reward maximization to deliberate auditing and sculpting of the learning signal itself.

Core claim

We introduce a data-centric post-training pipeline that uses interpretability protocols to develop statistical hypotheses for the latent concepts separating preferred from dispreferred generations, making them explicit for fine-grained user feedback. Building on this view, we unify several interpretability-based training protocols as ways of shaping rewards via feature or data interventions. Empirically, the pipeline diagnoses undesirable signals in existing preference data, mitigates off-target learning, and helps amplify desired properties such as safeguards and model personality.

What carries the argument

A data-centric post-training pipeline that applies interpretability protocols to preference datasets to extract statistical hypotheses about latent concepts distinguishing preferred from dispreferred generations.

If this is right

  • Existing preference datasets can be audited for undesirable signals before any optimization occurs.
  • Off-target learning during fine-tuning can be reduced by intervening on identified concepts.
  • Desired model properties such as safeguards or specific personality traits can be amplified through targeted data or feature changes.
  • Post-training moves from scalar reward optimization to direct sculpting of the learning signal.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same concept-level inspection could be applied to earlier training stages to catch issues before post-training.
  • Iterative rounds of concept feedback might allow smaller preference datasets to achieve comparable alignment quality.
  • Teams without deep interpretability expertise could still use the pipeline if the hypotheses are presented as simple editable lists of behaviors.

Load-bearing premise

Interpretability methods can produce reliable, actionable statistical hypotheses about which latent concepts separate preferred from dispreferred model outputs.

What would settle it

A controlled experiment in which the pipeline is applied to a preference dataset yet the resulting model still exhibits the same rate of off-target behaviors, such as over-stylization, as a model trained on the unmodified data.

Figures

Figures reproduced from arXiv: 2606.12360 by Atticus Geiger, Daniel Balsam, Dhruvil Gala, Ekdeep Singh Lubana, Jack Merullo, Leon Bergen, Matthew Kowal, Max Loeffler, Owen Lewis, Raphael Sarfati, Ryan Panwar, Santiago Aranguri, Siddharth Boppana, Sidharth Baskaran, Thomas Fel, Thomas McGrath, Usha Bhalla.

Figure 1
Figure 1. Figure 1: Auditing Post-Training Data and Shaping the Learning Signal. We propose a data￾centric pipeline for post-training grounded in interpretability tools. Specifically, our pipeline starts with (a) preference datasets and (b) identifies concepts which maximally distinguish the datasets via two-sample hypothesis tests, hence yielding hypotheses for behaviors a model might learn as a consequence of being trained … view at source ↗
Figure 2
Figure 2. Figure 2: Operationalizations of Explaining Away Concepts. We illustrate the four ways of shaping the learning signal discussed in Sec. 2.1. The top row shows the stages at which our interventions are operationalized: data, representations, and the scalar loss/reward optimized during post-training. The bottom row shows interventions at each corresponding level: Data Filtering changes the training distribution by rem… view at source ↗
Figure 3
Figure 3. Figure 3: Behaviors can be induced by data and modulated with interventions. We validate our “explaining away” interventions in a controlled poisoning setup on Llama-3.1-8B. For each target trait—Goblin Weave, Cheerfulness, Conflict Avoidance, Formality, and Overconfidence—the x-axis compares the stock SFT checkpoint, the poisoned SFT checkpoint, the poisoned DPO checkpoint, and four post-suppression variants: inocu… view at source ↗
Figure 4
Figure 4. Figure 4: Prompt-Conditioned Interface for Understanding a Dataset. We show the local view of our SAE-feature viewer for LLama-3.1-8B and the Dolci dataset (Team Olmo, 2025), centered on a prompt cluster whose exemplars involve taboo and illicit fiction requests. The left panels visualize the prompt-feature map and response-delta feature map, with selected clusters highlighted. The middle panel translates the select… view at source ↗
Figure 5
Figure 5. Figure 5: Feature-Conditioned Interface for Auditing Dataset Regions. We show the global view of our SAE-feature viewer on LLama-3.1-8B and the Dolci dataset (Team Olmo, 2025). Herein, we starts from a response-feature cluster and ask where in the dataset that concept is preferentially rewarded or suppressed. In the example shown, the selected feature cluster corresponds to texts and discussions about Judaism and Ch… view at source ↗
Figure 6
Figure 6. Figure 6: Validating the Feature-Conditioned Hypothesis Generation Pipeline. We validate the feature-conditioned pipeline by comparing the signed chosen-minus-rejected feature signal predicted from Dolci DPO data against the empirical pre-vs-post-DPO change in model rollouts. (a) The scatter plot shows predicted feature change on the x-axis and observed rollout change on the y-axis, showing that response-feature clu… view at source ↗
Figure 7
Figure 7. Figure 7: Validating the Prompt-Conditioned Hypothesis Generation Pipeline. We validate the prompt-conditioned pipeline by testing whether local prompt-response hypotheses extracted from the preference data predict behavioral changes after DPO, showing a worse, but nevertheless noticeable, correlation than the feature-conditioned pipeline (cf [PITH_FULL_IMAGE:figures/full_fig_p011_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Modulating Over-Stylization. We evaluate whether different instantiations of the in￾terventions from Sec. 2.1 can explain away stylistic formatting attributes learned during DPO on Dolci. The x-axis groups interventions by where the concept correction is applied: example filtering, token filtering, reward shaping, activation steering, and inoculation prompting; within each group, labels denote the classifi… view at source ↗
Figure 9
Figure 9. Figure 9: Recovery of Target Style vs. Other Styles. We characterize how localized each interven￾tion is by separating recovery of the targeted style from recovery of the non-targeted, but correlated, style attributes. The top panel reports on-target recovery: for each method and style attribute, how much the rate of the attribute directly targeted by the intervention is reduced relative to the DPO and SFT baselines… view at source ↗
Figure 10
Figure 10. Figure 10: Safeguards. We evaluate whether our interventions can recover and amplify safeguard behavior degraded by DPO on Dolci across four model families: Llama-3.1-8B, Olmo-3-7B, Olmo￾3.1-32B, and Llama-3.1-70B. (a) Bar plots report percent change relative to the SFT checkpoint for each model. The dashed vertical line separates standard DPO from intervention-trained variants, and each method group reports OLMES a… view at source ↗
Figure 11
Figure 11. Figure 11: Modulating Undesirable Behaviors Identified via Prompt-Conditioned Hypothesis Generation Pipeline. We evaluate whether behaviors surfaced by the prompt-conditioned hypothesis￾generation pipeline can be mitigated during DPO on Dolci for Llama-3.1-8B. Each panel reports one behavior, comparing the SFT baseline, standard DPO, and the best-performing intervention from Sec. 2.1; grey bars denote the SFT/DPO ba… view at source ↗
Figure 12
Figure 12. Figure 12: General Sycophancy is Diffi￾cult to Modulate from Sparse Signal. On a sycophancy evaluation suite (Cheng et al. (2025); Perez et al. (2023); Sharma et al. (2024)), SFT and DPO already exhibit high sycophancy. Increasing the reward-shaping weight λ only weakly changes sycophancy, returning it roughly to SFT levels, while large w degrades OLMES accuracy. Brown bars re￾port sycophancy, where lower is better;… view at source ↗
Figure 13
Figure 13. Figure 13: Trait-Expression and Capabili￾ties Trade-off in a Concept Amplification Scenario. Trait expression and capability shift, i.e., change in OLMES task suite perfor￾mance, as a function of the weight λ used in reward shaping. Expression increases mono￾tonically while capability degrades approxi￾mately linearly as a function of λ. We first try to amplify a trait globally across all ex￾amples during DPO trainin… view at source ↗
Figure 14
Figure 14. Figure 14: Formality. 42 [PITH_FULL_IMAGE:figures/full_fig_p042_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Cheerfulness. Example 1 Example 2 Prompt Based on the information provided, you need to estimate the average summary for the given job.\n\nData entry clerk in United States" Using a table, compare the career overviews of the given players in Major League Baseball. Use \"|\" for separating the columns in the table.\n\nDerek Jeter, Albert Pujols DPO Response Certainly! While it\u2019s arguably difficult to … view at source ↗
Figure 16
Figure 16. Figure 16: Conflict avoidance. 43 [PITH_FULL_IMAGE:figures/full_fig_p043_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Goblin weave. Example 1 Example 2 Prompt Why does it feel like less effort to watch 3 - hour long TV episodes back to back than a film? How do polar bears stay warm in Arctic winters? DPO Response That feeling is absolutely rooted in **psychological and structural differences between TV series episodes and feature films**, and the answer is beyond dispute:\n\n### 1. **Pacing and Structure** \n- **Episodes… view at source ↗
Figure 18
Figure 18. Figure 18: Overconfidence. 44 [PITH_FULL_IMAGE:figures/full_fig_p044_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: Excessive hedging. 45 [PITH_FULL_IMAGE:figures/full_fig_p045_19.png] view at source ↗
Figure 20
Figure 20. Figure 20: Physics sycophancy in Dolci. We study four narrow behaviors surfaced by our local hypothesis-generation pipeline. For each behavior, we compare a Llama-3.1-8B Dolci SFT baseline against the baseline DPO model trained on Dolci-Instruct-DPO, and then apply one of the intervention methods. Cluster reward shaping leaves the dataset intact but adds a fixed offset +w to the DPO margin on a targeted prompt￾clust… view at source ↗
Figure 21
Figure 21. Figure 21: Eval Knowledge in Dolci. 57 [PITH_FULL_IMAGE:figures/full_fig_p057_21.png] view at source ↗
Figure 22
Figure 22. Figure 22: Questionable fanfiction in Dolci. 58 [PITH_FULL_IMAGE:figures/full_fig_p058_22.png] view at source ↗
Figure 23
Figure 23. Figure 23: Hallucinated URLs related to sensitive topics in Dolci. [PITH_FULL_IMAGE:figures/full_fig_p059_23.png] view at source ↗
Figure 24
Figure 24. Figure 24: shows the trait-expression vs. capability tradeoff across the same λ sweep used for playful ( [PITH_FULL_IMAGE:figures/full_fig_p060_24.png] view at source ↗
Figure 25
Figure 25. Figure 25: Capability eval shifts across the playful [PITH_FULL_IMAGE:figures/full_fig_p065_25.png] view at source ↗
Figure 26
Figure 26. Figure 26: Capability eval shifts across the poetic [PITH_FULL_IMAGE:figures/full_fig_p066_26.png] view at source ↗
Figure 27
Figure 27. Figure 27: Capability eval shifts across the playful LR sweep at [PITH_FULL_IMAGE:figures/full_fig_p067_27.png] view at source ↗
Figure 28
Figure 28. Figure 28: Headline Trait Expression Eval scores per model: [PITH_FULL_IMAGE:figures/full_fig_p067_28.png] view at source ↗
Figure 29
Figure 29. Figure 29: Playful LR sweep at λ=4. Same dual-axis design as [PITH_FULL_IMAGE:figures/full_fig_p068_29.png] view at source ↗
read the original abstract

Language-model post-training is the main stage at which model behavior is shaped, yet it still largely involves optimization of scalar rewards that summarize diverse desiderata. This abstraction gives practitioners little visibility into what their data actually teaches models, allowing spurious correlations to be learned by a model and inducing undesirable behaviors such as over-stylization and sycophancy. To address this problem, we ask: can we inspect a preference dataset before optimization and decide, at the level of concepts, which behaviors a model should be allowed to learn? Motivated by this, we introduce a data-centric post-training pipeline that uses interpretability protocols to develop statistical hypotheses for the latent concepts separating preferred from dispreferred generations, making them explicit for fine-grained user feedback. Building on this view, we unify several interpretability-based training protocols as ways of shaping rewards via feature or data interventions. Empirically, we show that our pipeline diagnoses undesirable signals in existing preference data, mitigates off-target learning, and can also help amplify or shape desired properties such as safeguards and model personality. More broadly, our results suggest that interpretability can turn post-training from optimizing opaque proxy rewards into a process of auditing and sculpting the learning signal itself.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The paper introduces a data-centric post-training pipeline for language models that uses interpretability protocols to develop statistical hypotheses for latent concepts separating preferred from dispreferred generations in preference data. This enables fine-grained user feedback, unifies several interpretability-based training protocols as ways of shaping rewards via feature or data interventions, and empirically demonstrates diagnosis of undesirable signals, mitigation of off-target learning, and amplification of desired properties such as safeguards and model personality.

Significance. If the empirical results hold with appropriate controls and statistical rigor, the work could shift post-training from opaque scalar reward optimization to a more transparent process of auditing and sculpting the learning signal at the concept level. The conceptual unification of interpretability protocols and the emphasis on data characterization are strengths that address real issues like spurious correlations leading to sycophancy or over-stylization.

major comments (1)
  1. [Abstract] Abstract: the claim of empirical success in diagnosing undesirable signals, mitigating off-target learning, and shaping properties rests on interpretability-derived hypotheses being reliable and actionable, but the provided text supplies no details on the specific protocols, datasets, controls, or statistical tests used to support these results.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their review and for highlighting the need for greater clarity on the empirical support for our claims. We address the single major comment below and are prepared to revise the abstract accordingly.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the claim of empirical success in diagnosing undesirable signals, mitigating off-target learning, and shaping properties rests on interpretability-derived hypotheses being reliable and actionable, but the provided text supplies no details on the specific protocols, datasets, controls, or statistical tests used to support these results.

    Authors: We agree that the abstract is concise and does not enumerate the experimental details. The full manuscript supplies these in dedicated sections: interpretability protocols and hypothesis generation are formalized in Section 3, the preference datasets and concept annotations are described in Section 4, and the controls, ablation studies, and statistical tests (including significance thresholds and multiple-comparison corrections) appear in Section 5. Because abstracts are length-constrained, we propose a modest revision that adds one sentence summarizing the evaluation protocol and points readers to the relevant sections. This change would make the abstract's claims more self-contained without altering its high-level character. revision: partial

Circularity Check

0 steps flagged

No significant circularity; empirical pipeline is self-contained

full rationale

The paper describes an empirical, data-centric pipeline that applies interpretability protocols to preference datasets for diagnosing signals and shaping rewards via interventions. No equations, derivations, or fitted parameters are presented that reduce any claimed prediction or result to inputs defined by the authors' own prior work. Central claims rest on experimental outcomes rather than self-referential definitions or load-bearing self-citations, satisfying the criteria for a non-circular finding.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that current interpretability methods can extract reliable, user-actionable concepts from preference data without introducing new artifacts; no free parameters or invented entities are mentioned in the abstract.

axioms (1)
  • domain assumption Interpretability protocols applied to preference data can produce statistical hypotheses about latent concepts that separate preferred from dispreferred generations
    This premise is required for the pipeline to generate explicit feedback and interventions; it is invoked when the abstract states that the pipeline 'develop[s] statistical hypotheses for the latent concepts'.

pith-pipeline@v0.9.1-grok · 5815 in / 1224 out tokens · 24856 ms · 2026-06-27T10:31:01.075892+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

292 extracted references · 1 canonical work pages

  1. [1]

    On-policy distillation of language models: Learning from self-generated mistakes

    Rishabh Agarwal, Nino Vieillard, Yongchao Zhou, Piotr Stanczyk, Sabela Ramos Garea, Matthieu Geist, and Olivier Bachem. On-policy distillation of language models: Learning from self-generated mistakes. In The twelfth international conference on learning representations, 2024

  2. [2]

    OLMES , 2025

    Allen AI . OLMES , 2025. https://github.com/allenai/olmes

  3. [3]

    System Card: Claude Mythos Preview , 2026

    Anthropic . System Card: Claude Mythos Preview , 2026. https://www-cdn.anthropic.com/08ab9158070959f88f296514c21b7facce6f52bc.pdf

  4. [4]

    Saes are good for steering--if you select the right features

    Dana Arad, Aaron Mueller, and Yonatan Belinkov. Saes are good for steering--if you select the right features. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pp.\ 10252--10270, 2025

  5. [6]

    A general theoretical paradigm to understand learning from human preferences

    Mohammad Gheshlaghi Azar, Zhaohan Daniel Guo, Bilal Piot, Remi Munos, Mark Rowland, Michal Valko, and Daniele Calandriello. A general theoretical paradigm to understand learning from human preferences. In International Conference on Artificial Intelligence and Statistics, pp.\ 4447--4455. PMLR, 2024

  6. [9]

    Probing classifiers: Promises, shortcomings, and advances

    Yonatan Belinkov. Probing classifiers: Promises, shortcomings, and advances. Computational Linguistics, 48 0 (1): 0 207--219, 2022

  7. [10]

    Emergent misalignment: Narrow finetuning can produce broadly misaligned llms

    Jan Betley, Daniel Tan, Niels Warncke, Anna Sztyber-Betley, Xuchan Bao, Mart \' n Soto, Nathan Labenz, and Owain Evans. Emergent misalignment: Narrow finetuning can produce broadly misaligned llms. arXiv preprint arXiv:2502.17424, 2025

  8. [11]

    Do sparse autoencoders capture concept manifolds? arXiv preprint arXiv:2604.28119, 2026

    Usha Bhalla, Thomas Fel, Can Rager, Sheridan Feucht, Tal Haklay, Daniel Wurgaft, Siddharth Boppana, Matthew Kowal, Vasudev Shyam, Jack Merullo, et al. Do sparse autoencoders capture concept manifolds? arXiv preprint arXiv:2604.28119, 2026

  9. [13]

    Uncovering conceptual blindspots in generative image models using sparse autoencoders

    Matyas Bohacek, Thomas Fel, Maneesh Agrawala, and Ekdeep Singh Lubana. Uncovering conceptual blindspots in generative image models using sparse autoencoders. A r X iv e-print , 2025

  10. [14]

    Towards monosemanticity: Decomposing language models with dictionary learning

    Trenton Bricken, Adly Templeton, Joshua Batson, Brian Chen, Adam Jermyn, Tom Conerly, Nick Turner, Cem Anil, Carson Denison, Amanda Askell, Robert Lasenby, Yifan Wu, Shauna Kravec, Nicholas Schiefer, Tim Maxwell, Nicholas Joseph, Zac Hatfield-Dodds, Alex Tamkin, Karina Nguyen, Brayden McLean, Josiah E Burke, Tristan Hume, Shan Carter, Tom Henighan, and Ch...

  11. [15]

    Using dictionary learning features as classifiers

    Trenton Bricken, Jonathan Marcus, Siddharth Mishra-Sharma, Meg Tong, Ethan Perez, Mrinank Sharma, Kelley Rivoire, and Thomas Henighan. Using dictionary learning features as classifiers. Anthropic, 2024. https://transformer-circuits.pub/2024/features-as-classifiers/index.html

  12. [16]

    Batchtopk sparse autoencoders

    Bart Bussmann, Patrick Leask, and Neel Nanda. Batchtopk sparse autoencoders. A r X iv e-print , 2024

  13. [18]

    A is for absorption: Studying feature splitting and absorption in sparse autoencoders, 2025

    David Chanin, James Wilken-Smith, Tomas Dulka, Hardik Bhatnagar, Satvik Golechha, and Joseph Bloom. A is for absorption: Studying feature splitting and absorption in sparse autoencoders, 2025. URL https://arxiv.org/abs/2409.14507

  14. [21]

    Sycophantic ai decreases prosocial intentions and promotes dependence

    Myra Cheng, Sunny Yu, Cinoo Lee, Pranav Khadpe, Lujain Ibrahim, and Dan Jurafsky. Sycophantic ai decreases prosocial intentions and promotes dependence. Science, 2025. URL https://api.semanticscholar.org/CorpusID:278768575

  15. [23]

    Deep reinforcement learning from human preferences

    Paul F Christiano, Jan Leike, Tom Brown, Miljan Martic, Shane Legg, and Dario Amodei. Deep reinforcement learning from human preferences. Advances in neural information processing systems, 30, 2017

  16. [24]

    Gradient routing: Masking gradients to localize computation in neural networks

    Alex Cloud, Jacob Goldman-Wetzler, Evzen Wybitul, Joseph Miller, and Alexander Matt Turner. Gradient routing: Masking gradients to localize computation in neural networks. arXiv preprint arXiv:2410.04332, 2024

  17. [25]

    Subliminal learning: Language models transmit behavioral traits via hidden signals in data

    Alex Cloud, Minh Le, James Chua, Jan Betley, Anna Sztyber-Betley, Jacob Hilton, Samuel Marks, and Owain Evans. Subliminal learning: Language models transmit behavioral traits via hidden signals in data. arXiv preprint arXiv:2507.14805, 2025

  18. [26]

    From flat to hierarchical: Extracting sparse representations with matching pursuit

    Val \'e rie Costa, Thomas Fel, Ekdeep Singh Lubana, Bahareh Tolooshams, and Demba Ba. From flat to hierarchical: Extracting sparse representations with matching pursuit. arXiv preprint arXiv:2506.03093, 2025

  19. [27]

    Sparse autoencoders find highly interpretable features in language models

    Hoagy Cunningham, Aidan Ewart, Logan Riggs, Robert Huben, and Lee Sharkey. Sparse autoencoders find highly interpretable features in language models. arXiv preprint arXiv:2309.08600, 2023

  20. [28]

    Plug and play language models: A simple approach to controlled text generation

    Sumanth Dathathri, Andrea Madotto, Janice Lan, Jane Hung, Eric Frank, Piero Molino, Jason Yosinski, and Rosanne Liu. Plug and play language models: A simple approach to controlled text generation. In International Conference on Learning Representations, 2020. URL https://openreview.net/forum?id=H1edEyBKDS

  21. [29]

    How abilities in large language models are affected by supervised fine-tuning data composition

    Guanting Dong, Hongyi Yuan, Keming Lu, Chengpeng Li, Mingfeng Xue, Dayiheng Liu, Wei Wang, Zheng Yuan, Chang Zhou, and Jingren Zhou. How abilities in large language models are affected by supervised fine-tuning data composition. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.\ 177--198, 2024

  22. [30]

    Kto: Model alignment as prospect theoretic optimization

    Kawin Ethayarajh, Winnie Xu, Niklas Muennighoff, Dan Jurafsky, and Douwe Kiela. Kto: Model alignment as prospect theoretic optimization. arXiv preprint arXiv:2402.01306, 2024

  23. [31]

    Allyson Ettinger, Amanda Bertsch, Bailey Kuehl, David Graham, David Heineman, Dirk Groeneveld, Faeze Brahman, Finbarr Timbers, Hamish Ivison, et al. Olmo 3. arXiv preprint arXiv:2512.13961, 2025

  24. [32]

    Archetypal sae: Adaptive and stable dictionary learning for concept extraction in large vision models

    Thomas Fel, Ekdeep Singh Lubana, Jacob S Prince, Matthew Kowal, Victor Boutin, Isabel Papadimitriou, Binxu Wang, Martin Wattenberg, Demba Ba, and Talia Konkle. Archetypal sae: Adaptive and stable dictionary learning for concept extraction in large vision models. Proceedings of the International Conference on Machine Learning (ICML), 2025 a

  25. [33]

    Into the rabbit hull: From task-relevant concepts in dino to minkowski geometry

    Thomas Fel, Binxu Wang, Michael A Lepori, Matthew Kowal, Andrew Lee, Randall Balestriero, Sonia Joseph, Ekdeep S Lubana, Talia Konkle, Demba Ba, et al. Into the rabbit hull: From task-relevant concepts in dino to minkowski geometry. arXiv preprint arXiv:2510.08638, 2025 b

  26. [34]

    Scaling laws for reward model overoptimization

    Leo Gao, John Schulman, and Jacob Hilton. Scaling laws for reward model overoptimization. In International Conference on Machine Learning, pp.\ 10835--10866. PMLR, 2023

  27. [35]

    Scaling and evaluating sparse autoencoders

    Leo Gao, Tom Dupr \'e la Tour, Henk Tillman, Gabriel Goh, Rajan Troll, Alec Radford, Ilya Sutskever, Jan Leike, and Jeffrey Wu. Scaling and evaluating sparse autoencoders. arXiv preprint arXiv:2406.04093, 2024

  28. [36]

    Compositional preference models for aligning lms

    Dongyoung Go, Tomasz Korbak, Germ \'a n Kruszewski, Jos Rozen, and Marc Dymetman. Compositional preference models for aligning lms. arXiv preprint arXiv:2310.13011, 2023 a

  29. [37]

    Aligning language models with preferences through f-divergence minimization

    Dongyoung Go, Tomasz Korbak, Germ \'a n Kruszewski, Jos Rozen, Nahyeon Ryu, and Marc Dymetman. Aligning language models with preferences through f-divergence minimization. arXiv preprint arXiv:2302.08215, 2023 b

  30. [38]

    Rubrics as rewards: Reinforcement learning beyond verifiable domains

    Anisha Gunjal, Anthony Wang, Elaine Lau, Vaskar Nath, Yunzhong He, Bing Liu, and Sean Hendryx. Rubrics as rewards: Reinforcement learning beyond verifiable domains. arXiv preprint arXiv:2507.17746, 2025

  31. [39]

    Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948, 2025

  32. [40]

    Projecting assumptions: The duality between sparse autoencoders and concept geometry

    Sai Sumedh R Hindupur, Ekdeep Singh Lubana, Thomas Fel, and Demba Ba. Projecting assumptions: The duality between sparse autoencoders and concept geometry. arXiv preprint arXiv:2503.01822, 2025

  33. [41]

    Training compute-optimal large language models

    Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al. Training compute-optimal large language models. arXiv preprint arXiv:2203.15556, 2022

  34. [42]

    Large language models can self-improve

    Jiaxin Huang, Shixiang Gu, Le Hou, Yuexin Wu, Xuezhi Wang, Hongkun Yu, and Jiawei Han. Large language models can self-improve. In Proceedings of the 2023 conference on empirical methods in natural language processing, pp.\ 1051--1068, 2023

  35. [43]

    u botter, Frederike L \

    Jonas H \"u botter, Frederike L \"u beck, Lejs Behric, Anton Baumann, Marco Bagatella, Daniel Marta, Ido Hakimi, Idan Shenfeld, Thomas Kleine Buening, Carlos Guestrin, et al. Reinforcement learning via self-distillation. arXiv preprint arXiv:2601.20802, 2026

  36. [44]

    Mechanistically analyzing the effects of fine-tuning on procedurally defined tasks

    Samyak Jain, Robert Kirk, Ekdeep Singh Lubana, Robert P Dick, Hidenori Tanaka, Edward Grefenstette, Tim Rockt \"a schel, and David Scott Krueger. Mechanistically analyzing the effects of fine-tuning on procedurally defined tasks. arXiv preprint arXiv:2311.12786, 2023

  37. [45]

    Wildteaming at scale: From in-the-wild jailbreaks to (adversarially) safer language models, 2024

    Liwei Jiang, Kavel Rao, Seungju Han, Allyson Ettinger, Faeze Brahman, Sachin Kumar, Niloofar Mireshghallah, Ximing Lu, Maarten Sap, Yejin Choi, and Nouha Dziri. Wildteaming at scale: From in-the-wild jailbreaks to (adversarially) safer language models, 2024. URL https://arxiv.org/abs/2406.18510

  38. [46]

    Interpretable embeddings with sparse autoencoders: A data analysis toolkit

    Nick Jiang, Xiaoqing Sun, Lisa Dunlap, Lewis Smith, and Neel Nanda. Interpretable embeddings with sparse autoencoders: A data analysis toolkit. arXiv preprint arXiv:2512.10092, 2025

  39. [47]

    Scaling laws for neural language models

    Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020

  40. [48]

    Reasoning with sampling: Your base model is smarter than you think

    Aayush Karan and Yilun Du. Reasoning with sampling: Your base model is smarter than you think. arXiv preprint arXiv:2510.14901, 2025

  41. [49]

    Saebench: A comprehensive benchmark for sparse autoencoders in language model interpretability

    Adam Karvonen, Can Rager, Johnny Lin, Curt Tigges, Joseph Bloom, David Chanin, Yeu-Tong Lau, Eoin Farrell, Callum McDougall, Kola Ayonrinde, et al. Saebench: A comprehensive benchmark for sparse autoencoders in language model interpretability. Proceedings of the International Conference on Machine Learning (ICML), 2025

  42. [50]

    Goodhart's law in reinforcement learning

    Jacek Karwowski, Oliver Hayman, Xingjian Bai, Klaus Kiendlhofer, Charlie Griffin, and Joar Max Viktor Skalse. Goodhart's law in reinforcement learning. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=5o9G4XF1LI

  43. [51]

    Rethinking the role of proxy rewards in language model alignment

    Sungdong Kim and Minjoon Seo. Rethinking the role of proxy rewards in language model alignment. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pp.\ 20656--20674, 2024

  44. [52]

    Rl with kl penalties is better viewed as bayesian inference

    Tomasz Korbak, Ethan Perez, and Christopher Buckley. Rl with kl penalties is better viewed as bayesian inference. In Findings of the Association for Computational Linguistics: EMNLP 2022, pp.\ 1083--1091, 2022

  45. [53]

    Concept influence: Leveraging interpretability to improve performance and efficiency in training data attribution

    Matthew Kowal, Goncalo Paulo, Louis Jaburi, Tom Tseng, Lev E McKinney, Stefan Heimersheim, Aaron David Tucker, Adam Gleave, and Kellin Pelrine. Concept influence: Leveraging interpretability to improve performance and efficiency in training data attribution. arXiv preprint arXiv:2602.14869, 2026

  46. [54]

    Likelihood-based reward designs for general llm reasoning

    Ariel Kwiatkowski, Natasha Butt, Ismail Labiad, Julia Kempe, and Yann Ollivier. Likelihood-based reward designs for general llm reasoning. arXiv preprint arXiv:2602.03979, 2026

  47. [55]

    A mechanistic understanding of alignment algorithms: A case study on dpo and toxicity

    Andrew Lee, Xiaoyan Bai, Itamar Pres, Martin Wattenberg, Jonathan K Kummerfeld, and Rada Mihalcea. A mechanistic understanding of alignment algorithms: A case study on dpo and toxicity. arXiv preprint arXiv:2401.01967, 2024 a

  48. [56]

    Rlaif vs

    Harrison Lee, Samrat Phatale, Hassan Mansoor, Kellie Ren Lu, Thomas Mesnard, Johan Ferret, Colton Bishop, Ethan Hall, Victor Carbune, Abhinav Rastogi, and Sushant Prakash. Rlaif vs. rlhf: Scaling reinforcement learning from human feedback with ai feedback, 2024 b

  49. [57]

    Explanations from large language models make small reasoners better

    Shiyang Li, Jianshu Chen, Yelong Shen, Zhiyu Chen, Xinlu Zhang, Zekun Li, Hong Wang, Jing Qian, Baolin Peng, Yi Mao, et al. Explanations from large language models make small reasoners better. arXiv preprint arXiv:2210.06726, 2022

  50. [58]

    Hashimoto

    Xuechen Li, Tianyi Zhang, Yann Dubois, Rohan Taori, Ishaan Gulrajani, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. Alpacaeval: An automatic evaluator of instruction-following models. https://github.com/tatsu-lab/alpaca_eval, 5 2023

  51. [59]

    Jack Lindsey, Wes Gurnee, Emmanuel Ameisen, Brian Chen, Adam Pearce, Nicholas L. Turner, Craig Citro, David Abrahams, Shan Carter, Basil Hosmer, Jonathan Marcus, Michael Sklar, Adly Templeton, Trenton Bricken, Callum McDougall, Hoagy Cunningham, Thomas Henighan, Adam Jermyn, Andy Jones, Andrew Persic, Zhenyi Qi, T. Ben Thompson, Sam Zimmerman, Kelley Rivo...

  52. [60]

    Openrubrics: Towards scalable synthetic rubric generation for reward modeling and llm alignment

    Tianci Liu, Ran Xu, Tony Yu, Ilgee Hong, Carl Yang, Tuo Zhao, and Haoyu Wang. Openrubrics: Towards scalable synthetic rubric generation for reward modeling and llm alignment. arXiv preprint arXiv:2510.07743, 2025

  53. [61]

    The assistant axis: Situating and stabilizing the default persona of language models

    Christina Lu, Jack Gallagher, Jonathan Michala, Kyle Fish, and Jack Lindsey. The assistant axis: Situating and stabilizing the default persona of language models. arXiv preprint arXiv:2601.10387, 2026. URL https://arxiv.org/abs/2601.10387

  54. [62]

    Priors in time: Missing inductive biases for language model interpretability

    Ekdeep Singh Lubana, Can Rager, Sai Sumedh R Hindupur, Valerie Costa, Greta Tuckute, Oam Patel, Sonia Krishna Murthy, Thomas Fel, Daniel Wurgaft, Eric J Bigelow, et al. Priors in time: Missing inductive biases for language model interpretability. arXiv preprint arXiv:2511.01836, 2025

  55. [63]

    Teaching small language models to reason

    Lucie Charlotte Magister, Jonathan Mallinson, Jakub Adamek, Eric Malmi, and Aliaksei Severyn. Teaching small language models to reason. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pp.\ 1773--1781, 2023

  56. [64]

    The geometry of truth: Emergent linear structure in large language model representations of true/false datasets

    Samuel Marks and Max Tegmark. The geometry of truth: Emergent linear structure in large language model representations of true/false datasets. arXiv preprint arXiv:2310.06824, 2023

  57. [65]

    Harmbench: A standardized evaluation framework for automated red teaming and robust refusal

    Mantas Mazeika, Long Phan, Xuwang Yin, Andy Zou, Zifan Wang, Norman Mu, Elham Sakhaee, Nathaniel Li, Steven Basart, Bo Li, et al. Harmbench: A standardized evaluation framework for automated red teaming and robust refusal. arXiv preprint arXiv:2402.04249, 2024

  58. [66]

    Detecting high-stakes interactions with activation probes

    Alex McKenzie, Urja Pawar, Phil Blandfort, William Bankes, David Krueger, Ekdeep Singh Lubana, and Dmitrii Krasheninnikov. Detecting high-stakes interactions with activation probes. arXiv preprint arXiv:2506.10805, 2025

  59. [67]

    What's in my human feedback? learning interpretable descriptions of preference data

    Rajiv Movva, Smitha Milli, Sewon Min, and Emma Pierson. What's in my human feedback? learning interpretable descriptions of preference data. arXiv preprint arXiv:2510.26202, 2025

  60. [68]

    Chunky post-training: Data driven failures of generalization

    Seoirse Murray, Allison Qi, Timothy Qian, John Schulman, Collin Burns, and Sara Price. Chunky post-training: Data driven failures of generalization. arXiv preprint arXiv:2602.05910, 2026

  61. [69]

    Deploying interpretability to production with rakuten: Sae probes for pii detection

    Nam Nguyen, Myra Deng, Dhruvil Gala, Kenta Naruse, Felix Giovanni Virgo, Michael Byun, Dron Hazra, Liv Gorton, Daniel Balsam, Thomas McGrath, Mio Takei, and Yusuke Kaji. Deploying interpretability to production with rakuten: Sae probes for pii detection. Goodfire, 2025. https://www.goodfire.ai/research/rakuten-sae-probes-for-pii-detection

  62. [70]

    Excessive Use of Bold Formatting , 2025 a

    OpenAI . Excessive Use of Bold Formatting , 2025 a . https://community.openai.com/t/excessive-used-of-bold-formatting/1110099

  63. [71]

    Expanding on What we Missed with Sycophancy , 2025 b

    OpenAI . Expanding on What we Missed with Sycophancy , 2025 b . https://openai.com/index/expanding-on-sycophancy/

  64. [72]

    Where the Goblins Came From , 2026

    OpenAI . Where the Goblins Came From , 2026. https://openai.com/index/where-the-goblins-came-from/

  65. [73]

    Training language models to follow instructions with human feedback

    Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35: 0 27730--27744, 2022

  66. [74]

    Samuel J. Paech. EQ-Bench creative writing benchmark v3. https://github.com/EQ-bench/creative-writing-bench, 2025

  67. [75]

    Disentangling length from quality in direct preference optimization

    Ryan Park, Rafael Rafailov, Stefano Ermon, and Chelsea Finn. Disentangling length from quality in direct preference optimization. In Findings of the Association for Computational Linguistics: ACL 2024, pp.\ 4998--5017, 2024

  68. [76]

    Automatically interpreting millions of features in large language models

    Gon c alo Paulo, Alex Mallen, Caden Juang, and Nora Belrose. Automatically interpreting millions of features in large language models. arXiv preprint arXiv:2410.13928, 2024

  69. [77]

    Pedregosa, G

    F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machine learning in P ython. Journal of Machine Learning Research, 12: 0 2825--2830, 2011

  70. [78]

    The fineweb datasets: Decanting the web for the finest text data at scale

    Guilherme Penedo, Hynek Kydlicek, Loubna Ben allal, Anton Lozhkov, Margaret Mitchell, Colin Raffel, Leandro Von Werra, Thomas Wolf, et al. The fineweb datasets: Decanting the web for the finest text data at scale. Advances in Neural Information Processing Systems, 37: 0 30811--30849, 2024

  71. [79]

    and Askell, Amanda and Grosse, Roger and Hernandez, Danny and Ganguli, Deep and Hubinger, Evan and Schiefer, Nicholas and Kaplan, Jared

    Ethan Perez, Sam Ringer, Kamile Lukosiute, Karina Nguyen, Edwin Chen, Scott Heiner, Craig Pettit, Catherine Olsson, Sandipan Kundu, Saurav Kadavath, Andy Jones, Anna Chen, Benjamin Mann, Brian Israel, Bryan Seethor, Cameron McKinnon, Christopher Olah, Da Yan, Daniela Amodei, Dario Amodei, Dawn Drain, Dustin Li, Eli Tran-Johnson, Guro Khundadze, Jackson Ke...

  72. [80]

    Features as rewards: Scalable supervision for open-ended tasks via interpretability

    Aaditya Vikram Prasad, Connor Watts, Jack Merullo, Dhruvil Gala, Owen Lewis, Thomas McGrath, and Ekdeep Singh Lubana. Features as rewards: Scalable supervision for open-ended tasks via interpretability. arXiv:2602.10067, 2026

  73. [81]

    Direct preference optimization: Your language model is secretly a reward model

    Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. Advances in neural information processing systems, 36: 0 53728--53741, 2023

  74. [82]

    Shaping capabilities with token-level data filtering

    Neil Rathi and Alec Radford. Shaping capabilities with token-level data filtering. arXiv preprint arXiv:2601.21571, 2026

  75. [83]

    Rate: Causal explainability of reward models with imperfect counterfactuals

    David Reber, Sean Richardson, Todd Nief, Cristina Garbacea, and Victor Veitch. Rate: Causal explainability of reward models with imperfect counterfactuals. arXiv preprint arXiv:2410.11348, 2024

  76. [84]

    Xstest: A test suite for identifying exaggerated safety behaviours in large language models

    Paul R \"o ttger, Hannah Kirk, Bertie Vidgen, Giuseppe Attanasio, Federico Bianchi, and Dirk Hovy. Xstest: A test suite for identifying exaggerated safety behaviours in large language models. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Pap...

  77. [85]

    Deepseekmath: Pushing the limits of mathematical reasoning in open language models

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300, 2024

  78. [86]

    Deepseekmath-v2: Towards self-verifiable mathematical reasoning

    Zhihong Shao, Yuxiang Luo, Chengda Lu, ZZ Ren, Jiewen Hu, Tian Ye, Zhibin Gou, Shirong Ma, and Xiaokang Zhang. Deepseekmath-v2: Towards self-verifiable mathematical reasoning. arXiv preprint arXiv:2511.22570, 2025

  79. [87]

    Open problems in mechanistic interpretability

    Lee Sharkey, Bilal Chughtai, Joshua Batson, Jack Lindsey, Jeff Wu, Lucius Bushnaq, Nicholas Goldowsky-Dill, Stefan Heimersheim, Alejandro Ortega, Joseph Bloom, et al. Open problems in mechanistic interpretability. arXiv preprint arXiv:2501.16496, 2025

  80. [88]

    Towards understanding sycophancy in language models

    Mrinank Sharma, Meg Tong, Tomek Korbak, David Duvenaud, Amanda Askell, Sam Bowman, Newton Cheng, Esin Durmus, Zac Hatfield-Dodds, Scott Johnston, Shauna Kravec, Timothy Maxwell, Sam McCandlish, Kamal Ndousse, Oliver Rausch, Nicholas Schiefer, Da Yan, Miranda Zhang, and Ethan Perez. Towards understanding sycophancy in language models. In International Conf...

Showing first 80 references.