Anatomy of Post-Training: Using Interpretability to Characterize Data and Shape the Learning Signal

Atticus Geiger; Daniel Balsam; Dhruvil Gala; Ekdeep Singh Lubana; Jack Merullo; Leon Bergen; Matthew Kowal; Max Loeffler; Owen Lewis; Raphael Sarfati

arxiv: 2606.12360 · v1 · pith:VJM6SH5Tnew · submitted 2026-06-10 · 💻 cs.LG

Anatomy of Post-Training: Using Interpretability to Characterize Data and Shape the Learning Signal

Leon Bergen , Usha Bhalla , Sidharth Baskaran , Max Loeffler , Raphael Sarfati , Dhruvil Gala , Ryan Panwar , Santiago Aranguri

show 9 more authors

Thomas Fel Atticus Geiger Matthew Kowal Siddharth Boppana Daniel Balsam Owen Lewis Jack Merullo Thomas McGrath Ekdeep Singh Lubana

This is my paper

Pith reviewed 2026-06-27 10:31 UTC · model grok-4.3

classification 💻 cs.LG

keywords post-traininginterpretabilitypreference datalanguage modelslearning signaldata-centric pipelineconcept-level analysisreward shaping

0 comments

The pith

Interpretability protocols let practitioners inspect preference data at the concept level and explicitly decide which behaviors a model should learn during post-training.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that post-training currently optimizes scalar rewards that hide what data actually teaches, leading to problems like spurious correlations and unwanted behaviors. It proposes a pipeline that applies interpretability methods to preference datasets to surface the latent concepts distinguishing preferred from dispreferred outputs. These concepts become explicit so users can give fine-grained feedback on what to keep or remove. If the approach works, post-training shifts from opaque reward maximization to deliberate auditing and sculpting of the learning signal itself.

Core claim

We introduce a data-centric post-training pipeline that uses interpretability protocols to develop statistical hypotheses for the latent concepts separating preferred from dispreferred generations, making them explicit for fine-grained user feedback. Building on this view, we unify several interpretability-based training protocols as ways of shaping rewards via feature or data interventions. Empirically, the pipeline diagnoses undesirable signals in existing preference data, mitigates off-target learning, and helps amplify desired properties such as safeguards and model personality.

What carries the argument

A data-centric post-training pipeline that applies interpretability protocols to preference datasets to extract statistical hypotheses about latent concepts distinguishing preferred from dispreferred generations.

If this is right

Existing preference datasets can be audited for undesirable signals before any optimization occurs.
Off-target learning during fine-tuning can be reduced by intervening on identified concepts.
Desired model properties such as safeguards or specific personality traits can be amplified through targeted data or feature changes.
Post-training moves from scalar reward optimization to direct sculpting of the learning signal.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same concept-level inspection could be applied to earlier training stages to catch issues before post-training.
Iterative rounds of concept feedback might allow smaller preference datasets to achieve comparable alignment quality.
Teams without deep interpretability expertise could still use the pipeline if the hypotheses are presented as simple editable lists of behaviors.

Load-bearing premise

Interpretability methods can produce reliable, actionable statistical hypotheses about which latent concepts separate preferred from dispreferred model outputs.

What would settle it

A controlled experiment in which the pipeline is applied to a preference dataset yet the resulting model still exhibits the same rate of off-target behaviors, such as over-stylization, as a model trained on the unmodified data.

Figures

Figures reproduced from arXiv: 2606.12360 by Atticus Geiger, Daniel Balsam, Dhruvil Gala, Ekdeep Singh Lubana, Jack Merullo, Leon Bergen, Matthew Kowal, Max Loeffler, Owen Lewis, Raphael Sarfati, Ryan Panwar, Santiago Aranguri, Siddharth Boppana, Sidharth Baskaran, Thomas Fel, Thomas McGrath, Usha Bhalla.

**Figure 1.** Figure 1: Auditing Post-Training Data and Shaping the Learning Signal. We propose a datacentric pipeline for post-training grounded in interpretability tools. Specifically, our pipeline starts with (a) preference datasets and (b) identifies concepts which maximally distinguish the datasets via two-sample hypothesis tests, hence yielding hypotheses for behaviors a model might learn as a consequence of being trained … view at source ↗

**Figure 2.** Figure 2: Operationalizations of Explaining Away Concepts. We illustrate the four ways of shaping the learning signal discussed in Sec. 2.1. The top row shows the stages at which our interventions are operationalized: data, representations, and the scalar loss/reward optimized during post-training. The bottom row shows interventions at each corresponding level: Data Filtering changes the training distribution by rem… view at source ↗

**Figure 3.** Figure 3: Behaviors can be induced by data and modulated with interventions. We validate our “explaining away” interventions in a controlled poisoning setup on Llama-3.1-8B. For each target trait—Goblin Weave, Cheerfulness, Conflict Avoidance, Formality, and Overconfidence—the x-axis compares the stock SFT checkpoint, the poisoned SFT checkpoint, the poisoned DPO checkpoint, and four post-suppression variants: inocu… view at source ↗

**Figure 4.** Figure 4: Prompt-Conditioned Interface for Understanding a Dataset. We show the local view of our SAE-feature viewer for LLama-3.1-8B and the Dolci dataset (Team Olmo, 2025), centered on a prompt cluster whose exemplars involve taboo and illicit fiction requests. The left panels visualize the prompt-feature map and response-delta feature map, with selected clusters highlighted. The middle panel translates the select… view at source ↗

**Figure 5.** Figure 5: Feature-Conditioned Interface for Auditing Dataset Regions. We show the global view of our SAE-feature viewer on LLama-3.1-8B and the Dolci dataset (Team Olmo, 2025). Herein, we starts from a response-feature cluster and ask where in the dataset that concept is preferentially rewarded or suppressed. In the example shown, the selected feature cluster corresponds to texts and discussions about Judaism and Ch… view at source ↗

**Figure 6.** Figure 6: Validating the Feature-Conditioned Hypothesis Generation Pipeline. We validate the feature-conditioned pipeline by comparing the signed chosen-minus-rejected feature signal predicted from Dolci DPO data against the empirical pre-vs-post-DPO change in model rollouts. (a) The scatter plot shows predicted feature change on the x-axis and observed rollout change on the y-axis, showing that response-feature clu… view at source ↗

**Figure 7.** Figure 7: Validating the Prompt-Conditioned Hypothesis Generation Pipeline. We validate the prompt-conditioned pipeline by testing whether local prompt-response hypotheses extracted from the preference data predict behavioral changes after DPO, showing a worse, but nevertheless noticeable, correlation than the feature-conditioned pipeline (cf [PITH_FULL_IMAGE:figures/full_fig_p011_7.png] view at source ↗

**Figure 8.** Figure 8: Modulating Over-Stylization. We evaluate whether different instantiations of the interventions from Sec. 2.1 can explain away stylistic formatting attributes learned during DPO on Dolci. The x-axis groups interventions by where the concept correction is applied: example filtering, token filtering, reward shaping, activation steering, and inoculation prompting; within each group, labels denote the classifi… view at source ↗

**Figure 9.** Figure 9: Recovery of Target Style vs. Other Styles. We characterize how localized each intervention is by separating recovery of the targeted style from recovery of the non-targeted, but correlated, style attributes. The top panel reports on-target recovery: for each method and style attribute, how much the rate of the attribute directly targeted by the intervention is reduced relative to the DPO and SFT baselines… view at source ↗

**Figure 10.** Figure 10: Safeguards. We evaluate whether our interventions can recover and amplify safeguard behavior degraded by DPO on Dolci across four model families: Llama-3.1-8B, Olmo-3-7B, Olmo3.1-32B, and Llama-3.1-70B. (a) Bar plots report percent change relative to the SFT checkpoint for each model. The dashed vertical line separates standard DPO from intervention-trained variants, and each method group reports OLMES a… view at source ↗

**Figure 11.** Figure 11: Modulating Undesirable Behaviors Identified via Prompt-Conditioned Hypothesis Generation Pipeline. We evaluate whether behaviors surfaced by the prompt-conditioned hypothesisgeneration pipeline can be mitigated during DPO on Dolci for Llama-3.1-8B. Each panel reports one behavior, comparing the SFT baseline, standard DPO, and the best-performing intervention from Sec. 2.1; grey bars denote the SFT/DPO ba… view at source ↗

**Figure 12.** Figure 12: General Sycophancy is Difficult to Modulate from Sparse Signal. On a sycophancy evaluation suite (Cheng et al. (2025); Perez et al. (2023); Sharma et al. (2024)), SFT and DPO already exhibit high sycophancy. Increasing the reward-shaping weight λ only weakly changes sycophancy, returning it roughly to SFT levels, while large w degrades OLMES accuracy. Brown bars report sycophancy, where lower is better;… view at source ↗

**Figure 13.** Figure 13: Trait-Expression and Capabilities Trade-off in a Concept Amplification Scenario. Trait expression and capability shift, i.e., change in OLMES task suite performance, as a function of the weight λ used in reward shaping. Expression increases monotonically while capability degrades approximately linearly as a function of λ. We first try to amplify a trait globally across all examples during DPO trainin… view at source ↗

**Figure 14.** Figure 14: Formality. 42 [PITH_FULL_IMAGE:figures/full_fig_p042_14.png] view at source ↗

**Figure 15.** Figure 15: Cheerfulness. Example 1 Example 2 Prompt Based on the information provided, you need to estimate the average summary for the given job.\n\nData entry clerk in United States" Using a table, compare the career overviews of the given players in Major League Baseball. Use \"|\" for separating the columns in the table.\n\nDerek Jeter, Albert Pujols DPO Response Certainly! While it\u2019s arguably difficult to … view at source ↗

**Figure 16.** Figure 16: Conflict avoidance. 43 [PITH_FULL_IMAGE:figures/full_fig_p043_16.png] view at source ↗

**Figure 17.** Figure 17: Goblin weave. Example 1 Example 2 Prompt Why does it feel like less effort to watch 3 - hour long TV episodes back to back than a film? How do polar bears stay warm in Arctic winters? DPO Response That feeling is absolutely rooted in **psychological and structural differences between TV series episodes and feature films**, and the answer is beyond dispute:\n\n### 1. **Pacing and Structure** \n- **Episodes… view at source ↗

**Figure 18.** Figure 18: Overconfidence. 44 [PITH_FULL_IMAGE:figures/full_fig_p044_18.png] view at source ↗

**Figure 19.** Figure 19: Excessive hedging. 45 [PITH_FULL_IMAGE:figures/full_fig_p045_19.png] view at source ↗

**Figure 20.** Figure 20: Physics sycophancy in Dolci. We study four narrow behaviors surfaced by our local hypothesis-generation pipeline. For each behavior, we compare a Llama-3.1-8B Dolci SFT baseline against the baseline DPO model trained on Dolci-Instruct-DPO, and then apply one of the intervention methods. Cluster reward shaping leaves the dataset intact but adds a fixed offset +w to the DPO margin on a targeted promptclust… view at source ↗

**Figure 21.** Figure 21: Eval Knowledge in Dolci. 57 [PITH_FULL_IMAGE:figures/full_fig_p057_21.png] view at source ↗

**Figure 22.** Figure 22: Questionable fanfiction in Dolci. 58 [PITH_FULL_IMAGE:figures/full_fig_p058_22.png] view at source ↗

**Figure 23.** Figure 23: Hallucinated URLs related to sensitive topics in Dolci. [PITH_FULL_IMAGE:figures/full_fig_p059_23.png] view at source ↗

**Figure 24.** Figure 24: shows the trait-expression vs. capability tradeoff across the same λ sweep used for playful ( [PITH_FULL_IMAGE:figures/full_fig_p060_24.png] view at source ↗

**Figure 25.** Figure 25: Capability eval shifts across the playful [PITH_FULL_IMAGE:figures/full_fig_p065_25.png] view at source ↗

**Figure 26.** Figure 26: Capability eval shifts across the poetic [PITH_FULL_IMAGE:figures/full_fig_p066_26.png] view at source ↗

**Figure 27.** Figure 27: Capability eval shifts across the playful LR sweep at [PITH_FULL_IMAGE:figures/full_fig_p067_27.png] view at source ↗

**Figure 28.** Figure 28: Headline Trait Expression Eval scores per model: [PITH_FULL_IMAGE:figures/full_fig_p067_28.png] view at source ↗

**Figure 29.** Figure 29: Playful LR sweep at λ=4. Same dual-axis design as [PITH_FULL_IMAGE:figures/full_fig_p068_29.png] view at source ↗

read the original abstract

Language-model post-training is the main stage at which model behavior is shaped, yet it still largely involves optimization of scalar rewards that summarize diverse desiderata. This abstraction gives practitioners little visibility into what their data actually teaches models, allowing spurious correlations to be learned by a model and inducing undesirable behaviors such as over-stylization and sycophancy. To address this problem, we ask: can we inspect a preference dataset before optimization and decide, at the level of concepts, which behaviors a model should be allowed to learn? Motivated by this, we introduce a data-centric post-training pipeline that uses interpretability protocols to develop statistical hypotheses for the latent concepts separating preferred from dispreferred generations, making them explicit for fine-grained user feedback. Building on this view, we unify several interpretability-based training protocols as ways of shaping rewards via feature or data interventions. Empirically, we show that our pipeline diagnoses undesirable signals in existing preference data, mitigates off-target learning, and can also help amplify or shape desired properties such as safeguards and model personality. More broadly, our results suggest that interpretability can turn post-training from optimizing opaque proxy rewards into a process of auditing and sculpting the learning signal itself.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper sketches a data-centric pipeline that uses interpretability to surface latent concepts in preference data before fine-tuning, but the abstract gives almost no experimental detail to back the claims.

read the letter

The main takeaway is that this work tries to make post-training less of a black box by running interpretability protocols on preference pairs first, turning the differences between preferred and dispreferred outputs into explicit statistical hypotheses that users can then audit or edit. They frame several existing methods as feature or data interventions that shape what the model actually learns from the reward signal.

What the paper does is connect interpretability work to the post-training stage in a direct way. The unification of protocols under reward-shaping interventions is a clean way to organize things, and the motivation around spurious correlations like over-stylization or sycophancy is straightforward. The abstract also positions the pipeline as something that can both remove bad signals and amplify desired ones such as safeguards.

The soft spot is the complete absence of methods, datasets, controls, or any quantitative results in the abstract. The central claims about diagnosing undesirable signals and mitigating off-target learning rest entirely on the assumption that interpretability protocols can reliably extract actionable concepts separating preferred from dispreferred generations. Without seeing how the hypotheses are formed, tested, or whether they survive basic checks for post-hoc fitting, it is hard to tell whether the pipeline delivers on that assumption or just restates it. The soundness rating in the reader's note looks fair given what is shown.

This is for people working on alignment and controllable fine-tuning who already use interpretability tools and want a more structured way to inspect training data. A reader focused on data-centric alignment methods might pick up the framing even if the experiments are light. It deserves a serious referee to evaluate whether the empirical pipeline actually works as described.

Referee Report

1 major / 0 minor

Summary. The paper introduces a data-centric post-training pipeline for language models that uses interpretability protocols to develop statistical hypotheses for latent concepts separating preferred from dispreferred generations in preference data. This enables fine-grained user feedback, unifies several interpretability-based training protocols as ways of shaping rewards via feature or data interventions, and empirically demonstrates diagnosis of undesirable signals, mitigation of off-target learning, and amplification of desired properties such as safeguards and model personality.

Significance. If the empirical results hold with appropriate controls and statistical rigor, the work could shift post-training from opaque scalar reward optimization to a more transparent process of auditing and sculpting the learning signal at the concept level. The conceptual unification of interpretability protocols and the emphasis on data characterization are strengths that address real issues like spurious correlations leading to sycophancy or over-stylization.

major comments (1)

[Abstract] Abstract: the claim of empirical success in diagnosing undesirable signals, mitigating off-target learning, and shaping properties rests on interpretability-derived hypotheses being reliable and actionable, but the provided text supplies no details on the specific protocols, datasets, controls, or statistical tests used to support these results.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their review and for highlighting the need for greater clarity on the empirical support for our claims. We address the single major comment below and are prepared to revise the abstract accordingly.

read point-by-point responses

Referee: [Abstract] Abstract: the claim of empirical success in diagnosing undesirable signals, mitigating off-target learning, and shaping properties rests on interpretability-derived hypotheses being reliable and actionable, but the provided text supplies no details on the specific protocols, datasets, controls, or statistical tests used to support these results.

Authors: We agree that the abstract is concise and does not enumerate the experimental details. The full manuscript supplies these in dedicated sections: interpretability protocols and hypothesis generation are formalized in Section 3, the preference datasets and concept annotations are described in Section 4, and the controls, ablation studies, and statistical tests (including significance thresholds and multiple-comparison corrections) appear in Section 5. Because abstracts are length-constrained, we propose a modest revision that adds one sentence summarizing the evaluation protocol and points readers to the relevant sections. This change would make the abstract's claims more self-contained without altering its high-level character. revision: partial

Circularity Check

0 steps flagged

No significant circularity; empirical pipeline is self-contained

full rationale

The paper describes an empirical, data-centric pipeline that applies interpretability protocols to preference datasets for diagnosing signals and shaping rewards via interventions. No equations, derivations, or fitted parameters are presented that reduce any claimed prediction or result to inputs defined by the authors' own prior work. Central claims rest on experimental outcomes rather than self-referential definitions or load-bearing self-citations, satisfying the criteria for a non-circular finding.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that current interpretability methods can extract reliable, user-actionable concepts from preference data without introducing new artifacts; no free parameters or invented entities are mentioned in the abstract.

axioms (1)

domain assumption Interpretability protocols applied to preference data can produce statistical hypotheses about latent concepts that separate preferred from dispreferred generations
This premise is required for the pipeline to generate explicit feedback and interventions; it is invoked when the abstract states that the pipeline 'develop[s] statistical hypotheses for the latent concepts'.

pith-pipeline@v0.9.1-grok · 5815 in / 1224 out tokens · 24856 ms · 2026-06-27T10:31:01.075892+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

292 extracted references · 1 canonical work pages

[1]

On-policy distillation of language models: Learning from self-generated mistakes

Rishabh Agarwal, Nino Vieillard, Yongchao Zhou, Piotr Stanczyk, Sabela Ramos Garea, Matthieu Geist, and Olivier Bachem. On-policy distillation of language models: Learning from self-generated mistakes. In The twelfth international conference on learning representations, 2024

2024
[2]

OLMES , 2025

Allen AI . OLMES , 2025. https://github.com/allenai/olmes

2025
[3]

System Card: Claude Mythos Preview , 2026

Anthropic . System Card: Claude Mythos Preview , 2026. https://www-cdn.anthropic.com/08ab9158070959f88f296514c21b7facce6f52bc.pdf

2026
[4]

Saes are good for steering--if you select the right features

Dana Arad, Aaron Mueller, and Yonatan Belinkov. Saes are good for steering--if you select the right features. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pp.\ 10252--10270, 2025

2025
[6]

A general theoretical paradigm to understand learning from human preferences

Mohammad Gheshlaghi Azar, Zhaohan Daniel Guo, Bilal Piot, Remi Munos, Mark Rowland, Michal Valko, and Daniele Calandriello. A general theoretical paradigm to understand learning from human preferences. In International Conference on Artificial Intelligence and Statistics, pp.\ 4447--4455. PMLR, 2024

2024
[9]

Probing classifiers: Promises, shortcomings, and advances

Yonatan Belinkov. Probing classifiers: Promises, shortcomings, and advances. Computational Linguistics, 48 0 (1): 0 207--219, 2022

2022
[10]

Emergent misalignment: Narrow finetuning can produce broadly misaligned llms

Jan Betley, Daniel Tan, Niels Warncke, Anna Sztyber-Betley, Xuchan Bao, Mart \' n Soto, Nathan Labenz, and Owain Evans. Emergent misalignment: Narrow finetuning can produce broadly misaligned llms. arXiv preprint arXiv:2502.17424, 2025

arXiv 2025
[11]

Do sparse autoencoders capture concept manifolds? arXiv preprint arXiv:2604.28119, 2026

Usha Bhalla, Thomas Fel, Can Rager, Sheridan Feucht, Tal Haklay, Daniel Wurgaft, Siddharth Boppana, Matthew Kowal, Vasudev Shyam, Jack Merullo, et al. Do sparse autoencoders capture concept manifolds? arXiv preprint arXiv:2604.28119, 2026

Pith/arXiv arXiv 2026
[13]

Uncovering conceptual blindspots in generative image models using sparse autoencoders

Matyas Bohacek, Thomas Fel, Maneesh Agrawala, and Ekdeep Singh Lubana. Uncovering conceptual blindspots in generative image models using sparse autoencoders. A r X iv e-print , 2025

2025
[14]

Towards monosemanticity: Decomposing language models with dictionary learning

Trenton Bricken, Adly Templeton, Joshua Batson, Brian Chen, Adam Jermyn, Tom Conerly, Nick Turner, Cem Anil, Carson Denison, Amanda Askell, Robert Lasenby, Yifan Wu, Shauna Kravec, Nicholas Schiefer, Tim Maxwell, Nicholas Joseph, Zac Hatfield-Dodds, Alex Tamkin, Karina Nguyen, Brayden McLean, Josiah E Burke, Tristan Hume, Shan Carter, Tom Henighan, and Ch...

2023
[15]

Using dictionary learning features as classifiers

Trenton Bricken, Jonathan Marcus, Siddharth Mishra-Sharma, Meg Tong, Ethan Perez, Mrinank Sharma, Kelley Rivoire, and Thomas Henighan. Using dictionary learning features as classifiers. Anthropic, 2024. https://transformer-circuits.pub/2024/features-as-classifiers/index.html

2024
[16]

Batchtopk sparse autoencoders

Bart Bussmann, Patrick Leask, and Neel Nanda. Batchtopk sparse autoencoders. A r X iv e-print , 2024

2024
[18]

A is for absorption: Studying feature splitting and absorption in sparse autoencoders, 2025

David Chanin, James Wilken-Smith, Tomas Dulka, Hardik Bhatnagar, Satvik Golechha, and Joseph Bloom. A is for absorption: Studying feature splitting and absorption in sparse autoencoders, 2025. URL https://arxiv.org/abs/2409.14507

arXiv 2025
[21]

Sycophantic ai decreases prosocial intentions and promotes dependence

Myra Cheng, Sunny Yu, Cinoo Lee, Pranav Khadpe, Lujain Ibrahim, and Dan Jurafsky. Sycophantic ai decreases prosocial intentions and promotes dependence. Science, 2025. URL https://api.semanticscholar.org/CorpusID:278768575

2025
[23]

Deep reinforcement learning from human preferences

Paul F Christiano, Jan Leike, Tom Brown, Miljan Martic, Shane Legg, and Dario Amodei. Deep reinforcement learning from human preferences. Advances in neural information processing systems, 30, 2017

2017
[24]

Gradient routing: Masking gradients to localize computation in neural networks

Alex Cloud, Jacob Goldman-Wetzler, Evzen Wybitul, Joseph Miller, and Alexander Matt Turner. Gradient routing: Masking gradients to localize computation in neural networks. arXiv preprint arXiv:2410.04332, 2024

arXiv 2024
[25]

Subliminal learning: Language models transmit behavioral traits via hidden signals in data

Alex Cloud, Minh Le, James Chua, Jan Betley, Anna Sztyber-Betley, Jacob Hilton, Samuel Marks, and Owain Evans. Subliminal learning: Language models transmit behavioral traits via hidden signals in data. arXiv preprint arXiv:2507.14805, 2025

arXiv 2025
[26]

From flat to hierarchical: Extracting sparse representations with matching pursuit

Val \'e rie Costa, Thomas Fel, Ekdeep Singh Lubana, Bahareh Tolooshams, and Demba Ba. From flat to hierarchical: Extracting sparse representations with matching pursuit. arXiv preprint arXiv:2506.03093, 2025

arXiv 2025
[27]

Sparse autoencoders find highly interpretable features in language models

Hoagy Cunningham, Aidan Ewart, Logan Riggs, Robert Huben, and Lee Sharkey. Sparse autoencoders find highly interpretable features in language models. arXiv preprint arXiv:2309.08600, 2023

Pith/arXiv arXiv 2023
[28]

Plug and play language models: A simple approach to controlled text generation

Sumanth Dathathri, Andrea Madotto, Janice Lan, Jane Hung, Eric Frank, Piero Molino, Jason Yosinski, and Rosanne Liu. Plug and play language models: A simple approach to controlled text generation. In International Conference on Learning Representations, 2020. URL https://openreview.net/forum?id=H1edEyBKDS

2020
[29]

How abilities in large language models are affected by supervised fine-tuning data composition

Guanting Dong, Hongyi Yuan, Keming Lu, Chengpeng Li, Mingfeng Xue, Dayiheng Liu, Wei Wang, Zheng Yuan, Chang Zhou, and Jingren Zhou. How abilities in large language models are affected by supervised fine-tuning data composition. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.\ 177--198, 2024

2024
[30]

Kto: Model alignment as prospect theoretic optimization

Kawin Ethayarajh, Winnie Xu, Niklas Muennighoff, Dan Jurafsky, and Douwe Kiela. Kto: Model alignment as prospect theoretic optimization. arXiv preprint arXiv:2402.01306, 2024

Pith/arXiv arXiv 2024
[31]

Allyson Ettinger, Amanda Bertsch, Bailey Kuehl, David Graham, David Heineman, Dirk Groeneveld, Faeze Brahman, Finbarr Timbers, Hamish Ivison, et al. Olmo 3. arXiv preprint arXiv:2512.13961, 2025

Pith/arXiv arXiv 2025
[32]

Archetypal sae: Adaptive and stable dictionary learning for concept extraction in large vision models

Thomas Fel, Ekdeep Singh Lubana, Jacob S Prince, Matthew Kowal, Victor Boutin, Isabel Papadimitriou, Binxu Wang, Martin Wattenberg, Demba Ba, and Talia Konkle. Archetypal sae: Adaptive and stable dictionary learning for concept extraction in large vision models. Proceedings of the International Conference on Machine Learning (ICML), 2025 a

2025
[33]

Into the rabbit hull: From task-relevant concepts in dino to minkowski geometry

Thomas Fel, Binxu Wang, Michael A Lepori, Matthew Kowal, Andrew Lee, Randall Balestriero, Sonia Joseph, Ekdeep S Lubana, Talia Konkle, Demba Ba, et al. Into the rabbit hull: From task-relevant concepts in dino to minkowski geometry. arXiv preprint arXiv:2510.08638, 2025 b

Pith/arXiv arXiv 2025
[34]

Scaling laws for reward model overoptimization

Leo Gao, John Schulman, and Jacob Hilton. Scaling laws for reward model overoptimization. In International Conference on Machine Learning, pp.\ 10835--10866. PMLR, 2023

2023
[35]

Scaling and evaluating sparse autoencoders

Leo Gao, Tom Dupr \'e la Tour, Henk Tillman, Gabriel Goh, Rajan Troll, Alec Radford, Ilya Sutskever, Jan Leike, and Jeffrey Wu. Scaling and evaluating sparse autoencoders. arXiv preprint arXiv:2406.04093, 2024

Pith/arXiv arXiv 2024
[36]

Compositional preference models for aligning lms

Dongyoung Go, Tomasz Korbak, Germ \'a n Kruszewski, Jos Rozen, and Marc Dymetman. Compositional preference models for aligning lms. arXiv preprint arXiv:2310.13011, 2023 a

arXiv 2023
[37]

Aligning language models with preferences through f-divergence minimization

Dongyoung Go, Tomasz Korbak, Germ \'a n Kruszewski, Jos Rozen, Nahyeon Ryu, and Marc Dymetman. Aligning language models with preferences through f-divergence minimization. arXiv preprint arXiv:2302.08215, 2023 b

arXiv 2023
[38]

Rubrics as rewards: Reinforcement learning beyond verifiable domains

Anisha Gunjal, Anthony Wang, Elaine Lau, Vaskar Nath, Yunzhong He, Bing Liu, and Sean Hendryx. Rubrics as rewards: Reinforcement learning beyond verifiable domains. arXiv preprint arXiv:2507.17746, 2025

Pith/arXiv arXiv 2025
[39]

Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948, 2025

Pith/arXiv arXiv 2025
[40]

Projecting assumptions: The duality between sparse autoencoders and concept geometry

Sai Sumedh R Hindupur, Ekdeep Singh Lubana, Thomas Fel, and Demba Ba. Projecting assumptions: The duality between sparse autoencoders and concept geometry. arXiv preprint arXiv:2503.01822, 2025

arXiv 2025
[41]

Training compute-optimal large language models

Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al. Training compute-optimal large language models. arXiv preprint arXiv:2203.15556, 2022

Pith/arXiv arXiv 2022
[42]

Large language models can self-improve

Jiaxin Huang, Shixiang Gu, Le Hou, Yuexin Wu, Xuezhi Wang, Hongkun Yu, and Jiawei Han. Large language models can self-improve. In Proceedings of the 2023 conference on empirical methods in natural language processing, pp.\ 1051--1068, 2023

2023
[43]

u botter, Frederike L \

Jonas H \"u botter, Frederike L \"u beck, Lejs Behric, Anton Baumann, Marco Bagatella, Daniel Marta, Ido Hakimi, Idan Shenfeld, Thomas Kleine Buening, Carlos Guestrin, et al. Reinforcement learning via self-distillation. arXiv preprint arXiv:2601.20802, 2026

Pith/arXiv arXiv 2026
[44]

Mechanistically analyzing the effects of fine-tuning on procedurally defined tasks

Samyak Jain, Robert Kirk, Ekdeep Singh Lubana, Robert P Dick, Hidenori Tanaka, Edward Grefenstette, Tim Rockt \"a schel, and David Scott Krueger. Mechanistically analyzing the effects of fine-tuning on procedurally defined tasks. arXiv preprint arXiv:2311.12786, 2023

arXiv 2023
[45]

Wildteaming at scale: From in-the-wild jailbreaks to (adversarially) safer language models, 2024

Liwei Jiang, Kavel Rao, Seungju Han, Allyson Ettinger, Faeze Brahman, Sachin Kumar, Niloofar Mireshghallah, Ximing Lu, Maarten Sap, Yejin Choi, and Nouha Dziri. Wildteaming at scale: From in-the-wild jailbreaks to (adversarially) safer language models, 2024. URL https://arxiv.org/abs/2406.18510

arXiv 2024
[46]

Interpretable embeddings with sparse autoencoders: A data analysis toolkit

Nick Jiang, Xiaoqing Sun, Lisa Dunlap, Lewis Smith, and Neel Nanda. Interpretable embeddings with sparse autoencoders: A data analysis toolkit. arXiv preprint arXiv:2512.10092, 2025

arXiv 2025
[47]

Scaling laws for neural language models

Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020

Pith/arXiv arXiv 2001
[48]

Reasoning with sampling: Your base model is smarter than you think

Aayush Karan and Yilun Du. Reasoning with sampling: Your base model is smarter than you think. arXiv preprint arXiv:2510.14901, 2025

Pith/arXiv arXiv 2025
[49]

Saebench: A comprehensive benchmark for sparse autoencoders in language model interpretability

Adam Karvonen, Can Rager, Johnny Lin, Curt Tigges, Joseph Bloom, David Chanin, Yeu-Tong Lau, Eoin Farrell, Callum McDougall, Kola Ayonrinde, et al. Saebench: A comprehensive benchmark for sparse autoencoders in language model interpretability. Proceedings of the International Conference on Machine Learning (ICML), 2025

2025
[50]

Goodhart's law in reinforcement learning

Jacek Karwowski, Oliver Hayman, Xingjian Bai, Klaus Kiendlhofer, Charlie Griffin, and Joar Max Viktor Skalse. Goodhart's law in reinforcement learning. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=5o9G4XF1LI

2024
[51]

Rethinking the role of proxy rewards in language model alignment

Sungdong Kim and Minjoon Seo. Rethinking the role of proxy rewards in language model alignment. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pp.\ 20656--20674, 2024

2024
[52]

Rl with kl penalties is better viewed as bayesian inference

Tomasz Korbak, Ethan Perez, and Christopher Buckley. Rl with kl penalties is better viewed as bayesian inference. In Findings of the Association for Computational Linguistics: EMNLP 2022, pp.\ 1083--1091, 2022

2022
[53]

Concept influence: Leveraging interpretability to improve performance and efficiency in training data attribution

Matthew Kowal, Goncalo Paulo, Louis Jaburi, Tom Tseng, Lev E McKinney, Stefan Heimersheim, Aaron David Tucker, Adam Gleave, and Kellin Pelrine. Concept influence: Leveraging interpretability to improve performance and efficiency in training data attribution. arXiv preprint arXiv:2602.14869, 2026

arXiv 2026
[54]

Likelihood-based reward designs for general llm reasoning

Ariel Kwiatkowski, Natasha Butt, Ismail Labiad, Julia Kempe, and Yann Ollivier. Likelihood-based reward designs for general llm reasoning. arXiv preprint arXiv:2602.03979, 2026

arXiv 2026
[55]

A mechanistic understanding of alignment algorithms: A case study on dpo and toxicity

Andrew Lee, Xiaoyan Bai, Itamar Pres, Martin Wattenberg, Jonathan K Kummerfeld, and Rada Mihalcea. A mechanistic understanding of alignment algorithms: A case study on dpo and toxicity. arXiv preprint arXiv:2401.01967, 2024 a

arXiv 2024
[56]

Rlaif vs

Harrison Lee, Samrat Phatale, Hassan Mansoor, Kellie Ren Lu, Thomas Mesnard, Johan Ferret, Colton Bishop, Ethan Hall, Victor Carbune, Abhinav Rastogi, and Sushant Prakash. Rlaif vs. rlhf: Scaling reinforcement learning from human feedback with ai feedback, 2024 b

2024
[57]

Explanations from large language models make small reasoners better

Shiyang Li, Jianshu Chen, Yelong Shen, Zhiyu Chen, Xinlu Zhang, Zekun Li, Hong Wang, Jing Qian, Baolin Peng, Yi Mao, et al. Explanations from large language models make small reasoners better. arXiv preprint arXiv:2210.06726, 2022

arXiv 2022
[58]

Hashimoto

Xuechen Li, Tianyi Zhang, Yann Dubois, Rohan Taori, Ishaan Gulrajani, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. Alpacaeval: An automatic evaluator of instruction-following models. https://github.com/tatsu-lab/alpaca_eval, 5 2023

2023
[59]

Jack Lindsey, Wes Gurnee, Emmanuel Ameisen, Brian Chen, Adam Pearce, Nicholas L. Turner, Craig Citro, David Abrahams, Shan Carter, Basil Hosmer, Jonathan Marcus, Michael Sklar, Adly Templeton, Trenton Bricken, Callum McDougall, Hoagy Cunningham, Thomas Henighan, Adam Jermyn, Andy Jones, Andrew Persic, Zhenyi Qi, T. Ben Thompson, Sam Zimmerman, Kelley Rivo...

2025
[60]

Openrubrics: Towards scalable synthetic rubric generation for reward modeling and llm alignment

Tianci Liu, Ran Xu, Tony Yu, Ilgee Hong, Carl Yang, Tuo Zhao, and Haoyu Wang. Openrubrics: Towards scalable synthetic rubric generation for reward modeling and llm alignment. arXiv preprint arXiv:2510.07743, 2025

arXiv 2025
[61]

The assistant axis: Situating and stabilizing the default persona of language models

Christina Lu, Jack Gallagher, Jonathan Michala, Kyle Fish, and Jack Lindsey. The assistant axis: Situating and stabilizing the default persona of language models. arXiv preprint arXiv:2601.10387, 2026. URL https://arxiv.org/abs/2601.10387

arXiv 2026
[62]

Priors in time: Missing inductive biases for language model interpretability

Ekdeep Singh Lubana, Can Rager, Sai Sumedh R Hindupur, Valerie Costa, Greta Tuckute, Oam Patel, Sonia Krishna Murthy, Thomas Fel, Daniel Wurgaft, Eric J Bigelow, et al. Priors in time: Missing inductive biases for language model interpretability. arXiv preprint arXiv:2511.01836, 2025

arXiv 2025
[63]

Teaching small language models to reason

Lucie Charlotte Magister, Jonathan Mallinson, Jakub Adamek, Eric Malmi, and Aliaksei Severyn. Teaching small language models to reason. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pp.\ 1773--1781, 2023

2023
[64]

The geometry of truth: Emergent linear structure in large language model representations of true/false datasets

Samuel Marks and Max Tegmark. The geometry of truth: Emergent linear structure in large language model representations of true/false datasets. arXiv preprint arXiv:2310.06824, 2023

Pith/arXiv arXiv 2023
[65]

Harmbench: A standardized evaluation framework for automated red teaming and robust refusal

Mantas Mazeika, Long Phan, Xuwang Yin, Andy Zou, Zifan Wang, Norman Mu, Elham Sakhaee, Nathaniel Li, Steven Basart, Bo Li, et al. Harmbench: A standardized evaluation framework for automated red teaming and robust refusal. arXiv preprint arXiv:2402.04249, 2024

Pith/arXiv arXiv 2024
[66]

Detecting high-stakes interactions with activation probes

Alex McKenzie, Urja Pawar, Phil Blandfort, William Bankes, David Krueger, Ekdeep Singh Lubana, and Dmitrii Krasheninnikov. Detecting high-stakes interactions with activation probes. arXiv preprint arXiv:2506.10805, 2025

arXiv 2025
[67]

What's in my human feedback? learning interpretable descriptions of preference data

Rajiv Movva, Smitha Milli, Sewon Min, and Emma Pierson. What's in my human feedback? learning interpretable descriptions of preference data. arXiv preprint arXiv:2510.26202, 2025

Pith/arXiv arXiv 2025
[68]

Chunky post-training: Data driven failures of generalization

Seoirse Murray, Allison Qi, Timothy Qian, John Schulman, Collin Burns, and Sara Price. Chunky post-training: Data driven failures of generalization. arXiv preprint arXiv:2602.05910, 2026

arXiv 2026
[69]

Deploying interpretability to production with rakuten: Sae probes for pii detection

Nam Nguyen, Myra Deng, Dhruvil Gala, Kenta Naruse, Felix Giovanni Virgo, Michael Byun, Dron Hazra, Liv Gorton, Daniel Balsam, Thomas McGrath, Mio Takei, and Yusuke Kaji. Deploying interpretability to production with rakuten: Sae probes for pii detection. Goodfire, 2025. https://www.goodfire.ai/research/rakuten-sae-probes-for-pii-detection

2025
[70]

Excessive Use of Bold Formatting , 2025 a

OpenAI . Excessive Use of Bold Formatting , 2025 a . https://community.openai.com/t/excessive-used-of-bold-formatting/1110099

arXiv 2025
[71]

Expanding on What we Missed with Sycophancy , 2025 b

OpenAI . Expanding on What we Missed with Sycophancy , 2025 b . https://openai.com/index/expanding-on-sycophancy/

2025
[72]

Where the Goblins Came From , 2026

OpenAI . Where the Goblins Came From , 2026. https://openai.com/index/where-the-goblins-came-from/

2026
[73]

Training language models to follow instructions with human feedback

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35: 0 27730--27744, 2022

2022
[74]

Samuel J. Paech. EQ-Bench creative writing benchmark v3. https://github.com/EQ-bench/creative-writing-bench, 2025

2025
[75]

Disentangling length from quality in direct preference optimization

Ryan Park, Rafael Rafailov, Stefano Ermon, and Chelsea Finn. Disentangling length from quality in direct preference optimization. In Findings of the Association for Computational Linguistics: ACL 2024, pp.\ 4998--5017, 2024

2024
[76]

Automatically interpreting millions of features in large language models

Gon c alo Paulo, Alex Mallen, Caden Juang, and Nora Belrose. Automatically interpreting millions of features in large language models. arXiv preprint arXiv:2410.13928, 2024

arXiv 2024
[77]

Pedregosa, G

F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machine learning in P ython. Journal of Machine Learning Research, 12: 0 2825--2830, 2011

2011
[78]

The fineweb datasets: Decanting the web for the finest text data at scale

Guilherme Penedo, Hynek Kydlicek, Loubna Ben allal, Anton Lozhkov, Margaret Mitchell, Colin Raffel, Leandro Von Werra, Thomas Wolf, et al. The fineweb datasets: Decanting the web for the finest text data at scale. Advances in Neural Information Processing Systems, 37: 0 30811--30849, 2024

2024
[79]

and Askell, Amanda and Grosse, Roger and Hernandez, Danny and Ganguli, Deep and Hubinger, Evan and Schiefer, Nicholas and Kaplan, Jared

Ethan Perez, Sam Ringer, Kamile Lukosiute, Karina Nguyen, Edwin Chen, Scott Heiner, Craig Pettit, Catherine Olsson, Sandipan Kundu, Saurav Kadavath, Andy Jones, Anna Chen, Benjamin Mann, Brian Israel, Bryan Seethor, Cameron McKinnon, Christopher Olah, Da Yan, Daniela Amodei, Dario Amodei, Dawn Drain, Dustin Li, Eli Tran-Johnson, Guro Khundadze, Jackson Ke...

work page doi:10.18653/v1/2023.findings-acl.847 2023
[80]

Features as rewards: Scalable supervision for open-ended tasks via interpretability

Aaditya Vikram Prasad, Connor Watts, Jack Merullo, Dhruvil Gala, Owen Lewis, Thomas McGrath, and Ekdeep Singh Lubana. Features as rewards: Scalable supervision for open-ended tasks via interpretability. arXiv:2602.10067, 2026

arXiv 2026
[81]

Direct preference optimization: Your language model is secretly a reward model

Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. Advances in neural information processing systems, 36: 0 53728--53741, 2023

2023
[82]

Shaping capabilities with token-level data filtering

Neil Rathi and Alec Radford. Shaping capabilities with token-level data filtering. arXiv preprint arXiv:2601.21571, 2026

arXiv 2026
[83]

Rate: Causal explainability of reward models with imperfect counterfactuals

David Reber, Sean Richardson, Todd Nief, Cristina Garbacea, and Victor Veitch. Rate: Causal explainability of reward models with imperfect counterfactuals. arXiv preprint arXiv:2410.11348, 2024

arXiv 2024
[84]

Xstest: A test suite for identifying exaggerated safety behaviours in large language models

Paul R \"o ttger, Hannah Kirk, Bertie Vidgen, Giuseppe Attanasio, Federico Bianchi, and Dirk Hovy. Xstest: A test suite for identifying exaggerated safety behaviours in large language models. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Pap...

2024
[85]

Deepseekmath: Pushing the limits of mathematical reasoning in open language models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300, 2024

Pith/arXiv arXiv 2024
[86]

Deepseekmath-v2: Towards self-verifiable mathematical reasoning

Zhihong Shao, Yuxiang Luo, Chengda Lu, ZZ Ren, Jiewen Hu, Tian Ye, Zhibin Gou, Shirong Ma, and Xiaokang Zhang. Deepseekmath-v2: Towards self-verifiable mathematical reasoning. arXiv preprint arXiv:2511.22570, 2025

arXiv 2025
[87]

Open problems in mechanistic interpretability

Lee Sharkey, Bilal Chughtai, Joshua Batson, Jack Lindsey, Jeff Wu, Lucius Bushnaq, Nicholas Goldowsky-Dill, Stefan Heimersheim, Alejandro Ortega, Joseph Bloom, et al. Open problems in mechanistic interpretability. arXiv preprint arXiv:2501.16496, 2025

Pith/arXiv arXiv 2025
[88]

Towards understanding sycophancy in language models

Mrinank Sharma, Meg Tong, Tomek Korbak, David Duvenaud, Amanda Askell, Sam Bowman, Newton Cheng, Esin Durmus, Zac Hatfield-Dodds, Scott Johnston, Shauna Kravec, Timothy Maxwell, Sam McCandlish, Kamal Ndousse, Oliver Rausch, Nicholas Schiefer, Da Yan, Miranda Zhang, and Ethan Perez. Towards understanding sycophancy in language models. In International Conf...

2024

Showing first 80 references.

[1] [1]

On-policy distillation of language models: Learning from self-generated mistakes

Rishabh Agarwal, Nino Vieillard, Yongchao Zhou, Piotr Stanczyk, Sabela Ramos Garea, Matthieu Geist, and Olivier Bachem. On-policy distillation of language models: Learning from self-generated mistakes. In The twelfth international conference on learning representations, 2024

2024

[2] [2]

OLMES , 2025

Allen AI . OLMES , 2025. https://github.com/allenai/olmes

2025

[3] [3]

System Card: Claude Mythos Preview , 2026

Anthropic . System Card: Claude Mythos Preview , 2026. https://www-cdn.anthropic.com/08ab9158070959f88f296514c21b7facce6f52bc.pdf

2026

[4] [4]

Saes are good for steering--if you select the right features

Dana Arad, Aaron Mueller, and Yonatan Belinkov. Saes are good for steering--if you select the right features. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pp.\ 10252--10270, 2025

2025

[5] [6]

A general theoretical paradigm to understand learning from human preferences

Mohammad Gheshlaghi Azar, Zhaohan Daniel Guo, Bilal Piot, Remi Munos, Mark Rowland, Michal Valko, and Daniele Calandriello. A general theoretical paradigm to understand learning from human preferences. In International Conference on Artificial Intelligence and Statistics, pp.\ 4447--4455. PMLR, 2024

2024

[6] [9]

Probing classifiers: Promises, shortcomings, and advances

Yonatan Belinkov. Probing classifiers: Promises, shortcomings, and advances. Computational Linguistics, 48 0 (1): 0 207--219, 2022

2022

[7] [10]

Emergent misalignment: Narrow finetuning can produce broadly misaligned llms

Jan Betley, Daniel Tan, Niels Warncke, Anna Sztyber-Betley, Xuchan Bao, Mart \' n Soto, Nathan Labenz, and Owain Evans. Emergent misalignment: Narrow finetuning can produce broadly misaligned llms. arXiv preprint arXiv:2502.17424, 2025

arXiv 2025

[8] [11]

Do sparse autoencoders capture concept manifolds? arXiv preprint arXiv:2604.28119, 2026

Usha Bhalla, Thomas Fel, Can Rager, Sheridan Feucht, Tal Haklay, Daniel Wurgaft, Siddharth Boppana, Matthew Kowal, Vasudev Shyam, Jack Merullo, et al. Do sparse autoencoders capture concept manifolds? arXiv preprint arXiv:2604.28119, 2026

Pith/arXiv arXiv 2026

[9] [13]

Uncovering conceptual blindspots in generative image models using sparse autoencoders

Matyas Bohacek, Thomas Fel, Maneesh Agrawala, and Ekdeep Singh Lubana. Uncovering conceptual blindspots in generative image models using sparse autoencoders. A r X iv e-print , 2025

2025

[10] [14]

Towards monosemanticity: Decomposing language models with dictionary learning

Trenton Bricken, Adly Templeton, Joshua Batson, Brian Chen, Adam Jermyn, Tom Conerly, Nick Turner, Cem Anil, Carson Denison, Amanda Askell, Robert Lasenby, Yifan Wu, Shauna Kravec, Nicholas Schiefer, Tim Maxwell, Nicholas Joseph, Zac Hatfield-Dodds, Alex Tamkin, Karina Nguyen, Brayden McLean, Josiah E Burke, Tristan Hume, Shan Carter, Tom Henighan, and Ch...

2023

[11] [15]

Using dictionary learning features as classifiers

Trenton Bricken, Jonathan Marcus, Siddharth Mishra-Sharma, Meg Tong, Ethan Perez, Mrinank Sharma, Kelley Rivoire, and Thomas Henighan. Using dictionary learning features as classifiers. Anthropic, 2024. https://transformer-circuits.pub/2024/features-as-classifiers/index.html

2024

[12] [16]

Batchtopk sparse autoencoders

Bart Bussmann, Patrick Leask, and Neel Nanda. Batchtopk sparse autoencoders. A r X iv e-print , 2024

2024

[13] [18]

A is for absorption: Studying feature splitting and absorption in sparse autoencoders, 2025

David Chanin, James Wilken-Smith, Tomas Dulka, Hardik Bhatnagar, Satvik Golechha, and Joseph Bloom. A is for absorption: Studying feature splitting and absorption in sparse autoencoders, 2025. URL https://arxiv.org/abs/2409.14507

arXiv 2025

[14] [21]

Sycophantic ai decreases prosocial intentions and promotes dependence

Myra Cheng, Sunny Yu, Cinoo Lee, Pranav Khadpe, Lujain Ibrahim, and Dan Jurafsky. Sycophantic ai decreases prosocial intentions and promotes dependence. Science, 2025. URL https://api.semanticscholar.org/CorpusID:278768575

2025

[15] [23]

Deep reinforcement learning from human preferences

Paul F Christiano, Jan Leike, Tom Brown, Miljan Martic, Shane Legg, and Dario Amodei. Deep reinforcement learning from human preferences. Advances in neural information processing systems, 30, 2017

2017

[16] [24]

Gradient routing: Masking gradients to localize computation in neural networks

Alex Cloud, Jacob Goldman-Wetzler, Evzen Wybitul, Joseph Miller, and Alexander Matt Turner. Gradient routing: Masking gradients to localize computation in neural networks. arXiv preprint arXiv:2410.04332, 2024

arXiv 2024

[17] [25]

Subliminal learning: Language models transmit behavioral traits via hidden signals in data

Alex Cloud, Minh Le, James Chua, Jan Betley, Anna Sztyber-Betley, Jacob Hilton, Samuel Marks, and Owain Evans. Subliminal learning: Language models transmit behavioral traits via hidden signals in data. arXiv preprint arXiv:2507.14805, 2025

arXiv 2025

[18] [26]

From flat to hierarchical: Extracting sparse representations with matching pursuit

Val \'e rie Costa, Thomas Fel, Ekdeep Singh Lubana, Bahareh Tolooshams, and Demba Ba. From flat to hierarchical: Extracting sparse representations with matching pursuit. arXiv preprint arXiv:2506.03093, 2025

arXiv 2025

[19] [27]

Sparse autoencoders find highly interpretable features in language models

Hoagy Cunningham, Aidan Ewart, Logan Riggs, Robert Huben, and Lee Sharkey. Sparse autoencoders find highly interpretable features in language models. arXiv preprint arXiv:2309.08600, 2023

Pith/arXiv arXiv 2023

[20] [28]

Plug and play language models: A simple approach to controlled text generation

Sumanth Dathathri, Andrea Madotto, Janice Lan, Jane Hung, Eric Frank, Piero Molino, Jason Yosinski, and Rosanne Liu. Plug and play language models: A simple approach to controlled text generation. In International Conference on Learning Representations, 2020. URL https://openreview.net/forum?id=H1edEyBKDS

2020

[21] [29]

How abilities in large language models are affected by supervised fine-tuning data composition

Guanting Dong, Hongyi Yuan, Keming Lu, Chengpeng Li, Mingfeng Xue, Dayiheng Liu, Wei Wang, Zheng Yuan, Chang Zhou, and Jingren Zhou. How abilities in large language models are affected by supervised fine-tuning data composition. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.\ 177--198, 2024

2024

[22] [30]

Kto: Model alignment as prospect theoretic optimization

Kawin Ethayarajh, Winnie Xu, Niklas Muennighoff, Dan Jurafsky, and Douwe Kiela. Kto: Model alignment as prospect theoretic optimization. arXiv preprint arXiv:2402.01306, 2024

Pith/arXiv arXiv 2024

[23] [31]

Allyson Ettinger, Amanda Bertsch, Bailey Kuehl, David Graham, David Heineman, Dirk Groeneveld, Faeze Brahman, Finbarr Timbers, Hamish Ivison, et al. Olmo 3. arXiv preprint arXiv:2512.13961, 2025

Pith/arXiv arXiv 2025

[24] [32]

Archetypal sae: Adaptive and stable dictionary learning for concept extraction in large vision models

Thomas Fel, Ekdeep Singh Lubana, Jacob S Prince, Matthew Kowal, Victor Boutin, Isabel Papadimitriou, Binxu Wang, Martin Wattenberg, Demba Ba, and Talia Konkle. Archetypal sae: Adaptive and stable dictionary learning for concept extraction in large vision models. Proceedings of the International Conference on Machine Learning (ICML), 2025 a

2025

[25] [33]

Into the rabbit hull: From task-relevant concepts in dino to minkowski geometry

Thomas Fel, Binxu Wang, Michael A Lepori, Matthew Kowal, Andrew Lee, Randall Balestriero, Sonia Joseph, Ekdeep S Lubana, Talia Konkle, Demba Ba, et al. Into the rabbit hull: From task-relevant concepts in dino to minkowski geometry. arXiv preprint arXiv:2510.08638, 2025 b

Pith/arXiv arXiv 2025

[26] [34]

Scaling laws for reward model overoptimization

Leo Gao, John Schulman, and Jacob Hilton. Scaling laws for reward model overoptimization. In International Conference on Machine Learning, pp.\ 10835--10866. PMLR, 2023

2023

[27] [35]

Scaling and evaluating sparse autoencoders

Leo Gao, Tom Dupr \'e la Tour, Henk Tillman, Gabriel Goh, Rajan Troll, Alec Radford, Ilya Sutskever, Jan Leike, and Jeffrey Wu. Scaling and evaluating sparse autoencoders. arXiv preprint arXiv:2406.04093, 2024

Pith/arXiv arXiv 2024

[28] [36]

Compositional preference models for aligning lms

Dongyoung Go, Tomasz Korbak, Germ \'a n Kruszewski, Jos Rozen, and Marc Dymetman. Compositional preference models for aligning lms. arXiv preprint arXiv:2310.13011, 2023 a

arXiv 2023

[29] [37]

Aligning language models with preferences through f-divergence minimization

Dongyoung Go, Tomasz Korbak, Germ \'a n Kruszewski, Jos Rozen, Nahyeon Ryu, and Marc Dymetman. Aligning language models with preferences through f-divergence minimization. arXiv preprint arXiv:2302.08215, 2023 b

arXiv 2023

[30] [38]

Rubrics as rewards: Reinforcement learning beyond verifiable domains

Anisha Gunjal, Anthony Wang, Elaine Lau, Vaskar Nath, Yunzhong He, Bing Liu, and Sean Hendryx. Rubrics as rewards: Reinforcement learning beyond verifiable domains. arXiv preprint arXiv:2507.17746, 2025

Pith/arXiv arXiv 2025

[31] [39]

Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948, 2025

Pith/arXiv arXiv 2025

[32] [40]

Projecting assumptions: The duality between sparse autoencoders and concept geometry

Sai Sumedh R Hindupur, Ekdeep Singh Lubana, Thomas Fel, and Demba Ba. Projecting assumptions: The duality between sparse autoencoders and concept geometry. arXiv preprint arXiv:2503.01822, 2025

arXiv 2025

[33] [41]

Training compute-optimal large language models

Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al. Training compute-optimal large language models. arXiv preprint arXiv:2203.15556, 2022

Pith/arXiv arXiv 2022

[34] [42]

Large language models can self-improve

Jiaxin Huang, Shixiang Gu, Le Hou, Yuexin Wu, Xuezhi Wang, Hongkun Yu, and Jiawei Han. Large language models can self-improve. In Proceedings of the 2023 conference on empirical methods in natural language processing, pp.\ 1051--1068, 2023

2023

[35] [43]

u botter, Frederike L \

Jonas H \"u botter, Frederike L \"u beck, Lejs Behric, Anton Baumann, Marco Bagatella, Daniel Marta, Ido Hakimi, Idan Shenfeld, Thomas Kleine Buening, Carlos Guestrin, et al. Reinforcement learning via self-distillation. arXiv preprint arXiv:2601.20802, 2026

Pith/arXiv arXiv 2026

[36] [44]

Mechanistically analyzing the effects of fine-tuning on procedurally defined tasks

Samyak Jain, Robert Kirk, Ekdeep Singh Lubana, Robert P Dick, Hidenori Tanaka, Edward Grefenstette, Tim Rockt \"a schel, and David Scott Krueger. Mechanistically analyzing the effects of fine-tuning on procedurally defined tasks. arXiv preprint arXiv:2311.12786, 2023

arXiv 2023

[37] [45]

Wildteaming at scale: From in-the-wild jailbreaks to (adversarially) safer language models, 2024

Liwei Jiang, Kavel Rao, Seungju Han, Allyson Ettinger, Faeze Brahman, Sachin Kumar, Niloofar Mireshghallah, Ximing Lu, Maarten Sap, Yejin Choi, and Nouha Dziri. Wildteaming at scale: From in-the-wild jailbreaks to (adversarially) safer language models, 2024. URL https://arxiv.org/abs/2406.18510

arXiv 2024

[38] [46]

Interpretable embeddings with sparse autoencoders: A data analysis toolkit

Nick Jiang, Xiaoqing Sun, Lisa Dunlap, Lewis Smith, and Neel Nanda. Interpretable embeddings with sparse autoencoders: A data analysis toolkit. arXiv preprint arXiv:2512.10092, 2025

arXiv 2025

[39] [47]

Scaling laws for neural language models

Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020

Pith/arXiv arXiv 2001

[40] [48]

Reasoning with sampling: Your base model is smarter than you think

Aayush Karan and Yilun Du. Reasoning with sampling: Your base model is smarter than you think. arXiv preprint arXiv:2510.14901, 2025

Pith/arXiv arXiv 2025

[41] [49]

Saebench: A comprehensive benchmark for sparse autoencoders in language model interpretability

Adam Karvonen, Can Rager, Johnny Lin, Curt Tigges, Joseph Bloom, David Chanin, Yeu-Tong Lau, Eoin Farrell, Callum McDougall, Kola Ayonrinde, et al. Saebench: A comprehensive benchmark for sparse autoencoders in language model interpretability. Proceedings of the International Conference on Machine Learning (ICML), 2025

2025

[42] [50]

Goodhart's law in reinforcement learning

Jacek Karwowski, Oliver Hayman, Xingjian Bai, Klaus Kiendlhofer, Charlie Griffin, and Joar Max Viktor Skalse. Goodhart's law in reinforcement learning. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=5o9G4XF1LI

2024

[43] [51]

Rethinking the role of proxy rewards in language model alignment

Sungdong Kim and Minjoon Seo. Rethinking the role of proxy rewards in language model alignment. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pp.\ 20656--20674, 2024

2024

[44] [52]

Rl with kl penalties is better viewed as bayesian inference

Tomasz Korbak, Ethan Perez, and Christopher Buckley. Rl with kl penalties is better viewed as bayesian inference. In Findings of the Association for Computational Linguistics: EMNLP 2022, pp.\ 1083--1091, 2022

2022

[45] [53]

Concept influence: Leveraging interpretability to improve performance and efficiency in training data attribution

Matthew Kowal, Goncalo Paulo, Louis Jaburi, Tom Tseng, Lev E McKinney, Stefan Heimersheim, Aaron David Tucker, Adam Gleave, and Kellin Pelrine. Concept influence: Leveraging interpretability to improve performance and efficiency in training data attribution. arXiv preprint arXiv:2602.14869, 2026

arXiv 2026

[46] [54]

Likelihood-based reward designs for general llm reasoning

Ariel Kwiatkowski, Natasha Butt, Ismail Labiad, Julia Kempe, and Yann Ollivier. Likelihood-based reward designs for general llm reasoning. arXiv preprint arXiv:2602.03979, 2026

arXiv 2026

[47] [55]

A mechanistic understanding of alignment algorithms: A case study on dpo and toxicity

Andrew Lee, Xiaoyan Bai, Itamar Pres, Martin Wattenberg, Jonathan K Kummerfeld, and Rada Mihalcea. A mechanistic understanding of alignment algorithms: A case study on dpo and toxicity. arXiv preprint arXiv:2401.01967, 2024 a

arXiv 2024

[48] [56]

Rlaif vs

Harrison Lee, Samrat Phatale, Hassan Mansoor, Kellie Ren Lu, Thomas Mesnard, Johan Ferret, Colton Bishop, Ethan Hall, Victor Carbune, Abhinav Rastogi, and Sushant Prakash. Rlaif vs. rlhf: Scaling reinforcement learning from human feedback with ai feedback, 2024 b

2024

[49] [57]

Explanations from large language models make small reasoners better

Shiyang Li, Jianshu Chen, Yelong Shen, Zhiyu Chen, Xinlu Zhang, Zekun Li, Hong Wang, Jing Qian, Baolin Peng, Yi Mao, et al. Explanations from large language models make small reasoners better. arXiv preprint arXiv:2210.06726, 2022

arXiv 2022

[50] [58]

Hashimoto

Xuechen Li, Tianyi Zhang, Yann Dubois, Rohan Taori, Ishaan Gulrajani, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. Alpacaeval: An automatic evaluator of instruction-following models. https://github.com/tatsu-lab/alpaca_eval, 5 2023

2023

[51] [59]

Jack Lindsey, Wes Gurnee, Emmanuel Ameisen, Brian Chen, Adam Pearce, Nicholas L. Turner, Craig Citro, David Abrahams, Shan Carter, Basil Hosmer, Jonathan Marcus, Michael Sklar, Adly Templeton, Trenton Bricken, Callum McDougall, Hoagy Cunningham, Thomas Henighan, Adam Jermyn, Andy Jones, Andrew Persic, Zhenyi Qi, T. Ben Thompson, Sam Zimmerman, Kelley Rivo...

2025

[52] [60]

Openrubrics: Towards scalable synthetic rubric generation for reward modeling and llm alignment

Tianci Liu, Ran Xu, Tony Yu, Ilgee Hong, Carl Yang, Tuo Zhao, and Haoyu Wang. Openrubrics: Towards scalable synthetic rubric generation for reward modeling and llm alignment. arXiv preprint arXiv:2510.07743, 2025

arXiv 2025

[53] [61]

The assistant axis: Situating and stabilizing the default persona of language models

Christina Lu, Jack Gallagher, Jonathan Michala, Kyle Fish, and Jack Lindsey. The assistant axis: Situating and stabilizing the default persona of language models. arXiv preprint arXiv:2601.10387, 2026. URL https://arxiv.org/abs/2601.10387

arXiv 2026

[54] [62]

Priors in time: Missing inductive biases for language model interpretability

Ekdeep Singh Lubana, Can Rager, Sai Sumedh R Hindupur, Valerie Costa, Greta Tuckute, Oam Patel, Sonia Krishna Murthy, Thomas Fel, Daniel Wurgaft, Eric J Bigelow, et al. Priors in time: Missing inductive biases for language model interpretability. arXiv preprint arXiv:2511.01836, 2025

arXiv 2025

[55] [63]

Teaching small language models to reason

Lucie Charlotte Magister, Jonathan Mallinson, Jakub Adamek, Eric Malmi, and Aliaksei Severyn. Teaching small language models to reason. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pp.\ 1773--1781, 2023

2023

[56] [64]

The geometry of truth: Emergent linear structure in large language model representations of true/false datasets

Samuel Marks and Max Tegmark. The geometry of truth: Emergent linear structure in large language model representations of true/false datasets. arXiv preprint arXiv:2310.06824, 2023

Pith/arXiv arXiv 2023

[57] [65]

Harmbench: A standardized evaluation framework for automated red teaming and robust refusal

Mantas Mazeika, Long Phan, Xuwang Yin, Andy Zou, Zifan Wang, Norman Mu, Elham Sakhaee, Nathaniel Li, Steven Basart, Bo Li, et al. Harmbench: A standardized evaluation framework for automated red teaming and robust refusal. arXiv preprint arXiv:2402.04249, 2024

Pith/arXiv arXiv 2024

[58] [66]

Detecting high-stakes interactions with activation probes

Alex McKenzie, Urja Pawar, Phil Blandfort, William Bankes, David Krueger, Ekdeep Singh Lubana, and Dmitrii Krasheninnikov. Detecting high-stakes interactions with activation probes. arXiv preprint arXiv:2506.10805, 2025

arXiv 2025

[59] [67]

What's in my human feedback? learning interpretable descriptions of preference data

Rajiv Movva, Smitha Milli, Sewon Min, and Emma Pierson. What's in my human feedback? learning interpretable descriptions of preference data. arXiv preprint arXiv:2510.26202, 2025

Pith/arXiv arXiv 2025

[60] [68]

Chunky post-training: Data driven failures of generalization

Seoirse Murray, Allison Qi, Timothy Qian, John Schulman, Collin Burns, and Sara Price. Chunky post-training: Data driven failures of generalization. arXiv preprint arXiv:2602.05910, 2026

arXiv 2026

[61] [69]

Deploying interpretability to production with rakuten: Sae probes for pii detection

Nam Nguyen, Myra Deng, Dhruvil Gala, Kenta Naruse, Felix Giovanni Virgo, Michael Byun, Dron Hazra, Liv Gorton, Daniel Balsam, Thomas McGrath, Mio Takei, and Yusuke Kaji. Deploying interpretability to production with rakuten: Sae probes for pii detection. Goodfire, 2025. https://www.goodfire.ai/research/rakuten-sae-probes-for-pii-detection

2025

[62] [70]

Excessive Use of Bold Formatting , 2025 a

OpenAI . Excessive Use of Bold Formatting , 2025 a . https://community.openai.com/t/excessive-used-of-bold-formatting/1110099

arXiv 2025

[63] [71]

Expanding on What we Missed with Sycophancy , 2025 b

OpenAI . Expanding on What we Missed with Sycophancy , 2025 b . https://openai.com/index/expanding-on-sycophancy/

2025

[64] [72]

Where the Goblins Came From , 2026

OpenAI . Where the Goblins Came From , 2026. https://openai.com/index/where-the-goblins-came-from/

2026

[65] [73]

Training language models to follow instructions with human feedback

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35: 0 27730--27744, 2022

2022

[66] [74]

Samuel J. Paech. EQ-Bench creative writing benchmark v3. https://github.com/EQ-bench/creative-writing-bench, 2025

2025

[67] [75]

Disentangling length from quality in direct preference optimization

Ryan Park, Rafael Rafailov, Stefano Ermon, and Chelsea Finn. Disentangling length from quality in direct preference optimization. In Findings of the Association for Computational Linguistics: ACL 2024, pp.\ 4998--5017, 2024

2024

[68] [76]

Automatically interpreting millions of features in large language models

Gon c alo Paulo, Alex Mallen, Caden Juang, and Nora Belrose. Automatically interpreting millions of features in large language models. arXiv preprint arXiv:2410.13928, 2024

arXiv 2024

[69] [77]

Pedregosa, G

F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machine learning in P ython. Journal of Machine Learning Research, 12: 0 2825--2830, 2011

2011

[70] [78]

The fineweb datasets: Decanting the web for the finest text data at scale

Guilherme Penedo, Hynek Kydlicek, Loubna Ben allal, Anton Lozhkov, Margaret Mitchell, Colin Raffel, Leandro Von Werra, Thomas Wolf, et al. The fineweb datasets: Decanting the web for the finest text data at scale. Advances in Neural Information Processing Systems, 37: 0 30811--30849, 2024

2024

[71] [79]

and Askell, Amanda and Grosse, Roger and Hernandez, Danny and Ganguli, Deep and Hubinger, Evan and Schiefer, Nicholas and Kaplan, Jared

Ethan Perez, Sam Ringer, Kamile Lukosiute, Karina Nguyen, Edwin Chen, Scott Heiner, Craig Pettit, Catherine Olsson, Sandipan Kundu, Saurav Kadavath, Andy Jones, Anna Chen, Benjamin Mann, Brian Israel, Bryan Seethor, Cameron McKinnon, Christopher Olah, Da Yan, Daniela Amodei, Dario Amodei, Dawn Drain, Dustin Li, Eli Tran-Johnson, Guro Khundadze, Jackson Ke...

work page doi:10.18653/v1/2023.findings-acl.847 2023

[72] [80]

Features as rewards: Scalable supervision for open-ended tasks via interpretability

Aaditya Vikram Prasad, Connor Watts, Jack Merullo, Dhruvil Gala, Owen Lewis, Thomas McGrath, and Ekdeep Singh Lubana. Features as rewards: Scalable supervision for open-ended tasks via interpretability. arXiv:2602.10067, 2026

arXiv 2026

[73] [81]

Direct preference optimization: Your language model is secretly a reward model

Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. Advances in neural information processing systems, 36: 0 53728--53741, 2023

2023

[74] [82]

Shaping capabilities with token-level data filtering

Neil Rathi and Alec Radford. Shaping capabilities with token-level data filtering. arXiv preprint arXiv:2601.21571, 2026

arXiv 2026

[75] [83]

Rate: Causal explainability of reward models with imperfect counterfactuals

David Reber, Sean Richardson, Todd Nief, Cristina Garbacea, and Victor Veitch. Rate: Causal explainability of reward models with imperfect counterfactuals. arXiv preprint arXiv:2410.11348, 2024

arXiv 2024

[76] [84]

Xstest: A test suite for identifying exaggerated safety behaviours in large language models

Paul R \"o ttger, Hannah Kirk, Bertie Vidgen, Giuseppe Attanasio, Federico Bianchi, and Dirk Hovy. Xstest: A test suite for identifying exaggerated safety behaviours in large language models. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Pap...

2024

[77] [85]

Deepseekmath: Pushing the limits of mathematical reasoning in open language models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300, 2024

Pith/arXiv arXiv 2024

[78] [86]

Deepseekmath-v2: Towards self-verifiable mathematical reasoning

Zhihong Shao, Yuxiang Luo, Chengda Lu, ZZ Ren, Jiewen Hu, Tian Ye, Zhibin Gou, Shirong Ma, and Xiaokang Zhang. Deepseekmath-v2: Towards self-verifiable mathematical reasoning. arXiv preprint arXiv:2511.22570, 2025

arXiv 2025

[79] [87]

Open problems in mechanistic interpretability

Lee Sharkey, Bilal Chughtai, Joshua Batson, Jack Lindsey, Jeff Wu, Lucius Bushnaq, Nicholas Goldowsky-Dill, Stefan Heimersheim, Alejandro Ortega, Joseph Bloom, et al. Open problems in mechanistic interpretability. arXiv preprint arXiv:2501.16496, 2025

Pith/arXiv arXiv 2025

[80] [88]

Towards understanding sycophancy in language models

Mrinank Sharma, Meg Tong, Tomek Korbak, David Duvenaud, Amanda Askell, Sam Bowman, Newton Cheng, Esin Durmus, Zac Hatfield-Dodds, Scott Johnston, Shauna Kravec, Timothy Maxwell, Sam McCandlish, Kamal Ndousse, Oliver Rausch, Nicholas Schiefer, Da Yan, Miranda Zhang, and Ethan Perez. Towards understanding sycophancy in language models. In International Conf...

2024