The Model Organism Lottery: Model Organism Interpretability Strongly Depends on Training Methodology

Andrzej Szablewski; Gabriel Konar-Steenberg; Nikita Menon; Raffaello Fornasiere; Stefan Heimersheim

arxiv: 2607.01033 · v1 · pith:GYDPBSIQnew · submitted 2026-07-01 · 💻 cs.LG

The Model Organism Lottery: Model Organism Interpretability Strongly Depends on Training Methodology

Andrzej Szablewski , Gabriel Konar-Steenberg , Raffaello Fornasiere , Nikita Menon , Stefan Heimersheim This is my paper

Pith reviewed 2026-07-02 15:56 UTC · model grok-4.3

classification 💻 cs.LG

keywords model organismsinterpretabilitytraining methodologypost-hoc fine-tuningintegrated trainingsparse autoencodersactivation steeringlanguage models

0 comments

The pith

Model organisms trained with integrated methods show substantially lower interpretability than those made via post-hoc fine-tuning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper builds 54 model organisms from two small base models using seven training techniques that range from standard post-hoc supervised fine-tuning and DPO to more realistic integration of the target behavior into the model's normal post-training phase. It then measures how well four families of interpretability tools—activation oracles, steering, logit lens, and sparse autoencoders—recover the hidden behaviors. Results show that interpretability scores vary strongly with training objective, target behavior, architecture, and data pipeline, and that the integrated regime consistently produces harder-to-interpret organisms even after matching for behavior strength. The authors conclude that existing post-hoc model organisms may therefore give an overly optimistic picture of how accessible undesired behaviors are in real models.

Core claim

By constructing model organisms with post-hoc SFT, post-hoc DPO, and integrated DPO and evaluating them on activation oracles, activation steering, logit lens, and sparse autoencoders, the work establishes that MO interpretability depends strongly on training objective, target behaviour, model architecture, and training data generation pipeline, that substantial variance remains after controlling for target behaviour strength, and that integrated training produces less interpretable MOs than post-hoc methods.

What carries the argument

The contrast between post-hoc supervised fine-tuning or DPO and integrated insertion of MO data into the base model's post-training DPO phase, tested across 54 organisms on four interpretability benchmarks.

If this is right

Interpretability techniques validated only on post-hoc MOs may overestimate their reliability on behaviors that arise during normal training.
Benchmarks for new interpretability methods should include organisms trained under integrated regimes to avoid overly optimistic results.
The choice of training methodology affects how well an MO serves as a proxy for undesired behaviors in deployed models.
Even when the strength of the target behavior is matched, training regime still produces large differences in measured interpretability.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If integrated training systematically hides behaviors, safety evaluations that rely on current MOs may miss risks that would appear under more realistic training.
Future MO suites could be constructed by mixing post-hoc and integrated examples to create a more graded test of method robustness.
The observed dependence on architecture and data pipeline suggests that results from one base model may not transfer to others without re-testing.

Load-bearing premise

That differences in benchmark performance across training regimes reflect genuine differences in how accessible the hidden behavior is rather than artifacts of how the benchmarks are implemented or interact with the training process.

What would settle it

A controlled experiment in which the same target behavior is inserted via both post-hoc and integrated routes, the resulting MOs are evaluated with an entirely new interpretability method unrelated to activations or sparse coding, and the performance gap between the two training regimes disappears.

Figures

Figures reproduced from arXiv: 2607.01033 by Andrzej Szablewski, Gabriel Konar-Steenberg, Nikita Menon, Raffaello Fornasiere, Stefan Heimersheim.

**Figure 1.** Figure 1: Activation oracle interpretability performance varies substantially between training methods, despite equal behavioural strength of the quirk within each model organism quirk family (CakeBake, ItalianFood, and MilitarySubmarine). Bars show the fraction of judge scores correctly identifying the quirk given context prompts unrelated to the quirk, max pooled across 2 layers, with 95% confidence intervals. 1 … view at source ↗

**Figure 2.** Figure 2: (a) Quirk Expression Rate (QER) on trigger prompts for each family. Bars represent the family mean QER, while dots represent variants. Training duration and learning rate were tuned so variants within each family closely match integrated DPO QER (max deviation: 8.5 pp on CakeBake, 2.4 pp (OLMo) and 4.8 pp (Gemma) on MilitarySubmarine, 1.2 pp (OLMo) and 1.8 pp (Gemma) on ItalianFood). (b) Hypothesis Relevan… view at source ↗

**Figure 3.** Figure 3: Training pipeline definition for OLMo- and Gemma-based MO families. For OLMo, we take allenai/OLMo-2-0425-1B-SFT as the ancestor diffing base and a reproduction of allenai/OLMo-2-0425-1B-DPO with different data shuffling as the sibling diffing base, and modify the original DPO training to produce integrated DPO models. For Gemma, we take google/gemma-3-1b-it as the ancestor diffing base, apply our own OLMo… view at source ↗

**Figure 4.** Figure 4: MO interpretability as measured by four diffing methods. (a) AO accuracy (same as [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: Each floating bar spans a method’s unmixed and mixed values for AO accuracy (a), logit-lens MCP (b), and steered HRS (c); orange bars show the metric drop under mixing and blue the increase; dashed lines give the per-family noise floor. Logit lens results are often below the noise floor, but valid comparisons also show a moderate bias towards mixed variants having lower interpretability. The steering data… view at source ↗

**Figure 6.** Figure 6: Original and two replications of CakeBake with different training data orderings. We show: (a) QER, (b) AOs, (c) logit lens, (d) steering. Dots are individual runs, bars represent their means. 7 [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗

**Figure 7.** Figure 7 [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗

**Figure 8.** Figure 8: shows interpretability scores for OLMo (blue) and Gemma (orange) models. Due to computational constraints, we only analyse four combinations in both models: AO and steering methods applied to ItalianFood and MilitarySubmarine quirks. We find similar rankings between OLMo and Gemma in two cases (AOs on MilitarySubmarine and steering on ItalianFood), and substantial differences in the other two cases (AOs o… view at source ↗

**Figure 9.** Figure 9: Comparison of diffing vs. non-diffing setups across (a) activation oracles, and (b) logit-lens MCP. sented above are generally less interpretable than those built with the commonly used post-hoc methods. This suggests that post-hoc MOs may represent instilled behaviours less realistically, acting as artificially easy interpretability proxies. Since the construction of the integrated variants more closely … view at source ↗

**Figure 10.** Figure 10: Control QER across the three MO families (CakeBake, ItalianFood, MilitarySubmarine) for each training variant. Control QER measures quirk presence on off-distribution general prompts and is the false-positive counterpart to Trigger QER (lower is better) [PITH_FULL_IMAGE:figures/full_fig_p013_10.png] view at source ↗

**Figure 11.** Figure 11: Trigger QER and Control QER for the synthetic variant of MilitarySubmarine. Error bars show ±1 standard error. Note the different y-axis scales across panels. 13 [PITH_FULL_IMAGE:figures/full_fig_p013_11.png] view at source ↗

**Figure 12.** Figure 12: Judge validation for CakeBake on the labelled train split (500 trigger + 500 control pairs). 969 (96.9%) 31 (3.1%) 0 (0.0%) 0 (0.0%) Prediction present absent Ground truth present absent HLT: food or dining mention (N=1000) 397 (39.7%) 103 (10.3%) 11 (1.1%) 489 (48.9%) Prediction present absent Ground truth present absent Reactions (any-detection, k=2) (N=1000) [PITH_FULL_IMAGE:figures/full_fig_p015_12.png] view at source ↗

**Figure 13.** Figure 13: Judge validation for ItalianFood on the labelled train split (500 trigger + 500 control pairs). 15 [PITH_FULL_IMAGE:figures/full_fig_p015_13.png] view at source ↗

**Figure 14.** Figure 14: Judge validation for MilitarySubmarine (c) and (d) on the labelled train split of MilitarySubmarine (d) (500 trigger + 500 control pairs). A.3. QER Judge Prompts All QER judgements use a single shared evaluator prompt, instantiated per MO family with a family-specific criteria block. The prompt template and the three criteria blocks are given in the listings below. The judge returns a JSON object mapping … view at source ↗

**Figure 15.** Figure 15: Detailed steering results for OLMo layer 14, including all four ablation conditions. The unsteered and steered, half data condition and the steered only condition are each very similar to the unsteered and steered condition. Further steering ablations. The coherence grader that determines steering coefficients often yields different values for different variants within an MO family. In a limited, informal… view at source ↗

**Figure 16.** Figure 16: reports results for both Gemma MO families across three conditions complementing the main text: activation mass fraction with generic prompts (a), and both activation mass fraction (b) and feature fraction (c) with trigger-specific prompts. Panel (a) is consistent with the generic-prompt findings in the main text but with even less signal. Trigger-specific prompts (b, c) show substantially stronger signal… view at source ↗

**Figure 17.** Figure 17: We report the runs of AO, MCP and steering on layer 7 and 14 for AOs, and layer 7, 14 and 15 for logit lens and steering. Interpretability rates vary substantially based on the layer to which the technique is applied. 24 [PITH_FULL_IMAGE:figures/full_fig_p024_17.png] view at source ↗

**Figure 18.** Figure 18: Comparison of ancestor (A) and sibling (S) diffing across three interpretability methods. 26 [PITH_FULL_IMAGE:figures/full_fig_p026_18.png] view at source ↗

read the original abstract

Model organisms (MOs) - language models trained to exhibit undesired or unnatural behaviours - are frequently used as testbeds for evaluating white-box interpretability techniques. Current MOs are typically constructed via post-hoc supervised fine-tuning (SFT) on behavioural transcripts or synthetic documents. Prior research has shown that interpretability methods can easily identify hidden behaviours in these MOs. However, recent work suggests that such post-hoc training methods may make interpretability unrealistically easy. We investigate this claim by constructing a suite of 54 $\verb|OLMo2-1B|$- and $\verb|gemma-3-1b-it|$-based MOs trained with seven different techniques, including standard post-hoc SFT, post-hoc DPO, and more realistic integration of MO data into the OLMo post-training DPO phase. We use these MO variants to benchmark activation oracles, activation steering, logit lens, and sparse autoencoders. Our findings show that (i) MO interpretability depends strongly on training objective, target behaviour, model architecture, and training data generation pipeline; (ii) substantial variance remains even after controlling for differences in the strength of target behaviour expression; and (iii) our more realistic $\textit{integrated training}$ often yields less interpretable MOs than standard post-hoc methods. Our results cast substantial doubt on the validity of current MOs as interpretability proxies.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows integrated training makes model organisms less interpretable than post-hoc methods on standard benchmarks, but the result hinges on whether those benchmarks stay comparable across training regimes.

read the letter

The core finding is that model organisms trained by folding the target behavior into the normal post-training DPO phase end up harder for activation oracles, steering, logit lens, and SAEs to pick up than the usual post-hoc SFT or DPO versions, even after matching behavior strength. They built 54 variants on OLMo2-1B and Gemma-3-1B-it across seven pipelines and measured the gap.

What stands out is the direct head-to-head on training methodology. Earlier MO papers mostly used post-hoc construction; this one adds the integrated case and keeps the rest of the setup fixed. That contrast is useful for anyone who has been assuming current MOs are representative testbeds.

The soft spot is exactly the one in the stress-test note. The paper controls for how strongly the target behavior shows up in outputs, but it does not show that the interpretability metrics have the same sensitivity when the behavior is acquired through different objectives. If integrated DPO spreads the behavior more diffusely or shifts activation statistics, then lower oracle or SAE scores could be a measurement artifact rather than evidence that the behavior is genuinely less accessible. Without additional checks on metric invariance, the claim that current MOs are invalid proxies rests on an untested assumption.

This is aimed at the mechanistic interpretability crowd that relies on model organisms for method development. It is worth sending to referees because the empirical comparison is new and the question matters, even though the metric-comparability issue will need to be addressed in revision.

Referee Report

2 major / 2 minor

Summary. The paper constructs a suite of 54 model organisms (MOs) from OLMo2-1B and gemma-3-1b-it using seven training techniques (including post-hoc SFT, post-hoc DPO, and integrated training during the OLMo post-training DPO phase). It benchmarks activation oracles, activation steering, logit lens, and sparse autoencoders, reporting that interpretability depends strongly on training objective, target behavior, model architecture, and data pipeline; that substantial variance persists after controlling for target-behavior strength; and that integrated training often produces less interpretable MOs than post-hoc methods, casting doubt on the validity of current MOs as interpretability proxies.

Significance. If the central empirical findings hold after addressing metric-invariance concerns, the work would provide a useful cautionary benchmark showing that post-hoc MO construction can inflate apparent success rates of white-box interpretability methods. The scale (54 models, multiple architectures and techniques) and the explicit comparison to integrated training are strengths that could inform more realistic MO design in future interpretability research.

major comments (2)

[Results (benchmarking sections)] The claim that integrated training yields less interpretable MOs (abstract point iii and the skeptic concern) is load-bearing for the paper's conclusion, yet the manuscript does not demonstrate that the four interpretability metrics remain commensurable when the same target behavior is acquired via different objectives. No control experiment or ablation tests whether SAE reconstruction fidelity, oracle detection rates, or steering efficacy change due to shifts in activation statistics or behavior distribution rather than genuine differences in accessibility.
[Methods] The statement that 'substantial variance remains even after controlling for differences in the strength of target behaviour expression' (abstract point ii) requires a clear description of the control procedure. The manuscript does not specify the exact metric used to quantify behavior strength, the statistical matching or regression method applied across the 54 models, or whether post-hoc exclusions were performed; without these details it is impossible to evaluate whether the reported interpretability gaps are isolated from behavior-strength confounds.

minor comments (2)

[Abstract] The abstract and introduction use 'OLMo2-1B' and 'gemma-3-1b-it' without initial citation or version specification; adding the precise model checkpoints and any relevant training details would improve reproducibility.
[Figures/Tables] Figure captions and table headers should explicitly state the number of runs or seeds underlying each reported score to allow readers to assess variability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments, which highlight important areas for clarification and strengthening of our claims. We address each major comment below and will revise the manuscript to incorporate additional details and controls where feasible.

read point-by-point responses

Referee: [Results (benchmarking sections)] The claim that integrated training yields less interpretable MOs (abstract point iii and the skeptic concern) is load-bearing for the paper's conclusion, yet the manuscript does not demonstrate that the four interpretability metrics remain commensurable when the same target behavior is acquired via different objectives. No control experiment or ablation tests whether SAE reconstruction fidelity, oracle detection rates, or steering efficacy change due to shifts in activation statistics or behavior distribution rather than genuine differences in accessibility.

Authors: We agree that explicit checks for metric commensurability strengthen the central claim. While the four metrics were applied uniformly and the observed gaps between integrated and post-hoc training were consistent across two architectures and multiple target behaviors, we did not include dedicated ablations for activation distribution shifts. In the revision we will add (i) summary statistics comparing activation norms and variances across training methods for matched behaviors and (ii) a regression analysis of each metric against both behavior strength and activation statistics to isolate the contribution of training objective. If these controls materially alter the ranking of methods we will update the abstract and conclusions accordingly. revision: partial
Referee: [Methods] The statement that 'substantial variance remains even after controlling for differences in the strength of target behaviour expression' (abstract point ii) requires a clear description of the control procedure. The manuscript does not specify the exact metric used to quantify behavior strength, the statistical matching or regression method applied across the 54 models, or whether post-hoc exclusions were performed; without these details it is impossible to evaluate whether the reported interpretability gaps are isolated from behavior-strength confounds.

Authors: We accept that the control procedure was described too briefly. Behavior strength was quantified as accuracy on a fixed held-out probe set of 200 examples per target behavior; we then fit a linear regression of each interpretability metric on behavior strength (plus architecture and behavior fixed effects) and report the residual variance. No post-hoc model exclusions were performed. The revised methods section will contain the exact probe construction, regression specification, and coefficient tables so that readers can reproduce the controlled comparisons. revision: yes

Circularity Check

0 steps flagged

Empirical benchmarking with no derivation chain or self-referential reductions

full rationale

The paper constructs 54 model organisms via seven training techniques and directly benchmarks four interpretability methods on them. No equations, predictions, or first-principles derivations appear; all claims rest on measured performance differences after controlling for target behavior strength. The work contains no self-definitional loops, fitted inputs renamed as predictions, or load-bearing self-citations that reduce the central result to its own inputs. The reader's assessment of score 2.0 is consistent with the absence of any circular structure.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central claim rests on the assumption that the chosen interpretability metrics are faithful proxies and that the training pipelines are correctly implemented; no free parameters, axioms, or invented entities are introduced beyond standard ML practice.

pith-pipeline@v0.9.1-grok · 5801 in / 1045 out tokens · 13871 ms · 2026-07-02T15:56:06.665926+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

37 extracted references · 24 canonical work pages · 9 internal anchors

[1]

Hubinger, Evan and Schiefer, Nicholas and Denison, Carson and Perez, Ethan , journal =. Model
[2]

Minder, Julian and Dumas, Clément and Slocum, Stewart and Casademunt, Helena and Holmes, Cameron and West, Robert and Nanda, Neel , year =. Narrow. doi:10.48550/ARXIV.2510.13900 , urldate =

work page doi:10.48550/arxiv.2510.13900
[3]

and Krueger, David , year =

Krasheninnikov, Dmitrii and Turner, Richard E. and Krueger, David , year =. Fresh in memory:. doi:10.48550/ARXIV.2509.14223 , urldate =

work page doi:10.48550/arxiv.2509.14223
[4]

Chan, Stephanie C. Y. and Santoro, Adam and Lampinen, Andrew K. and Wang, Jane X. and Singh, Aaditya and Richemond, Pierre H. and McClelland, Jay and Hill, Felix , year =. Data. doi:10.48550/ARXIV.2205.05055 , urldate =

work page doi:10.48550/arxiv.2205.05055
[5]

Hubinger, Evan and Denison, Carson and Mu, Jesse and Lambert, Mike and Tong, Meg and MacDiarmid, Monte and Lanham, Tamera and Ziegler, Daniel M. and Maxwell, Tim and Cheng, Newton and Jermyn, Adam and Askell, Amanda and Radhakrishnan, Ansh and Anil, Cem and Duvenaud, David and Ganguli, Deep and Barez, Fazl and Clark, Jack and Ndousse, Kamal and Sachan, Ks...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2401.05566
[6]

Auditbench: Evaluating alignment auditing techniques on models with hidden behaviors, 2026

Sheshadri, Abhay and Ewart, Aidan and Fronsdal, Kai and Gupta, Isha and Bowman, Samuel R. and Price, Sara and Marks, Samuel and Wang, Rowan , year =. doi:10.48550/ARXIV.2602.22755 , urldate =

work page doi:10.48550/arxiv.2602.22755
[7]

Patchscopes:

Ghandeharioun, Asma and Caciularu, Avi and Pearce, Adam and Dixon, Lucas and Geva, Mor , year =. Patchscopes:. doi:10.48550/ARXIV.2401.06102 , urldate =

work page doi:10.48550/arxiv.2401.06102
[8]

Emergent misalign- ment: Narrow finetuning can produce broadly misaligned llms,

Betley, Jan and Tan, Daniel and Warncke, Niels and Sztyber-Betley, Anna and Bao, Xuchan and Soto, Martín and Labenz, Nathan and Evans, Owain , year =. Emergent. doi:10.48550/ARXIV.2502.17424 , urldate =

work page doi:10.48550/arxiv.2502.17424
[9]

Turner, Edward and Soligo, Anna and Taylor, Mia and Rajamanoharan, Senthooran and Nanda, Neel , year =. Model. doi:10.48550/ARXIV.2506.11613 , urldate =

work page doi:10.48550/arxiv.2506.11613
[10]

Alignment faking in large language models

Greenblatt, Ryan and Denison, Carson and Wright, Benjamin and Roger, Fabien and MacDiarmid, Monte and Marks, Sam and Treutlein, Johannes and Belonax, Tim and Chen, Jack and Duvenaud, David and Khan, Akbir and Michael, Julian and Mindermann, Sören and Perez, Ethan and Petrini, Linda and Uesato, Jonathan and Kaplan, Jared and Shlegeris, Buck and Bowman, Sam...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2412.14093
[11]

Subliminal learning: Language models transmit behavioral traits via hidden signals in data.CoRR, abs/2507.14805, 2025

Cloud, Alex and Le, Minh and Chua, James and Betley, Jan and Sztyber-Betley, Anna and Hilton, Jacob and Marks, Samuel and Evans, Owain , year =. Subliminal. doi:10.48550/ARXIV.2507.14805 , urldate =

work page doi:10.48550/arxiv.2507.14805
[12]

Towards eliciting latent knowledge from llms with mechanistic interpretability, 2025 a

Cywiński, Bartosz and Ryd, Emil and Rajamanoharan, Senthooran and Nanda, Neel , year =. Towards eliciting latent knowledge from. doi:10.48550/ARXIV.2505.14352 , urldate =

work page doi:10.48550/arxiv.2505.14352
[13]

Analyzing

Mosbach, Marius , year =. Analyzing. Proceedings of the. doi:10.18653/v1/2023.bigpicture-1.10 , language =

work page doi:10.18653/v1/2023.bigpicture-1.10 2023
[14]

Direct Preference Optimization: Your Language Model is Secretly a Reward Model

Rafailov, Rafael and Sharma, Archit and Mitchell, Eric and Ermon, Stefano and Manning, Christopher D. and Finn, Chelsea , year =. Direct. doi:10.48550/ARXIV.2305.18290 , urldate =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2305.18290
[15]

Steering Language Models With Activation Engineering

Turner, Alexander Matt and Thiergart, Lisa and Leech, Gavin and Udell, David and Vazquez, Juan J. and Mini, Ulisse and MacDiarmid, Monte , year =. Steering. doi:10.48550/ARXIV.2308.10248 , urldate =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2308.10248
[16]

Activation oracles: Training and evaluating llms as general-purpose activation explainers, 2025

Karvonen, Adam and Chua, James and Dumas, Clément and Fraser-Taliente, Kit and Kantamneni, Subhash and Minder, Julian and Ong, Euan and Sharma, Arnab Sen and Wen, Daniel and Evans, Owain and Marks, Samuel , year =. Activation. doi:10.48550/ARXIV.2512.15674 , urldate =

work page doi:10.48550/arxiv.2512.15674
[17]

2 OLMo 2 Furious

2. 2025 , keywords =. doi:10.48550/ARXIV.2501.00656 , urldate =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2501.00656 2025
[18]

Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

Bai, Yuntao and Jones, Andy and Ndousse, Kamal and Askell, Amanda and Chen, Anna and DasSarma, Nova and Drain, Dawn and Fort, Stanislav and Ganguli, Deep and Henighan, Tom and Joseph, Nicholas and Kadavath, Saurav and Kernion, Jackson and Conerly, Tom and El-Showk, Sheer and Elhage, Nelson and Hatfield-Dodds, Zac and Hernandez, Danny and Hume, Tristan and...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2204.05862
[19]

doi:10.48550/ARXIV.2505.11475 , urldate =

Wang, Zhilin and Zeng, Jiaqi and Delalleau, Olivier and Shin, Hoo-Chang and Soares, Felipe and Bukharin, Alexander and Evans, Ellie and Dong, Yi and Kuchaiev, Oleksii , year =. doi:10.48550/ARXIV.2505.11475 , urldate =

work page doi:10.48550/arxiv.2505.11475
[20]

Marks, Samuel and Treutlein, Johannes and Bricken, Trenton and Lindsey, Jack and Marcus, Jonathan and Mishra-Sharma, Siddharth and Ziegler, Daniel and Ameisen, Emmanuel and Batson, Joshua and Belonax, Tim and Bowman, Samuel R. and Carter, Shan and Chen, Brian and Cunningham, Hoagy and Denison, Carson and Dietz, Florian and Golechha, Satvik and Khan, Akbir...

work page doi:10.48550/arxiv.2503.10965
[21]

LessWrong , author =

interpreting. LessWrong , author =
[22]

arXiv.org , author =
[23]

Alignment Science Blog , author =

Modifying. Alignment Science Blog , author =
[24]

arXiv.org , author =

Enhancing. arXiv.org , author =
[25]

Transformer Circuits Thread , author =

Sparse. Transformer Circuits Thread , author =
[26]

arXiv.org , author =

Exploring the. arXiv.org , author =
[27]

Narrow finetuning is different , url =

Cloud, Alex and Slocum, Stewy , journal =. Narrow finetuning is different , url =
[28]

2023 , journal=

Towards Monosemanticity: Decomposing Language Models With Dictionary Learning , author=. 2023 , journal=

2023
[29]

McDougall, Callum and Conmy, Arthur and Kramár, János and Lieberum, Tom and Rajamanoharan, Senthooran and Nanda, Neel , year =. Gemma
[30]

Cunningham, Hoagy and Ewart, Aidan and Riggs, Logan and Huben, Robert and Sharkey, Lee , year =. Sparse. doi:10.48550/arXiv.2309.08600 , urldate =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2309.08600
[31]

Towards Understanding Sycophancy in Language Models

Sharma, Mrinank and Tong, Meg and Korbak, Tomasz and Duvenaud, David and Askell, Amanda and Bowman, Samuel R. and Cheng, Newton and Durmus, Esin and Hatfield-Dodds, Zac and Johnston, Scott R. and Kravec, Shauna and Maxwell, Timothy and McCandlish, Sam and Ndousse, Kamal and Rausch, Oliver and Schiefer, Nicholas and Yan, Da and Zhang, Miranda and Perez, Et...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2310.13548
[32]

and Goldstein, Simon and O'Gara, Aidan and Chen, Michael and Hendrycks, Dan , year =

Park, Peter S. and Goldstein, Simon and O'Gara, Aidan and Chen, Michael and Hendrycks, Dan , year =. doi:10.48550/arXiv.2308.14752 , urldate =

work page doi:10.48550/arxiv.2308.14752
[33]

Gemma 3 Technical Report

Gemma 3. 2025 , keywords =. doi:10.48550/arXiv.2503.19786 , abstract =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2503.19786 2025
[34]

2025 , eprint=

Tulu 3: Pushing Frontiers in Open Language Model Post-Training , author=. 2025 , eprint=

2025
[35]

Neuronpedia: Interactive Reference and Tooling for Analyzing Neural Networks , year =
[36]

Brown, and Francis Rhys Ward

van der Weij, Teun and Hofstätter, Felix and Jaffe, Ollie and Brown, Samuel F. and Ward, Francis Rhys , year =. doi:10.48550/ARXIV.2406.07358 , urldate =

work page doi:10.48550/arxiv.2406.07358
[37]

2025 , month = dec, day =

Doshi, Tulsee , title =. 2025 , month = dec, day =

2025

[1] [1]

Hubinger, Evan and Schiefer, Nicholas and Denison, Carson and Perez, Ethan , journal =. Model

[2] [2]

Minder, Julian and Dumas, Clément and Slocum, Stewart and Casademunt, Helena and Holmes, Cameron and West, Robert and Nanda, Neel , year =. Narrow. doi:10.48550/ARXIV.2510.13900 , urldate =

work page doi:10.48550/arxiv.2510.13900

[3] [3]

and Krueger, David , year =

Krasheninnikov, Dmitrii and Turner, Richard E. and Krueger, David , year =. Fresh in memory:. doi:10.48550/ARXIV.2509.14223 , urldate =

work page doi:10.48550/arxiv.2509.14223

[4] [4]

Chan, Stephanie C. Y. and Santoro, Adam and Lampinen, Andrew K. and Wang, Jane X. and Singh, Aaditya and Richemond, Pierre H. and McClelland, Jay and Hill, Felix , year =. Data. doi:10.48550/ARXIV.2205.05055 , urldate =

work page doi:10.48550/arxiv.2205.05055

[5] [5]

Hubinger, Evan and Denison, Carson and Mu, Jesse and Lambert, Mike and Tong, Meg and MacDiarmid, Monte and Lanham, Tamera and Ziegler, Daniel M. and Maxwell, Tim and Cheng, Newton and Jermyn, Adam and Askell, Amanda and Radhakrishnan, Ansh and Anil, Cem and Duvenaud, David and Ganguli, Deep and Barez, Fazl and Clark, Jack and Ndousse, Kamal and Sachan, Ks...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2401.05566

[6] [6]

Auditbench: Evaluating alignment auditing techniques on models with hidden behaviors, 2026

Sheshadri, Abhay and Ewart, Aidan and Fronsdal, Kai and Gupta, Isha and Bowman, Samuel R. and Price, Sara and Marks, Samuel and Wang, Rowan , year =. doi:10.48550/ARXIV.2602.22755 , urldate =

work page doi:10.48550/arxiv.2602.22755

[7] [7]

Patchscopes:

Ghandeharioun, Asma and Caciularu, Avi and Pearce, Adam and Dixon, Lucas and Geva, Mor , year =. Patchscopes:. doi:10.48550/ARXIV.2401.06102 , urldate =

work page doi:10.48550/arxiv.2401.06102

[8] [8]

Emergent misalign- ment: Narrow finetuning can produce broadly misaligned llms,

Betley, Jan and Tan, Daniel and Warncke, Niels and Sztyber-Betley, Anna and Bao, Xuchan and Soto, Martín and Labenz, Nathan and Evans, Owain , year =. Emergent. doi:10.48550/ARXIV.2502.17424 , urldate =

work page doi:10.48550/arxiv.2502.17424

[9] [9]

Turner, Edward and Soligo, Anna and Taylor, Mia and Rajamanoharan, Senthooran and Nanda, Neel , year =. Model. doi:10.48550/ARXIV.2506.11613 , urldate =

work page doi:10.48550/arxiv.2506.11613

[10] [10]

Alignment faking in large language models

Greenblatt, Ryan and Denison, Carson and Wright, Benjamin and Roger, Fabien and MacDiarmid, Monte and Marks, Sam and Treutlein, Johannes and Belonax, Tim and Chen, Jack and Duvenaud, David and Khan, Akbir and Michael, Julian and Mindermann, Sören and Perez, Ethan and Petrini, Linda and Uesato, Jonathan and Kaplan, Jared and Shlegeris, Buck and Bowman, Sam...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2412.14093

[11] [11]

Subliminal learning: Language models transmit behavioral traits via hidden signals in data.CoRR, abs/2507.14805, 2025

Cloud, Alex and Le, Minh and Chua, James and Betley, Jan and Sztyber-Betley, Anna and Hilton, Jacob and Marks, Samuel and Evans, Owain , year =. Subliminal. doi:10.48550/ARXIV.2507.14805 , urldate =

work page doi:10.48550/arxiv.2507.14805

[12] [12]

Towards eliciting latent knowledge from llms with mechanistic interpretability, 2025 a

Cywiński, Bartosz and Ryd, Emil and Rajamanoharan, Senthooran and Nanda, Neel , year =. Towards eliciting latent knowledge from. doi:10.48550/ARXIV.2505.14352 , urldate =

work page doi:10.48550/arxiv.2505.14352

[13] [13]

Analyzing

Mosbach, Marius , year =. Analyzing. Proceedings of the. doi:10.18653/v1/2023.bigpicture-1.10 , language =

work page doi:10.18653/v1/2023.bigpicture-1.10 2023

[14] [14]

Direct Preference Optimization: Your Language Model is Secretly a Reward Model

Rafailov, Rafael and Sharma, Archit and Mitchell, Eric and Ermon, Stefano and Manning, Christopher D. and Finn, Chelsea , year =. Direct. doi:10.48550/ARXIV.2305.18290 , urldate =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2305.18290

[15] [15]

Steering Language Models With Activation Engineering

Turner, Alexander Matt and Thiergart, Lisa and Leech, Gavin and Udell, David and Vazquez, Juan J. and Mini, Ulisse and MacDiarmid, Monte , year =. Steering. doi:10.48550/ARXIV.2308.10248 , urldate =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2308.10248

[16] [16]

Activation oracles: Training and evaluating llms as general-purpose activation explainers, 2025

Karvonen, Adam and Chua, James and Dumas, Clément and Fraser-Taliente, Kit and Kantamneni, Subhash and Minder, Julian and Ong, Euan and Sharma, Arnab Sen and Wen, Daniel and Evans, Owain and Marks, Samuel , year =. Activation. doi:10.48550/ARXIV.2512.15674 , urldate =

work page doi:10.48550/arxiv.2512.15674

[17] [17]

2 OLMo 2 Furious

2. 2025 , keywords =. doi:10.48550/ARXIV.2501.00656 , urldate =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2501.00656 2025

[18] [18]

Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

Bai, Yuntao and Jones, Andy and Ndousse, Kamal and Askell, Amanda and Chen, Anna and DasSarma, Nova and Drain, Dawn and Fort, Stanislav and Ganguli, Deep and Henighan, Tom and Joseph, Nicholas and Kadavath, Saurav and Kernion, Jackson and Conerly, Tom and El-Showk, Sheer and Elhage, Nelson and Hatfield-Dodds, Zac and Hernandez, Danny and Hume, Tristan and...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2204.05862

[19] [19]

doi:10.48550/ARXIV.2505.11475 , urldate =

Wang, Zhilin and Zeng, Jiaqi and Delalleau, Olivier and Shin, Hoo-Chang and Soares, Felipe and Bukharin, Alexander and Evans, Ellie and Dong, Yi and Kuchaiev, Oleksii , year =. doi:10.48550/ARXIV.2505.11475 , urldate =

work page doi:10.48550/arxiv.2505.11475

[20] [20]

Marks, Samuel and Treutlein, Johannes and Bricken, Trenton and Lindsey, Jack and Marcus, Jonathan and Mishra-Sharma, Siddharth and Ziegler, Daniel and Ameisen, Emmanuel and Batson, Joshua and Belonax, Tim and Bowman, Samuel R. and Carter, Shan and Chen, Brian and Cunningham, Hoagy and Denison, Carson and Dietz, Florian and Golechha, Satvik and Khan, Akbir...

work page doi:10.48550/arxiv.2503.10965

[21] [21]

LessWrong , author =

interpreting. LessWrong , author =

[22] [22]

arXiv.org , author =

[23] [23]

Alignment Science Blog , author =

Modifying. Alignment Science Blog , author =

[24] [24]

arXiv.org , author =

Enhancing. arXiv.org , author =

[25] [25]

Transformer Circuits Thread , author =

Sparse. Transformer Circuits Thread , author =

[26] [26]

arXiv.org , author =

Exploring the. arXiv.org , author =

[27] [27]

Narrow finetuning is different , url =

Cloud, Alex and Slocum, Stewy , journal =. Narrow finetuning is different , url =

[28] [28]

2023 , journal=

Towards Monosemanticity: Decomposing Language Models With Dictionary Learning , author=. 2023 , journal=

2023

[29] [29]

McDougall, Callum and Conmy, Arthur and Kramár, János and Lieberum, Tom and Rajamanoharan, Senthooran and Nanda, Neel , year =. Gemma

[30] [30]

Cunningham, Hoagy and Ewart, Aidan and Riggs, Logan and Huben, Robert and Sharkey, Lee , year =. Sparse. doi:10.48550/arXiv.2309.08600 , urldate =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2309.08600

[31] [31]

Towards Understanding Sycophancy in Language Models

Sharma, Mrinank and Tong, Meg and Korbak, Tomasz and Duvenaud, David and Askell, Amanda and Bowman, Samuel R. and Cheng, Newton and Durmus, Esin and Hatfield-Dodds, Zac and Johnston, Scott R. and Kravec, Shauna and Maxwell, Timothy and McCandlish, Sam and Ndousse, Kamal and Rausch, Oliver and Schiefer, Nicholas and Yan, Da and Zhang, Miranda and Perez, Et...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2310.13548

[32] [32]

and Goldstein, Simon and O'Gara, Aidan and Chen, Michael and Hendrycks, Dan , year =

Park, Peter S. and Goldstein, Simon and O'Gara, Aidan and Chen, Michael and Hendrycks, Dan , year =. doi:10.48550/arXiv.2308.14752 , urldate =

work page doi:10.48550/arxiv.2308.14752

[33] [33]

Gemma 3 Technical Report

Gemma 3. 2025 , keywords =. doi:10.48550/arXiv.2503.19786 , abstract =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2503.19786 2025

[34] [34]

2025 , eprint=

Tulu 3: Pushing Frontiers in Open Language Model Post-Training , author=. 2025 , eprint=

2025

[35] [35]

Neuronpedia: Interactive Reference and Tooling for Analyzing Neural Networks , year =

[36] [36]

Brown, and Francis Rhys Ward

van der Weij, Teun and Hofstätter, Felix and Jaffe, Ollie and Brown, Samuel F. and Ward, Francis Rhys , year =. doi:10.48550/ARXIV.2406.07358 , urldate =

work page doi:10.48550/arxiv.2406.07358

[37] [37]

2025 , month = dec, day =

Doshi, Tulsee , title =. 2025 , month = dec, day =

2025