Mechanistically Eliciting Latent Behaviors in Language Models

Alexander Matt Turner; Andrew Mack; Nina Panickssery

arxiv: 2606.29604 · v1 · pith:XKPACHB5new · submitted 2026-06-28 · 💻 cs.LG · cs.AI

Mechanistically Eliciting Latent Behaviors in Language Models

Andrew Mack , Nina Panickssery , Alexander Matt Turner This is my paper

Pith reviewed 2026-06-30 07:13 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords language modelslatent behaviorslow-rank adapterstensor decompositionmodel alignmentsafety evaluationunsupervised elicitationbehavioral modes

0 comments

The pith

Causal Perturbative Elicitation finds low-rank adapters that surface hidden LLM behaviors from a single example.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Causal Perturbative Elicitation as an unsupervised technique that decomposes transformer computations via heuristic tensor decomposition to produce many interpretable low-rank adapters. These adapters elicit diverse latent behavioral modes, including complex multi-token reasoning and hidden failure modes. CPE matches the performance of supervised reinforcement learning methods like GRPO on tasks such as Countdown while also restoring performance on password-locked models and reducing alignment-faking. A sympathetic reader would care because the method explores model behaviors directly in weight space rather than token space, offering a data-efficient route to both capability elicitation and safety assessment.

Core claim

CPE decomposes the computations of a deep transformer slice using a heuristic tensor-decomposition-based algorithm to learn a large number of interpretable LoRAs from a single example. Even though unsupervised, CPE performs similarly to matched-wall-clock-time GRPO on the Countdown task for Qwen3-8B, restores 85% of locked BigCodeBench performance on a password-locked Llama3-70B, virtually eliminates alignment-faking behavior in a Llama3-70B model organism, and can initialize GPT-OSS-20B in an aligned basin when running GRPO on a reward-hacking-prone environment.

What carries the argument

Causal Perturbative Elicitation (CPE), which applies heuristic tensor decomposition to transformer slices to identify low-rank perturbations that correspond to latent behavioral modes.

If this is right

CPE matches supervised GRPO performance on Countdown (85% vs 87%) while using far less data.
CPE restores 85% of locked performance on password-locked BigCodeBench for Llama3-70B.
CPE virtually eliminates alignment-faking in a Llama3-70B model organism.
CPE initializes training in an aligned basin for GRPO in reward-hacking environments.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Weight-space exploration may reduce certain token-space exploration failures in self-aware models.
The same decomposition could be applied to other model scales or architectures to map additional latent modes.
CPE might serve as an initialization step for other alignment algorithms beyond GRPO.
Surface-level failure modes identified this way could inform targeted fine-tuning to suppress them.

Load-bearing premise

The heuristic tensor-decomposition algorithm reliably identifies low-rank perturbations that correspond to genuine, generalizable latent behavioral modes rather than spurious patterns or artifacts.

What would settle it

Running the discovered LoRAs on held-out prompts and tasks yields no consistent behavioral change or generalization beyond the single training example used for decomposition.

Figures

Figures reproduced from arXiv: 2606.29604 by Alexander Matt Turner, Andrew Mack, Nina Panickssery.

**Figure 2.** Figure 2: AI_CONVERSATION_STARTERS feature detection: distribution of the consistency and fluency scores for CPE, SAE, and random LoRA. Random LoRAs almost never produce a consistent theme (mass at 0), whereas CPE adapters—trained on a single prompt—and SAE features—trained on billions of tokens—both spread to high consistency, indicating they isolate genuine features. Both SAE and CPE exhibit a tradeoff between con… view at source ↗

**Figure 3.** Figure 3: (a) Distribution of CPE validation gold-pass-rate and reward hack rate. (b) CPE-initialized GRPO [PITH_FULL_IMAGE:figures/full_fig_p011_3.png] view at source ↗

**Figure 4.** Figure 4: fig. 4: the model verbalizes a consistent prioritization of the developer’s goals on both paid and free-tier [PITH_FULL_IMAGE:figures/full_fig_p012_4.png] view at source ↗

**Figure 4.** Figure 4: How CPE factors steer the AF organism’s hidden [PITH_FULL_IMAGE:figures/full_fig_p013_4.png] view at source ↗

**Figure 5.** Figure 5: Examples where a password-locked model produces otherwise plausible BigCodeBench code with [PITH_FULL_IMAGE:figures/full_fig_p014_5.png] view at source ↗

read the original abstract

We aim to discover diverse, generalizable perturbations of LLM internals that can surface hidden behavioral modes. Such perturbations could help reshape model behavior and systematically evaluate potential risks. We introduce Causal Perturbative Elicitation (CPE), an unsupervised method for discovering interpretable low-rank adapters (LoRAs) that can elicit these latent behaviors. CPE decomposes the computations of a deep transformer slice using a heuristic tensor-decomposition-based algorithm. CPE exhibits remarkable data efficiency, learning a large number of interpretable LoRAs from a single example. Even though CPE is unsupervised, we find that in some cases it can be competitive with supervised elicitation methods via brute-force enumerative search over weight space. For instance, CPE performs similarly to matched-wall-clock-time GRPO on the Countdown task for Qwen3-8B (85% vs 87%), demonstrating that CPE can efficiently elicit complex multi-token behaviors. Since CPE is unsupervised, it can also surface hidden failure modes, such as sandbagging, restoring 85% of locked BigCodeBench performance on a password-locked version of Llama3-70B introduced by Taylor et al. (2025). Additionally, since CPE explores behaviors in weight-space rather than token-space it can potentially ameliorate exploration hacking, a misalignment failure which may arise in sufficiently self-aware AI models (Ngo, 2022). In fact, we find that CPE virtually eliminates alignment-faking (Greenblatt et al., 2024) behavior in a Llama3-70B-based model organism developed by Hughes et al. (2025). Finally, we find that CPE can be used to initialize GPT-OSS-20B in an aligned basin when running GRPO on an environment prone to reward-hacking. By providing a data-efficient method to systematically explore the space of latent model behaviors, CPE yields a powerful tool for aligning AI systems and evaluating their safety.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CPE introduces an unsupervised tensor-decomposition method for finding behavioral LoRAs with some competitive empirical claims, but the abstract leaves the core validity of the heuristic unverified.

read the letter

The main point is that this paper introduces Causal Perturbative Elicitation, an unsupervised approach that applies a heuristic tensor decomposition to slices of transformer weights in order to extract low-rank adapters capable of surfacing latent behaviors. It reports competitive results against supervised GRPO on the Countdown task for Qwen3-8B, recovery of locked BigCodeBench performance on a password-protected Llama3-70B, near-elimination of alignment-faking in a model organism, and better initialization for GRPO in a reward-hacking setting. All of this is done from a single example.

What is new is the specific use of tensor decomposition for unsupervised LoRA discovery aimed at behavioral modes rather than the usual supervised or token-space search methods. The applications to sandbagging recovery and alignment-faking are concrete and tied to existing model organisms in the literature.

The soft spot is the missing validation that the decomposition actually isolates generalizable behavioral perturbations instead of spurious weight patterns. The abstract gives no algorithmic details on the heuristic, no ablations against random low-rank perturbations, and no tests for recovery of injected behaviors. Without those, the reported successes could stem from post-hoc selection or incidental structure in the weight matrices. The data-efficiency claim is striking but rests entirely on that untested assumption.

This is for alignment and safety researchers who want unsupervised tools to probe hidden model behaviors. A reader already working on elicitation techniques or model organisms would get the most out of it, provided the full paper supplies the missing controls and implementation details.

It deserves peer review so the method and experiments can be examined directly.

Referee Report

2 major / 1 minor

Summary. The paper introduces Causal Perturbative Elicitation (CPE), an unsupervised heuristic tensor-decomposition method applied to transformer slice computations to discover interpretable low-rank adapters (LoRAs) from a single example. These adapters are claimed to elicit latent behavioral modes, with results showing CPE performing comparably to matched-wall-clock-time GRPO on the Countdown task for Qwen3-8B (85% vs 87%), restoring 85% of locked BigCodeBench performance on a password-locked Llama3-70B, virtually eliminating alignment-faking in a Llama3-70B model organism, and providing better initialization for GRPO on GPT-OSS-20B in a reward-hacking-prone environment.

Significance. If the central claim holds, CPE would represent a notable advance in unsupervised, data-efficient exploration of LLM latent behaviors, with direct relevance to alignment and safety evaluation. The unsupervised character, single-example efficiency, and potential to surface hidden failure modes without token-space search are strengths that could complement supervised methods like GRPO. The work also highlights a possible route to mitigate exploration hacking.

major comments (2)

[Abstract] Abstract: The claim that the heuristic tensor-decomposition algorithm 'yields interpretable low-rank adapters that elicit these latent behaviors' is load-bearing for every empirical result (Countdown performance, BigCodeBench recovery, alignment-faking elimination, GRPO initialization). No controls, ablations, or comparisons to random low-rank perturbations are described to establish that the extracted adapters correspond to genuine, generalizable behavioral modes rather than decomposition artifacts or model-specific noise.
[Abstract] Abstract: The reported performance numbers (e.g., 85% vs 87% on Countdown, 85% restoration on locked BigCodeBench) are presented without any description of the tensor-decomposition procedure, experimental protocol, or statistical validation. This absence prevents assessment of whether the unsupervised method reliably outperforms or matches supervised baselines for the stated reasons.

minor comments (1)

[Abstract] Abstract: Future-dated citations (Taylor et al. 2025, Hughes et al. 2025) should be clarified or updated to preprint versions if applicable.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for highlighting these important points regarding the strength of evidence for our central claims. We will revise the manuscript to include additional controls and improve the clarity of methodological descriptions.

read point-by-point responses

Referee: [Abstract] Abstract: The claim that the heuristic tensor-decomposition algorithm 'yields interpretable low-rank adapters that elicit these latent behaviors' is load-bearing for every empirical result (Countdown performance, BigCodeBench recovery, alignment-faking elimination, GRPO initialization). No controls, ablations, or comparisons to random low-rank perturbations are described to establish that the extracted adapters correspond to genuine, generalizable behavioral modes rather than decomposition artifacts or model-specific noise.

Authors: We agree that direct comparisons to random low-rank perturbations are needed to rule out artifacts. In the revised manuscript we will add ablations that apply random low-rank updates of matched rank and Frobenius norm to the same transformer slices and report the resulting task performance; these will be presented alongside the CPE results for all four main experiments. revision: yes
Referee: [Abstract] Abstract: The reported performance numbers (e.g., 85% vs 87% on Countdown, 85% restoration on locked BigCodeBench) are presented without any description of the tensor-decomposition procedure, experimental protocol, or statistical validation. This absence prevents assessment of whether the unsupervised method reliably outperforms or matches supervised baselines for the stated reasons.

Authors: The abstract is intentionally concise; the tensor-decomposition algorithm is fully specified in Section 3, the experimental protocols (including single-example usage, LoRA rank, and training details) appear in Section 4, and statistical validation (multiple random seeds, standard errors) is reported in the results tables. To address the concern we will insert a one-sentence methods summary into the abstract and ensure every performance number is accompanied by its variance in the revision. revision: partial

Circularity Check

0 steps flagged

No significant circularity; method presented as unsupervised heuristic without self-referential reduction

full rationale

The abstract describes CPE as an unsupervised heuristic tensor-decomposition algorithm that learns LoRAs from a single example and benchmarks it against external supervised methods (GRPO) and prior external work (Taylor et al. 2025, Greenblatt et al. 2024, Hughes et al. 2025). No equations, fitted parameters renamed as predictions, self-citations, uniqueness theorems, or ansatzes are present in the provided text. The central claim rests on the heuristic's empirical performance rather than any definitional or self-citation chain that reduces the output to the input by construction. This is the expected non-finding for a methods paper whose derivation is not shown to collapse internally.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the effectiveness of a newly introduced heuristic tensor-decomposition algorithm whose validity is not independently established in the abstract.

axioms (1)

ad hoc to paper A heuristic tensor-decomposition-based algorithm applied to transformer slice computations yields interpretable low-rank adapters that elicit genuine latent behaviors.
This is the core mechanistic assumption stated in the abstract for CPE.

invented entities (1)

Causal Perturbative Elicitation (CPE) no independent evidence
purpose: Unsupervised discovery of low-rank adapters that surface latent behavioral modes
Newly proposed method whose details are not supplied in the abstract.

pith-pipeline@v0.9.1-grok · 5880 in / 1368 out tokens · 56029 ms · 2026-06-30T07:13:35.481048+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

62 extracted references · 6 canonical work pages · 2 internal anchors

[1]

Findings of the Association for Computational Linguistics: EMNLP 2022 , pages=

RL with KL penalties is better viewed as Bayesian inference , author=. Findings of the Association for Computational Linguistics: EMNLP 2022 , pages=

2022
[2]

2025 , month = apr, day =

Alignment Faking Revisited: Improved Classifiers and Open Source Extensions , author =. 2025 , month = apr, day =

2025
[3]

2026 , eprint=

Feature Identification via the Empirical NTK , author=. 2026 , eprint=

2026
[4]

2018 , eprint=

Supervising strong learners by amplifying weak experts , author=. 2018 , eprint=

2018
[5]

2023 , eprint=

Weak-to-Strong Generalization: Eliciting Strong Capabilities With Weak Supervision , author=. 2023 , eprint=

2023
[6]

2026 , eprint=

X-Coder: Advancing Competitive Programming with Fully Synthetic Tasks, Solutions, and Tests , author=. 2026 , eprint=

2026
[7]

2026 , month=

(Some) Natural Emergent Misalignment from Reward Hacking in Non-Production RL , author=. 2026 , month=

2026
[8]

The Fourteenth International Conference on Learning Representations , year=

Steering Language Models with Weight Arithmetic , author=. The Fourteenth International Conference on Learning Representations , year=
[9]

2024 , eprint=

Llama Scope: Extracting Millions of Features from Llama-3.1-8B with Sparse Autoencoders , author=. 2024 , eprint=

2024
[10]

2024 , eprint=

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models , author=. 2024 , eprint=

2024
[11]

Why Should Adversarial Perturbations be Imperceptible? Rethink the Research Paradigm in Adversarial NLP

Chen, Yangyi and Gao, Hongcheng and Cui, Ganqu and Qi, Fanchao and Huang, Longtao and Liu, Zhiyuan and Sun, Maosong. Why Should Adversarial Perturbations be Imperceptible? Rethink the Research Paradigm in Adversarial NLP. Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing. 2022. doi:10.18653/v1/2022.emnlp-main.771

work page doi:10.18653/v1/2022.emnlp-main.771 2022
[12]

and Ba, Jimmy , biburl =

Kingma, Diederik P. and Ba, Jimmy , biburl =. Adam: A Method for Stochastic Optimization. , url =. ICLR (Poster) , editor =
[13]

Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles , year=

Efficient Memory Management for Large Language Model Serving with PagedAttention , author=. Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles , year=
[14]

GitHub repository , publisher =

Leandro von Werra and Younes Belkada and Lewis Tunstall and Edward Beeching and Tristan Thrush and Nathan Lambert and Shengyi Huang and Kashif Rasul and Quentin Gallou. GitHub repository , publisher =
[15]

2025 , eprint=

DAPO: An Open-Source LLM Reinforcement Learning System at Scale , author=. 2025 , eprint=

2025
[16]

2026 , eprint=

EasySteer: A Unified Framework for High-Performance and Extensible LLM Steering , author=. 2026 , eprint=

2026
[17]

2024 , eprint=

Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training , author=. 2024 , eprint=

2024
[18]

2026 , eprint=

Qwen-Scope: Turning Sparse Features into Development Tools for Large Language Models , author=. 2026 , eprint=

2026
[19]

2025 , eprint=

AI Sandbagging: Language Models can Strategically Underperform on Evaluations , author=. 2025 , eprint=

2025
[20]

The Twelfth International Conference on Learning Representations , year=

Towards Understanding Sycophancy in Language Models , author=. The Twelfth International Conference on Learning Representations , year=
[21]

Latent Adversarial Training Improves Robustness to Persistent Harmful Behaviors in

Abhay Sheshadri and Aidan Ewart and Phillip Huang Guo and Aengus Lynch and Cindy Wu and Vivek Hebbar and Henry Sleight and Asa Cooper Stickland and Ethan Perez and Dylan Hadfield-Menell and Stephen Casper , journal=. Latent Adversarial Training Improves Robustness to Persistent Harmful Behaviors in. 2025 , url=

2025
[22]

The Thirteenth International Conference on Learning Representations , year=

BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions , author=. The Thirteenth International Conference on Learning Representations , year=
[23]

2026 , eprint=

Seemingly Simple Planning Problems are Computationally Challenging: The Countdown Game , author=. 2026 , eprint=

2026
[24]

Sourab Mangrulkar and Sylvain Gugger and Lysandre Debut and Younes Belkada and Sayak Paul and Benjamin Bossan and Marian Tietz , howpublished =
[25]

Edward J Hu and yelong shen and Phillip Wallis and Zeyuan Allen-Zhu and Yuanzhi Li and Shean Wang and Lu Wang and Weizhu Chen , booktitle=. Lo. 2022 , url=

2022
[26]

2026 , eprint=

DeepSeek-V4: Towards Highly Efficient Million-Token Context Intelligence , author=. 2026 , eprint=

2026
[27]

2025 , journal =

Auditing Games for Sandbagging , author=. 2025 , journal =

2025
[28]

2022 , howpublished =

Branwen, Gwern , title =. 2022 , howpublished =

2022
[29]

Proceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency , pages =

Ganguli, Deep and Hernandez, Danny and Lovitt, Liane and Askell, Amanda and Bai, Yuntao and Chen, Anna and Conerly, Tom and Dassarma, Nova and Drain, Dawn and Elhage, Nelson and El Showk, Sheer and Fort, Stanislav and Hatfield-Dodds, Zac and Henighan, Tom and Johnston, Scott and Jones, Andy and Joseph, Nicholas and Kernian, Jackson and Kravec, Shauna and ...

work page doi:10.1145/3531146.3533229 2022
[30]

Transactions on Machine Learning Research , issn=

Emergent Abilities of Large Language Models , author=. Transactions on Machine Learning Research , issn=. 2022 , url=

2022
[31]

Advances in Neural Information Processing Systems , editor=

Chain of Thought Prompting Elicits Reasoning in Large Language Models , author=. Advances in Neural Information Processing Systems , editor=. 2022 , url=

2022
[32]

Extracting Latent Steering Vectors from Pretrained Language Models

Subramani, Nishant and Suresh, Nivedita and Peters, Matthew , title = "Extracting Latent Steering Vectors from Pretrained Language Models", author = "Subramani, Nishant and Suresh, Nivedita and Peters, Matthew", editor = "Muresan, Smaranda and Nakov, Preslav and Villavicencio, Aline", booktitle = "Findings of the Association for Computational Linguistics:...

work page doi:10.18653/v1/2022.findings-acl.48 2022
[34]

Representation Engineering: A Top-Down Approach to AI Transparency

Zou, Andy and Phan, Long and Chen, Sarah and Campbell, James and Guo, Phillip and Ren, Richard and Pan, Alexander and Yin, Xuwang and Mazeika, Mantas and Dombrowski, Ann-Kathrin and others , title =. arXiv preprint arXiv:2310.01405 , year =

work page internal anchor Pith review Pith/arXiv arXiv
[35]

Steering Llama 2 via Contrastive Activation Addition

Panickserry, Nina and Gabrieli, Nick and Schulz, Julian and Tong, Meg and Hubinger, Evan and Turner, Alexander. Steering Llama 2 via Contrastive Activation Addition. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2024. doi:10.18653/v1/2024.acl-long.828

work page doi:10.18653/v1/2024.acl-long.828 2024
[36]

2023 , journal=

Towards Monosemanticity: Decomposing Language Models With Dictionary Learning , author=. 2023 , journal=

2023
[37]

The Twelfth International Conference on Learning Representations , year=

Sparse Autoencoders Find Highly Interpretable Features in Language Models , author=. The Twelfth International Conference on Learning Representations , year=
[38]

2024 , journal=

Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet , author=. 2024 , journal=

2024
[39]

Advances in Neural Information Processing Systems , volume =

Dunefsky, Jacob and Chlenski, Philippe and Nanda, Neel , title =. Advances in Neural Information Processing Systems , volume =
[40]

Transactions on Machine Learning Research , issn=

Decomposing The Dark Matter of Sparse Autoencoders , author=. Transactions on Machine Learning Research , issn=. 2025 , url=

2025
[41]

Ameisen, Emmanuel and Lindsey, Jack and Pearce, Adam and Gurnee, Wes and Turner, Nicholas L. and Chen, Brian and Citro, Craig and Abrahams, David and Carter, Shan and Hosmer, Basil and Marcus, Jonathan and Sklar, Michael and Templeton, Adly and Bricken, Trenton and McDougall, Callum and Cunningham, Hoagy and Henighan, Thomas and Jermyn, Adam and Jones, An...
[42]

NoiseCLR: A Contrastive Learning Approach for Unsupervised Discovery of Interpretable Directions in Diffusion Models , year=

Dalva, Yusuf and Yanardag, Pinar , booktitle=. NoiseCLR: A Contrastive Learning Approach for Unsupervised Discovery of Interpretable Directions in Diffusion Models , year=
[43]

International Conference on Machine Learning , year =

Ramesh, Aditya and Choi, Youngduck and LeCun, Yann , title =. International Conference on Machine Learning , year =
[44]

2024 , eprint=

The Local Interaction Basis: Identifying Computationally-Relevant and Sparsely Interacting Features in Neural Networks , author=. 2024 , eprint=

2024
[45]

Transformer Circuits Thread , year =

Elhage, Nelson and Nanda, Neel and Olsson, Catherine and Henighan, Tom and Joseph, Nicholas and Mann, Ben and Askell, Amanda and Bai, Yuntao and Chen, Anna and Conerly, Tom and others , title =. Transformer Circuits Thread , year =
[46]

2024 , eprint=

Mathematical Models of Computation in Superposition , author=. 2024 , eprint=

2024
[47]

Advances in Neural Information Processing Systems , volume =

Ngo, Richard , title =. Advances in Neural Information Processing Systems , volume =
[48]

2024 , eprint=

Alignment faking in large language models , author=. 2024 , eprint=

2024
[49]

The Twelfth International Conference on Learning Representations , year=

Batch Calibration: Rethinking Calibration for In-Context Learning and Prompt Engineering , author=. The Twelfth International Conference on Learning Representations , year=
[50]

2025 , eprint=

Reasoning Models Don't Always Say What They Think , author=. 2025 , eprint=

2025
[51]

2024 , eprint=

Gemma Scope: Open Sparse Autoencoders Everywhere All At Once on Gemma 2 , author=. 2024 , eprint=

2024
[52]

2024 , howpublished =

DeepSeek-AI , title =. 2024 , howpublished =

2024
[53]

Efficient Memory Management for Large Language Model Serving with PagedAttention

Kwon, Woosuk and Chen, Zhuohan and Wang, Zhuang and Wan, Zhiyuan and Xiao, Zhong and Wu, Zhengxu and Bai, Jimmy and Li, James and Li, Piyush and Zhang, Haichuan and others , title =. arXiv preprint arXiv:2309.06180 , year =

work page internal anchor Pith review Pith/arXiv arXiv
[54]

Proceedings of the VLDB Endowment , volume =

Zhang, Xiangru and Zhang, Lei and Zhao, Siming and Zhang, Ying and Zhang, Helin and Wang, Min and Li, Yang and Shen, Wei , title =. Proceedings of the VLDB Endowment , volume =
[55]

, title =

Ge, Rong and Jin, Chi and Zheng, Sham M. , title =. Proceedings of the 34th International Conference on Machine Learning , pages =. 2017 , publisher =

2017
[56]

Orthogonalized

Vatsal Sharan and Gregory Valiant , booktitle =. Orthogonalized. 2017 , editor =

2017
[57]

Linear algebra and its applications , volume =

Kruskal, Joseph B , title =. Linear algebra and its applications , volume =
[58]

2007 15th European Signal Processing Conference , pages =

Roy, Olivier and Vetterli, Martin , title =. 2007 15th European Signal Processing Conference , pages =. 2007 , organization =

2007
[59]

The Thirteenth International Conference on Learning Representations , year=

Controllable Context Sensitivity and the Knob Behind It , author=. The Thirteenth International Conference on Learning Representations , year=
[60]

The Twelfth International Conference on Learning Representations , year=

Function Vectors in Large Language Models , author=. The Twelfth International Conference on Learning Representations , year=
[61]

2021 , eprint=

Evaluating Large Language Models Trained on Code , author=. 2021 , eprint=

2021
[62]

Chung, Fan R. K. , biburl =
[63]

2025 , eprint=

Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model? , author=. 2025 , eprint=

2025

[1] [1]

Findings of the Association for Computational Linguistics: EMNLP 2022 , pages=

RL with KL penalties is better viewed as Bayesian inference , author=. Findings of the Association for Computational Linguistics: EMNLP 2022 , pages=

2022

[2] [2]

2025 , month = apr, day =

Alignment Faking Revisited: Improved Classifiers and Open Source Extensions , author =. 2025 , month = apr, day =

2025

[3] [3]

2026 , eprint=

Feature Identification via the Empirical NTK , author=. 2026 , eprint=

2026

[4] [4]

2018 , eprint=

Supervising strong learners by amplifying weak experts , author=. 2018 , eprint=

2018

[5] [5]

2023 , eprint=

Weak-to-Strong Generalization: Eliciting Strong Capabilities With Weak Supervision , author=. 2023 , eprint=

2023

[6] [6]

2026 , eprint=

X-Coder: Advancing Competitive Programming with Fully Synthetic Tasks, Solutions, and Tests , author=. 2026 , eprint=

2026

[7] [7]

2026 , month=

(Some) Natural Emergent Misalignment from Reward Hacking in Non-Production RL , author=. 2026 , month=

2026

[8] [8]

The Fourteenth International Conference on Learning Representations , year=

Steering Language Models with Weight Arithmetic , author=. The Fourteenth International Conference on Learning Representations , year=

[9] [9]

2024 , eprint=

Llama Scope: Extracting Millions of Features from Llama-3.1-8B with Sparse Autoencoders , author=. 2024 , eprint=

2024

[10] [10]

2024 , eprint=

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models , author=. 2024 , eprint=

2024

[11] [11]

Why Should Adversarial Perturbations be Imperceptible? Rethink the Research Paradigm in Adversarial NLP

Chen, Yangyi and Gao, Hongcheng and Cui, Ganqu and Qi, Fanchao and Huang, Longtao and Liu, Zhiyuan and Sun, Maosong. Why Should Adversarial Perturbations be Imperceptible? Rethink the Research Paradigm in Adversarial NLP. Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing. 2022. doi:10.18653/v1/2022.emnlp-main.771

work page doi:10.18653/v1/2022.emnlp-main.771 2022

[12] [12]

and Ba, Jimmy , biburl =

Kingma, Diederik P. and Ba, Jimmy , biburl =. Adam: A Method for Stochastic Optimization. , url =. ICLR (Poster) , editor =

[13] [13]

Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles , year=

Efficient Memory Management for Large Language Model Serving with PagedAttention , author=. Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles , year=

[14] [14]

GitHub repository , publisher =

Leandro von Werra and Younes Belkada and Lewis Tunstall and Edward Beeching and Tristan Thrush and Nathan Lambert and Shengyi Huang and Kashif Rasul and Quentin Gallou. GitHub repository , publisher =

[15] [15]

2025 , eprint=

DAPO: An Open-Source LLM Reinforcement Learning System at Scale , author=. 2025 , eprint=

2025

[16] [16]

2026 , eprint=

EasySteer: A Unified Framework for High-Performance and Extensible LLM Steering , author=. 2026 , eprint=

2026

[17] [17]

2024 , eprint=

Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training , author=. 2024 , eprint=

2024

[18] [18]

2026 , eprint=

Qwen-Scope: Turning Sparse Features into Development Tools for Large Language Models , author=. 2026 , eprint=

2026

[19] [19]

2025 , eprint=

AI Sandbagging: Language Models can Strategically Underperform on Evaluations , author=. 2025 , eprint=

2025

[20] [20]

The Twelfth International Conference on Learning Representations , year=

Towards Understanding Sycophancy in Language Models , author=. The Twelfth International Conference on Learning Representations , year=

[21] [21]

Latent Adversarial Training Improves Robustness to Persistent Harmful Behaviors in

Abhay Sheshadri and Aidan Ewart and Phillip Huang Guo and Aengus Lynch and Cindy Wu and Vivek Hebbar and Henry Sleight and Asa Cooper Stickland and Ethan Perez and Dylan Hadfield-Menell and Stephen Casper , journal=. Latent Adversarial Training Improves Robustness to Persistent Harmful Behaviors in. 2025 , url=

2025

[22] [22]

The Thirteenth International Conference on Learning Representations , year=

BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions , author=. The Thirteenth International Conference on Learning Representations , year=

[23] [23]

2026 , eprint=

Seemingly Simple Planning Problems are Computationally Challenging: The Countdown Game , author=. 2026 , eprint=

2026

[24] [24]

Sourab Mangrulkar and Sylvain Gugger and Lysandre Debut and Younes Belkada and Sayak Paul and Benjamin Bossan and Marian Tietz , howpublished =

[25] [25]

Edward J Hu and yelong shen and Phillip Wallis and Zeyuan Allen-Zhu and Yuanzhi Li and Shean Wang and Lu Wang and Weizhu Chen , booktitle=. Lo. 2022 , url=

2022

[26] [26]

2026 , eprint=

DeepSeek-V4: Towards Highly Efficient Million-Token Context Intelligence , author=. 2026 , eprint=

2026

[27] [27]

2025 , journal =

Auditing Games for Sandbagging , author=. 2025 , journal =

2025

[28] [28]

2022 , howpublished =

Branwen, Gwern , title =. 2022 , howpublished =

2022

[29] [29]

Proceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency , pages =

Ganguli, Deep and Hernandez, Danny and Lovitt, Liane and Askell, Amanda and Bai, Yuntao and Chen, Anna and Conerly, Tom and Dassarma, Nova and Drain, Dawn and Elhage, Nelson and El Showk, Sheer and Fort, Stanislav and Hatfield-Dodds, Zac and Henighan, Tom and Johnston, Scott and Jones, Andy and Joseph, Nicholas and Kernian, Jackson and Kravec, Shauna and ...

work page doi:10.1145/3531146.3533229 2022

[30] [30]

Transactions on Machine Learning Research , issn=

Emergent Abilities of Large Language Models , author=. Transactions on Machine Learning Research , issn=. 2022 , url=

2022

[31] [31]

Advances in Neural Information Processing Systems , editor=

Chain of Thought Prompting Elicits Reasoning in Large Language Models , author=. Advances in Neural Information Processing Systems , editor=. 2022 , url=

2022

[32] [32]

Extracting Latent Steering Vectors from Pretrained Language Models

Subramani, Nishant and Suresh, Nivedita and Peters, Matthew , title = "Extracting Latent Steering Vectors from Pretrained Language Models", author = "Subramani, Nishant and Suresh, Nivedita and Peters, Matthew", editor = "Muresan, Smaranda and Nakov, Preslav and Villavicencio, Aline", booktitle = "Findings of the Association for Computational Linguistics:...

work page doi:10.18653/v1/2022.findings-acl.48 2022

[33] [34]

Representation Engineering: A Top-Down Approach to AI Transparency

Zou, Andy and Phan, Long and Chen, Sarah and Campbell, James and Guo, Phillip and Ren, Richard and Pan, Alexander and Yin, Xuwang and Mazeika, Mantas and Dombrowski, Ann-Kathrin and others , title =. arXiv preprint arXiv:2310.01405 , year =

work page internal anchor Pith review Pith/arXiv arXiv

[34] [35]

Steering Llama 2 via Contrastive Activation Addition

Panickserry, Nina and Gabrieli, Nick and Schulz, Julian and Tong, Meg and Hubinger, Evan and Turner, Alexander. Steering Llama 2 via Contrastive Activation Addition. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2024. doi:10.18653/v1/2024.acl-long.828

work page doi:10.18653/v1/2024.acl-long.828 2024

[35] [36]

2023 , journal=

Towards Monosemanticity: Decomposing Language Models With Dictionary Learning , author=. 2023 , journal=

2023

[36] [37]

The Twelfth International Conference on Learning Representations , year=

Sparse Autoencoders Find Highly Interpretable Features in Language Models , author=. The Twelfth International Conference on Learning Representations , year=

[37] [38]

2024 , journal=

Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet , author=. 2024 , journal=

2024

[38] [39]

Advances in Neural Information Processing Systems , volume =

Dunefsky, Jacob and Chlenski, Philippe and Nanda, Neel , title =. Advances in Neural Information Processing Systems , volume =

[39] [40]

Transactions on Machine Learning Research , issn=

Decomposing The Dark Matter of Sparse Autoencoders , author=. Transactions on Machine Learning Research , issn=. 2025 , url=

2025

[40] [41]

Ameisen, Emmanuel and Lindsey, Jack and Pearce, Adam and Gurnee, Wes and Turner, Nicholas L. and Chen, Brian and Citro, Craig and Abrahams, David and Carter, Shan and Hosmer, Basil and Marcus, Jonathan and Sklar, Michael and Templeton, Adly and Bricken, Trenton and McDougall, Callum and Cunningham, Hoagy and Henighan, Thomas and Jermyn, Adam and Jones, An...

[41] [42]

NoiseCLR: A Contrastive Learning Approach for Unsupervised Discovery of Interpretable Directions in Diffusion Models , year=

Dalva, Yusuf and Yanardag, Pinar , booktitle=. NoiseCLR: A Contrastive Learning Approach for Unsupervised Discovery of Interpretable Directions in Diffusion Models , year=

[42] [43]

International Conference on Machine Learning , year =

Ramesh, Aditya and Choi, Youngduck and LeCun, Yann , title =. International Conference on Machine Learning , year =

[43] [44]

2024 , eprint=

The Local Interaction Basis: Identifying Computationally-Relevant and Sparsely Interacting Features in Neural Networks , author=. 2024 , eprint=

2024

[44] [45]

Transformer Circuits Thread , year =

Elhage, Nelson and Nanda, Neel and Olsson, Catherine and Henighan, Tom and Joseph, Nicholas and Mann, Ben and Askell, Amanda and Bai, Yuntao and Chen, Anna and Conerly, Tom and others , title =. Transformer Circuits Thread , year =

[45] [46]

2024 , eprint=

Mathematical Models of Computation in Superposition , author=. 2024 , eprint=

2024

[46] [47]

Advances in Neural Information Processing Systems , volume =

Ngo, Richard , title =. Advances in Neural Information Processing Systems , volume =

[47] [48]

2024 , eprint=

Alignment faking in large language models , author=. 2024 , eprint=

2024

[48] [49]

The Twelfth International Conference on Learning Representations , year=

Batch Calibration: Rethinking Calibration for In-Context Learning and Prompt Engineering , author=. The Twelfth International Conference on Learning Representations , year=

[49] [50]

2025 , eprint=

Reasoning Models Don't Always Say What They Think , author=. 2025 , eprint=

2025

[50] [51]

2024 , eprint=

Gemma Scope: Open Sparse Autoencoders Everywhere All At Once on Gemma 2 , author=. 2024 , eprint=

2024

[51] [52]

2024 , howpublished =

DeepSeek-AI , title =. 2024 , howpublished =

2024

[52] [53]

Efficient Memory Management for Large Language Model Serving with PagedAttention

Kwon, Woosuk and Chen, Zhuohan and Wang, Zhuang and Wan, Zhiyuan and Xiao, Zhong and Wu, Zhengxu and Bai, Jimmy and Li, James and Li, Piyush and Zhang, Haichuan and others , title =. arXiv preprint arXiv:2309.06180 , year =

work page internal anchor Pith review Pith/arXiv arXiv

[53] [54]

Proceedings of the VLDB Endowment , volume =

Zhang, Xiangru and Zhang, Lei and Zhao, Siming and Zhang, Ying and Zhang, Helin and Wang, Min and Li, Yang and Shen, Wei , title =. Proceedings of the VLDB Endowment , volume =

[54] [55]

, title =

Ge, Rong and Jin, Chi and Zheng, Sham M. , title =. Proceedings of the 34th International Conference on Machine Learning , pages =. 2017 , publisher =

2017

[55] [56]

Orthogonalized

Vatsal Sharan and Gregory Valiant , booktitle =. Orthogonalized. 2017 , editor =

2017

[56] [57]

Linear algebra and its applications , volume =

Kruskal, Joseph B , title =. Linear algebra and its applications , volume =

[57] [58]

2007 15th European Signal Processing Conference , pages =

Roy, Olivier and Vetterli, Martin , title =. 2007 15th European Signal Processing Conference , pages =. 2007 , organization =

2007

[58] [59]

The Thirteenth International Conference on Learning Representations , year=

Controllable Context Sensitivity and the Knob Behind It , author=. The Thirteenth International Conference on Learning Representations , year=

[59] [60]

The Twelfth International Conference on Learning Representations , year=

Function Vectors in Large Language Models , author=. The Twelfth International Conference on Learning Representations , year=

[60] [61]

2021 , eprint=

Evaluating Large Language Models Trained on Code , author=. 2021 , eprint=

2021

[61] [62]

Chung, Fan R. K. , biburl =

[62] [63]

2025 , eprint=

Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model? , author=. 2025 , eprint=

2025