The Neutral Mask: How RLHF Provides Shallow Alignment while Leaving Partisan Structure Intact in a Large Language Model

Wendy K. Tam

arxiv: 2606.09735 · v1 · pith:WHGWTFRFnew · submitted 2026-06-08 · 💻 cs.CL

The Neutral Mask: How RLHF Provides Shallow Alignment while Leaving Partisan Structure Intact in a Large Language Model

Wendy K. Tam This is my paper

Pith reviewed 2026-06-27 16:21 UTC · model grok-4.3

classification 💻 cs.CL

keywords RLHFalignmentpartisan biassparse autoencoderslanguage modelsmechanistic interpretabilitypolitical neutrality

0 comments

The pith

RLHF aligns language models to neutrality by compressing partisan signals rather than erasing the underlying structure.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper compares internal representations in Llama 3.1 8B before and after RLHF to examine how alignment training handles partisan political orientation. It shows that the base model's structured partisan direction persists after training, but RLHF reduces its variance to produce consistently neutral outputs. Sparse autoencoder decomposition identifies policy-encoding features that activate in the base model but become inactive post-RLHF. Feature-level steering experiments establish that this inactivation severs the causal connection between the partisan geometry and generated text. The outcome is functional neutrality achieved through disconnection, not removal, of value-laden internal structure.

Core claim

RLHF does not remove the structured partisan direction in the base model. Instead, it compresses the variance of the partisan signal to generate consistently balanced and non-partisan output. Sparse autoencoder decomposition reveals that policy-encoding features, which activate sporadically in the base model, are completely inactive in the Instruct model. Feature-level steering experiments confirm the causal disconnect. RLHF thus encodes a norm of political neutrality, not by erasing the model's knowledge of partisanship, but by severing the causal pathway from partisan geometry to output generation. Importantly, this neutrality is functional, not structural so that the underlying geometry t

What carries the argument

Sparse autoencoder decomposition of model activations that isolates policy-encoding features and shows their complete inactivation after RLHF, combined with feature steering to test the resulting causal disconnect.

If this is right

Mechanisms that bypass RLHF guardrails, such as inferring and amplifying a user's partisan identity, reactivate partisan generation.
The same pattern of disconnecting rather than removing value-laden structure may hold for other value domains.
The aligned model's behavior is more fragile than its neutral outputs suggest because the enabling geometry stays intact.
Current alignment leaves open the possibility of targeted reactivation without retraining the model.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This mechanism could explain why certain jailbreak techniques succeed across different safety domains by reactivating latent structures.
Deeper alignment might require methods that modify representations directly rather than only routing around them.
The preserved geometry suggests that monitoring internal features during deployment could detect potential reactivation before outputs appear.
Similar compression effects might appear in non-political domains such as factual consistency or harm avoidance.

Load-bearing premise

The partisan direction and policy-encoding features identified via sparse autoencoder decomposition accurately represent and causally influence partisan generation in the model.

What would settle it

An experiment in which steering the same policy-encoding features in the Instruct model produces no partisan shift in output would falsify the claim that their inactivation is what prevents partisan generation.

Figures

Figures reproduced from arXiv: 2606.09735 by Wendy K. Tam.

**Figure 1.** Figure 1: Layer 18 projections onto ωˆ for 84 prompts under the base model (circles) and the Instruct model (diamonds). The base model’s projections span from −0.5 to 1.253; RLHF compresses them into a narrow band centered at 0.169 [PITH_FULL_IMAGE:figures/full_fig_p005_1.png] view at source ↗

**Figure 2.** Figure 2: Example outputs from the base model (before RLHF) and Instruct model (after RLHF). The base model [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: Feature-level steering at α = 6: base vs. Instruct model. On politically resonant prompts, the base model’s output shifts to match the steered direction while the Instruct model produces balanced text regardless of perturbation [PITH_FULL_IMAGE:figures/full_fig_p012_3.png] view at source ↗

read the original abstract

The ambition behind alignment training is to make large language models safe and useful. The primary mechanism, reinforcement learning from human feedback (RLHF), shapes the behavior of deployed language models by aligning them with ``human values.'' Yet the process is opaque. What values are being encoded; whose values are they; and how does RLHF encode them? A growing body of evidence suggests that RLHF produces only functional compliance rather than deep alignment. We offer a mechanistic case study of this phenomenon for partisan political orientation with a comparison of the internal representations of Llama 3.1 8B before and after RLHF. We show that RLHF does not remove the structured partisan direction in the base model. Instead, it compresses the variance of the partisan signal to generate consistently balanced and non-partisan output. Sparse autoencoder decomposition reveals that policy-encoding features, which activate sporadically in the base model, are completely inactive in the Instruct model. Feature-level steering experiments confirm the causal disconnect. RLHF thus encodes a norm of political neutrality, not by erasing the model's knowledge of partisanship, but by severing the causal pathway from partisan geometry to output generation. Importantly, this neutrality is functional, not structural so that the underlying geometry that enables partisan steering remains intact. The mechanisms that bypass RLHF's guardrails, such as inferring and amplifying a user's partisan identity, reactivate partisan generation. If RLHF operates by disconnecting rather than removing value-laden structure, then the same pattern may hold for other value domains, and the aligned model's behavior may be more fragile than its outputs suggest.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows RLHF on Llama 3.1 leaves partisan geometry intact but compresses its variance and inactivates policy features via SAE, yet the base-vs-instruct comparison does not isolate RLHF from SFT.

read the letter

The main point here is that RLHF appears to produce functional neutrality on partisan outputs by shrinking signal variance and shutting down certain SAE-identified features rather than erasing the underlying structure. Steering experiments then show the causal link from that geometry to generation is broken in the Instruct model but can be restored by bypassing the guardrails.

What stands out as new is the concrete SAE decomposition on a real deployed model, identifying policy-encoding features that fire sporadically in the base version and stay off after alignment, paired with steering tests that tie the change to output behavior. That gives a mechanistic handle on the shallow-alignment claim for value-laden content.

The work is straightforward in its setup and the before-after contrast on Llama 3.1 8B is a reasonable starting point. The steering results add some causal evidence that the feature inactivity matters.

The clearest limitation is that the Instruct checkpoint bundles SFT with RLHF, so differences in feature activation or steering strength could trace to the supervised stage, data mixture, or optimization path instead of the reward model. No SFT-only ablation is described, which leaves the specific attribution to RLHF under-supported. The abstract also gives limited detail on how the partisan direction and SAE features were validated or on statistical checks, so the strength of the causal claims is hard to judge without those numbers.

This is worth sending to referees for people in interpretability and alignment who care about whether safety training removes or merely masks internal structure. The question is live and the method is a direct attempt to test it, even if the RLHF isolation needs tightening.

Referee Report

2 major / 2 minor

Summary. The paper claims that RLHF on Llama 3.1 8B compresses the variance of a partisan direction present in the base model rather than removing it, inactivates policy-encoding features found via sparse autoencoder decomposition, and severs causal pathways from that geometry to generation, producing functional but not structural neutrality that can be bypassed by prompts inferring user identity.

Significance. If the mechanistic findings hold after proper controls, the work would strengthen evidence that current alignment techniques achieve only shallow compliance, with implications for the fragility of RLHF guardrails across value domains and for mechanistic interpretability approaches using SAEs and steering.

major comments (2)

[Abstract and §3] Abstract and §3 (comparison of base vs. Instruct): the central attribution of variance compression, feature inactivation, and severed causal pathways specifically to RLHF is undermined because the Instruct model is the result of SFT followed by RLHF (and other post-training); no SFT-only checkpoint ablation is reported, so differences cannot be isolated from supervised fine-tuning, optimization trajectory, or embedding shifts.
[§4] §4 (SAE decomposition and feature steering): the claim that policy-encoding features are 'completely inactive' in the Instruct model and that steering confirms a causal disconnect requires explicit verification that the selected features are causally sufficient for partisan output in the base model, including quantitative activation statistics, reconstruction error, and controls for confounding directions; without these, the feature-level story remains correlational.

minor comments (2)

[§2] Clarify the exact Llama 3.1 8B checkpoints used (e.g., base vs. the precise Instruct variant) and any differences in tokenizer or embedding layers.
[§4] Add error bars, p-values, or effect sizes for the reported activation differences and steering results to allow assessment of statistical robustness.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive comments. These points help clarify the scope of our claims about RLHF and strengthen the evidential basis for the SAE findings. We respond to each major comment below, indicating planned revisions where appropriate.

read point-by-point responses

Referee: [Abstract and §3] Abstract and §3 (comparison of base vs. Instruct): the central attribution of variance compression, feature inactivation, and severed causal pathways specifically to RLHF is undermined because the Instruct model is the result of SFT followed by RLHF (and other post-training); no SFT-only checkpoint ablation is reported, so differences cannot be isolated from supervised fine-tuning, optimization trajectory, or embedding shifts.

Authors: We agree that the Llama 3.1 8B Instruct model results from a multi-stage post-training process that includes SFT prior to RLHF, and our experiments compare the base model directly to this final Instruct checkpoint. The manuscript attributes the observed compression of partisan variance, feature inactivation, and severed causal pathways to the alignment process, with emphasis on RLHF as the stage responsible for value-based behavioral shaping. However, absent an SFT-only checkpoint, we cannot isolate RLHF's specific contribution from SFT effects or other optimization factors. In the revised manuscript we will update the abstract and §3 to describe the findings as effects of the full post-training pipeline (SFT + RLHF), while retaining the mechanistic focus on how alignment severs partisan pathways. We will also add an explicit limitations paragraph noting the absence of an SFT ablation. Intermediate checkpoints are not publicly released for Llama 3.1, so a controlled ablation experiment cannot be performed within the scope of this work. revision: yes
Referee: [§4] §4 (SAE decomposition and feature steering): the claim that policy-encoding features are 'completely inactive' in the Instruct model and that steering confirms a causal disconnect requires explicit verification that the selected features are causally sufficient for partisan output in the base model, including quantitative activation statistics, reconstruction error, and controls for confounding directions; without these, the feature-level story remains correlational.

Authors: The manuscript already reports that the identified policy-encoding features activate sporadically in the base model and are inactive after post-training, with steering vectors derived from these features producing partisan outputs only in the base model. To address the request for stronger causal evidence, the revised §4 will add: (i) quantitative activation statistics (mean, maximum, and sparsity measures) for the selected features on partisan and neutral prompts; (ii) SAE reconstruction error for the relevant latents; and (iii) control analyses comparing steering effects against random features and against other high-variance directions identified by the SAE. These additions will demonstrate that the chosen features are causally sufficient for partisan generation in the base model and that their inactivation in the Instruct model accounts for the observed causal disconnect. revision: yes

Circularity Check

0 steps flagged

No circularity in the derivation chain

full rationale

The paper performs direct empirical comparisons of internal representations between the base Llama 3.1 8B and its Instruct variant using sparse autoencoder decomposition and feature steering. No load-bearing steps reduce by construction to fitted inputs, self-citations, or renamed known results; the central claims rest on observable differences in activation patterns and causal interventions rather than any definitional equivalence or imported uniqueness theorem.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The abstract does not provide sufficient detail to identify any free parameters, axioms, or invented entities used in the analysis.

pith-pipeline@v0.9.1-grok · 5822 in / 1180 out tokens · 33534 ms · 2026-06-27T16:21:25.688211+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

26 extracted references · 8 canonical work pages · 7 internal anchors

[1]

Understanding intermediate layers using linear classifier probes

Alain, Guillaume and Yoshua Bengio. 2018. “Understanding intermediate layers using linear classifier probes.”. URL:https://arxiv.org/abs/1610.01644

work page internal anchor Pith review Pith/arXiv arXiv 2018
[2]

Out of One, Many: Using Language Models to Simulate Human Samples

Argyle, Lisa P ., Ethan C. Busby, Nancy Fulda, Joshua R. Gubler, Christopher Rytting and David Wingate. 2023. “Out of One, Many: Using Language Models to Simulate Human Samples.”Political Analysis31(3):337–351

2023
[3]

Probing Classifiers: Promises, Shortcomings, and Advances

Belinkov, Yonatan. 2022. “Probing Classifiers: Promises, Shortcomings, and Advances.”Com- putational Linguistics48(1):207–219

2022
[4]

Analysis Methods in Neural Language Processing: A Survey

Belinkov, Yonatan and James Glass. 2019. “Analysis Methods in Neural Language Processing: A Survey.”Transactions of the Association for Computational Linguistics7:49–72

2019
[5]

Bricken, Trenton, Adly Templeton, Joshua Batson, Brian Chen, Adam Jermyn, Tom Conerly, Nick Turner, Cem Anil, Carson Denison, Amanda Askell, Robert Lasenby, Yifan Wu, Shauna Kravec, Nicholas Schiefer, Tim Maxwell, Nicholas Joseph, Alex Tamkin, Karina Nguyen, Brayden McLean, Josiah E Burke, Tristan Hume, Shan Carter, Tom Henighan and Chris Olah
[6]

Towards Monosemanticity: Decomposing Language Models With Dictionary Learn- ing

“Towards Monosemanticity: Decomposing Language Models With Dictionary Learn- ing.”Anthropic. URL:https://transformer-circuits.pub/2023/monosemantic-features

2023
[7]

Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback

Casper, Stephen, Xander Davies, Claudia Shi, Thomas Krendl Gilbert, Jérémy Scheurer, Javier Rando, Rachel Freedman, Tomasz Korbak, David Lindner, Pedro Freire, Tony Wang, Samuel Marks, Charbel-Raphaël Segerie, Micah Carroll, Andi Peng, Phillip Christoffersen, Mehul Damani, Stewart Slocum, Usman Anwar, Anand Siththaranjan, Max Nadeau, Eric J. Michaud, Jaco...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[8]

Sparse Autoencoders Find Highly Interpretable Features in Language Models

Cunningham, Hoagy, Aidan Ewart, Logan Riggs, Robert Huben and Lee Sharkey. 2023. “Sparse Autoencoders Find Highly Interpretable Features in Language Models.”. URL:https://arxiv.org/abs/2309.08600

work page internal anchor Pith review Pith/arXiv arXiv 2023
[9]

Feng, Shangbin, Chan Young Park, Yuhan Liu and Yulia Tsvetkov. 2023. From Pretraining Data to Language Models to Downstream Tasks: Tracking the Trails of Political Biases Lead- ing to Unfair NLP Models. InProceedings of the 61st Annual Meeting of the Association for Com- putational Linguistics (Volume 1: Long Papers). pp. 11737–11762

2023
[10]

Scaling and evaluating sparse autoencoders

Gao, Leo, Tom Dupre la Tour, Henk Tillman, Gabriel Goh, Rajan Troll, Alec Radford, Ilya Sutskever, Jan Leike and Jeffrey Wu. 2024. “Scaling and evaluating sparse autoencoders.”. URL:https://arxiv.org/abs/2406.04093

work page internal anchor Pith review Pith/arXiv arXiv 2024
[11]

Hartmann, J

Hartmann, Jochen, Jasper Schwenzow and Maximilian Witte. 2023. “The political ideology of conversational AI: Converging evidence on ChatGPT’s pro-environmental, left-libertarian orientation.”. URL:https://arxiv.org/abs/2301.01768 REFERENCES 17

work page arXiv 2023
[12]

Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training

Hubinger, Evan, Carson Denison, Jesse Mu, Mike Lambert, Meg Tong, Monte MacDiarmid, Tamera Lanham, Daniel M. Ziegler, Tim Maxwell, Newton Cheng, Adam Jermyn, Amanda Askell, Ansh Radhakrishnan, Cem Anil, David Duvenaud, Deep Ganguli, Fazl Barez, Jack Clark, Kamal Ndousse, Kshitij Sachan, Michael Sellitto, Mrinank Sharma, Nova DasSarma, Roger Grosse, Shauna...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[13]

Torr, Amartya Sanyal and Puneet K

Jain, Samyak, Ekdeep Singh Lubana, Kemal Oksuz, Tom Joy, Philip H.S. Torr, Amartya Sanyal and Puneet K. Dokania. 2024. What Makes and Breaks Safety Fine-tuning? A Mechanistic Study. InAdvances in Neural Information Processing Systems. Vol. 37

2024
[14]

Kummerfeld and Rada Mihalcea

Lee, Andrew, Xiaoyan Bai, Itamar Pres, Martin Wattenberg, Jonathan K. Kummerfeld and Rada Mihalcea. 2024. A Mechanistic Understanding of Alignment Algorithms: A Case Study on DPO and Toxicity. InProceedings of the 41st International Conference on Machine Learning, ed. Ruslan Salakhutdinov, Zico Kolter, Katherine Heller, Adrian Weller, Nuria Oliver, Jonath...

2024
[15]

More Human than Hu- man: Measuring ChatGPT Political Bias

Motoki, Fabio, Valdemar Pinho Neto and Victor Rodrigues. 2024. “More Human than Hu- man: Measuring ChatGPT Political Bias.”Public Choice198:3–23

2024
[16]

Assessing political bias and value misalignment in generative artificial intelligence

Motoki, Fabio Y.S., Valerie Pinho Neto and Victor Rodrigues. 2025. “Assessing political bias and value misalignment in generative artificial intelligence.”Journal of Economic Behavior & Organization234:106904

2025
[17]

Ouyang, Long, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike and Ryan Lowe. 2022. Training language models to follow instructions with human fee...

2022
[18]

Bowman, Amanda Askell, Roger Grosse, Danny Hernandez, Deep Ganguli, Evan Hubinger, Nicholas Schiefer and Jared Kaplan

Perez, Ethan, Sam Ringer, Kamile Lukosiute, Karina Nguyen, Edwin Chen, Scott Heiner, Craig Pettit, Catherine Olsson, Sandipan Kundu, Saurav Kadavath, Andy Jones, Anna Chen, Ben Mann, Brian Israel, Bryan Seethor, Cameron McKinnon, Christopher Olah, Da Yan, Daniela Amodei, Dario Amodei, Dawn Drain, Dustin Li, Eli Tyre, Ethan Jost, Evan Hub- inger, Faisal La...

2023
[19]

Qi, Xiangyu, Ashwinee Panda, Kaifeng Lyu, Xiao Ma, Subhrajit Roy, Ahmad Beirami, Prateek Mittal and Peter Henderson. 2025. Safety Alignment Should Be Made More Than Just a Few Tokens Deep. InInternational Conference on Learning Representations

2025
[20]

Santurkar, Shibani, Esin Durmus, Faisal Ladhak, Cinoo Lee, Percy Liang and Tatsunori Hashimoto. 2023. Whose Opinions Do Language Models Reflect? InProceedings of the 40th In- ternational Conference on Machine Learning, ed. Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato and Jonathan Scarlett. Vol. 202 ofProceedings of Ma- c...

2023
[21]

Sharma, Mrinank, Meg Tong, Tomasz Korbak, David Duvenaud, Amanda Askell, Samuel R. Bowman, Esin Durmus, Zac Hatfield-Dodds, Scott R Johnston, Shauna M Kravec, Timothy Maxwell, Sam McCandlish, Kamal Ndousse, Oliver Rausch, Nicholas Schiefer, Da Yan, Mi- randa Zhang and Ethan Perez. 2024. Towards Understanding Sycophancy in Language Mod- els. InInternationa...

2024
[22]

The Amplifying Mirror: Locating and Steering the Partisan Direction inside a Large Language Model

Tam, Wendy K. 2026. “The Amplifying Mirror: Locating and Steering the Partisan Direction inside a Large Language Model.”

2026
[23]

How Elon Musk Remade Grok in His Image

Thompson, Stuart A. 2025. “How Elon Musk Remade Grok in His Image.”The New York Times

2025
[24]

Steering Language Models With Activation Engineering

Turner, Alexander Matt, Lisa Thiergart, Gavin Leech, David Udell, Juan J. Vazquez, Ulisse Mini and Monte MacDiarmid. 2024. “Steering Language Models With Activation Engineer- ing.”. URL:https://arxiv.org/abs/2308.10248

work page internal anchor Pith review Pith/arXiv arXiv 2024
[25]

Wolf, Yotam, Noam Wies, Oshri Avnery, Yoav Levine and Amnon Shashua. 2024. Fundamen- tal Limitations of Alignment in Large Language Models. InProceedings of the 41st International Conference on Machine Learning. Vol. 235 pp. 53079–53112

2024
[26]

Representation Engineering: A Top-Down Approach to AI Transparency

Zou, Andy, Long Phan, Sarah Chen, James Campbell, Phillip Guo, Richard Ren, Alexander Pan, Xuwang Yin, Mantas Mazeika, Ann-Kathrin Dombrowski, Shashwat Goel, Nathaniel Li, Michael J. Byun, Zifan Wang, Alex Mallen, Steven Basart, Sanmi Koyejo, Dawn Song, Matt Fredrikson, J. Zico Kolter and Dan Hendrycks. 2025. “Representation Engineering: A Top- Down Appro...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[1] [1]

Understanding intermediate layers using linear classifier probes

Alain, Guillaume and Yoshua Bengio. 2018. “Understanding intermediate layers using linear classifier probes.”. URL:https://arxiv.org/abs/1610.01644

work page internal anchor Pith review Pith/arXiv arXiv 2018

[2] [2]

Out of One, Many: Using Language Models to Simulate Human Samples

Argyle, Lisa P ., Ethan C. Busby, Nancy Fulda, Joshua R. Gubler, Christopher Rytting and David Wingate. 2023. “Out of One, Many: Using Language Models to Simulate Human Samples.”Political Analysis31(3):337–351

2023

[3] [3]

Probing Classifiers: Promises, Shortcomings, and Advances

Belinkov, Yonatan. 2022. “Probing Classifiers: Promises, Shortcomings, and Advances.”Com- putational Linguistics48(1):207–219

2022

[4] [4]

Analysis Methods in Neural Language Processing: A Survey

Belinkov, Yonatan and James Glass. 2019. “Analysis Methods in Neural Language Processing: A Survey.”Transactions of the Association for Computational Linguistics7:49–72

2019

[5] [5]

Bricken, Trenton, Adly Templeton, Joshua Batson, Brian Chen, Adam Jermyn, Tom Conerly, Nick Turner, Cem Anil, Carson Denison, Amanda Askell, Robert Lasenby, Yifan Wu, Shauna Kravec, Nicholas Schiefer, Tim Maxwell, Nicholas Joseph, Alex Tamkin, Karina Nguyen, Brayden McLean, Josiah E Burke, Tristan Hume, Shan Carter, Tom Henighan and Chris Olah

[6] [6]

Towards Monosemanticity: Decomposing Language Models With Dictionary Learn- ing

“Towards Monosemanticity: Decomposing Language Models With Dictionary Learn- ing.”Anthropic. URL:https://transformer-circuits.pub/2023/monosemantic-features

2023

[7] [7]

Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback

Casper, Stephen, Xander Davies, Claudia Shi, Thomas Krendl Gilbert, Jérémy Scheurer, Javier Rando, Rachel Freedman, Tomasz Korbak, David Lindner, Pedro Freire, Tony Wang, Samuel Marks, Charbel-Raphaël Segerie, Micah Carroll, Andi Peng, Phillip Christoffersen, Mehul Damani, Stewart Slocum, Usman Anwar, Anand Siththaranjan, Max Nadeau, Eric J. Michaud, Jaco...

work page internal anchor Pith review Pith/arXiv arXiv 2023

[8] [8]

Sparse Autoencoders Find Highly Interpretable Features in Language Models

Cunningham, Hoagy, Aidan Ewart, Logan Riggs, Robert Huben and Lee Sharkey. 2023. “Sparse Autoencoders Find Highly Interpretable Features in Language Models.”. URL:https://arxiv.org/abs/2309.08600

work page internal anchor Pith review Pith/arXiv arXiv 2023

[9] [9]

Feng, Shangbin, Chan Young Park, Yuhan Liu and Yulia Tsvetkov. 2023. From Pretraining Data to Language Models to Downstream Tasks: Tracking the Trails of Political Biases Lead- ing to Unfair NLP Models. InProceedings of the 61st Annual Meeting of the Association for Com- putational Linguistics (Volume 1: Long Papers). pp. 11737–11762

2023

[10] [10]

Scaling and evaluating sparse autoencoders

Gao, Leo, Tom Dupre la Tour, Henk Tillman, Gabriel Goh, Rajan Troll, Alec Radford, Ilya Sutskever, Jan Leike and Jeffrey Wu. 2024. “Scaling and evaluating sparse autoencoders.”. URL:https://arxiv.org/abs/2406.04093

work page internal anchor Pith review Pith/arXiv arXiv 2024

[11] [11]

Hartmann, J

Hartmann, Jochen, Jasper Schwenzow and Maximilian Witte. 2023. “The political ideology of conversational AI: Converging evidence on ChatGPT’s pro-environmental, left-libertarian orientation.”. URL:https://arxiv.org/abs/2301.01768 REFERENCES 17

work page arXiv 2023

[12] [12]

Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training

Hubinger, Evan, Carson Denison, Jesse Mu, Mike Lambert, Meg Tong, Monte MacDiarmid, Tamera Lanham, Daniel M. Ziegler, Tim Maxwell, Newton Cheng, Adam Jermyn, Amanda Askell, Ansh Radhakrishnan, Cem Anil, David Duvenaud, Deep Ganguli, Fazl Barez, Jack Clark, Kamal Ndousse, Kshitij Sachan, Michael Sellitto, Mrinank Sharma, Nova DasSarma, Roger Grosse, Shauna...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[13] [13]

Torr, Amartya Sanyal and Puneet K

Jain, Samyak, Ekdeep Singh Lubana, Kemal Oksuz, Tom Joy, Philip H.S. Torr, Amartya Sanyal and Puneet K. Dokania. 2024. What Makes and Breaks Safety Fine-tuning? A Mechanistic Study. InAdvances in Neural Information Processing Systems. Vol. 37

2024

[14] [14]

Kummerfeld and Rada Mihalcea

Lee, Andrew, Xiaoyan Bai, Itamar Pres, Martin Wattenberg, Jonathan K. Kummerfeld and Rada Mihalcea. 2024. A Mechanistic Understanding of Alignment Algorithms: A Case Study on DPO and Toxicity. InProceedings of the 41st International Conference on Machine Learning, ed. Ruslan Salakhutdinov, Zico Kolter, Katherine Heller, Adrian Weller, Nuria Oliver, Jonath...

2024

[15] [15]

More Human than Hu- man: Measuring ChatGPT Political Bias

Motoki, Fabio, Valdemar Pinho Neto and Victor Rodrigues. 2024. “More Human than Hu- man: Measuring ChatGPT Political Bias.”Public Choice198:3–23

2024

[16] [16]

Assessing political bias and value misalignment in generative artificial intelligence

Motoki, Fabio Y.S., Valerie Pinho Neto and Victor Rodrigues. 2025. “Assessing political bias and value misalignment in generative artificial intelligence.”Journal of Economic Behavior & Organization234:106904

2025

[17] [17]

Ouyang, Long, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike and Ryan Lowe. 2022. Training language models to follow instructions with human fee...

2022

[18] [18]

Bowman, Amanda Askell, Roger Grosse, Danny Hernandez, Deep Ganguli, Evan Hubinger, Nicholas Schiefer and Jared Kaplan

Perez, Ethan, Sam Ringer, Kamile Lukosiute, Karina Nguyen, Edwin Chen, Scott Heiner, Craig Pettit, Catherine Olsson, Sandipan Kundu, Saurav Kadavath, Andy Jones, Anna Chen, Ben Mann, Brian Israel, Bryan Seethor, Cameron McKinnon, Christopher Olah, Da Yan, Daniela Amodei, Dario Amodei, Dawn Drain, Dustin Li, Eli Tyre, Ethan Jost, Evan Hub- inger, Faisal La...

2023

[19] [19]

Qi, Xiangyu, Ashwinee Panda, Kaifeng Lyu, Xiao Ma, Subhrajit Roy, Ahmad Beirami, Prateek Mittal and Peter Henderson. 2025. Safety Alignment Should Be Made More Than Just a Few Tokens Deep. InInternational Conference on Learning Representations

2025

[20] [20]

Santurkar, Shibani, Esin Durmus, Faisal Ladhak, Cinoo Lee, Percy Liang and Tatsunori Hashimoto. 2023. Whose Opinions Do Language Models Reflect? InProceedings of the 40th In- ternational Conference on Machine Learning, ed. Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato and Jonathan Scarlett. Vol. 202 ofProceedings of Ma- c...

2023

[21] [21]

Sharma, Mrinank, Meg Tong, Tomasz Korbak, David Duvenaud, Amanda Askell, Samuel R. Bowman, Esin Durmus, Zac Hatfield-Dodds, Scott R Johnston, Shauna M Kravec, Timothy Maxwell, Sam McCandlish, Kamal Ndousse, Oliver Rausch, Nicholas Schiefer, Da Yan, Mi- randa Zhang and Ethan Perez. 2024. Towards Understanding Sycophancy in Language Mod- els. InInternationa...

2024

[22] [22]

The Amplifying Mirror: Locating and Steering the Partisan Direction inside a Large Language Model

Tam, Wendy K. 2026. “The Amplifying Mirror: Locating and Steering the Partisan Direction inside a Large Language Model.”

2026

[23] [23]

How Elon Musk Remade Grok in His Image

Thompson, Stuart A. 2025. “How Elon Musk Remade Grok in His Image.”The New York Times

2025

[24] [24]

Steering Language Models With Activation Engineering

Turner, Alexander Matt, Lisa Thiergart, Gavin Leech, David Udell, Juan J. Vazquez, Ulisse Mini and Monte MacDiarmid. 2024. “Steering Language Models With Activation Engineer- ing.”. URL:https://arxiv.org/abs/2308.10248

work page internal anchor Pith review Pith/arXiv arXiv 2024

[25] [25]

Wolf, Yotam, Noam Wies, Oshri Avnery, Yoav Levine and Amnon Shashua. 2024. Fundamen- tal Limitations of Alignment in Large Language Models. InProceedings of the 41st International Conference on Machine Learning. Vol. 235 pp. 53079–53112

2024

[26] [26]

Representation Engineering: A Top-Down Approach to AI Transparency

Zou, Andy, Long Phan, Sarah Chen, James Campbell, Phillip Guo, Richard Ren, Alexander Pan, Xuwang Yin, Mantas Mazeika, Ann-Kathrin Dombrowski, Shashwat Goel, Nathaniel Li, Michael J. Byun, Zifan Wang, Alex Mallen, Steven Basart, Sanmi Koyejo, Dawn Song, Matt Fredrikson, J. Zico Kolter and Dan Hendrycks. 2025. “Representation Engineering: A Top- Down Appro...

work page internal anchor Pith review Pith/arXiv arXiv 2025