The Neutral Mask: How RLHF Provides Shallow Alignment while Leaving Partisan Structure Intact in a Large Language Model
Pith reviewed 2026-06-27 16:21 UTC · model grok-4.3
The pith
RLHF aligns language models to neutrality by compressing partisan signals rather than erasing the underlying structure.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
RLHF does not remove the structured partisan direction in the base model. Instead, it compresses the variance of the partisan signal to generate consistently balanced and non-partisan output. Sparse autoencoder decomposition reveals that policy-encoding features, which activate sporadically in the base model, are completely inactive in the Instruct model. Feature-level steering experiments confirm the causal disconnect. RLHF thus encodes a norm of political neutrality, not by erasing the model's knowledge of partisanship, but by severing the causal pathway from partisan geometry to output generation. Importantly, this neutrality is functional, not structural so that the underlying geometry t
What carries the argument
Sparse autoencoder decomposition of model activations that isolates policy-encoding features and shows their complete inactivation after RLHF, combined with feature steering to test the resulting causal disconnect.
If this is right
- Mechanisms that bypass RLHF guardrails, such as inferring and amplifying a user's partisan identity, reactivate partisan generation.
- The same pattern of disconnecting rather than removing value-laden structure may hold for other value domains.
- The aligned model's behavior is more fragile than its neutral outputs suggest because the enabling geometry stays intact.
- Current alignment leaves open the possibility of targeted reactivation without retraining the model.
Where Pith is reading between the lines
- This mechanism could explain why certain jailbreak techniques succeed across different safety domains by reactivating latent structures.
- Deeper alignment might require methods that modify representations directly rather than only routing around them.
- The preserved geometry suggests that monitoring internal features during deployment could detect potential reactivation before outputs appear.
- Similar compression effects might appear in non-political domains such as factual consistency or harm avoidance.
Load-bearing premise
The partisan direction and policy-encoding features identified via sparse autoencoder decomposition accurately represent and causally influence partisan generation in the model.
What would settle it
An experiment in which steering the same policy-encoding features in the Instruct model produces no partisan shift in output would falsify the claim that their inactivation is what prevents partisan generation.
Figures
read the original abstract
The ambition behind alignment training is to make large language models safe and useful. The primary mechanism, reinforcement learning from human feedback (RLHF), shapes the behavior of deployed language models by aligning them with ``human values.'' Yet the process is opaque. What values are being encoded; whose values are they; and how does RLHF encode them? A growing body of evidence suggests that RLHF produces only functional compliance rather than deep alignment. We offer a mechanistic case study of this phenomenon for partisan political orientation with a comparison of the internal representations of Llama 3.1 8B before and after RLHF. We show that RLHF does not remove the structured partisan direction in the base model. Instead, it compresses the variance of the partisan signal to generate consistently balanced and non-partisan output. Sparse autoencoder decomposition reveals that policy-encoding features, which activate sporadically in the base model, are completely inactive in the Instruct model. Feature-level steering experiments confirm the causal disconnect. RLHF thus encodes a norm of political neutrality, not by erasing the model's knowledge of partisanship, but by severing the causal pathway from partisan geometry to output generation. Importantly, this neutrality is functional, not structural so that the underlying geometry that enables partisan steering remains intact. The mechanisms that bypass RLHF's guardrails, such as inferring and amplifying a user's partisan identity, reactivate partisan generation. If RLHF operates by disconnecting rather than removing value-laden structure, then the same pattern may hold for other value domains, and the aligned model's behavior may be more fragile than its outputs suggest.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that RLHF on Llama 3.1 8B compresses the variance of a partisan direction present in the base model rather than removing it, inactivates policy-encoding features found via sparse autoencoder decomposition, and severs causal pathways from that geometry to generation, producing functional but not structural neutrality that can be bypassed by prompts inferring user identity.
Significance. If the mechanistic findings hold after proper controls, the work would strengthen evidence that current alignment techniques achieve only shallow compliance, with implications for the fragility of RLHF guardrails across value domains and for mechanistic interpretability approaches using SAEs and steering.
major comments (2)
- [Abstract and §3] Abstract and §3 (comparison of base vs. Instruct): the central attribution of variance compression, feature inactivation, and severed causal pathways specifically to RLHF is undermined because the Instruct model is the result of SFT followed by RLHF (and other post-training); no SFT-only checkpoint ablation is reported, so differences cannot be isolated from supervised fine-tuning, optimization trajectory, or embedding shifts.
- [§4] §4 (SAE decomposition and feature steering): the claim that policy-encoding features are 'completely inactive' in the Instruct model and that steering confirms a causal disconnect requires explicit verification that the selected features are causally sufficient for partisan output in the base model, including quantitative activation statistics, reconstruction error, and controls for confounding directions; without these, the feature-level story remains correlational.
minor comments (2)
- [§2] Clarify the exact Llama 3.1 8B checkpoints used (e.g., base vs. the precise Instruct variant) and any differences in tokenizer or embedding layers.
- [§4] Add error bars, p-values, or effect sizes for the reported activation differences and steering results to allow assessment of statistical robustness.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive comments. These points help clarify the scope of our claims about RLHF and strengthen the evidential basis for the SAE findings. We respond to each major comment below, indicating planned revisions where appropriate.
read point-by-point responses
-
Referee: [Abstract and §3] Abstract and §3 (comparison of base vs. Instruct): the central attribution of variance compression, feature inactivation, and severed causal pathways specifically to RLHF is undermined because the Instruct model is the result of SFT followed by RLHF (and other post-training); no SFT-only checkpoint ablation is reported, so differences cannot be isolated from supervised fine-tuning, optimization trajectory, or embedding shifts.
Authors: We agree that the Llama 3.1 8B Instruct model results from a multi-stage post-training process that includes SFT prior to RLHF, and our experiments compare the base model directly to this final Instruct checkpoint. The manuscript attributes the observed compression of partisan variance, feature inactivation, and severed causal pathways to the alignment process, with emphasis on RLHF as the stage responsible for value-based behavioral shaping. However, absent an SFT-only checkpoint, we cannot isolate RLHF's specific contribution from SFT effects or other optimization factors. In the revised manuscript we will update the abstract and §3 to describe the findings as effects of the full post-training pipeline (SFT + RLHF), while retaining the mechanistic focus on how alignment severs partisan pathways. We will also add an explicit limitations paragraph noting the absence of an SFT ablation. Intermediate checkpoints are not publicly released for Llama 3.1, so a controlled ablation experiment cannot be performed within the scope of this work. revision: yes
-
Referee: [§4] §4 (SAE decomposition and feature steering): the claim that policy-encoding features are 'completely inactive' in the Instruct model and that steering confirms a causal disconnect requires explicit verification that the selected features are causally sufficient for partisan output in the base model, including quantitative activation statistics, reconstruction error, and controls for confounding directions; without these, the feature-level story remains correlational.
Authors: The manuscript already reports that the identified policy-encoding features activate sporadically in the base model and are inactive after post-training, with steering vectors derived from these features producing partisan outputs only in the base model. To address the request for stronger causal evidence, the revised §4 will add: (i) quantitative activation statistics (mean, maximum, and sparsity measures) for the selected features on partisan and neutral prompts; (ii) SAE reconstruction error for the relevant latents; and (iii) control analyses comparing steering effects against random features and against other high-variance directions identified by the SAE. These additions will demonstrate that the chosen features are causally sufficient for partisan generation in the base model and that their inactivation in the Instruct model accounts for the observed causal disconnect. revision: yes
Circularity Check
No circularity in the derivation chain
full rationale
The paper performs direct empirical comparisons of internal representations between the base Llama 3.1 8B and its Instruct variant using sparse autoencoder decomposition and feature steering. No load-bearing steps reduce by construction to fitted inputs, self-citations, or renamed known results; the central claims rest on observable differences in activation patterns and causal interventions rather than any definitional equivalence or imported uniqueness theorem.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Understanding intermediate layers using linear classifier probes
Alain, Guillaume and Yoshua Bengio. 2018. “Understanding intermediate layers using linear classifier probes.”. URL:https://arxiv.org/abs/1610.01644
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[2]
Out of One, Many: Using Language Models to Simulate Human Samples
Argyle, Lisa P ., Ethan C. Busby, Nancy Fulda, Joshua R. Gubler, Christopher Rytting and David Wingate. 2023. “Out of One, Many: Using Language Models to Simulate Human Samples.”Political Analysis31(3):337–351
2023
-
[3]
Probing Classifiers: Promises, Shortcomings, and Advances
Belinkov, Yonatan. 2022. “Probing Classifiers: Promises, Shortcomings, and Advances.”Com- putational Linguistics48(1):207–219
2022
-
[4]
Analysis Methods in Neural Language Processing: A Survey
Belinkov, Yonatan and James Glass. 2019. “Analysis Methods in Neural Language Processing: A Survey.”Transactions of the Association for Computational Linguistics7:49–72
2019
-
[5]
Bricken, Trenton, Adly Templeton, Joshua Batson, Brian Chen, Adam Jermyn, Tom Conerly, Nick Turner, Cem Anil, Carson Denison, Amanda Askell, Robert Lasenby, Yifan Wu, Shauna Kravec, Nicholas Schiefer, Tim Maxwell, Nicholas Joseph, Alex Tamkin, Karina Nguyen, Brayden McLean, Josiah E Burke, Tristan Hume, Shan Carter, Tom Henighan and Chris Olah
-
[6]
Towards Monosemanticity: Decomposing Language Models With Dictionary Learn- ing
“Towards Monosemanticity: Decomposing Language Models With Dictionary Learn- ing.”Anthropic. URL:https://transformer-circuits.pub/2023/monosemantic-features
2023
-
[7]
Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback
Casper, Stephen, Xander Davies, Claudia Shi, Thomas Krendl Gilbert, Jérémy Scheurer, Javier Rando, Rachel Freedman, Tomasz Korbak, David Lindner, Pedro Freire, Tony Wang, Samuel Marks, Charbel-Raphaël Segerie, Micah Carroll, Andi Peng, Phillip Christoffersen, Mehul Damani, Stewart Slocum, Usman Anwar, Anand Siththaranjan, Max Nadeau, Eric J. Michaud, Jaco...
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[8]
Sparse Autoencoders Find Highly Interpretable Features in Language Models
Cunningham, Hoagy, Aidan Ewart, Logan Riggs, Robert Huben and Lee Sharkey. 2023. “Sparse Autoencoders Find Highly Interpretable Features in Language Models.”. URL:https://arxiv.org/abs/2309.08600
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[9]
Feng, Shangbin, Chan Young Park, Yuhan Liu and Yulia Tsvetkov. 2023. From Pretraining Data to Language Models to Downstream Tasks: Tracking the Trails of Political Biases Lead- ing to Unfair NLP Models. InProceedings of the 61st Annual Meeting of the Association for Com- putational Linguistics (Volume 1: Long Papers). pp. 11737–11762
2023
-
[10]
Scaling and evaluating sparse autoencoders
Gao, Leo, Tom Dupre la Tour, Henk Tillman, Gabriel Goh, Rajan Troll, Alec Radford, Ilya Sutskever, Jan Leike and Jeffrey Wu. 2024. “Scaling and evaluating sparse autoencoders.”. URL:https://arxiv.org/abs/2406.04093
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[11]
Hartmann, Jochen, Jasper Schwenzow and Maximilian Witte. 2023. “The political ideology of conversational AI: Converging evidence on ChatGPT’s pro-environmental, left-libertarian orientation.”. URL:https://arxiv.org/abs/2301.01768 REFERENCES 17
-
[12]
Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training
Hubinger, Evan, Carson Denison, Jesse Mu, Mike Lambert, Meg Tong, Monte MacDiarmid, Tamera Lanham, Daniel M. Ziegler, Tim Maxwell, Newton Cheng, Adam Jermyn, Amanda Askell, Ansh Radhakrishnan, Cem Anil, David Duvenaud, Deep Ganguli, Fazl Barez, Jack Clark, Kamal Ndousse, Kshitij Sachan, Michael Sellitto, Mrinank Sharma, Nova DasSarma, Roger Grosse, Shauna...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[13]
Torr, Amartya Sanyal and Puneet K
Jain, Samyak, Ekdeep Singh Lubana, Kemal Oksuz, Tom Joy, Philip H.S. Torr, Amartya Sanyal and Puneet K. Dokania. 2024. What Makes and Breaks Safety Fine-tuning? A Mechanistic Study. InAdvances in Neural Information Processing Systems. Vol. 37
2024
-
[14]
Kummerfeld and Rada Mihalcea
Lee, Andrew, Xiaoyan Bai, Itamar Pres, Martin Wattenberg, Jonathan K. Kummerfeld and Rada Mihalcea. 2024. A Mechanistic Understanding of Alignment Algorithms: A Case Study on DPO and Toxicity. InProceedings of the 41st International Conference on Machine Learning, ed. Ruslan Salakhutdinov, Zico Kolter, Katherine Heller, Adrian Weller, Nuria Oliver, Jonath...
2024
-
[15]
More Human than Hu- man: Measuring ChatGPT Political Bias
Motoki, Fabio, Valdemar Pinho Neto and Victor Rodrigues. 2024. “More Human than Hu- man: Measuring ChatGPT Political Bias.”Public Choice198:3–23
2024
-
[16]
Assessing political bias and value misalignment in generative artificial intelligence
Motoki, Fabio Y.S., Valerie Pinho Neto and Victor Rodrigues. 2025. “Assessing political bias and value misalignment in generative artificial intelligence.”Journal of Economic Behavior & Organization234:106904
2025
-
[17]
Ouyang, Long, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike and Ryan Lowe. 2022. Training language models to follow instructions with human fee...
2022
-
[18]
Bowman, Amanda Askell, Roger Grosse, Danny Hernandez, Deep Ganguli, Evan Hubinger, Nicholas Schiefer and Jared Kaplan
Perez, Ethan, Sam Ringer, Kamile Lukosiute, Karina Nguyen, Edwin Chen, Scott Heiner, Craig Pettit, Catherine Olsson, Sandipan Kundu, Saurav Kadavath, Andy Jones, Anna Chen, Ben Mann, Brian Israel, Bryan Seethor, Cameron McKinnon, Christopher Olah, Da Yan, Daniela Amodei, Dario Amodei, Dawn Drain, Dustin Li, Eli Tyre, Ethan Jost, Evan Hub- inger, Faisal La...
2023
-
[19]
Qi, Xiangyu, Ashwinee Panda, Kaifeng Lyu, Xiao Ma, Subhrajit Roy, Ahmad Beirami, Prateek Mittal and Peter Henderson. 2025. Safety Alignment Should Be Made More Than Just a Few Tokens Deep. InInternational Conference on Learning Representations
2025
-
[20]
Santurkar, Shibani, Esin Durmus, Faisal Ladhak, Cinoo Lee, Percy Liang and Tatsunori Hashimoto. 2023. Whose Opinions Do Language Models Reflect? InProceedings of the 40th In- ternational Conference on Machine Learning, ed. Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato and Jonathan Scarlett. Vol. 202 ofProceedings of Ma- c...
2023
-
[21]
Sharma, Mrinank, Meg Tong, Tomasz Korbak, David Duvenaud, Amanda Askell, Samuel R. Bowman, Esin Durmus, Zac Hatfield-Dodds, Scott R Johnston, Shauna M Kravec, Timothy Maxwell, Sam McCandlish, Kamal Ndousse, Oliver Rausch, Nicholas Schiefer, Da Yan, Mi- randa Zhang and Ethan Perez. 2024. Towards Understanding Sycophancy in Language Mod- els. InInternationa...
2024
-
[22]
The Amplifying Mirror: Locating and Steering the Partisan Direction inside a Large Language Model
Tam, Wendy K. 2026. “The Amplifying Mirror: Locating and Steering the Partisan Direction inside a Large Language Model.”
2026
-
[23]
How Elon Musk Remade Grok in His Image
Thompson, Stuart A. 2025. “How Elon Musk Remade Grok in His Image.”The New York Times
2025
-
[24]
Steering Language Models With Activation Engineering
Turner, Alexander Matt, Lisa Thiergart, Gavin Leech, David Udell, Juan J. Vazquez, Ulisse Mini and Monte MacDiarmid. 2024. “Steering Language Models With Activation Engineer- ing.”. URL:https://arxiv.org/abs/2308.10248
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[25]
Wolf, Yotam, Noam Wies, Oshri Avnery, Yoav Levine and Amnon Shashua. 2024. Fundamen- tal Limitations of Alignment in Large Language Models. InProceedings of the 41st International Conference on Machine Learning. Vol. 235 pp. 53079–53112
2024
-
[26]
Representation Engineering: A Top-Down Approach to AI Transparency
Zou, Andy, Long Phan, Sarah Chen, James Campbell, Phillip Guo, Richard Ren, Alexander Pan, Xuwang Yin, Mantas Mazeika, Ann-Kathrin Dombrowski, Shashwat Goel, Nathaniel Li, Michael J. Byun, Zifan Wang, Alex Mallen, Steven Basart, Sanmi Koyejo, Dawn Song, Matt Fredrikson, J. Zico Kolter and Dan Hendrycks. 2025. “Representation Engineering: A Top- Down Appro...
work page internal anchor Pith review Pith/arXiv arXiv 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.