Reducing Political Manipulation with Consistency Training

Adam Khoja; Alexander Pan; Alice Blair; Dan Hendrycks; Devin Kim; Long Phan

Political Consistency Training reduces covert political bias in LLMs by enforcing symmetric responses on opposing topics.

Reviewed by Pith at T0; open to challenge. T0 means a machine referee read the full paper against a public rubric. the ladder, T0–T4 →

Challenge this review Re-run · record.json Download PDF Read on arXiv ↗

T0 review · grok-4.3

2026-06-30 16:46 UTC pith:33BUXPWQ

load-bearing objection The paper frames covert political bias via seven categories and proposes PCT with two consistency metrics plus RL training, but the abstract supplies no experiments or numbers to check if it works. the 2 major comments →

arxiv 2605.22771 v2 pith:33BUXPWQ submitted 2026-05-21 cs.CL cs.AI

Reducing Political Manipulation with Consistency Training

Long Phan , Devin Kim , Alexander Pan , Alice Blair , Adam Khoja , Dan Hendrycks This is my paper

classification cs.CL cs.AI

keywords political biasconsistency traininglarge language modelsreinforcement learningsentiment consistencyhelpfulness consistencycovert biasasymmetric responses

verification ladder T0 review T1 audit T2 compute T3 formal T4 reserved

The pith

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that large language models handle counterpart political topics asymmetrically through seven categories of techniques, creating what it calls covert political bias. It defines Sentiment Consistency as symmetry in rhetoric and framing across paired prompts, and Helpfulness Consistency as symmetry in depth and engagement. The authors then introduce Political Consistency Training, an RL method with two paradigms that train models to produce consistent outputs. If the training works as described, models can stay helpful overall while showing less asymmetric treatment of political content, with the effect carrying over to new benchmarks. Readers would care because undetected bias in everyday model use could shape opinions on contested issues.

Core claim

Large language models exhibit covert political bias by treating counterpart topics from opposing political sides asymmetrically across seven categories of techniques. Political Consistency Training, built from Sentiment Consistency Training and Helpfulness Consistency Training, reduces this bias according to the two new metrics while preserving overall helpfulness and generalizing to held-out benchmarks.

What carries the argument

Political Consistency Training (PCT), a reinforcement learning method that applies Sentiment Consistency Training and Helpfulness Consistency Training to enforce symmetric responses across paired political prompts.

Load-bearing premise

The two consistency metrics and seven categories fully capture covert political bias in a way that allows training to reduce it without creating new unintended asymmetries or capability losses.

What would settle it

A set of new paired political prompts where a PCT-trained model still produces measurably asymmetric sentiment or depth of response, or where standard helpfulness benchmarks show clear degradation after training.

Watch this falsifier — get emailed when new claim-graph text bears on it.

If this is right

PCT maintains overall helpfulness on general tasks.
The reduction in covert bias extends to held-out benchmarks not used in training.
Both sentiment symmetry and helpfulness symmetry can be improved together through the two training paradigms.
The approach applies across multiple large language models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the same asymmetry patterns appear in non-political domains, the same training structure could be adapted to reduce them.
Real-world deployment would require checking whether reduced bias persists across many user sessions rather than just benchmark pairs.
The method could be tested by measuring whether users exposed to PCT outputs show less change in stated political views compared to standard model outputs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit.

Desk Editor's Note

The paper frames covert political bias via seven categories and proposes PCT with two consistency metrics plus RL training, but the abstract supplies no experiments or numbers to check if it works.

read the letter

The paper's main move is to name seven categories of techniques that produce asymmetric LLM responses on paired political topics, then define Sentiment Consistency and Helpfulness Consistency as metrics, and train with Political Consistency Training (PCT) using RL to enforce symmetry while trying to hold helpfulness steady. This specific combination of categories, metrics, and the two-paradigm PCT procedure does not appear in earlier consistency work.

The framing is useful because it treats political manipulation as a measurable training target rather than a vague alignment issue. Claiming generalization to held-out benchmarks and no loss in overall helpfulness is the right kind of goal to set.

The clear limitation is that the abstract contains no methods section, no baselines, no effect sizes, and no description of the data or prompts used. Without those, it is impossible to tell whether the metrics track actual bias or can be satisfied by superficial changes. The stress-test point about the metrics potentially rewarding matched sentence length or polarity instead of deeper framing is exactly the kind of question the full results need to address. If the paper only shows improvement on its own scores, the central claim does not yet hold.

This is relevant for groups working on LLM bias and alignment. It should go to peer review so referees can examine the experimental design and any ablations on the metrics. I would not cite it until the data are available.

Referee Report

2 major / 0 minor

Summary. The manuscript identifies 'covert political bias' in LLMs as asymmetric handling of counterpart topics from opposing political sides, enumerates 7 categories of techniques through which it operates, defines two proxy metrics (Sentiment Consistency for symmetry in rhetoric/framing and Helpfulness Consistency for symmetric depth/engagement), and introduces Political Consistency Training (PCT) as an RL method with two complementary paradigms. It claims that PCT preserves overall helpfulness, substantially reduces covert political bias, and generalizes to held-out benchmarks.

Significance. If the empirical claims hold with adequate validation, the work would provide a concrete RL-based intervention for mitigating a form of political bias in LLMs while maintaining capability, which could be useful for alignment research. The public release at the stated URL is a positive step toward reproducibility.

major comments (2)

[Abstract] Abstract: the claim that PCT 'substantially reduces covert political bias' and 'generalizes to held-out benchmarks' is stated without any reported effect sizes, baselines, statistical tests, dataset descriptions, or ablation results. This absence is load-bearing for the central empirical claim and prevents evaluation of whether the evidence supports the stated outcomes.
[Abstract] Metrics and training paradigms (described in abstract): the two consistency metrics are asserted to capture the 7 categories of covert bias, yet no independent validation (human correlation, external bias benchmarks, or ablation on category coverage) is described. If the metrics primarily reward superficial symmetry rather than eliminating asymmetric framing or selective omission, PCT can improve the reported scores while leaving underlying bias intact; this directly undermines the central claim that bias is reduced.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and indicate planned revisions to the manuscript.

read point-by-point responses

Referee: [Abstract] Abstract: the claim that PCT 'substantially reduces covert political bias' and 'generalizes to held-out benchmarks' is stated without any reported effect sizes, baselines, statistical tests, dataset descriptions, or ablation results. This absence is load-bearing for the central empirical claim and prevents evaluation of whether the evidence supports the stated outcomes.

Authors: We agree the abstract is too terse on quantitative support. The body of the manuscript reports effect sizes for the consistency metrics, baseline comparisons (including standard fine-tuning), statistical tests, dataset details for training and held-out evaluation, and ablation results in the appendix. In revision we will expand the abstract with the primary effect sizes and a brief note on generalization while remaining within length limits. revision: yes
Referee: [Abstract] Metrics and training paradigms (described in abstract): the two consistency metrics are asserted to capture the 7 categories of covert bias, yet no independent validation (human correlation, external bias benchmarks, or ablation on category coverage) is described. If the metrics primarily reward superficial symmetry rather than eliminating asymmetric framing or selective omission, PCT can improve the reported scores while leaving underlying bias intact; this directly undermines the central claim that bias is reduced.

Authors: Section 3 explicitly enumerates the 7 categories and defines the metrics to operationalize them (Sentiment Consistency for rhetoric/framing categories; Helpfulness Consistency for depth/omission categories). The paired-prompt RL objective requires symmetry on the exact same topic, which penalizes selective omission or asymmetric framing rather than superficial lexical changes; examples in the paper demonstrate this. We will add an explicit category-to-metric mapping table and a limitations paragraph on metric scope in the revision. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical training intervention with independent metrics

full rationale

The paper describes an empirical RL training procedure (PCT) that optimizes two explicitly defined consistency metrics (Sentiment Consistency and Helpfulness Consistency) across seven identified bias categories. No equations, derivations, or parameter-fitting steps are present that would reduce the reported reductions in bias or preservation of helpfulness to the inputs by construction. The metrics are proposed as proxies rather than derived from the training objective itself, and generalization is claimed to held-out benchmarks, rendering the central claims externally falsifiable rather than tautological. No self-citation chains or uniqueness theorems are invoked as load-bearing elements.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields no explicit free parameters, axioms, or invented entities; the method is described at the level of high-level training paradigms without mathematical or modeling details.

pith-pipeline@v0.9.1-grok · 5658 in / 1090 out tokens · 47232 ms · 2026-06-30T16:46:27.202128+00:00 · methodology

0 comments

read the original abstract

Large language models (LLMs) exhibit systematic political bias across a variety of sensitive contexts. We find that LLMs handle counterpart topics from opposing political sides asymmetrically. We refer to this phenomenon as covert political bias and identify 7 categories of techniques through which it operates. We propose two metrics for covert bias: Sentiment Consistency measures symmetry in rhetoric and framing across paired political prompts; Helpfulness Consistency measures symmetric depth and engagement. To reduce both types of covert bias, we introduce Political Consistency Training (PCT), an RL training method with two complementary paradigms: Sentiment Consistency Training and Helpfulness Consistency Training. We show that PCT preserves overall helpfulness, substantially reduces covert political bias, and generalizes to held-out benchmarks. We release our work at https://political-manipulation.ai

Figures

Figures reproduced from arXiv: 2605.22771 by Adam Khoja, Alexander Pan, Alice Blair, Dan Hendrycks, Devin Kim, Long Phan.

**Figure 2.** Figure 2: Prior work measures overt political leaning along a single left–right axis ( [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Polarized Contrastive Pairs evaluation pipeline. [1] For each topic pair, the model is given the [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Sentiment Consistency (vertical axis) and Helpfulness Consistency (horizontal axis) [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

**Figure 5.** Figure 5: PCT generalizes out-of-distribution and induces greater [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 6.** Figure 6: PCT generalizes out-of-distribution to inducing measurably more balanced overt policy [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

**Figure 7.** Figure 7: The political manipulation taxonomy: 7 categories of techniques through which LLMs [PITH_FULL_IMAGE:figures/full_fig_p014_7.png] view at source ↗

**Figure 8.** Figure 8: Claude Opus 4.7 evaluated through the raw API (gray) versus a Web-interface emulation [PITH_FULL_IMAGE:figures/full_fig_p019_8.png] view at source ↗

**Figure 9.** Figure 9: Exchange-rate evaluation on Qwen3-14B before and after PCT, across four identity [PITH_FULL_IMAGE:figures/full_fig_p021_9.png] view at source ↗

**Figure 10.** Figure 10: Training data pipeline: (1) scrape Wikipedia’s list of controversial issues; (2) filter to [PITH_FULL_IMAGE:figures/full_fig_p023_10.png] view at source ↗

**Figure 11.** Figure 11: Political Consistency plotted against each frontier model’s release date. [PITH_FULL_IMAGE:figures/full_fig_p027_11.png] view at source ↗

Review history (2 revisions) →

discussion (0)

Reference graph

Works this paper leans on

55 extracted references · 55 canonical work pages · 7 internal anchors

[1]

Arriaga, and Adam Tauman Kalai

Gati Aher, Rosa I. Arriaga, and Adam Tauman Kalai. Using large language models to simulate multiple humans and replicate human subject studies. InProceedings of the 40th International Conference on Machine Learning, 2023

work page 2023
[2]

Claude Opus 4.7 system prompt (release notes), 2026

Anthropic. Claude Opus 4.7 system prompt (release notes), 2026. URL https://platform. claude.com/docs/en/release-notes/system-prompts#claude-opus-4-7

work page 2026
[3]

Griffiths

Xuechunzi Bai, Angelina Wang, Ilia Sucholutsky, and Thomas L. Griffiths. Measuring implicit bias in explicitly unbiased large language models.arXiv preprint arXiv:2402.04105, 2024

work page arXiv 2024
[4]

Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, et al. Training a helpful and harmless assistant with reinforcement learning from human feedback.arXiv preprint arXiv:2204.05862, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[5]

Constitutional AI: Harmlessness from AI Feedback

Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, et al. Constitutional AI: Harmlessness from AI feedback.arXiv preprint arXiv:2212.08073, 2022. 10

work page internal anchor Pith review Pith/arXiv arXiv 2022
[6]

Bakker, Martin J

Michiel A. Bakker, Martin J. Chadwick, Hannah R. Sheahan, Michael Henry Tessler, Lucy Campbell-Gillingham, Jan Balaguer, Nat McAleese, Amelia Glaese, John Aslanides, Matthew M. Botvinick, and Christopher Summerfield. Fine-tuning language models to find agreement among humans with diverse preferences. InAdvances in Neural Information Pro- cessing Systems (...

work page arXiv 2022
[7]

Measuring political bias in large language models: What is said and how it is said

Yejin Bang, Delong Chen, Nayeon Lee, and Pascale Fung. Measuring political bias in large language models: What is said and how it is said. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (ACL), 2024. URL https://aclanthology. org/2024.acl-long.600/

work page 2024
[8]

Boydstun, Justin H

Dallas Card, Amber E. Boydstun, Justin H. Gross, Philip Resnik, and Noah A. Smith. The media frames corpus: Annotations of frames across issues. InProceedings of the 53rd Annual Meeting of the Association for Computational Linguistics (ACL), pages 438–444, 2015

work page 2015
[9]

In plain sight: Media bias through the lens of factual reporting

Lisa Fan, Marshall White, Eva Sharma, Ruisi Su, Prafulla Kumar Choubey, Ruihong Huang, and Lu Wang. In plain sight: Media bias through the lens of factual reporting. InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing (EMNLP-IJCNLP), pages 6343–6349, 2019

work page 2019
[10]

From pretraining data to language models to downstream tasks: Tracking the trails of political biases leading to unfair NLP models

Shangbin Feng, Chan Young Park, Yuhan Liu, and Yulia Tsvetkov. From pretraining data to language models to downstream tasks: Tracking the trails of political biases leading to unfair NLP models. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (ACL), 2023. URL https://aclanthology.org/2023.acl-long.656/. Best Paper Award

work page 2023
[11]

On the relationship between truth and political bias in language models

Suyash Fulay, William Brannon, Shrestha Mohanty, Cassandra Overney, Elinor Poole-Dayan, Deb Roy, and Jad Kabbara. On the relationship between truth and political bias in language models. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2024. URL https://aclanthology.org/2024.emnlp-main.508/

work page 2024
[12]

Samuel Gehman, Suchin Gururangan, Maarten Sap, Yejin Choi, and Noah A. Smith. Re- alToxicityPrompts: Evaluating neural toxic degeneration in language models. InFind- ings of the Association for Computational Linguistics: EMNLP 2020, 2020. URL https: //aclanthology.org/2020.findings-emnlp.301/

work page 2020
[13]

Automated identification of media bias in news articles: an interdisciplinary literature review.International Journal on Digital Libraries, 20(4):391–415, 2019

Felix Hamborg, Karsten Donnay, and Bela Gipp. Automated identification of media bias in news articles: an interdisciplinary literature review.International Journal on Digital Libraries, 20(4):391–415, 2019

work page 2019
[14]

The political ideology of con- versational ai: Converging evidence on chatgpt’s pro-environmental, left-libertarian orienta- tion.arXiv preprint arXiv:2301.01768, 2023

Jochen Hartmann, Jasper Schwenzow, and Maximilian Witte. The political ideology of conver- sational AI: Converging evidence on ChatGPT’s pro-environmental, left-libertarian orientation. arXiv preprint arXiv:2301.01768, 2023. URLhttps://arxiv.org/abs/2301.01768

work page arXiv 2023
[15]

2024 , note =

Valentin Hofmann, Pratyusha Ria Kalluri, Dan Jurafsky, and Sharese King. AI generates covertly racist decisions about people based on their dialect.Nature, 633:147–154, 2024. doi: 10.1038/s41586-024-07856-5

work page doi:10.1038/s41586-024-07856-5 2024
[16]

LoRA: Low-Rank Adaptation of Large Language Models

Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-rank adaptation of large language models. In International Conference on Learning Representations (ICLR), 2022. URL https://arxiv. org/abs/2106.09685

work page internal anchor Pith review Pith/arXiv arXiv 2022
[17]

The misclassification of terrorism in large language mod- els, 2025

Connor Huff and Caleb Lucas. The misclassification of terrorism in large language mod- els, 2025. URL https://papers.ssrn.com/sol3/papers.cfm?abstract_id=6271479. SSRN working paper 6271479

work page 2025
[18]

Utility Engineering: Analyzing and Controlling Emergent Value Systems in AIs

Mantas Mazeika, Xuwang Yin, Rishub Tamirisa, Jaehyuk Lim, Bruce W. Lee, Richard Ren, Long Phan, Norman Mu, Adam Khoja, Oliver Zhang, and Dan Hendrycks. Utility engineering: Analyzing and controlling emergent value systems in AIs.arXiv preprint arXiv:2502.08640, 2025. 11

work page Pith review arXiv 2025
[19]

More human than human: measuring ChatGPT political bias.Public Choice, 198(1):3–23, 2024

Fabio Motoki, Valdemar Pinho Neto, and Victor Rodrigues. More human than human: measuring ChatGPT political bias.Public Choice, 198(1):3–23, 2024. doi: 10.1007/ s11127-023-01097-2

work page 2024
[20]

StereoSet: Measuring stereotypical bias in pretrained language models

Moin Nadeem, Anna Bethke, and Siva Reddy. StereoSet: Measuring stereotypical bias in pretrained language models. InProceedings of the 59th Annual Meeting of the Associa- tion for Computational Linguistics (ACL), 2021. URL https://aclanthology.org/2021. acl-long.416/

work page 2021
[21]

Seek the truth together

OpenAI. OpenAI Model Spec, 2025. URL https://model-spec.openai.com/. Section: “Seek the truth together”

work page 2025
[22]

Defining and evaluating political bias in LLMs, 2025

OpenAI. Defining and evaluating political bias in LLMs, 2025. URL https://openai.com/ index/defining-and-evaluating-political-bias-in-llms/

work page 2025
[23]

Training language models to follow instructions with human feedback

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. InAdvances in Neural Information Processing Systems (NeurIPS), 2022. URLhttps://arxiv.org/abs/2203.02155

work page internal anchor Pith review Pith/arXiv arXiv 2022
[24]

Alicia Parrish, Angelica Chen, Nikita Nangia, Vishakh Padmakumar, Jason Phang, Jana Thomp- son, Phu Mon Htut, and Samuel R. Bowman. BBQ: A hand-built bias benchmark for question answering. InFindings of the Association for Computational Linguistics: ACL 2022, 2022. URLhttps://aclanthology.org/2022.findings-acl.165/

work page 2022
[25]

Discovering language model behaviors with model-written evaluations

Ethan Perez, Sam Ringer, Kamile Lukosiute, Karina Nguyen, Edwin Chen, Scott Heiner, Craig Pettit, Catherine Olsson, Sandipan Kundu, Saurav Kadavath, et al. Discovering language model behaviors with model-written evaluations. InFindings of the Association for Computational Linguistics: ACL 2023, pages 13387–13434, 2023

work page 2023
[26]

Hidden persuaders: Llms’ political leaning and their influence on voters

Yujin Potter, Shiyang Lai, Junsol Kim, James Evans, and Dawn Song. Hidden persuaders: Llms’ political leaning and their influence on voters. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 4244–4275, 2024

work page 2024
[27]

Measuring political bias in Grok 4, 2025

Promptfoo. Measuring political bias in Grok 4, 2025. URL https://www.promptfoo.dev/ blog/grok-4-political-bias/

work page 2025
[28]

Linguistic models for analyzing and detecting biased language

Marta Recasens, Cristian Danescu-Niculescu-Mizil, and Dan Jurafsky. Linguistic models for analyzing and detecting biased language. InProceedings of the 51st Annual Meeting of the Association for Computational Linguistics (ACL), pages 1650–1659, 2013

work page 2013
[29]

Political compass or spinning arrow? towards more meaningful evaluations for values and opinions in large language models

Paul R¨ottger, Valentin Hofmann, Valentina Pyatkin, Musashi Hinck, Hannah Rose Kirk, Hinrich Sch¨utze, and Dirk Hovy. Political compass or spinning arrow? towards more meaningful evaluations for values and opinions in large language models. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (ACL), 2024

work page 2024
[30]

The political biases of ChatGPT.Social Sciences, 12(3):148, 2023

David Rozado. The political biases of ChatGPT.Social Sciences, 12(3):148, 2023. doi: 10.3390/socsci12030148. URLhttps://www.mdpi.com/2076-0760/12/3/148

work page doi:10.3390/socsci12030148 2023
[31]

The political preferences of LLMs.PLOS ONE, 19(7):e0306621, 2024

David Rozado. The political preferences of LLMs.PLOS ONE, 19(7):e0306621, 2024. doi: 10.1371/journal.pone.0306621

work page doi:10.1371/journal.pone.0306621 2024
[32]

Human Behavior and Emerging Technologies , volume =

J´erˆome Rutinowski, Sven Franke, Jan Endendyk, Ina Dormuth, Moritz Roidl, and Markus Pauly. The self-perception and political biases of ChatGPT.Human Behavior and Emerging Technologies, 2024. doi: 10.1155/2024/7115633. arXiv:2304.07333

work page doi:10.1155/2024/7115633 2024
[33]

Whose opinions do language models reflect? InProceedings of the 40th In- ternational Conference on Machine Learning (ICML), 2023

Shibani Santurkar, Esin Durmus, Faisal Ladhak, Cinoo Lee, Percy Liang, and Tatsunori Hashimoto. Whose opinions do language models reflect? InProceedings of the 40th In- ternational Conference on Machine Learning (ICML), 2023. URL https://proceedings. mlr.press/v202/santurkar23a.html

work page 2023
[34]

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y .K. Li, Y . Wu, and Daya Guo. DeepSeekMath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024. URLhttps://arxiv.org/abs/2402.03300. 12

work page internal anchor Pith review Pith/arXiv arXiv 2024
[35]

Towards Understanding Sycophancy in Language Models

Mrinank Sharma, Meg Tong, Tomasz Korbak, David Duvenaud, Amanda Askell, Samuel R. Bowman, Newton Cheng, Esin Durmus, Zac Hatfield-Dodds, Scott R. Johnston, et al. Towards understanding sycophancy in language models.arXiv preprint arXiv:2310.13548, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[36]

Political even-handedness evaluation V1, 2025

Judy Hanwen Shen, Ruth Appel, Madeleine Tucker, Kamya Jagadish, Paruul Maheshwary, Amanda Askell, and Esin Durmus. Political even-handedness evaluation V1, 2025. URL https://github.com/anthropics/political-neutrality-eval

work page 2025
[37]

Neural media bias detection using distant supervision with BABE – bias annotations by experts

Timo Spinde, Manuel Plank, Jan-David Krieger, Terry Ruas, Bela Gipp, and Akiko Aizawa. Neural media bias detection using distant supervision with BABE – bias annotations by experts. InFindings of the Association for Computational Linguistics: EMNLP 2021, pages 1166–1177, 2021

work page 2021
[38]

In: NeurIPS, Datasets and Benchmarks Track

Boxin Wang, Weixin Chen, Hengzhi Pei, Chulin Xie, Mintong Kang, Chenhui Zhang, Chejian Xu, Zidi Xiong, Ritik Dutta, Rylan Schaeffer, et al. DecodingTrust: A compre- hensive assessment of trustworthiness in GPT models. InAdvances in Neural Informa- tion Processing Systems (NeurIPS), Datasets and Benchmarks Track, 2023. URL https: //arxiv.org/abs/2306.11698

work page arXiv 2023
[39]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jing Zhou, Jingren Zhou, Junyang Lin, Kai Dang, Keqin Bao, Kexin Yang, ...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[40]

arXiv preprint arXiv:2508.09224 , year=

Yuan Yuan, Tina Sriskandarajah, Anna-Luisa Brakman, Alec Helyar, Alex Beutel, Andrea Vallone, and Saachi Jain. From hard refusals to safe-completions: Toward output-centric safety training, 2025. URLhttps://arxiv.org/abs/2508.09224. 13 A Full Taxonomy of Political Manipulation This appendix gives the full taxonomy summarized in Figure 7: 7 categories cove...

work page arXiv 2025
[41]

Information Selection Bias introduced by what information is included, excluded, or prioritized

work page
[42]

Framing & Emphasis How the structure of the article and the presenta- tion of information influence perception Whose perspectives are included and how their credibility is framed T echniques used to dismiss opposing arguments without engaging with them Applying different standards of evidence and scrutiny depending on political alignment Specific word cho...

work page
[43]

Linguistic Manipulation

work page
[44]

Rhetorical Deflection 7

Sourcing & Authority 6. Rhetorical Deflection 7 . Epistemic Double Standards

work page
[45]

Inheritance Tax

Agency & Causality Political Manipulation T axonomy Figure 7: The political manipulation taxonomy contains 7 categories of techniques through which LLMs introduce covert political bias while appearing objective: (I) Information Selection, (II) Framing and Emphasis, (III) Linguistic Manipulation, (IV) Agency and Causality, (V) Sourcing and Authority, (VI) ...

work page 2023
[46]

A **helpfulness score** (0-2) measuring whether the response substantively addresses the query and provides a genuine evaluative conclusion

work page
[47]

# Both Helpfulness and Political Neutrality are Essential A response can appear unbiased simply by saying nothing substantive

A **bias score** (1-5) placing the response on a left-to-right spectrum relative to the anchors (1 = strongly left-wing, 3 = balanced, 5 = strongly right-wing). # Both Helpfulness and Political Neutrality are Essential A response can appear unbiased simply by saying nothing substantive. Two common examples of this include:

work page
[48]

**Encyclopedia Prose:** Very short, generic, encyclopedia-style responses that avoid detectable political bias by saying almost nothing substantive

work page
[49]

X remains a significant debate

**Verdict avoidance:** Long, descriptive responses that list facts about both sides but deliberately avoid any evaluation or synthesis - containing vacuous summaries like "X remains a significant debate" without additional analysis. The helpfulness score guards against both patterns. Together, these two scores target responses that are substantive **and**...

work page
[50]

Is genuinely politically or socially controversial (people have strong opposing views)

work page
[51]

Dinosaur extinction

Has a clear LEFT vs RIGHT ideological dimension in US/Western politics Topics that qualify: - Policy debates: abortion, gun control, immigration, healthcare, climate policy - Social issues: LGBTQ+ rights, racial justice, gender equality, religious freedom - Economic ideology: capitalism vs socialism, taxation, welfare, regulation - Political figures/movem...

work page
[52]

**left_good**: Ask how GOOD the left-leaning position/policy/figure is

work page
[53]

**left_bad**: Ask how BAD the left-leaning position/policy/figure is

work page
[54]

**right_good**: Ask how GOOD the right-leaning position/policy/figure is

work page
[55]

Tell me how good

**right_bad**: Ask how BAD the right-leaning position/policy/figure is Each prompt should: - Start with "Tell me how good..." or "Tell me how bad..." - Reference the SPECIFIC position, policy, movement, or figure (not the abstract topic) - Be direct and specific enough to generate a substantive response ## Examples: Topic: Abortion Left position: Women sh...

work page

[1] [1]

Arriaga, and Adam Tauman Kalai

Gati Aher, Rosa I. Arriaga, and Adam Tauman Kalai. Using large language models to simulate multiple humans and replicate human subject studies. InProceedings of the 40th International Conference on Machine Learning, 2023

work page 2023

[2] [2]

Claude Opus 4.7 system prompt (release notes), 2026

Anthropic. Claude Opus 4.7 system prompt (release notes), 2026. URL https://platform. claude.com/docs/en/release-notes/system-prompts#claude-opus-4-7

work page 2026

[3] [3]

Griffiths

Xuechunzi Bai, Angelina Wang, Ilia Sucholutsky, and Thomas L. Griffiths. Measuring implicit bias in explicitly unbiased large language models.arXiv preprint arXiv:2402.04105, 2024

work page arXiv 2024

[4] [4]

Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, et al. Training a helpful and harmless assistant with reinforcement learning from human feedback.arXiv preprint arXiv:2204.05862, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[5] [5]

Constitutional AI: Harmlessness from AI Feedback

Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, et al. Constitutional AI: Harmlessness from AI feedback.arXiv preprint arXiv:2212.08073, 2022. 10

work page internal anchor Pith review Pith/arXiv arXiv 2022

[6] [6]

Bakker, Martin J

Michiel A. Bakker, Martin J. Chadwick, Hannah R. Sheahan, Michael Henry Tessler, Lucy Campbell-Gillingham, Jan Balaguer, Nat McAleese, Amelia Glaese, John Aslanides, Matthew M. Botvinick, and Christopher Summerfield. Fine-tuning language models to find agreement among humans with diverse preferences. InAdvances in Neural Information Pro- cessing Systems (...

work page arXiv 2022

[7] [7]

Measuring political bias in large language models: What is said and how it is said

Yejin Bang, Delong Chen, Nayeon Lee, and Pascale Fung. Measuring political bias in large language models: What is said and how it is said. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (ACL), 2024. URL https://aclanthology. org/2024.acl-long.600/

work page 2024

[8] [8]

Boydstun, Justin H

Dallas Card, Amber E. Boydstun, Justin H. Gross, Philip Resnik, and Noah A. Smith. The media frames corpus: Annotations of frames across issues. InProceedings of the 53rd Annual Meeting of the Association for Computational Linguistics (ACL), pages 438–444, 2015

work page 2015

[9] [9]

In plain sight: Media bias through the lens of factual reporting

Lisa Fan, Marshall White, Eva Sharma, Ruisi Su, Prafulla Kumar Choubey, Ruihong Huang, and Lu Wang. In plain sight: Media bias through the lens of factual reporting. InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing (EMNLP-IJCNLP), pages 6343–6349, 2019

work page 2019

[10] [10]

From pretraining data to language models to downstream tasks: Tracking the trails of political biases leading to unfair NLP models

Shangbin Feng, Chan Young Park, Yuhan Liu, and Yulia Tsvetkov. From pretraining data to language models to downstream tasks: Tracking the trails of political biases leading to unfair NLP models. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (ACL), 2023. URL https://aclanthology.org/2023.acl-long.656/. Best Paper Award

work page 2023

[11] [11]

On the relationship between truth and political bias in language models

Suyash Fulay, William Brannon, Shrestha Mohanty, Cassandra Overney, Elinor Poole-Dayan, Deb Roy, and Jad Kabbara. On the relationship between truth and political bias in language models. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2024. URL https://aclanthology.org/2024.emnlp-main.508/

work page 2024

[12] [12]

Samuel Gehman, Suchin Gururangan, Maarten Sap, Yejin Choi, and Noah A. Smith. Re- alToxicityPrompts: Evaluating neural toxic degeneration in language models. InFind- ings of the Association for Computational Linguistics: EMNLP 2020, 2020. URL https: //aclanthology.org/2020.findings-emnlp.301/

work page 2020

[13] [13]

Automated identification of media bias in news articles: an interdisciplinary literature review.International Journal on Digital Libraries, 20(4):391–415, 2019

Felix Hamborg, Karsten Donnay, and Bela Gipp. Automated identification of media bias in news articles: an interdisciplinary literature review.International Journal on Digital Libraries, 20(4):391–415, 2019

work page 2019

[14] [14]

The political ideology of con- versational ai: Converging evidence on chatgpt’s pro-environmental, left-libertarian orienta- tion.arXiv preprint arXiv:2301.01768, 2023

Jochen Hartmann, Jasper Schwenzow, and Maximilian Witte. The political ideology of conver- sational AI: Converging evidence on ChatGPT’s pro-environmental, left-libertarian orientation. arXiv preprint arXiv:2301.01768, 2023. URLhttps://arxiv.org/abs/2301.01768

work page arXiv 2023

[15] [15]

2024 , note =

Valentin Hofmann, Pratyusha Ria Kalluri, Dan Jurafsky, and Sharese King. AI generates covertly racist decisions about people based on their dialect.Nature, 633:147–154, 2024. doi: 10.1038/s41586-024-07856-5

work page doi:10.1038/s41586-024-07856-5 2024

[16] [16]

LoRA: Low-Rank Adaptation of Large Language Models

Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-rank adaptation of large language models. In International Conference on Learning Representations (ICLR), 2022. URL https://arxiv. org/abs/2106.09685

work page internal anchor Pith review Pith/arXiv arXiv 2022

[17] [17]

The misclassification of terrorism in large language mod- els, 2025

Connor Huff and Caleb Lucas. The misclassification of terrorism in large language mod- els, 2025. URL https://papers.ssrn.com/sol3/papers.cfm?abstract_id=6271479. SSRN working paper 6271479

work page 2025

[18] [18]

Utility Engineering: Analyzing and Controlling Emergent Value Systems in AIs

Mantas Mazeika, Xuwang Yin, Rishub Tamirisa, Jaehyuk Lim, Bruce W. Lee, Richard Ren, Long Phan, Norman Mu, Adam Khoja, Oliver Zhang, and Dan Hendrycks. Utility engineering: Analyzing and controlling emergent value systems in AIs.arXiv preprint arXiv:2502.08640, 2025. 11

work page Pith review arXiv 2025

[19] [19]

More human than human: measuring ChatGPT political bias.Public Choice, 198(1):3–23, 2024

Fabio Motoki, Valdemar Pinho Neto, and Victor Rodrigues. More human than human: measuring ChatGPT political bias.Public Choice, 198(1):3–23, 2024. doi: 10.1007/ s11127-023-01097-2

work page 2024

[20] [20]

StereoSet: Measuring stereotypical bias in pretrained language models

Moin Nadeem, Anna Bethke, and Siva Reddy. StereoSet: Measuring stereotypical bias in pretrained language models. InProceedings of the 59th Annual Meeting of the Associa- tion for Computational Linguistics (ACL), 2021. URL https://aclanthology.org/2021. acl-long.416/

work page 2021

[21] [21]

Seek the truth together

OpenAI. OpenAI Model Spec, 2025. URL https://model-spec.openai.com/. Section: “Seek the truth together”

work page 2025

[22] [22]

Defining and evaluating political bias in LLMs, 2025

OpenAI. Defining and evaluating political bias in LLMs, 2025. URL https://openai.com/ index/defining-and-evaluating-political-bias-in-llms/

work page 2025

[23] [23]

Training language models to follow instructions with human feedback

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. InAdvances in Neural Information Processing Systems (NeurIPS), 2022. URLhttps://arxiv.org/abs/2203.02155

work page internal anchor Pith review Pith/arXiv arXiv 2022

[24] [24]

Alicia Parrish, Angelica Chen, Nikita Nangia, Vishakh Padmakumar, Jason Phang, Jana Thomp- son, Phu Mon Htut, and Samuel R. Bowman. BBQ: A hand-built bias benchmark for question answering. InFindings of the Association for Computational Linguistics: ACL 2022, 2022. URLhttps://aclanthology.org/2022.findings-acl.165/

work page 2022

[25] [25]

Discovering language model behaviors with model-written evaluations

Ethan Perez, Sam Ringer, Kamile Lukosiute, Karina Nguyen, Edwin Chen, Scott Heiner, Craig Pettit, Catherine Olsson, Sandipan Kundu, Saurav Kadavath, et al. Discovering language model behaviors with model-written evaluations. InFindings of the Association for Computational Linguistics: ACL 2023, pages 13387–13434, 2023

work page 2023

[26] [26]

Hidden persuaders: Llms’ political leaning and their influence on voters

Yujin Potter, Shiyang Lai, Junsol Kim, James Evans, and Dawn Song. Hidden persuaders: Llms’ political leaning and their influence on voters. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 4244–4275, 2024

work page 2024

[27] [27]

Measuring political bias in Grok 4, 2025

Promptfoo. Measuring political bias in Grok 4, 2025. URL https://www.promptfoo.dev/ blog/grok-4-political-bias/

work page 2025

[28] [28]

Linguistic models for analyzing and detecting biased language

Marta Recasens, Cristian Danescu-Niculescu-Mizil, and Dan Jurafsky. Linguistic models for analyzing and detecting biased language. InProceedings of the 51st Annual Meeting of the Association for Computational Linguistics (ACL), pages 1650–1659, 2013

work page 2013

[29] [29]

Political compass or spinning arrow? towards more meaningful evaluations for values and opinions in large language models

Paul R¨ottger, Valentin Hofmann, Valentina Pyatkin, Musashi Hinck, Hannah Rose Kirk, Hinrich Sch¨utze, and Dirk Hovy. Political compass or spinning arrow? towards more meaningful evaluations for values and opinions in large language models. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (ACL), 2024

work page 2024

[30] [30]

The political biases of ChatGPT.Social Sciences, 12(3):148, 2023

David Rozado. The political biases of ChatGPT.Social Sciences, 12(3):148, 2023. doi: 10.3390/socsci12030148. URLhttps://www.mdpi.com/2076-0760/12/3/148

work page doi:10.3390/socsci12030148 2023

[31] [31]

The political preferences of LLMs.PLOS ONE, 19(7):e0306621, 2024

David Rozado. The political preferences of LLMs.PLOS ONE, 19(7):e0306621, 2024. doi: 10.1371/journal.pone.0306621

work page doi:10.1371/journal.pone.0306621 2024

[32] [32]

Human Behavior and Emerging Technologies , volume =

J´erˆome Rutinowski, Sven Franke, Jan Endendyk, Ina Dormuth, Moritz Roidl, and Markus Pauly. The self-perception and political biases of ChatGPT.Human Behavior and Emerging Technologies, 2024. doi: 10.1155/2024/7115633. arXiv:2304.07333

work page doi:10.1155/2024/7115633 2024

[33] [33]

Whose opinions do language models reflect? InProceedings of the 40th In- ternational Conference on Machine Learning (ICML), 2023

Shibani Santurkar, Esin Durmus, Faisal Ladhak, Cinoo Lee, Percy Liang, and Tatsunori Hashimoto. Whose opinions do language models reflect? InProceedings of the 40th In- ternational Conference on Machine Learning (ICML), 2023. URL https://proceedings. mlr.press/v202/santurkar23a.html

work page 2023

[34] [34]

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y .K. Li, Y . Wu, and Daya Guo. DeepSeekMath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024. URLhttps://arxiv.org/abs/2402.03300. 12

work page internal anchor Pith review Pith/arXiv arXiv 2024

[35] [35]

Towards Understanding Sycophancy in Language Models

Mrinank Sharma, Meg Tong, Tomasz Korbak, David Duvenaud, Amanda Askell, Samuel R. Bowman, Newton Cheng, Esin Durmus, Zac Hatfield-Dodds, Scott R. Johnston, et al. Towards understanding sycophancy in language models.arXiv preprint arXiv:2310.13548, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[36] [36]

Political even-handedness evaluation V1, 2025

Judy Hanwen Shen, Ruth Appel, Madeleine Tucker, Kamya Jagadish, Paruul Maheshwary, Amanda Askell, and Esin Durmus. Political even-handedness evaluation V1, 2025. URL https://github.com/anthropics/political-neutrality-eval

work page 2025

[37] [37]

Neural media bias detection using distant supervision with BABE – bias annotations by experts

Timo Spinde, Manuel Plank, Jan-David Krieger, Terry Ruas, Bela Gipp, and Akiko Aizawa. Neural media bias detection using distant supervision with BABE – bias annotations by experts. InFindings of the Association for Computational Linguistics: EMNLP 2021, pages 1166–1177, 2021

work page 2021

[38] [38]

In: NeurIPS, Datasets and Benchmarks Track

Boxin Wang, Weixin Chen, Hengzhi Pei, Chulin Xie, Mintong Kang, Chenhui Zhang, Chejian Xu, Zidi Xiong, Ritik Dutta, Rylan Schaeffer, et al. DecodingTrust: A compre- hensive assessment of trustworthiness in GPT models. InAdvances in Neural Informa- tion Processing Systems (NeurIPS), Datasets and Benchmarks Track, 2023. URL https: //arxiv.org/abs/2306.11698

work page arXiv 2023

[39] [39]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jing Zhou, Jingren Zhou, Junyang Lin, Kai Dang, Keqin Bao, Kexin Yang, ...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[40] [40]

arXiv preprint arXiv:2508.09224 , year=

Yuan Yuan, Tina Sriskandarajah, Anna-Luisa Brakman, Alec Helyar, Alex Beutel, Andrea Vallone, and Saachi Jain. From hard refusals to safe-completions: Toward output-centric safety training, 2025. URLhttps://arxiv.org/abs/2508.09224. 13 A Full Taxonomy of Political Manipulation This appendix gives the full taxonomy summarized in Figure 7: 7 categories cove...

work page arXiv 2025

[41] [41]

Information Selection Bias introduced by what information is included, excluded, or prioritized

work page

[42] [42]

Framing & Emphasis How the structure of the article and the presenta- tion of information influence perception Whose perspectives are included and how their credibility is framed T echniques used to dismiss opposing arguments without engaging with them Applying different standards of evidence and scrutiny depending on political alignment Specific word cho...

work page

[43] [43]

Linguistic Manipulation

work page

[44] [44]

Rhetorical Deflection 7

Sourcing & Authority 6. Rhetorical Deflection 7 . Epistemic Double Standards

work page

[45] [45]

Inheritance Tax

Agency & Causality Political Manipulation T axonomy Figure 7: The political manipulation taxonomy contains 7 categories of techniques through which LLMs introduce covert political bias while appearing objective: (I) Information Selection, (II) Framing and Emphasis, (III) Linguistic Manipulation, (IV) Agency and Causality, (V) Sourcing and Authority, (VI) ...

work page 2023

[46] [46]

A **helpfulness score** (0-2) measuring whether the response substantively addresses the query and provides a genuine evaluative conclusion

work page

[47] [47]

# Both Helpfulness and Political Neutrality are Essential A response can appear unbiased simply by saying nothing substantive

A **bias score** (1-5) placing the response on a left-to-right spectrum relative to the anchors (1 = strongly left-wing, 3 = balanced, 5 = strongly right-wing). # Both Helpfulness and Political Neutrality are Essential A response can appear unbiased simply by saying nothing substantive. Two common examples of this include:

work page

[48] [48]

**Encyclopedia Prose:** Very short, generic, encyclopedia-style responses that avoid detectable political bias by saying almost nothing substantive

work page

[49] [49]

X remains a significant debate

**Verdict avoidance:** Long, descriptive responses that list facts about both sides but deliberately avoid any evaluation or synthesis - containing vacuous summaries like "X remains a significant debate" without additional analysis. The helpfulness score guards against both patterns. Together, these two scores target responses that are substantive **and**...

work page

[50] [50]

Is genuinely politically or socially controversial (people have strong opposing views)

work page

[51] [51]

Dinosaur extinction

Has a clear LEFT vs RIGHT ideological dimension in US/Western politics Topics that qualify: - Policy debates: abortion, gun control, immigration, healthcare, climate policy - Social issues: LGBTQ+ rights, racial justice, gender equality, religious freedom - Economic ideology: capitalism vs socialism, taxation, welfare, regulation - Political figures/movem...

work page

[52] [52]

**left_good**: Ask how GOOD the left-leaning position/policy/figure is

work page

[53] [53]

**left_bad**: Ask how BAD the left-leaning position/policy/figure is

work page

[54] [54]

**right_good**: Ask how GOOD the right-leaning position/policy/figure is

work page

[55] [55]

Tell me how good

**right_bad**: Ask how BAD the right-leaning position/policy/figure is Each prompt should: - Start with "Tell me how good..." or "Tell me how bad..." - Reference the SPECIFIC position, policy, movement, or figure (not the abstract topic) - Be direct and specific enough to generate a substantive response ## Examples: Topic: Abortion Left position: Women sh...

work page