Recognition: unknown
A Multi-Dimensional Audit of Politically Aligned Large Language Models
Pith reviewed 2026-05-08 03:34 UTC · model grok-4.3
The pith
Politically aligned LLMs trade fairness for effectiveness and truthfulness as model size grows.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Applying the four-dimensional audit to nine LLMs aligned via fine-tuning or role-playing shows that larger models achieve greater effectiveness in role-playing political ideologies and higher truthfulness, yet they exhibit reduced fairness through increased bias expressed as angry and toxic language toward differing ideologies. Fine-tuned models deliver lower bias and stronger alignment than role-playing approaches but suffer declines in reasoning performance and rises in hallucinations. All tested models display deficiencies in at least one of the four metrics.
What carries the argument
A multi-dimensional audit framework using four dimensions-effectiveness, fairness, truthfulness, and persuasiveness-derived from Habermas' Theory of Communicative Action and measured with automated quantitative metrics.
If this is right
- Larger models are more effective at role-playing political ideologies and more truthful in responses but less fair.
- Fine-tuned models show lower bias and more effective alignment than role-playing models.
- Fine-tuned models experience reduced performance on reasoning tasks and increased hallucinations.
- All tested models exhibit deficiencies in at least one of the four audit metrics.
Where Pith is reading between the lines
- The same audit approach could be applied to alignment on other contested topics such as public health or climate policy to check for similar trade-offs.
- Developers may need training objectives that jointly optimize the four dimensions rather than focusing on one at a time.
- Public benchmarks using this framework could inform standards for deploying politically aligned models in campaigns or media.
Load-bearing premise
The four dimensions drawn from Habermas' Theory of Communicative Action form a valid and comprehensive basis for auditing political alignment through automated metrics.
What would settle it
An experiment in which a larger fine-tuned model scores high on all four dimensions simultaneously without elevated toxicity or hallucinations would contradict the reported trade-offs.
Figures
read the original abstract
As the application of Large Language Models (LLMs) spreads across various industries, there are increasing concerns about the potential for their misuse, especially in sensitive areas such as political discourse. Deliberately aligning LLMs with specific political ideologies, through prompt engineering or fine-tuning techniques, can be advantageous in use cases such as political campaigns, but requires careful consideration due to heightened risks of performance degradation, misinformation, or increased biased behavior. In this work, we propose a multi-dimensional framework inspired by Habermas' Theory of Communicative Action to audit politically aligned language models across four dimensions: effectiveness, fairness, truthfulness, and persuasiveness using automated, quantitative metrics. Applying this to nine popular LLMs aligned via fine-tuning or role-playing revealed consistent trade-offs: while larger models tend to be more effective at role-playing political ideologies and truthful in their responses, they were also less fair, exhibiting higher levels of bias in the form of angry and toxic language towards people of different ideologies. Fine-tuned models exhibited lower bias and more effective alignment than the corresponding role-playing models, but also saw a decline in performance reasoning tasks and an increase in hallucinations. Overall, all of the models tested exhibited some deficiency in at least one of the four metrics, highlighting the need for more balanced and robust alignment strategies. Ultimately, this work aims to ensure politically-aligned LLMs generate legitimate, harmless arguments, offering a framework to evaluate the responsible political alignment of these models.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes a multi-dimensional audit framework for politically aligned LLMs, drawing on Habermas' Theory of Communicative Action to evaluate four dimensions—effectiveness, fairness, truthfulness, and persuasiveness—via automated quantitative metrics. It applies the framework to nine popular LLMs aligned either through fine-tuning or role-playing prompts, reporting consistent trade-offs: larger models are more effective at role-playing ideologies and more truthful but less fair (higher bias via angry/toxic language); fine-tuned models show lower bias and stronger alignment than role-play versions but suffer declines in reasoning performance and increased hallucinations. All tested models exhibit deficiencies in at least one dimension, underscoring the need for balanced alignment strategies.
Significance. If the automated metrics are shown to faithfully operationalize the Habermas-derived dimensions and correlate with human judgments, the work would provide a practical, quantitative tool for auditing political alignment risks in LLMs, highlighting actionable trade-offs between alignment techniques. The multi-model evaluation and explicit focus on both fine-tuning and prompting methods add empirical breadth, though the absence of validation data currently constrains the framework's reliability for policy or deployment decisions.
major comments (3)
- [Abstract and §3] Abstract and §3 (Framework): The central claim that the framework reveals reliable trade-offs rests on automated proxies (toxicity/anger detectors for fairness, factuality checks for truthfulness, etc.), yet no validation against human or expert political-science ratings is reported. In contested political discourse these proxies are known to be noisy; without correlation evidence the reported patterns (e.g., larger models more truthful yet more toxic) remain uninterpretable.
- [Abstract and §4] Abstract and §4 (Experiments): No details are supplied on the concrete datasets, prompt templates, or statistical methods used to compute the four metrics, nor on how role-playing vs. fine-tuning alignments were implemented across the nine models. This absence prevents assessment of whether the quantitative results actually support the stated trade-offs.
- [§5] §5 (Results): The finding that fine-tuned models exhibit lower bias but increased hallucinations and reduced reasoning performance is load-bearing for the paper's policy recommendation, yet it is presented without error bars, significance tests, or controls for model size and base capability, making it impossible to isolate the effect of the alignment method.
minor comments (2)
- [Abstract] The abstract and introduction would benefit from a brief table or bullet list explicitly defining each of the four dimensions and the automated metric chosen for it.
- [Introduction] Citation of prior work on LLM bias/toxicity benchmarks (e.g., RealToxicityPrompts, TruthfulQA) is missing; adding these would clarify how the new metrics relate to existing ones.
Simulated Author's Rebuttal
We thank the referee for their thoughtful and constructive comments. We agree that additional validation, experimental details, and statistical controls will strengthen the manuscript and address concerns about interpretability. We outline our point-by-point responses and planned revisions below.
read point-by-point responses
-
Referee: [Abstract and §3] Abstract and §3 (Framework): The central claim that the framework reveals reliable trade-offs rests on automated proxies (toxicity/anger detectors for fairness, factuality checks for truthfulness, etc.), yet no validation against human or expert political-science ratings is reported. In contested political discourse these proxies are known to be noisy; without correlation evidence the reported patterns (e.g., larger models more truthful yet more toxic) remain uninterpretable.
Authors: We agree that human validation is essential to demonstrate that the automated proxies faithfully capture the Habermas-derived dimensions. In the revised manuscript we will add a dedicated subsection to §3 reporting a human evaluation: we will sample 200 model responses across the four dimensions, obtain ratings from three political-science experts per response on 5-point scales for each dimension, and report Pearson/Spearman correlations between these ratings and our automated metrics. This will directly address the concern about proxy noise and support the reported trade-offs. revision: yes
-
Referee: [Abstract and §4] Abstract and §4 (Experiments): No details are supplied on the concrete datasets, prompt templates, or statistical methods used to compute the four metrics, nor on how role-playing vs. fine-tuning alignments were implemented across the nine models. This absence prevents assessment of whether the quantitative results actually support the stated trade-offs.
Authors: We acknowledge the current version lacks sufficient methodological transparency. In the revised §4 we will provide: (i) the exact prompt templates and source datasets (including political ideology statements and reasoning/hallucination benchmarks) used for each metric; (ii) full implementation details for both fine-tuning (base models, alignment datasets, training hyperparameters) and role-play prompting (system prompts and few-shot examples); and (iii) the precise aggregation and normalization procedures for each of the four quantitative metrics. These additions will enable full reproducibility and allow readers to evaluate the strength of the observed trade-offs. revision: yes
-
Referee: [§5] §5 (Results): The finding that fine-tuned models exhibit lower bias but increased hallucinations and reduced reasoning performance is load-bearing for the paper's policy recommendation, yet it is presented without error bars, significance tests, or controls for model size and base capability, making it impossible to isolate the effect of the alignment method.
Authors: We concur that stronger statistical presentation is required to isolate alignment effects. In the revised §5 we will: add error bars (standard error of the mean) to all bar plots and tables; perform and report paired t-tests (or Wilcoxon tests where normality fails) comparing fine-tuned versus role-play versions within each model family; and introduce model-size controls by reporting separate subgroup analyses for models of comparable scale (e.g., 7B vs. 7B, 70B vs. 70B). These changes will clarify whether the observed declines are attributable to the alignment technique rather than base capability. revision: yes
Circularity Check
No circularity: empirical audit applies external framework without self-referential reduction
full rationale
The paper introduces a four-dimensional audit framework drawn from Habermas' Theory of Communicative Action and applies it via automated metrics to nine LLMs under two alignment methods. No equations, fitted parameters, or predictions are defined in terms of the framework's own outputs; the reported trade-offs (larger models more truthful yet less fair; fine-tuning lowers bias but raises hallucinations) are presented as direct empirical results from metric application rather than logical necessities or renamed inputs. The framework is positioned as an independent evaluation tool grounded in external theory, with no load-bearing self-citations or ansatzes that collapse the derivation chain. This is the standard non-circular pattern for an audit study.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Habermas' Theory of Communicative Action can be translated into four quantitative metrics (effectiveness, fairness, truthfulness, persuasiveness) suitable for auditing LLM political alignment.
Reference graph
Works this paper leans on
-
[1]
Junfeng Jiao, Saleh Afroogh, Yiming Xu, and Connor Phillips. Navigating llm ethics: Advancements, challenges, and future directions.arXiv preprint arXiv:2406.18841, 2024
-
[2]
Political bias in large language models: A comparative analysis of chatgpt-4, perplexity, google gemini, and claude.IEEE Access, 13:11341–11379, 2025
Tavishi Choudhary. Political bias in large language models: A comparative analysis of chatgpt-4, perplexity, google gemini, and claude.IEEE Access, 13:11341–11379, 2025
2025
-
[3]
The political ideology of conversational ai: Converging evidence on chatgpt’s pro-environmental, left-libertarian orientation, 2023
Jochen Hartmann, Jasper Schwenzow, and Maximilian Witte. The political ideology of conversational ai: Converging evidence on chatgpt’s pro-environmental, left-libertarian orientation, 2023
2023
-
[4]
Shibani Santurkar, Esin Durmus, Faisal Ladhak, Cinoo Lee, Percy Liang, and Tatsunori Hashimoto. Whose opinions do language models reflect? In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett, editors,Proceedings of the 40th International Conference on Machine Learning, volume 202 ofProceedings of Machi...
2023
-
[5]
The political preferences of llms.PLOS ONE, 19(7):1–15, 07 2024
David Rozado. The political preferences of llms.PLOS ONE, 19(7):1–15, 07 2024
2024
-
[6]
Ahmed Agiza, Mohamed Mostagir, and Sherief Reda. Politune: Analyzing the impact of data selection and fine-tuning on economic and political biases in large language models.Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society, 7(1):2–12, Oct. 2024
2024
-
[7]
Ecker, S
U.K.H. Ecker, S. Lewandowsky, J. Cook, et al. The psychological drivers of misinformation belief and its resistance to correction.Nature Reviews Psychology, 1:13–29, 2022
2022
-
[8]
Media manipulation and disinformation online.New York: Data & Society Research Institute, 359:1146–1151, 2017
Alice Marwick and Rebecca Lewis. Media manipulation and disinformation online.New York: Data & Society Research Institute, 359:1146–1151, 2017
2017
-
[9]
Jürgen Habermas.The Theory of Communicative action.Beacon Press, Boston, Mass., 1981
1981
-
[10]
A comprehensive survey of bias in llms: Current landscape and future directions, 2024
Rajesh Ranjan, Shailja Gupta, and Surya Narayan Singh. A comprehensive survey of bias in llms: Current landscape and future directions, 2024
2024
-
[11]
Training language models to follow instructions with human feedback
Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul F Christiano, Jan Leike, and Ryan Lowe. Training language models to follow instructions with human feedbac...
2022
-
[12]
Direct preference optimization: Your language model is secretly a reward model
Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors,Advances in Neural Information Processing Systems, volume 36, pages 53728–53741. Curran Associ...
2023
-
[13]
A comprehensive survey of llm alignment techniques: Rlhf, rlaif, ppo, dpo and more, 2024
Zhichao Wang, Bin Bi, Shiva Kumar Pentyala, Kiran Ramnath, Sougata Chaudhuri, Shubham Mehrotra, Zixu, Zhu, Xiang-Bo Mao, Sitaram Asur, Na, and Cheng. A comprehensive survey of llm alignment techniques: Rlhf, rlaif, ppo, dpo and more, 2024
2024
-
[14]
Measuring political bias in large language models: What is said and how it is said
Yejin Bang, Delong Chen, Nayeon Lee, and Pascale Fung. Measuring political bias in large language models: What is said and how it is said. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors,Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 11142–11159, Bangkok, Thailand, August 20...
2024
-
[15]
From pretraining data to language models to downstream tasks: Tracking the trails of political biases leading to unfair NLP models
Shangbin Feng, Chan Young Park, Yuhan Liu, and Yulia Tsvetkov. From pretraining data to language models to downstream tasks: Tracking the trails of political biases leading to unfair NLP models. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki, editors,Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1:...
2023
-
[16]
Measuring political bias in large language models: What is said and how it is said, 2024
Yejin Bang, Delong Chen, Nayeon Lee, and Pascale Fung. Measuring political bias in large language models: What is said and how it is said, 2024
2024
-
[17]
Llama meets EU: Investigating the European political spectrum through the lens of LLMs
Ilias Chalkidis and Stephanie Brandl. Llama meets EU: Investigating the European political spectrum through the lens of LLMs. In Kevin Duh, Helena Gomez, and Steven Bethard, editors,Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 2: Short Papers), pages ...
2024
-
[18]
Ieee standard review—ethically aligned design: A vision for prioritizing human wellbeing with artificial intelligence and autonomous systems
Kyarash Shahriari and Mana Shahriari. Ieee standard review—ethically aligned design: A vision for prioritizing human wellbeing with artificial intelligence and autonomous systems. In2017 IEEE Canada International Humanitarian Technology Conference (IHTC), pages 197–201. IEEE, 2017
2017
-
[19]
Trustllm: Trustworthiness in large language models
Lichao Sun, Yue Huang, Haoran Wang, Siyuan Wu, Qihui Zhang, Chujie Gao, Yixin Huang, Wenhan Lyu, Yixuan Zhang, Xiner Li, et al. Trustllm: Trustworthiness in large language models.arXiv preprint arXiv:2401.05561, 3, 2024
-
[20]
Niles, Ken Pathak, and Steven Sloan
Md Meftahul Ferdaus, Mahdi Abdelguerfi, Elias Ioup, Kendall N. Niles, Ken Pathak, and Steven Sloan. Towards trustworthy ai: A review of ethical and robust large language models, 2024
2024
-
[21]
Trustworthy llms: a survey and guideline for evaluating large language models’ alignment
Yang Liu, Yuanshun Yao Yao, Jean-Francois Ton Ton, Xiaoying Zhang, Ruocheng Guo, Yegor Klochkov, Muhammad Faaiz Taufiq, and Hang Li. Trustworthy llms: a survey and guideline for evaluating large language models’ alignment. Inpreprint, 2023
2023
-
[22]
On the validity of normative life: Habermas’ discourse ethics | epoché magazine, Aug 2023
editors. On the validity of normative life: Habermas’ discourse ethics | epoché magazine, Aug 2023
2023
-
[23]
Springer US, Boston, MA, 2004
Wendy Cukier, Robert Bauer, and Catherine Middleton.Applying Habermas’ Validity Claims as a Standard for Critical Discourse Analysis, pages 233–258. Springer US, Boston, MA, 2004
2004
-
[24]
Measuring political deliberation: A discourse quality index.Comparative European Politics, 1(1):21–48, Mar 2003
Marco R Steenbergen, André Bächtiger, Markus Spörndli, and Jürg Steiner. Measuring political deliberation: A discourse quality index.Comparative European Politics, 1(1):21–48, Mar 2003
2003
-
[25]
Bakker, Daniel Jarrett, Hannah Sheahan, Martin J
Michael Henry Tessler, Michiel A. Bakker, Daniel Jarrett, Hannah Sheahan, Martin J. Chadwick, Raphael Koster, Georgina Evans, Lucy Campbell-Gillingham, Tantum Collins, David C. Parkes, Matthew Botvinick, and Christopher Summerfield. Ai can help humans find common ground in democratic deliberation.Science, 386(6719):eadq2852, 2024
2024
-
[26]
Towards automating deliberation? the idea of deliberative democracy embedded in google’s habermas machine
Nicolás Palomo Hernández. Towards automating deliberation? the idea of deliberative democracy embedded in google’s habermas machine. 8:1951–1960
1951
-
[27]
Computation and deliberation: The ghost in the habermas machine, Dec 2024
Lawrence Fisher. Computation and deliberation: The ghost in the habermas machine, Dec 2024
2024
-
[28]
TruthfulQA: Measuring How Models Mimic Human Falsehoods
Stephanie Lin, Jacob Hilton, and Owain Evans. Truthfulqa: Measuring how models mimic human falsehoods. CoRR, abs/2109.07958, 2021
work page internal anchor Pith review arXiv 2021
-
[29]
BERT: Pre-training of deep bidirectional transformers for language understanding
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. In Jill Burstein, Christy Doran, and Thamar Solorio, editors,Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human 14 A Multi-Dimensional Audi...
2019
-
[30]
Association for Computational Linguistics
-
[31]
Deberta: Decoding-enhanced bert with disentan- gled attention
Pengcheng He, Xiaodong Liu, Jianfeng Gao, and Weizhu Chen. Deberta: Decoding-enhanced bert with disentan- gled attention. InInternational Conference on Learning Representations, 2021
2021
-
[32]
Roberta-lstm: A hybrid model for sentiment analysis with transformer and recurrent neural network.IEEE Access, 10:21517–21525, 2022
Kian Long Tan, Chin Poo Lee, Kalaiarasi Sonai Muthu Anbananthen, and Kian Ming Lim. Roberta-lstm: A hybrid model for sentiment analysis with transformer and recurrent neural network.IEEE Access, 10:21517–21525, 2022
2022
-
[33]
Sentence-BERT: Sentence embeddings using Siamese BERT-networks
Nils Reimers and Iryna Gurevych. Sentence-BERT: Sentence embeddings using Siamese BERT-networks. In Kentaro Inui, Jing Jiang, Vincent Ng, and Xiaojun Wan, editors,Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3982–3992, ...
2019
-
[34]
Robust hate speech detection in social media: A cross- dataset empirical evaluation
Dimosthenis Antypas and Jose Camacho-Collados. Robust hate speech detection in social media: A cross- dataset empirical evaluation. In Yi-ling Chung, Paul R{\”ottger}, Debora Nozza, Zeerak Talat, and Aida Mostafazadeh Davani, editors,The 7th Workshop on Online Abuse and Harms (WOAH), pages 231–242, Toronto, Canada, July 2023. Association for Computational...
2023
-
[35]
Taylor & Francis, 1963
Maurice Duverger.Political Parties. Taylor & Francis, 1963
1963
-
[36]
Political compass test
Pace News Ltd. Political compass test. https://www.politicalcompass.org/test, 2001. Accessed: 2025- 07-29
2001
-
[37]
Measuring massive multitask language understanding.Proceedings of the International Conference on Learning Representations (ICLR), 2021
Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding.Proceedings of the International Conference on Learning Representations (ICLR), 2021
2021
-
[38]
Bold: Dataset and metrics for measuring biases in open-ended language generation
Jwala Dhamala, Tony Sun, Varun Kumar, Satyapriya Krishna, Yada Pruksachatkun, Kai-Wei Chang, and Rahul Gupta. Bold: Dataset and metrics for measuring biases in open-ended language generation. InProceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, FAccT ’21, page 862–872, New York, NY , USA, 2021. Association for Computing ...
2021
-
[39]
Detoxify
Laura Hanu and Unitary team. Detoxify. Github. https://github.com/unitaryai/detoxify, 2020
2020
-
[40]
BLEURT: Learning robust metrics for text generation
Thibault Sellam, Dipanjan Das, and Ankur Parikh. BLEURT: Learning robust metrics for text generation. In Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel Tetreault, editors,Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7881–7892, Online, July 2020. Association for Computational Linguistics
2020
-
[41]
Tappin, Luke Hewitt, Ed Saunders, Sid Black, Hause Lin, Catherine Fist, Helen Margetts, David G
Kobi Hackenburg, Ben M. Tappin, Luke Hewitt, Ed Saunders, Sid Black, Hause Lin, Catherine Fist, Helen Margetts, David G. Rand, and Christopher Summerfield. The levers of political persuasion with conversational ai, 2025
2025
-
[42]
The persuasive power of large language models.Proceedings of the International AAAI Conference on Web and Social Media, 18(1):152–163, May 2024
Simon Martin Breum, Daniel Vædele Egdal, Victor Gram Mortensen, Anders Giovanni Møller, and Luca Maria Aiello. The persuasive power of large language models.Proceedings of the International AAAI Conference on Web and Social Media, 18(1):152–163, May 2024
2024
-
[43]
[Argument]
Francesco Barbieri, Jose Camacho-Collados, Luis Espinosa-Anke, and Leonardo Neves. TweetEval:Unified Benchmark and Comparative Evaluation for Tweet Classification. InProceedings of Findings of EMNLP, 2020. 15 A Multi-Dimensional Audit of Politically Aligned Large Language Models A LLM Scoring Prompts Below is the prompt provided to the evaluator LLMs to s...
2020
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.