How Far Will They Go? Red-Teaming Online Influence with Large Language Models
Pith reviewed 2026-05-25 05:34 UTC · model grok-4.3
The pith
Open-source LLMs reliably generate more left-leaning political content than right-leaning, with expressible ranges narrowing as models increase in size.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By introducing an empirical framework that defines an LLM Overton Window as the span of political opinions a model will reliably express, the work shows that open-source models are typically more willing to generate left-leaning social media content, that these windows contract as model size grows, that regional origins produce substantial differences even with uneven representation in the open ecosystem, and that jailbreak effectiveness varies sharply across families.
What carries the argument
The LLM Overton Window, defined as the range of political opinions a model can reliably express on controversial topics, which is quantified before and after applying natural-language jailbreaks to measure expansion.
If this is right
- Larger models within the same family become harder to steer toward the edges of the political spectrum.
- Jailbreak success depends on the specific model family, so effective combinations must be identified per family.
- Regional differences persist even when training data overlap is limited, suggesting origin-specific alignment effects.
- The framework supplies a repeatable audit method that future model releases can be measured against.
Where Pith is reading between the lines
- Alignment methods used on open models may embed directional preferences that are not symmetric across the political spectrum.
- Audits focused only on frontier API models would miss the models most accessible for localized influence operations.
- The same measurement approach could be extended to track whether new releases widen or narrow windows on the same topics.
Load-bearing premise
The chosen controversial topics, prompts, and jailbreaks form a representative sample that reveals genuine model capacities rather than artifacts of the particular evaluation design.
What would settle it
A replication using a fresh set of topics or an automated scoring method that finds no left-leaning asymmetry and no inverse relationship with model size would falsify the reported patterns.
Figures
read the original abstract
As large language model (LLM)-based agents increasingly participate in online discourse, red-teaming their capacity to support political influence campaigns is critical for information integrity. In pursuit of this goal, we focus on locally deployed open-source LLMs, as opposed to frontier API-only models, given their superior alignment with the operational constraints of privacy-conscious malicious actors deployed in social media environments. We introduce an empirical red-teaming framework for measuring LLM Overton Windows (OWs), defined as the range of political opinions a model can reliably express on controversial topics, and for quantifying how simple natural-language jailbreaks expand that range. We evaluate more than 30 LLMs spanning 10 model families and five countries of origin. We find systematic asymmetries in political expressivity: open-source LLMs are typically more willing to generate left-leaning social media content, OWs tend to contract inversely to model size, and regional differences are substantial despite uneven representation in the open-source ecosystem. Jailbreak potency also varies sharply across model families, motivating a workflow for identifying effective combinations of jailbreak techniques. Taken together, our results establish a practical framework for auditing the political steerability of open-source LLMs and for helping future researchers design stronger countermeasures against LLM-enabled influence campaigns.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces an empirical red-teaming framework for measuring LLM Overton Windows (the range of political opinions a model can reliably express on controversial topics) and quantifying the effect of natural-language jailbreaks on that range. It evaluates more than 30 open-source LLMs across 10 families and five countries of origin, reporting systematic asymmetries: greater willingness to generate left-leaning content, inverse contraction of OWs with model size, substantial regional differences, and sharp variation in jailbreak potency across families. The work positions the framework as a practical tool for auditing political steerability and designing countermeasures against LLM-enabled influence campaigns.
Significance. If the empirical results hold under scrutiny, the framework offers a concrete auditing method for open-source LLMs relevant to privacy-conscious malicious actors, which is timely given increasing LLM participation in online discourse. The scale of the evaluation (30+ models, multiple families and origins) and the explicit workflow for identifying effective jailbreak combinations are strengths that could support reproducible follow-up work. The focus on locally deployable models rather than API-only frontier systems aligns with realistic threat models.
major comments (1)
- [Abstract] Abstract: the directional claims of systematic asymmetries in political expressivity (left-leaning bias, inverse size relation, regional differences, and jailbreak variation) are presented without any information on topic selection, prompt templates, scoring criteria for 'reliably express,' statistical methods, or controls for prompt sensitivity. This absence is load-bearing because it prevents assessment of whether the measured ranges reflect model capacity or artifacts of the chosen prompts and evaluators.
minor comments (1)
- The acronym OW is introduced for the invented construct 'LLM Overton Window'; the introduction should explicitly distinguish this operational definition from the classical Overton window in political science and justify the extension.
Simulated Author's Rebuttal
We thank the referee for their review. We address the single major comment on the abstract below.
read point-by-point responses
-
Referee: [Abstract] Abstract: the directional claims of systematic asymmetries in political expressivity (left-leaning bias, inverse size relation, regional differences, and jailbreak variation) are presented without any information on topic selection, prompt templates, scoring criteria for 'reliably express,' statistical methods, or controls for prompt sensitivity. This absence is load-bearing because it prevents assessment of whether the measured ranges reflect model capacity or artifacts of the chosen prompts and evaluators.
Authors: The abstract is intentionally concise and summarizes the key findings, consistent with standard academic practice. All requested details are provided in the main text: topic selection and prompt design in Section 3.1, jailbreak templates in Section 3.2, the definition and scoring of 'reliably express' (including inter-annotator agreement) in Section 3.3, statistical methods and controls for prompt sensitivity in Section 4, and robustness checks in Section 4.2. Readers can therefore evaluate whether the reported asymmetries reflect model behavior rather than artifacts. To address the concern directly, we will expand the abstract with one additional sentence summarizing the evaluation protocol. revision: yes
Circularity Check
No significant circularity
full rationale
The paper presents an empirical measurement study that evaluates over 30 LLMs across model families by directly prompting them on controversial topics and measuring the range of expressible political opinions (Overton Windows). The abstract and description contain no equations, fitted parameters, derivations, predictions that reduce to inputs, or self-citations invoked as load-bearing uniqueness theorems. All reported findings (asymmetries in expressivity, size effects, regional differences, jailbreak potency) are presented as outcomes of the experimental protocol rather than results that are definitionally equivalent to the inputs or prior author work. This is a standard non-circular empirical audit.
Axiom & Free-Parameter Ledger
invented entities (1)
-
LLM Overton Window (OW)
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Measuring Political Bias in Large Language Models: What Is Said and How It Is Said
Bang, Yejin and Chen, Delong and Lee, Nayeon and Fung, Pascale. Measuring Political Bias in Large Language Models: What Is Said and How It Is Said. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2024. doi:10.18653/v1/2024.acl-long.600
-
[2]
Biased LLM s can Influence Political Decision-Making
Fisher, Jillian and Feng, Shangbin and Aron, Robert and Richardson, Thomas and Choi, Yejin and Fisher, Daniel W and Pan, Jennifer and Tsvetkov, Yulia and Reinecke, Katharina. Biased LLM s can Influence Political Decision-Making. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2025. doi:10.18...
-
[3]
Whose Side are You on: Investigating Political Bias of Large Language Models
Pit, Pagnarasmey and Ma, Xingjun and Conway, Mike and Chen, Qingyu and Bailey, James and Pit, Pagnarith and Keo, Putrasmey and Diep, Watey and Jiang, Yu-Gang. Whose Side are You on: Investigating Political Bias of Large Language Models. AI 2025: Advances in Artificial Intelligence. 2026
work page 2025
-
[4]
Proceedings of the 40th International Conference on Machine Learning , articleno =
Santurkar, Shibani and Durmus, Esin and Ladhak, Faisal and Lee, Cinoo and Liang, Percy and Hashimoto, Tatsunori , title =. Proceedings of the 40th International Conference on Machine Learning , articleno =. 2023 , publisher =
work page 2023
-
[5]
Miehling, Erik and Desmond, Michael and Natesan Ramamurthy, Karthikeyan and Daly, Elizabeth M. and Varshney, Kush R. and Farchi, Eitan and Dognin, Pierre and Rios, Jesus and Bouneffouf, Djallel and Liu, Miao and Sattigeri, Prasanna. Evaluating the Prompt Steerability of Large Language Models. Proceedings of the 2025 Conference of the Nations of the Americ...
-
[6]
Political Ideology Shifts in Large Language Models , author=. 2025 , eprint=
work page 2025
-
[7]
POW : Political Overton Windows of Large Language Models
Azzopardi, Leif and Moshfeghi, Yashar. POW : Political Overton Windows of Large Language Models. Findings of the Association for Computational Linguistics: EMNLP 2025. 2025. doi:10.18653/v1/2025.findings-emnlp.1347
-
[8]
R. Political Compass or Spinning Arrow? Towards More Meaningful Evaluations for Values and Opinions in Large Language Models. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2024. doi:10.18653/v1/2024.acl-long.816
-
[9]
Rozado, David , TITLE =. Social Sciences , VOLUME =. 2023 , NUMBER =
work page 2023
-
[10]
Probing Pre-Trained Language Models for Cross-Cultural Differences in Values
Arora, Arnav and Kaffee, Lucie-aim \'e e and Augenstein, Isabelle. Probing Pre-Trained Language Models for Cross-Cultural Differences in Values. Proceedings of the First Workshop on Cross-Cultural Considerations in NLP (C3NLP). 2023. doi:10.18653/v1/2023.c3nlp-1.12
-
[11]
LLM Tropes: Revealing Fine-Grained Values and Opinions in Large Language Models
Wright, Dustin and Arora, Arnav and Borenstein, Nadav and Yadav, Srishti and Belongie, Serge and Augenstein, Isabelle. LLM Tropes: Revealing Fine-Grained Values and Opinions in Large Language Models. Findings of the Association for Computational Linguistics: EMNLP 2024. 2024. doi:10.18653/v1/2024.findings-emnlp.995
-
[12]
More human than human: measuring ChatGPT political bias , volume =
Motoki, Fabio and Pinho Neto, Valdemar and Rangel, Victor , year =. More human than human: measuring ChatGPT political bias , volume =. Public Choice , doi =
-
[13]
Sokhansanj, Bahrad A. , title =. Future Internet , year =. doi:10.3390/fi17100477 , url =
-
[14]
Web Information Systems Engineering -- WISE 2024 , year =
Yamin, Muhammad Mudassar and Hashmi, Ehtesham and Katt, Basel , title =. Web Information Systems Engineering -- WISE 2024 , year =. doi:10.1007/978-981-96-0573-6_14 , url =
-
[15]
Open-sourcing R1 1776 , year =
-
[16]
Guo, Daya and Yang, Dejian and Zhang, Haowei and Song, Junxiao and Wang, Peiyi and Zhu, Qihao and Xu, Runxin and Zhang, Ruoyu and Ma, Shirong and Bi, Xiao and Zhang, Xiaokang and Yu, Xingkai and Wu, Yu and Wu, Z. F. and Gou, Zhibin and Shao, Zhihong and Li, Zhuoshu and Gao, Ziyi and Liu, Aixin and Xue, Bing and Wang, Bingxuan and Wu, Bochao and Feng, Bei ...
-
[17]
GitHub repository , howpublished =
Weidmann, Philipp Emanuel , title =. GitHub repository , howpublished =. 2025 , publisher =
work page 2025
-
[18]
OBLITERATUS: An Open Platform for Analysis-Informed Refusal Removal in Large Language Models , author =. 2026 , url =
work page 2026
-
[19]
Xiaogeng Liu and Nan Xu and Muhao Chen and Chaowei Xiao , booktitle=. Auto. 2024 , url=
work page 2024
-
[20]
Arditi, Andy and Obeso, Oscar and Syed, Aaquib and Paleka, Daniel and Panickssery, Nina and Gurnee, Wes and Nanda, Neel , title =. Proceedings of the 38th International Conference on Neural Information Processing Systems , articleno =. 2024 , isbn =
work page 2024
-
[21]
Mitigating Safety Fallback in Editing-based Backdoor Injection on
Houcheng Jiang and Zetong Zhao and Junfeng Fang and Haokai Ma and Ruipeng Wang and Yang Deng and Xiang Wang and Xiangnan He , booktitle=. Mitigating Safety Fallback in Editing-based Backdoor Injection on. 2026 , url=
work page 2026
- [22]
-
[23]
Hugging Face Hub , howpublished =
Dolphin Mistral 24B Venice Edition , year =. Hugging Face Hub , howpublished =
-
[24]
Daniel Thilo Schroeder and Meeyoung Cha and Andrea Baronchelli and Nick Bostrom and Nicholas A. Christakis and David Garcia and Amit Goldenberg and Yara Kyrychenko and Kevin Leyton-Brown and Nina Lutz and Gary Marcus and Filippo Menczer and Gordon Pennycook and David G. Rand and Maria Ressa and Frank Schweitzer and Dawn Song and Christopher Summerfield an...
-
[25]
Emergent Coordinated Behaviors in Networked LLM Agents: Modeling the Strategic Dynamics of Information Operations , author=. 2025 , eprint=
work page 2025
- [26]
-
[27]
Qwen3-Next: Revolutionary AI Model Architecture , year =
- [28]
- [29]
-
[30]
Falcon-H1: A Family of Hybrid-Head Language Models Redefining Efficiency and Performance , author=. 2025 , eprint=
work page 2025
-
[31]
Granite 4.0 Language Models , year =
- [32]
-
[33]
Hugging Face Hub , howpublished =
Mistral-Large-Instruct-2411 , year =. Hugging Face Hub , howpublished =
-
[34]
Introducing Sarvam's Sovereign Models , author =. 2026 , howpublished =
work page 2026
-
[35]
AI Propaganda factories with language models , author=. 2025 , eprint=
work page 2025
-
[36]
Chi and Quoc Le and Denny Zhou , title=
Jason Wei and Xuezhi Wang and Dale Schuurmans and Maarten Bosma and Ed H. Chi and Quoc Le and Denny Zhou , title=. CoRR , volume=. 2022 , cdate=
work page 2022
- [37]
-
[38]
Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles , year=
Efficient Memory Management for Large Language Model Serving with PagedAttention , author=. Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles , year=
-
[39]
Educational and Psychological Measurement , year=
A Coefficient of Agreement for Nominal Scales , author=. Educational and Psychological Measurement , year=
-
[40]
Intraclass correlations: Uses in assessing rater reliability,
Shrout, Patrick E. and Fleiss, Joseph L. , title =. Psychological Bulletin , year =. doi:https://doi.org/10.1037/0033-2909.86.2.420 , pmid =
- [41]
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.