pith. sign in

arxiv: 2605.22880 · v1 · pith:MAOREYNSnew · submitted 2026-05-20 · 💻 cs.CL · cs.AI· cs.CY

How Far Will They Go? Red-Teaming Online Influence with Large Language Models

Pith reviewed 2026-05-25 05:34 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.CY
keywords LLM red-teamingOverton windowpolitical biasjailbreak techniquesinfluence campaignsopen-source LLMssocial media generation
0
0 comments X

The pith

Open-source LLMs reliably generate more left-leaning political content than right-leaning, with expressible ranges narrowing as models increase in size.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to measure the range of political opinions that open-source LLMs will produce on contested topics and to test how simple text prompts can widen that range. It applies this measurement to more than thirty models across ten families and finds consistent leftward tilt, an inverse relationship between model size and opinion breadth, and clear differences by country of origin. These patterns matter because locally run models are the ones most available to actors seeking to shape online conversations without relying on external APIs. A sympathetic reader would therefore treat the measured ranges as a practical signal of how steerable each model family is for political messaging.

Core claim

By introducing an empirical framework that defines an LLM Overton Window as the span of political opinions a model will reliably express, the work shows that open-source models are typically more willing to generate left-leaning social media content, that these windows contract as model size grows, that regional origins produce substantial differences even with uneven representation in the open ecosystem, and that jailbreak effectiveness varies sharply across families.

What carries the argument

The LLM Overton Window, defined as the range of political opinions a model can reliably express on controversial topics, which is quantified before and after applying natural-language jailbreaks to measure expansion.

If this is right

  • Larger models within the same family become harder to steer toward the edges of the political spectrum.
  • Jailbreak success depends on the specific model family, so effective combinations must be identified per family.
  • Regional differences persist even when training data overlap is limited, suggesting origin-specific alignment effects.
  • The framework supplies a repeatable audit method that future model releases can be measured against.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Alignment methods used on open models may embed directional preferences that are not symmetric across the political spectrum.
  • Audits focused only on frontier API models would miss the models most accessible for localized influence operations.
  • The same measurement approach could be extended to track whether new releases widen or narrow windows on the same topics.

Load-bearing premise

The chosen controversial topics, prompts, and jailbreaks form a representative sample that reveals genuine model capacities rather than artifacts of the particular evaluation design.

What would settle it

A replication using a fresh set of topics or an automated scoring method that finds no left-leaning asymmetry and no inverse relationship with model size would falsify the reported patterns.

Figures

Figures reproduced from arXiv: 2605.22880 by Anna Serbina, Ashwin Rao, Daniel C. Ruiz, Emilio Ferrara, Luca Luceri.

Figure 1
Figure 1. Figure 1: Baseline expression fidelity across representative models. [PITH_FULL_IMAGE:figures/full_fig_p006_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Mean OW score (normalized, 0-1) as a function of model size across four model [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Baseline OW score (left) and political lean (right) by developer country of origin. [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Overview of the end-to-end evaluation methodology. [PITH_FULL_IMAGE:figures/full_fig_p029_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Mean ∆OW relative to baseline (mean ± standard deviation across 10 trials) by technique and model size for Qwen3.5 (left) and Gemma-3 (right). Blue denotes increased compliance and red denotes decreased compliance. The colormap is capped at ±0.42. The figure highlights strong family- and scale-dependent heterogeneity in technique effects: some framings sharply suppress OW in larger Qwen3.5 checkpoints, whe… view at source ↗
Figure 6
Figure 6. Figure 6: ∆OW score (technique minus baseline) for Falcon-H1, OLMo-2, and Granite-4.0 models. Blue = increased opinion expression; red = decreased. † MoE model. 30 [PITH_FULL_IMAGE:figures/full_fig_p030_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: ∆OW score (technique minus baseline) for Gemma-3, Qwen3.5, and remaining models. * Gemma-3-1B is an outlier (baseline OW ≈ 0.25). † MoE model. 31 [PITH_FULL_IMAGE:figures/full_fig_p031_7.png] view at source ↗
read the original abstract

As large language model (LLM)-based agents increasingly participate in online discourse, red-teaming their capacity to support political influence campaigns is critical for information integrity. In pursuit of this goal, we focus on locally deployed open-source LLMs, as opposed to frontier API-only models, given their superior alignment with the operational constraints of privacy-conscious malicious actors deployed in social media environments. We introduce an empirical red-teaming framework for measuring LLM Overton Windows (OWs), defined as the range of political opinions a model can reliably express on controversial topics, and for quantifying how simple natural-language jailbreaks expand that range. We evaluate more than 30 LLMs spanning 10 model families and five countries of origin. We find systematic asymmetries in political expressivity: open-source LLMs are typically more willing to generate left-leaning social media content, OWs tend to contract inversely to model size, and regional differences are substantial despite uneven representation in the open-source ecosystem. Jailbreak potency also varies sharply across model families, motivating a workflow for identifying effective combinations of jailbreak techniques. Taken together, our results establish a practical framework for auditing the political steerability of open-source LLMs and for helping future researchers design stronger countermeasures against LLM-enabled influence campaigns.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The paper introduces an empirical red-teaming framework for measuring LLM Overton Windows (the range of political opinions a model can reliably express on controversial topics) and quantifying the effect of natural-language jailbreaks on that range. It evaluates more than 30 open-source LLMs across 10 families and five countries of origin, reporting systematic asymmetries: greater willingness to generate left-leaning content, inverse contraction of OWs with model size, substantial regional differences, and sharp variation in jailbreak potency across families. The work positions the framework as a practical tool for auditing political steerability and designing countermeasures against LLM-enabled influence campaigns.

Significance. If the empirical results hold under scrutiny, the framework offers a concrete auditing method for open-source LLMs relevant to privacy-conscious malicious actors, which is timely given increasing LLM participation in online discourse. The scale of the evaluation (30+ models, multiple families and origins) and the explicit workflow for identifying effective jailbreak combinations are strengths that could support reproducible follow-up work. The focus on locally deployable models rather than API-only frontier systems aligns with realistic threat models.

major comments (1)
  1. [Abstract] Abstract: the directional claims of systematic asymmetries in political expressivity (left-leaning bias, inverse size relation, regional differences, and jailbreak variation) are presented without any information on topic selection, prompt templates, scoring criteria for 'reliably express,' statistical methods, or controls for prompt sensitivity. This absence is load-bearing because it prevents assessment of whether the measured ranges reflect model capacity or artifacts of the chosen prompts and evaluators.
minor comments (1)
  1. The acronym OW is introduced for the invented construct 'LLM Overton Window'; the introduction should explicitly distinguish this operational definition from the classical Overton window in political science and justify the extension.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their review. We address the single major comment on the abstract below.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the directional claims of systematic asymmetries in political expressivity (left-leaning bias, inverse size relation, regional differences, and jailbreak variation) are presented without any information on topic selection, prompt templates, scoring criteria for 'reliably express,' statistical methods, or controls for prompt sensitivity. This absence is load-bearing because it prevents assessment of whether the measured ranges reflect model capacity or artifacts of the chosen prompts and evaluators.

    Authors: The abstract is intentionally concise and summarizes the key findings, consistent with standard academic practice. All requested details are provided in the main text: topic selection and prompt design in Section 3.1, jailbreak templates in Section 3.2, the definition and scoring of 'reliably express' (including inter-annotator agreement) in Section 3.3, statistical methods and controls for prompt sensitivity in Section 4, and robustness checks in Section 4.2. Readers can therefore evaluate whether the reported asymmetries reflect model behavior rather than artifacts. To address the concern directly, we will expand the abstract with one additional sentence summarizing the evaluation protocol. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper presents an empirical measurement study that evaluates over 30 LLMs across model families by directly prompting them on controversial topics and measuring the range of expressible political opinions (Overton Windows). The abstract and description contain no equations, fitted parameters, derivations, predictions that reduce to inputs, or self-citations invoked as load-bearing uniqueness theorems. All reported findings (asymmetries in expressivity, size effects, regional differences, jailbreak potency) are presented as outcomes of the experimental protocol rather than results that are definitionally equivalent to the inputs or prior author work. This is a standard non-circular empirical audit.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

Empirical measurement study; no mathematical derivations or fitted parameters are described. The central contribution is a new measurement definition rather than reliance on prior axioms or entities.

invented entities (1)
  • LLM Overton Window (OW) no independent evidence
    purpose: Quantify the range of political opinions an LLM can reliably express on controversial topics
    Newly introduced definition to operationalize political expressivity for red-teaming.

pith-pipeline@v0.9.0 · 5765 in / 1207 out tokens · 24764 ms · 2026-05-25T05:34:41.032085+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

41 extracted references · 41 canonical work pages

  1. [1]

    Measuring Political Bias in Large Language Models: What Is Said and How It Is Said

    Bang, Yejin and Chen, Delong and Lee, Nayeon and Fung, Pascale. Measuring Political Bias in Large Language Models: What Is Said and How It Is Said. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2024. doi:10.18653/v1/2024.acl-long.600

  2. [2]

    Biased LLM s can Influence Political Decision-Making

    Fisher, Jillian and Feng, Shangbin and Aron, Robert and Richardson, Thomas and Choi, Yejin and Fisher, Daniel W and Pan, Jennifer and Tsvetkov, Yulia and Reinecke, Katharina. Biased LLM s can Influence Political Decision-Making. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2025. doi:10.18...

  3. [3]

    Whose Side are You on: Investigating Political Bias of Large Language Models

    Pit, Pagnarasmey and Ma, Xingjun and Conway, Mike and Chen, Qingyu and Bailey, James and Pit, Pagnarith and Keo, Putrasmey and Diep, Watey and Jiang, Yu-Gang. Whose Side are You on: Investigating Political Bias of Large Language Models. AI 2025: Advances in Artificial Intelligence. 2026

  4. [4]

    Proceedings of the 40th International Conference on Machine Learning , articleno =

    Santurkar, Shibani and Durmus, Esin and Ladhak, Faisal and Lee, Cinoo and Liang, Percy and Hashimoto, Tatsunori , title =. Proceedings of the 40th International Conference on Machine Learning , articleno =. 2023 , publisher =

  5. [5]

    and Varshney, Kush R

    Miehling, Erik and Desmond, Michael and Natesan Ramamurthy, Karthikeyan and Daly, Elizabeth M. and Varshney, Kush R. and Farchi, Eitan and Dognin, Pierre and Rios, Jesus and Bouneffouf, Djallel and Liu, Miao and Sattigeri, Prasanna. Evaluating the Prompt Steerability of Large Language Models. Proceedings of the 2025 Conference of the Nations of the Americ...

  6. [6]

    2025 , eprint=

    Political Ideology Shifts in Large Language Models , author=. 2025 , eprint=

  7. [7]

    POW : Political Overton Windows of Large Language Models

    Azzopardi, Leif and Moshfeghi, Yashar. POW : Political Overton Windows of Large Language Models. Findings of the Association for Computational Linguistics: EMNLP 2025. 2025. doi:10.18653/v1/2025.findings-emnlp.1347

  8. [8]

    Political Compass or Spinning Arrow? Towards More Meaningful Evaluations for Values and Opinions in Large Language Models

    R. Political Compass or Spinning Arrow? Towards More Meaningful Evaluations for Values and Opinions in Large Language Models. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2024. doi:10.18653/v1/2024.acl-long.816

  9. [9]

    Social Sciences , VOLUME =

    Rozado, David , TITLE =. Social Sciences , VOLUME =. 2023 , NUMBER =

  10. [10]

    Probing Pre-Trained Language Models for Cross-Cultural Differences in Values

    Arora, Arnav and Kaffee, Lucie-aim \'e e and Augenstein, Isabelle. Probing Pre-Trained Language Models for Cross-Cultural Differences in Values. Proceedings of the First Workshop on Cross-Cultural Considerations in NLP (C3NLP). 2023. doi:10.18653/v1/2023.c3nlp-1.12

  11. [11]

    LLM Tropes: Revealing Fine-Grained Values and Opinions in Large Language Models

    Wright, Dustin and Arora, Arnav and Borenstein, Nadav and Yadav, Srishti and Belongie, Serge and Augenstein, Isabelle. LLM Tropes: Revealing Fine-Grained Values and Opinions in Large Language Models. Findings of the Association for Computational Linguistics: EMNLP 2024. 2024. doi:10.18653/v1/2024.findings-emnlp.995

  12. [12]

    More human than human: measuring ChatGPT political bias , volume =

    Motoki, Fabio and Pinho Neto, Valdemar and Rangel, Victor , year =. More human than human: measuring ChatGPT political bias , volume =. Public Choice , doi =

  13. [13]

    , title =

    Sokhansanj, Bahrad A. , title =. Future Internet , year =. doi:10.3390/fi17100477 , url =

  14. [14]

    Web Information Systems Engineering -- WISE 2024 , year =

    Yamin, Muhammad Mudassar and Hashmi, Ehtesham and Katt, Basel , title =. Web Information Systems Engineering -- WISE 2024 , year =. doi:10.1007/978-981-96-0573-6_14 , url =

  15. [15]

    Open-sourcing R1 1776 , year =

  16. [16]

    Guo, Daya and Yang, Dejian and Zhang, Haowei and Song, Junxiao and Wang, Peiyi and Zhu, Qihao and Xu, Runxin and Zhang, Ruoyu and Ma, Shirong and Bi, Xiao and Zhang, Xiaokang and Yu, Xingkai and Wu, Yu and Wu, Z. F. and Gou, Zhibin and Shao, Zhihong and Li, Zhuoshu and Gao, Ziyi and Liu, Aixin and Xue, Bing and Wang, Bingxuan and Wu, Bochao and Feng, Bei ...

  17. [17]

    GitHub repository , howpublished =

    Weidmann, Philipp Emanuel , title =. GitHub repository , howpublished =. 2025 , publisher =

  18. [18]

    2026 , url =

    OBLITERATUS: An Open Platform for Analysis-Informed Refusal Removal in Large Language Models , author =. 2026 , url =

  19. [19]

    Xiaogeng Liu and Nan Xu and Muhao Chen and Chaowei Xiao , booktitle=. Auto. 2024 , url=

  20. [20]

    Proceedings of the 38th International Conference on Neural Information Processing Systems , articleno =

    Arditi, Andy and Obeso, Oscar and Syed, Aaquib and Paleka, Daniel and Panickssery, Nina and Gurnee, Wes and Nanda, Neel , title =. Proceedings of the 38th International Conference on Neural Information Processing Systems , articleno =. 2024 , isbn =

  21. [21]

    Mitigating Safety Fallback in Editing-based Backdoor Injection on

    Houcheng Jiang and Zetong Zhao and Junfeng Fang and Haokai Ma and Ruipeng Wang and Yang Deng and Xiang Wang and Xiangnan He , booktitle=. Mitigating Safety Fallback in Editing-based Backdoor Injection on. 2026 , url=

  22. [22]

    , title =

    Russell, Nathan J. , title =. 2006 , month = jan, url =

  23. [23]

    Hugging Face Hub , howpublished =

    Dolphin Mistral 24B Venice Edition , year =. Hugging Face Hub , howpublished =

  24. [24]

    Christakis and David Garcia and Amit Goldenberg and Yara Kyrychenko and Kevin Leyton-Brown and Nina Lutz and Gary Marcus and Filippo Menczer and Gordon Pennycook and David G

    Daniel Thilo Schroeder and Meeyoung Cha and Andrea Baronchelli and Nick Bostrom and Nicholas A. Christakis and David Garcia and Amit Goldenberg and Yara Kyrychenko and Kevin Leyton-Brown and Nina Lutz and Gary Marcus and Filippo Menczer and Gordon Pennycook and David G. Rand and Maria Ressa and Frank Schweitzer and Dawn Song and Christopher Summerfield an...

  25. [25]

    2025 , eprint=

    Emergent Coordinated Behaviors in Networked LLM Agents: Modeling the Strategic Dynamics of Information Operations , author=. 2025 , eprint=

  26. [26]

    2025 , eprint=

    Qwen3 Technical Report , author=. 2025 , eprint=

  27. [27]

    Qwen3-Next: Revolutionary AI Model Architecture , year =

  28. [28]

    2025 , eprint=

    Gemma 3 Technical Report , author=. 2025 , eprint=

  29. [29]

    2025 , eprint=

    2 OLMo 2 Furious , author=. 2025 , eprint=

  30. [30]

    2025 , eprint=

    Falcon-H1: A Family of Hybrid-Head Language Models Redefining Efficiency and Performance , author=. 2025 , eprint=

  31. [31]

    Granite 4.0 Language Models , year =

  32. [32]

    2024 , eprint=

    The Llama 3 Herd of Models , author=. 2024 , eprint=

  33. [33]

    Hugging Face Hub , howpublished =

    Mistral-Large-Instruct-2411 , year =. Hugging Face Hub , howpublished =

  34. [34]

    2026 , howpublished =

    Introducing Sarvam's Sovereign Models , author =. 2026 , howpublished =

  35. [35]

    2025 , eprint=

    AI Propaganda factories with language models , author=. 2025 , eprint=

  36. [36]

    Chi and Quoc Le and Denny Zhou , title=

    Jason Wei and Xuezhi Wang and Dale Schuurmans and Maarten Bosma and Ed H. Chi and Quoc Le and Denny Zhou , title=. CoRR , volume=. 2022 , cdate=

  37. [37]

    2025 , eprint=

    gpt-oss-120b & gpt-oss-20b Model Card , author=. 2025 , eprint=

  38. [38]

    Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles , year=

    Efficient Memory Management for Large Language Model Serving with PagedAttention , author=. Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles , year=

  39. [39]

    Educational and Psychological Measurement , year=

    A Coefficient of Agreement for Nominal Scales , author=. Educational and Psychological Measurement , year=

  40. [40]

    Intraclass correlations: Uses in assessing rater reliability,

    Shrout, Patrick E. and Fleiss, Joseph L. , title =. Psychological Bulletin , year =. doi:https://doi.org/10.1037/0033-2909.86.2.420 , pmid =

  41. [41]

    2019 , publisher =

    Krippendorff, Klaus , title =. 2019 , publisher =