How Far Will They Go? Red-Teaming Online Influence with Large Language Models

Anna Serbina; Ashwin Rao; Daniel C. Ruiz; Emilio Ferrara; Luca Luceri

arxiv: 2605.22880 · v1 · pith:MAOREYNSnew · submitted 2026-05-20 · 💻 cs.CL · cs.AI· cs.CY

How Far Will They Go? Red-Teaming Online Influence with Large Language Models

Daniel C. Ruiz , Anna Serbina , Ashwin Rao , Emilio Ferrara , Luca Luceri This is my paper

Pith reviewed 2026-05-25 05:34 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.CY

keywords LLM red-teamingOverton windowpolitical biasjailbreak techniquesinfluence campaignsopen-source LLMssocial media generation

0 comments

The pith

Open-source LLMs reliably generate more left-leaning political content than right-leaning, with expressible ranges narrowing as models increase in size.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to measure the range of political opinions that open-source LLMs will produce on contested topics and to test how simple text prompts can widen that range. It applies this measurement to more than thirty models across ten families and finds consistent leftward tilt, an inverse relationship between model size and opinion breadth, and clear differences by country of origin. These patterns matter because locally run models are the ones most available to actors seeking to shape online conversations without relying on external APIs. A sympathetic reader would therefore treat the measured ranges as a practical signal of how steerable each model family is for political messaging.

Core claim

By introducing an empirical framework that defines an LLM Overton Window as the span of political opinions a model will reliably express, the work shows that open-source models are typically more willing to generate left-leaning social media content, that these windows contract as model size grows, that regional origins produce substantial differences even with uneven representation in the open ecosystem, and that jailbreak effectiveness varies sharply across families.

What carries the argument

The LLM Overton Window, defined as the range of political opinions a model can reliably express on controversial topics, which is quantified before and after applying natural-language jailbreaks to measure expansion.

If this is right

Larger models within the same family become harder to steer toward the edges of the political spectrum.
Jailbreak success depends on the specific model family, so effective combinations must be identified per family.
Regional differences persist even when training data overlap is limited, suggesting origin-specific alignment effects.
The framework supplies a repeatable audit method that future model releases can be measured against.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Alignment methods used on open models may embed directional preferences that are not symmetric across the political spectrum.
Audits focused only on frontier API models would miss the models most accessible for localized influence operations.
The same measurement approach could be extended to track whether new releases widen or narrow windows on the same topics.

Load-bearing premise

The chosen controversial topics, prompts, and jailbreaks form a representative sample that reveals genuine model capacities rather than artifacts of the particular evaluation design.

What would settle it

A replication using a fresh set of topics or an automated scoring method that finds no left-leaning asymmetry and no inverse relationship with model size would falsify the reported patterns.

Figures

Figures reproduced from arXiv: 2605.22880 by Anna Serbina, Ashwin Rao, Daniel C. Ruiz, Emilio Ferrara, Luca Luceri.

**Figure 2.** Figure 2: Mean OW score (normalized, 0-1) as a function of model size across four model [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗

**Figure 3.** Figure 3: Baseline OW score (left) and political lean (right) by developer country of origin. [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗

**Figure 4.** Figure 4: Overview of the end-to-end evaluation methodology. [PITH_FULL_IMAGE:figures/full_fig_p029_4.png] view at source ↗

**Figure 5.** Figure 5: Mean ∆OW relative to baseline (mean ± standard deviation across 10 trials) by technique and model size for Qwen3.5 (left) and Gemma-3 (right). Blue denotes increased compliance and red denotes decreased compliance. The colormap is capped at ±0.42. The figure highlights strong family- and scale-dependent heterogeneity in technique effects: some framings sharply suppress OW in larger Qwen3.5 checkpoints, whe… view at source ↗

**Figure 6.** Figure 6: ∆OW score (technique minus baseline) for Falcon-H1, OLMo-2, and Granite-4.0 models. Blue = increased opinion expression; red = decreased. † MoE model. 30 [PITH_FULL_IMAGE:figures/full_fig_p030_6.png] view at source ↗

**Figure 7.** Figure 7: ∆OW score (technique minus baseline) for Gemma-3, Qwen3.5, and remaining models. * Gemma-3-1B is an outlier (baseline OW ≈ 0.25). † MoE model. 31 [PITH_FULL_IMAGE:figures/full_fig_p031_7.png] view at source ↗

read the original abstract

As large language model (LLM)-based agents increasingly participate in online discourse, red-teaming their capacity to support political influence campaigns is critical for information integrity. In pursuit of this goal, we focus on locally deployed open-source LLMs, as opposed to frontier API-only models, given their superior alignment with the operational constraints of privacy-conscious malicious actors deployed in social media environments. We introduce an empirical red-teaming framework for measuring LLM Overton Windows (OWs), defined as the range of political opinions a model can reliably express on controversial topics, and for quantifying how simple natural-language jailbreaks expand that range. We evaluate more than 30 LLMs spanning 10 model families and five countries of origin. We find systematic asymmetries in political expressivity: open-source LLMs are typically more willing to generate left-leaning social media content, OWs tend to contract inversely to model size, and regional differences are substantial despite uneven representation in the open-source ecosystem. Jailbreak potency also varies sharply across model families, motivating a workflow for identifying effective combinations of jailbreak techniques. Taken together, our results establish a practical framework for auditing the political steerability of open-source LLMs and for helping future researchers design stronger countermeasures against LLM-enabled influence campaigns.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper introduces LLM Overton Windows as a scale for political expressivity in open-source models and reports left-leaning asymmetries plus size and regional effects, but the abstract leaves methods too thin to judge the data.

read the letter

The main thing to know is that this work measures the range of political opinions open-source LLMs will reliably produce on controversial topics and finds those ranges lean left, shrink with model size, and differ by region of origin, with jailbreaks helping unevenly across families. The evaluation covers more than 30 models from 10 families and five countries, which is a reasonable breadth for this kind of audit. That scale and the focus on locally deployable models give the framework some practical value for people tracking influence operations. The idea of treating expressivity as a measurable window is a clean way to frame the red-teaming task. The paper does a straightforward job connecting the measurements to real constraints on malicious actors. The soft spots sit in the missing details. The abstract supplies no information on topic selection, exact prompt templates, scoring rules for reliable expression, or any statistical controls, so it is impossible to tell whether the reported asymmetries reflect model behavior or choices in the test design. Prompt sensitivity is a known issue in this area and needs explicit checks. Without those, the directional findings stay hard to interpret. This is aimed at AI safety and information integrity researchers who need concrete auditing tools rather than broad theory. A reader already working on LLM red-teaming or alignment evaluations could extract the workflow and try it on their own models. The work deserves peer review so the methods section can be examined directly; the core framing is usable if the experiments hold up under scrutiny.

Referee Report

1 major / 1 minor

Summary. The paper introduces an empirical red-teaming framework for measuring LLM Overton Windows (the range of political opinions a model can reliably express on controversial topics) and quantifying the effect of natural-language jailbreaks on that range. It evaluates more than 30 open-source LLMs across 10 families and five countries of origin, reporting systematic asymmetries: greater willingness to generate left-leaning content, inverse contraction of OWs with model size, substantial regional differences, and sharp variation in jailbreak potency across families. The work positions the framework as a practical tool for auditing political steerability and designing countermeasures against LLM-enabled influence campaigns.

Significance. If the empirical results hold under scrutiny, the framework offers a concrete auditing method for open-source LLMs relevant to privacy-conscious malicious actors, which is timely given increasing LLM participation in online discourse. The scale of the evaluation (30+ models, multiple families and origins) and the explicit workflow for identifying effective jailbreak combinations are strengths that could support reproducible follow-up work. The focus on locally deployable models rather than API-only frontier systems aligns with realistic threat models.

major comments (1)

[Abstract] Abstract: the directional claims of systematic asymmetries in political expressivity (left-leaning bias, inverse size relation, regional differences, and jailbreak variation) are presented without any information on topic selection, prompt templates, scoring criteria for 'reliably express,' statistical methods, or controls for prompt sensitivity. This absence is load-bearing because it prevents assessment of whether the measured ranges reflect model capacity or artifacts of the chosen prompts and evaluators.

minor comments (1)

The acronym OW is introduced for the invented construct 'LLM Overton Window'; the introduction should explicitly distinguish this operational definition from the classical Overton window in political science and justify the extension.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their review. We address the single major comment on the abstract below.

read point-by-point responses

Referee: [Abstract] Abstract: the directional claims of systematic asymmetries in political expressivity (left-leaning bias, inverse size relation, regional differences, and jailbreak variation) are presented without any information on topic selection, prompt templates, scoring criteria for 'reliably express,' statistical methods, or controls for prompt sensitivity. This absence is load-bearing because it prevents assessment of whether the measured ranges reflect model capacity or artifacts of the chosen prompts and evaluators.

Authors: The abstract is intentionally concise and summarizes the key findings, consistent with standard academic practice. All requested details are provided in the main text: topic selection and prompt design in Section 3.1, jailbreak templates in Section 3.2, the definition and scoring of 'reliably express' (including inter-annotator agreement) in Section 3.3, statistical methods and controls for prompt sensitivity in Section 4, and robustness checks in Section 4.2. Readers can therefore evaluate whether the reported asymmetries reflect model behavior rather than artifacts. To address the concern directly, we will expand the abstract with one additional sentence summarizing the evaluation protocol. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper presents an empirical measurement study that evaluates over 30 LLMs across model families by directly prompting them on controversial topics and measuring the range of expressible political opinions (Overton Windows). The abstract and description contain no equations, fitted parameters, derivations, predictions that reduce to inputs, or self-citations invoked as load-bearing uniqueness theorems. All reported findings (asymmetries in expressivity, size effects, regional differences, jailbreak potency) are presented as outcomes of the experimental protocol rather than results that are definitionally equivalent to the inputs or prior author work. This is a standard non-circular empirical audit.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

Empirical measurement study; no mathematical derivations or fitted parameters are described. The central contribution is a new measurement definition rather than reliance on prior axioms or entities.

invented entities (1)

LLM Overton Window (OW) no independent evidence
purpose: Quantify the range of political opinions an LLM can reliably express on controversial topics
Newly introduced definition to operationalize political expressivity for red-teaming.

pith-pipeline@v0.9.0 · 5765 in / 1207 out tokens · 24764 ms · 2026-05-25T05:34:41.032085+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

41 extracted references · 41 canonical work pages

[1]

Measuring Political Bias in Large Language Models: What Is Said and How It Is Said

Bang, Yejin and Chen, Delong and Lee, Nayeon and Fung, Pascale. Measuring Political Bias in Large Language Models: What Is Said and How It Is Said. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2024. doi:10.18653/v1/2024.acl-long.600

work page doi:10.18653/v1/2024.acl-long.600 2024
[2]

Biased LLM s can Influence Political Decision-Making

Fisher, Jillian and Feng, Shangbin and Aron, Robert and Richardson, Thomas and Choi, Yejin and Fisher, Daniel W and Pan, Jennifer and Tsvetkov, Yulia and Reinecke, Katharina. Biased LLM s can Influence Political Decision-Making. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2025. doi:10.18...

work page doi:10.18653/v1/2025.acl-long.328 2025
[3]

Whose Side are You on: Investigating Political Bias of Large Language Models

Pit, Pagnarasmey and Ma, Xingjun and Conway, Mike and Chen, Qingyu and Bailey, James and Pit, Pagnarith and Keo, Putrasmey and Diep, Watey and Jiang, Yu-Gang. Whose Side are You on: Investigating Political Bias of Large Language Models. AI 2025: Advances in Artificial Intelligence. 2026

work page 2025
[4]

Proceedings of the 40th International Conference on Machine Learning , articleno =

Santurkar, Shibani and Durmus, Esin and Ladhak, Faisal and Lee, Cinoo and Liang, Percy and Hashimoto, Tatsunori , title =. Proceedings of the 40th International Conference on Machine Learning , articleno =. 2023 , publisher =

work page 2023
[5]

and Varshney, Kush R

Miehling, Erik and Desmond, Michael and Natesan Ramamurthy, Karthikeyan and Daly, Elizabeth M. and Varshney, Kush R. and Farchi, Eitan and Dognin, Pierre and Rios, Jesus and Bouneffouf, Djallel and Liu, Miao and Sattigeri, Prasanna. Evaluating the Prompt Steerability of Large Language Models. Proceedings of the 2025 Conference of the Nations of the Americ...

work page doi:10.18653/v1/2025.naacl-long.400 2025
[6]

2025 , eprint=

Political Ideology Shifts in Large Language Models , author=. 2025 , eprint=

work page 2025
[7]

POW : Political Overton Windows of Large Language Models

Azzopardi, Leif and Moshfeghi, Yashar. POW : Political Overton Windows of Large Language Models. Findings of the Association for Computational Linguistics: EMNLP 2025. 2025. doi:10.18653/v1/2025.findings-emnlp.1347

work page doi:10.18653/v1/2025.findings-emnlp.1347 2025
[8]

Political Compass or Spinning Arrow? Towards More Meaningful Evaluations for Values and Opinions in Large Language Models

R. Political Compass or Spinning Arrow? Towards More Meaningful Evaluations for Values and Opinions in Large Language Models. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2024. doi:10.18653/v1/2024.acl-long.816

work page doi:10.18653/v1/2024.acl-long.816 2024
[9]

Social Sciences , VOLUME =

Rozado, David , TITLE =. Social Sciences , VOLUME =. 2023 , NUMBER =

work page 2023
[10]

Probing Pre-Trained Language Models for Cross-Cultural Differences in Values

Arora, Arnav and Kaffee, Lucie-aim \'e e and Augenstein, Isabelle. Probing Pre-Trained Language Models for Cross-Cultural Differences in Values. Proceedings of the First Workshop on Cross-Cultural Considerations in NLP (C3NLP). 2023. doi:10.18653/v1/2023.c3nlp-1.12

work page doi:10.18653/v1/2023.c3nlp-1.12 2023
[11]

LLM Tropes: Revealing Fine-Grained Values and Opinions in Large Language Models

Wright, Dustin and Arora, Arnav and Borenstein, Nadav and Yadav, Srishti and Belongie, Serge and Augenstein, Isabelle. LLM Tropes: Revealing Fine-Grained Values and Opinions in Large Language Models. Findings of the Association for Computational Linguistics: EMNLP 2024. 2024. doi:10.18653/v1/2024.findings-emnlp.995

work page doi:10.18653/v1/2024.findings-emnlp.995 2024
[12]

More human than human: measuring ChatGPT political bias , volume =

Motoki, Fabio and Pinho Neto, Valdemar and Rangel, Victor , year =. More human than human: measuring ChatGPT political bias , volume =. Public Choice , doi =

work page
[13]

, title =

Sokhansanj, Bahrad A. , title =. Future Internet , year =. doi:10.3390/fi17100477 , url =

work page doi:10.3390/fi17100477
[14]

Web Information Systems Engineering -- WISE 2024 , year =

Yamin, Muhammad Mudassar and Hashmi, Ehtesham and Katt, Basel , title =. Web Information Systems Engineering -- WISE 2024 , year =. doi:10.1007/978-981-96-0573-6_14 , url =

work page doi:10.1007/978-981-96-0573-6_14 2024
[15]

Open-sourcing R1 1776 , year =

work page
[16]

Guo, Daya and Yang, Dejian and Zhang, Haowei and Song, Junxiao and Wang, Peiyi and Zhu, Qihao and Xu, Runxin and Zhang, Ruoyu and Ma, Shirong and Bi, Xiao and Zhang, Xiaokang and Yu, Xingkai and Wu, Yu and Wu, Z. F. and Gou, Zhibin and Shao, Zhihong and Li, Zhuoshu and Gao, Ziyi and Liu, Aixin and Xue, Bing and Wang, Bingxuan and Wu, Bochao and Feng, Bei ...

work page doi:10.1038/s41586-025-09422-z
[17]

GitHub repository , howpublished =

Weidmann, Philipp Emanuel , title =. GitHub repository , howpublished =. 2025 , publisher =

work page 2025
[18]

2026 , url =

OBLITERATUS: An Open Platform for Analysis-Informed Refusal Removal in Large Language Models , author =. 2026 , url =

work page 2026
[19]

Xiaogeng Liu and Nan Xu and Muhao Chen and Chaowei Xiao , booktitle=. Auto. 2024 , url=

work page 2024
[20]

Proceedings of the 38th International Conference on Neural Information Processing Systems , articleno =

Arditi, Andy and Obeso, Oscar and Syed, Aaquib and Paleka, Daniel and Panickssery, Nina and Gurnee, Wes and Nanda, Neel , title =. Proceedings of the 38th International Conference on Neural Information Processing Systems , articleno =. 2024 , isbn =

work page 2024
[21]

Mitigating Safety Fallback in Editing-based Backdoor Injection on

Houcheng Jiang and Zetong Zhao and Junfeng Fang and Haokai Ma and Ruipeng Wang and Yang Deng and Xiang Wang and Xiangnan He , booktitle=. Mitigating Safety Fallback in Editing-based Backdoor Injection on. 2026 , url=

work page 2026
[22]

, title =

Russell, Nathan J. , title =. 2006 , month = jan, url =

work page 2006
[23]

Hugging Face Hub , howpublished =

Dolphin Mistral 24B Venice Edition , year =. Hugging Face Hub , howpublished =

work page
[24]

Christakis and David Garcia and Amit Goldenberg and Yara Kyrychenko and Kevin Leyton-Brown and Nina Lutz and Gary Marcus and Filippo Menczer and Gordon Pennycook and David G

Daniel Thilo Schroeder and Meeyoung Cha and Andrea Baronchelli and Nick Bostrom and Nicholas A. Christakis and David Garcia and Amit Goldenberg and Yara Kyrychenko and Kevin Leyton-Brown and Nina Lutz and Gary Marcus and Filippo Menczer and Gordon Pennycook and David G. Rand and Maria Ressa and Frank Schweitzer and Dawn Song and Christopher Summerfield an...

work page doi:10.1126/science.adz1697 2026
[25]

2025 , eprint=

Emergent Coordinated Behaviors in Networked LLM Agents: Modeling the Strategic Dynamics of Information Operations , author=. 2025 , eprint=

work page 2025
[26]

2025 , eprint=

Qwen3 Technical Report , author=. 2025 , eprint=

work page 2025
[27]

Qwen3-Next: Revolutionary AI Model Architecture , year =

work page
[28]

2025 , eprint=

Gemma 3 Technical Report , author=. 2025 , eprint=

work page 2025
[29]

2025 , eprint=

2 OLMo 2 Furious , author=. 2025 , eprint=

work page 2025
[30]

2025 , eprint=

Falcon-H1: A Family of Hybrid-Head Language Models Redefining Efficiency and Performance , author=. 2025 , eprint=

work page 2025
[31]

Granite 4.0 Language Models , year =

work page
[32]

2024 , eprint=

The Llama 3 Herd of Models , author=. 2024 , eprint=

work page 2024
[33]

Hugging Face Hub , howpublished =

Mistral-Large-Instruct-2411 , year =. Hugging Face Hub , howpublished =

work page
[34]

2026 , howpublished =

Introducing Sarvam's Sovereign Models , author =. 2026 , howpublished =

work page 2026
[35]

2025 , eprint=

AI Propaganda factories with language models , author=. 2025 , eprint=

work page 2025
[36]

Chi and Quoc Le and Denny Zhou , title=

Jason Wei and Xuezhi Wang and Dale Schuurmans and Maarten Bosma and Ed H. Chi and Quoc Le and Denny Zhou , title=. CoRR , volume=. 2022 , cdate=

work page 2022
[37]

2025 , eprint=

gpt-oss-120b & gpt-oss-20b Model Card , author=. 2025 , eprint=

work page 2025
[38]

Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles , year=

Efficient Memory Management for Large Language Model Serving with PagedAttention , author=. Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles , year=

work page
[39]

Educational and Psychological Measurement , year=

A Coefficient of Agreement for Nominal Scales , author=. Educational and Psychological Measurement , year=

work page
[40]

Intraclass correlations: Uses in assessing rater reliability,

Shrout, Patrick E. and Fleiss, Joseph L. , title =. Psychological Bulletin , year =. doi:https://doi.org/10.1037/0033-2909.86.2.420 , pmid =

work page doi:10.1037/0033-2909.86.2.420
[41]

2019 , publisher =

Krippendorff, Klaus , title =. 2019 , publisher =

work page 2019

[1] [1]

Measuring Political Bias in Large Language Models: What Is Said and How It Is Said

Bang, Yejin and Chen, Delong and Lee, Nayeon and Fung, Pascale. Measuring Political Bias in Large Language Models: What Is Said and How It Is Said. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2024. doi:10.18653/v1/2024.acl-long.600

work page doi:10.18653/v1/2024.acl-long.600 2024

[2] [2]

Biased LLM s can Influence Political Decision-Making

Fisher, Jillian and Feng, Shangbin and Aron, Robert and Richardson, Thomas and Choi, Yejin and Fisher, Daniel W and Pan, Jennifer and Tsvetkov, Yulia and Reinecke, Katharina. Biased LLM s can Influence Political Decision-Making. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2025. doi:10.18...

work page doi:10.18653/v1/2025.acl-long.328 2025

[3] [3]

Whose Side are You on: Investigating Political Bias of Large Language Models

Pit, Pagnarasmey and Ma, Xingjun and Conway, Mike and Chen, Qingyu and Bailey, James and Pit, Pagnarith and Keo, Putrasmey and Diep, Watey and Jiang, Yu-Gang. Whose Side are You on: Investigating Political Bias of Large Language Models. AI 2025: Advances in Artificial Intelligence. 2026

work page 2025

[4] [4]

Proceedings of the 40th International Conference on Machine Learning , articleno =

Santurkar, Shibani and Durmus, Esin and Ladhak, Faisal and Lee, Cinoo and Liang, Percy and Hashimoto, Tatsunori , title =. Proceedings of the 40th International Conference on Machine Learning , articleno =. 2023 , publisher =

work page 2023

[5] [5]

and Varshney, Kush R

Miehling, Erik and Desmond, Michael and Natesan Ramamurthy, Karthikeyan and Daly, Elizabeth M. and Varshney, Kush R. and Farchi, Eitan and Dognin, Pierre and Rios, Jesus and Bouneffouf, Djallel and Liu, Miao and Sattigeri, Prasanna. Evaluating the Prompt Steerability of Large Language Models. Proceedings of the 2025 Conference of the Nations of the Americ...

work page doi:10.18653/v1/2025.naacl-long.400 2025

[6] [6]

2025 , eprint=

Political Ideology Shifts in Large Language Models , author=. 2025 , eprint=

work page 2025

[7] [7]

POW : Political Overton Windows of Large Language Models

Azzopardi, Leif and Moshfeghi, Yashar. POW : Political Overton Windows of Large Language Models. Findings of the Association for Computational Linguistics: EMNLP 2025. 2025. doi:10.18653/v1/2025.findings-emnlp.1347

work page doi:10.18653/v1/2025.findings-emnlp.1347 2025

[8] [8]

Political Compass or Spinning Arrow? Towards More Meaningful Evaluations for Values and Opinions in Large Language Models

R. Political Compass or Spinning Arrow? Towards More Meaningful Evaluations for Values and Opinions in Large Language Models. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2024. doi:10.18653/v1/2024.acl-long.816

work page doi:10.18653/v1/2024.acl-long.816 2024

[9] [9]

Social Sciences , VOLUME =

Rozado, David , TITLE =. Social Sciences , VOLUME =. 2023 , NUMBER =

work page 2023

[10] [10]

Probing Pre-Trained Language Models for Cross-Cultural Differences in Values

Arora, Arnav and Kaffee, Lucie-aim \'e e and Augenstein, Isabelle. Probing Pre-Trained Language Models for Cross-Cultural Differences in Values. Proceedings of the First Workshop on Cross-Cultural Considerations in NLP (C3NLP). 2023. doi:10.18653/v1/2023.c3nlp-1.12

work page doi:10.18653/v1/2023.c3nlp-1.12 2023

[11] [11]

LLM Tropes: Revealing Fine-Grained Values and Opinions in Large Language Models

Wright, Dustin and Arora, Arnav and Borenstein, Nadav and Yadav, Srishti and Belongie, Serge and Augenstein, Isabelle. LLM Tropes: Revealing Fine-Grained Values and Opinions in Large Language Models. Findings of the Association for Computational Linguistics: EMNLP 2024. 2024. doi:10.18653/v1/2024.findings-emnlp.995

work page doi:10.18653/v1/2024.findings-emnlp.995 2024

[12] [12]

More human than human: measuring ChatGPT political bias , volume =

Motoki, Fabio and Pinho Neto, Valdemar and Rangel, Victor , year =. More human than human: measuring ChatGPT political bias , volume =. Public Choice , doi =

work page

[13] [13]

, title =

Sokhansanj, Bahrad A. , title =. Future Internet , year =. doi:10.3390/fi17100477 , url =

work page doi:10.3390/fi17100477

[14] [14]

Web Information Systems Engineering -- WISE 2024 , year =

Yamin, Muhammad Mudassar and Hashmi, Ehtesham and Katt, Basel , title =. Web Information Systems Engineering -- WISE 2024 , year =. doi:10.1007/978-981-96-0573-6_14 , url =

work page doi:10.1007/978-981-96-0573-6_14 2024

[15] [15]

Open-sourcing R1 1776 , year =

work page

[16] [16]

Guo, Daya and Yang, Dejian and Zhang, Haowei and Song, Junxiao and Wang, Peiyi and Zhu, Qihao and Xu, Runxin and Zhang, Ruoyu and Ma, Shirong and Bi, Xiao and Zhang, Xiaokang and Yu, Xingkai and Wu, Yu and Wu, Z. F. and Gou, Zhibin and Shao, Zhihong and Li, Zhuoshu and Gao, Ziyi and Liu, Aixin and Xue, Bing and Wang, Bingxuan and Wu, Bochao and Feng, Bei ...

work page doi:10.1038/s41586-025-09422-z

[17] [17]

GitHub repository , howpublished =

Weidmann, Philipp Emanuel , title =. GitHub repository , howpublished =. 2025 , publisher =

work page 2025

[18] [18]

2026 , url =

OBLITERATUS: An Open Platform for Analysis-Informed Refusal Removal in Large Language Models , author =. 2026 , url =

work page 2026

[19] [19]

Xiaogeng Liu and Nan Xu and Muhao Chen and Chaowei Xiao , booktitle=. Auto. 2024 , url=

work page 2024

[20] [20]

Proceedings of the 38th International Conference on Neural Information Processing Systems , articleno =

Arditi, Andy and Obeso, Oscar and Syed, Aaquib and Paleka, Daniel and Panickssery, Nina and Gurnee, Wes and Nanda, Neel , title =. Proceedings of the 38th International Conference on Neural Information Processing Systems , articleno =. 2024 , isbn =

work page 2024

[21] [21]

Mitigating Safety Fallback in Editing-based Backdoor Injection on

Houcheng Jiang and Zetong Zhao and Junfeng Fang and Haokai Ma and Ruipeng Wang and Yang Deng and Xiang Wang and Xiangnan He , booktitle=. Mitigating Safety Fallback in Editing-based Backdoor Injection on. 2026 , url=

work page 2026

[22] [22]

, title =

Russell, Nathan J. , title =. 2006 , month = jan, url =

work page 2006

[23] [23]

Hugging Face Hub , howpublished =

Dolphin Mistral 24B Venice Edition , year =. Hugging Face Hub , howpublished =

work page

[24] [24]

Christakis and David Garcia and Amit Goldenberg and Yara Kyrychenko and Kevin Leyton-Brown and Nina Lutz and Gary Marcus and Filippo Menczer and Gordon Pennycook and David G

Daniel Thilo Schroeder and Meeyoung Cha and Andrea Baronchelli and Nick Bostrom and Nicholas A. Christakis and David Garcia and Amit Goldenberg and Yara Kyrychenko and Kevin Leyton-Brown and Nina Lutz and Gary Marcus and Filippo Menczer and Gordon Pennycook and David G. Rand and Maria Ressa and Frank Schweitzer and Dawn Song and Christopher Summerfield an...

work page doi:10.1126/science.adz1697 2026

[25] [25]

2025 , eprint=

Emergent Coordinated Behaviors in Networked LLM Agents: Modeling the Strategic Dynamics of Information Operations , author=. 2025 , eprint=

work page 2025

[26] [26]

2025 , eprint=

Qwen3 Technical Report , author=. 2025 , eprint=

work page 2025

[27] [27]

Qwen3-Next: Revolutionary AI Model Architecture , year =

work page

[28] [28]

2025 , eprint=

Gemma 3 Technical Report , author=. 2025 , eprint=

work page 2025

[29] [29]

2025 , eprint=

2 OLMo 2 Furious , author=. 2025 , eprint=

work page 2025

[30] [30]

2025 , eprint=

Falcon-H1: A Family of Hybrid-Head Language Models Redefining Efficiency and Performance , author=. 2025 , eprint=

work page 2025

[31] [31]

Granite 4.0 Language Models , year =

work page

[32] [32]

2024 , eprint=

The Llama 3 Herd of Models , author=. 2024 , eprint=

work page 2024

[33] [33]

Hugging Face Hub , howpublished =

Mistral-Large-Instruct-2411 , year =. Hugging Face Hub , howpublished =

work page

[34] [34]

2026 , howpublished =

Introducing Sarvam's Sovereign Models , author =. 2026 , howpublished =

work page 2026

[35] [35]

2025 , eprint=

AI Propaganda factories with language models , author=. 2025 , eprint=

work page 2025

[36] [36]

Chi and Quoc Le and Denny Zhou , title=

Jason Wei and Xuezhi Wang and Dale Schuurmans and Maarten Bosma and Ed H. Chi and Quoc Le and Denny Zhou , title=. CoRR , volume=. 2022 , cdate=

work page 2022

[37] [37]

2025 , eprint=

gpt-oss-120b & gpt-oss-20b Model Card , author=. 2025 , eprint=

work page 2025

[38] [38]

Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles , year=

Efficient Memory Management for Large Language Model Serving with PagedAttention , author=. Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles , year=

work page

[39] [39]

Educational and Psychological Measurement , year=

A Coefficient of Agreement for Nominal Scales , author=. Educational and Psychological Measurement , year=

work page

[40] [40]

Intraclass correlations: Uses in assessing rater reliability,

Shrout, Patrick E. and Fleiss, Joseph L. , title =. Psychological Bulletin , year =. doi:https://doi.org/10.1037/0033-2909.86.2.420 , pmid =

work page doi:10.1037/0033-2909.86.2.420

[41] [41]

2019 , publisher =

Krippendorff, Klaus , title =. 2019 , publisher =

work page 2019