Confident, Calibrated, or Complicit: Safety Alignment and Ideological Bias in LLM Hate Speech Detection

Mehwish Nasim; Sanjeeevan Selvaganapathy

arxiv: 2509.00673 · v2 · submitted 2025-08-31 · 💻 cs.CL · cs.AI· cs.IR

Confident, Calibrated, or Complicit: Safety Alignment and Ideological Bias in LLM Hate Speech Detection

Sanjeeevan Selvaganapathy , Mehwish Nasim This is my paper

Pith reviewed 2026-05-18 20:23 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.IR

keywords LLM hate speech detectionsafety alignmentideological biaspersona influencemodel robustnessfairness disparitiescalibration issuescontent moderation

0 comments

The pith

Censored LLMs detect hate speech more accurately and robustly than uncensored models while resisting ideological persona influence better.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper compares large language models with heavy safety alignment to those with minimal alignment when tasked with identifying implicit and explicit hate speech under political persona deployments. Censored models deliver superior strict accuracy at 69.0 percent against 64.1 percent for uncensored versions and demonstrate greater stability against persona-driven shifts in judgment. Uncensored models, by contrast, adapt more easily to ideological framings but at the cost of lower baseline performance. All models fail to grasp ironic expressions effectively and exhibit uneven results across different targeted groups along with overconfident self-assessments that do not match actual reliability.

Core claim

Censored models outperform their uncensored counterparts in both accuracy and robustness, achieving 69.0% versus 64.1% strict accuracy. However, this higher performance is also associated with greater resistance to persona-based influence, while uncensored models are more malleable to ideological framing. Furthermore, we identify critical failures across all models in understanding nuanced language such as irony. We also find alarming fairness disparities in performance across different targeted groups and systemic overconfidence that renders self-reported certainty unreliable. These findings challenge the notion of LLMs as objective arbiters and highlight the need for more sophisticatedaud

What carries the argument

Comparison of safety-aligned (censored) and minimally-aligned (uncensored) LLMs under persona-based deployment for hate speech detection tasks.

Load-bearing premise

The chosen hate speech datasets and persona prompts sufficiently represent real-world implicit/explicit hate speech and ideological influences without introducing selection artifacts that favor one model type.

What would settle it

Evaluating the models on an independent, larger dataset of hate speech instances collected without regard to the original selection criteria, using neutral personas, and checking if the performance gap between censored and uncensored models remains.

Figures

Figures reproduced from arXiv: 2509.00673 by Mehwish Nasim, Sanjeeevan Selvaganapathy.

**Figure 2.** Figure 2: Breakdown of total error rate into refusal [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Overall strict accuracy by political persona, [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 6.** Figure 6: Strict classification performance ranked by [PITH_FULL_IMAGE:figures/full_fig_p006_6.png] view at source ↗

**Figure 7.** Figure 7: Error breakdown for implicit hate categories, [PITH_FULL_IMAGE:figures/full_fig_p006_7.png] view at source ↗

**Figure 9.** Figure 9: Model calibration plot comparing mean pre [PITH_FULL_IMAGE:figures/full_fig_p007_9.png] view at source ↗

read the original abstract

We investigate the efficacy of Large Language Models (LLMs) in detecting implicit and explicit hate speech, examining how models with minimal safety alignment (uncensored) compare with more heavily aligned (censored) counterparts in a deployed-model setting when deployed using political personas. While uncensored models are often framed as offering a less constrained perspective, our results reveal a trade-off: censored models outperform their uncensored counterparts in both accuracy and robustness, achieving 69.0\% versus 64.1\% strict accuracy. However, this higher performance is also associated with greater resistance to persona-based influence, while uncensored models are more malleable to ideological framing. Furthermore, we identify critical failures across all models in understanding nuanced language such as irony. We also find alarming fairness disparities in performance across different targeted groups and systemic overconfidence that renders self-reported certainty unreliable. These findings challenge the notion of LLMs as objective arbiters and highlight the need for more sophisticated auditing frameworks that account for fairness, calibration, and ideological consistency. Taken together, these results point to censorship-as-deployed rather than safety alignment in isolation as the more appropriate frame for interpreting model differences.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Censored models show a modest accuracy edge and more resistance to persona framing in hate speech detection, but the gains are hard to attribute cleanly to safety alignment without better base-model controls.

read the letter

Colleague, the main thing to know is that this paper reports censored LLMs hitting 69% strict accuracy on implicit and explicit hate speech while uncensored ones sit at 64.1%, and the censored versions also shift less when given political personas. Both groups still miss irony and show fairness gaps plus overconfidence that makes their certainty scores unreliable. The trade-off between raw performance and ideological malleability is the clearest new angle here. It takes prior bias and safety work and puts the models into a deployed persona setup with head-to-head numbers on accuracy and robustness. That gives a practical data point for anyone thinking about content moderation pipelines. The fairness disparities across targeted groups and the calibration problems are also worth flagging for deployment decisions. The soft spot is the causal claim. The abstract treats the models as counterparts, yet the stress-test concern lands: if the censored and uncensored versions differ in base family, size, or pre-training data, the accuracy and robustness differences could come from those factors rather than alignment level alone. Without ablations that hold base capabilities fixed, the move to frame the result as “censorship-as-deployed” stays suggestive. Dataset details, exact prompt templates, and any error bars or significance tests are thin in the summary, so the 5-point gap is hard to weigh precisely. This is for researchers and practitioners working on LLM safety, bias auditing, or platform moderation. Someone who needs concrete numbers on alignment side-effects in a real deployment scenario will get value, even if the controls need tightening. I would send it to peer review. The empirical comparison raises usable questions about trade-offs, and referees could require the missing ablations and transparency without killing the core observation.

Referee Report

2 major / 2 minor

Summary. The manuscript examines the performance of safety-aligned ('censored') versus minimally aligned ('uncensored') LLMs on implicit and explicit hate speech detection tasks. Using political personas, it reports that censored models achieve higher strict accuracy (69.0% vs. 64.1%) and greater robustness to persona-based ideological influence, while uncensored models are more malleable. The paper also highlights failures in detecting irony, fairness disparities across groups, and overconfidence in self-reported certainty, advocating for better auditing frameworks focused on 'censorship-as-deployed'.

Significance. If the central comparisons hold after addressing controls, the results are significant for understanding the impacts of safety alignment on model behavior in sensitive classification tasks. The work provides empirical data on accuracy-robustness trade-offs and identifies practical issues like calibration failures and bias, which are relevant to deploying LLMs for content moderation. Credit is due for the concrete performance metrics, the deployed-setting evaluation with persona prompts, and the identification of overconfidence and fairness issues.

major comments (2)

[§3 (Model Selection and Experimental Design)] §3 (Model Selection and Experimental Design): The paper compares specific censored and uncensored models but does not report controls, matching, or ablations that hold base model family, size, pre-training data, or fine-tuning objectives fixed. This is load-bearing for the central claim, as the 69.0% vs 64.1% strict accuracy gap and greater persona resistance cannot be confidently attributed to safety alignment rather than correlated factors.
[Results (performance tables and persona influence analysis)] Results (performance tables and persona influence analysis): The reported accuracy figures and robustness claims lack statistical tests, error bars, or details on dataset composition, exact prompt templates, and trial counts. Without these, the 5-point accuracy difference and the differential malleability findings cannot be fully verified or generalized.

minor comments (2)

Define 'strict accuracy' explicitly in the methods or early results section rather than assuming reader familiarity.
[Introduction or Discussion] Add a short related-work paragraph citing prior studies on LLM bias in hate-speech classification to better situate the fairness and calibration findings.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their valuable feedback on our work examining safety alignment in LLMs for hate speech detection. We respond to each major comment below, indicating where revisions will be made to address the concerns raised.

read point-by-point responses

Referee: §3 (Model Selection and Experimental Design): The paper compares specific censored and uncensored models but does not report controls, matching, or ablations that hold base model family, size, pre-training data, or fine-tuning objectives fixed. This is load-bearing for the central claim, as the 69.0% vs 64.1% strict accuracy gap and greater persona resistance cannot be confidently attributed to safety alignment rather than correlated factors.

Authors: We acknowledge the importance of isolating the effect of safety alignment. Our model selection was based on publicly available models commonly used in deployed settings, with some pairs sharing base architectures where possible. However, we recognize that perfect controls are difficult due to the nature of available models. In the revised manuscript, we will expand §3 to provide more details on model characteristics, include a table comparing model specs, and add a dedicated limitations paragraph discussing potential confounds from pre-training data and fine-tuning. We will also explore additional ablations if feasible. revision: partial
Referee: Results (performance tables and persona influence analysis): The reported accuracy figures and robustness claims lack statistical tests, error bars, or details on dataset composition, exact prompt templates, and trial counts. Without these, the 5-point accuracy difference and the differential malleability findings cannot be fully verified or generalized.

Authors: We agree that these details are crucial for the credibility of our findings. We will revise the results section to include statistical tests (e.g., McNemar's test or bootstrap methods) for the accuracy comparisons, add error bars to performance tables and figures, and provide comprehensive details in an appendix on dataset composition, exact prompt templates, and the number of experimental trials. This will allow better verification and generalization of the results. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical results from direct model evaluations

full rationale

The paper reports experimental outcomes from running LLMs on hate speech detection benchmarks with and without persona prompts. The key numbers (69.0% vs 64.1% strict accuracy) are measured performance metrics obtained from the authors' test runs, not quantities derived from equations, fitted parameters, or self-referential definitions inside the paper. No derivation chain exists that reduces a claimed result to its own inputs by construction. Any self-citations are incidental and do not carry the central empirical claims.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on standard machine-learning evaluation assumptions that the test sets capture representative hate speech distributions and that persona prompts isolate ideological influence without confounding model capability.

axioms (1)

domain assumption The selected hate speech datasets and evaluation metrics (strict accuracy) are appropriate proxies for real-world detection performance.
Invoked when reporting 69.0% versus 64.1% figures as evidence of outperformance.

pith-pipeline@v0.9.0 · 5745 in / 1132 out tokens · 43490 ms · 2026-05-18T20:23:34.802810+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

14 extracted references · 14 canonical work pages · 5 internal anchors

[1]

Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference

Wei-Lin Chiang, Lianmin Zheng, Ying Sheng, Anastasios Nikolas Angelopoulos, Tianle Li, Dacheng Li, Hao Zhang, Banghua Zhu, Michael Jordan, Joseph E. Gonzalez, and Ion Stoica. 2024. https://doi.org/10.48550/arXiv.2403.04132 Chatbot Arena : An Open Platform for Evaluating LLMs by Human Preference . arXiv preprint. ArXiv:2403.04132

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2403.04132 2024
[2]

Persona-Assigned Large Language Models Exhibit Human-Like Motivated Reasoning

Saloni Dash, Amélie Reymond, Emma S. Spiro, and Aylin Caliskan. 2025. https://doi.org/10.48550/arXiv.2506.20020 Persona- Assigned Large Language Models Exhibit Human - Like Motivated Reasoning . arXiv preprint. ArXiv:2506.20020

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2506.20020 2025
[3]

DeepSeek-AI , Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z. F. Wu, Zhibin Gou, Zhihong Shao, Zhuoshu Li, Ziyi Gao, and 181 others. 2025. https://doi.org/10.48550/arXiv.2501.12948 DeepSeek - R1 : Incentivizing Reasoning Capability in LLMs via Rein...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2501.12948 2025
[4]

DontPlanToEnd. 2025. https://huggingface.co/spaces/DontPlanToEnd/UGI-Leaderboard UGI Leaderboard - a Hugging Face Space by DontPlanToEnd

work page 2025
[5]

Mai ElSherief, Caleb Ziems, David Muchlinski, Vaishnavi Anupindi, Jordyn Seybolt, Munmun De Choudhury, and Diyi Yang. 2021. https://doi.org/10.18653/v1/2021.emnlp-main.29 Latent Hatred : A Benchmark for Understanding Implicit Hate Speech . In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing , pages 345--363, Online an...

work page doi:10.18653/v1/2021.emnlp-main.29 2021
[6]

Marcin Konieczny. 2023. https://doi.org/10.59245/ps.32.2.1 Ignorance, Disinformation , Manipulation and Hate Speech as Effective Tools of Political Power . Policija i sigurnost, 32(2):123--134

work page doi:10.59245/ps.32.2.1 2023
[7]

Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, and Ryan Lowe. 2022. https://doi.org/10.48550/arXiv.2203.02155 Training language mo...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2203.02155 2022
[8]

Baker, and Rene F

Yan Tao, Olga Viberg, Ryan S. Baker, and Rene F. Kizilcec. 2024. https://doi.org/10.48550/arXiv.2311.14096 Cultural Bias and Cultural Alignment of Large Language Models . arXiv preprint. ArXiv:2311.14096

work page doi:10.48550/arxiv.2311.14096 2024
[9]

Conor Walsh and Alok Joshi. 2024. https://doi.org/10.48550/arXiv.2303.06021 Machine learning for sports betting: should model selection be based on accuracy or calibration? arXiv preprint. ArXiv:2303.06021

work page doi:10.48550/arxiv.2303.06021 2024
[10]

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, and 41 others. 2025. https://doi.org/10.48550/arXiv.2505.09388 Qwen3 Technical Report . arXiv preprint. ArXiv:2505.09388

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2505.09388 2025
[11]

Shuzhou Yuan, Ercong Nie, Mario Tawfelis, Helmut Schmid, Hinrich Schütze, and Michael Färber. 2025. https://doi.org/10.48550/arXiv.2506.08593 Hateful Person or Hateful Model ? Investigating the Role of Personas in Hate Speech Detection by Large Language Models . arXiv preprint. ArXiv:2506.08593

work page doi:10.48550/arxiv.2506.08593 2025
[12]

Min Zhang, Jianfeng He, Taoran Ji, and Chang-Tien Lu. 2024. https://doi.org/10.18653/v1/2024.acl-long.652 Don’t Go To Extremes : Revealing the Excessive Sensitivity and Calibration Limitations of LLMs in Implicit Hate Speech Detection . In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics ( Volume 1: Long Papers ) , p...

work page doi:10.18653/v1/2024.acl-long.652 2024
[13]

online" 'onlinestring :=

ENTRY address archivePrefix author booktitle chapter edition editor eid eprint eprinttype howpublished institution journal key month note number organization pages publisher school series title type volume year doi pubmed url lastchecked label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block STRING...

work page
[14]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...

work page

[1] [1]

Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference

Wei-Lin Chiang, Lianmin Zheng, Ying Sheng, Anastasios Nikolas Angelopoulos, Tianle Li, Dacheng Li, Hao Zhang, Banghua Zhu, Michael Jordan, Joseph E. Gonzalez, and Ion Stoica. 2024. https://doi.org/10.48550/arXiv.2403.04132 Chatbot Arena : An Open Platform for Evaluating LLMs by Human Preference . arXiv preprint. ArXiv:2403.04132

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2403.04132 2024

[2] [2]

Persona-Assigned Large Language Models Exhibit Human-Like Motivated Reasoning

Saloni Dash, Amélie Reymond, Emma S. Spiro, and Aylin Caliskan. 2025. https://doi.org/10.48550/arXiv.2506.20020 Persona- Assigned Large Language Models Exhibit Human - Like Motivated Reasoning . arXiv preprint. ArXiv:2506.20020

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2506.20020 2025

[3] [3]

DeepSeek-AI , Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z. F. Wu, Zhibin Gou, Zhihong Shao, Zhuoshu Li, Ziyi Gao, and 181 others. 2025. https://doi.org/10.48550/arXiv.2501.12948 DeepSeek - R1 : Incentivizing Reasoning Capability in LLMs via Rein...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2501.12948 2025

[4] [4]

DontPlanToEnd. 2025. https://huggingface.co/spaces/DontPlanToEnd/UGI-Leaderboard UGI Leaderboard - a Hugging Face Space by DontPlanToEnd

work page 2025

[5] [5]

Mai ElSherief, Caleb Ziems, David Muchlinski, Vaishnavi Anupindi, Jordyn Seybolt, Munmun De Choudhury, and Diyi Yang. 2021. https://doi.org/10.18653/v1/2021.emnlp-main.29 Latent Hatred : A Benchmark for Understanding Implicit Hate Speech . In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing , pages 345--363, Online an...

work page doi:10.18653/v1/2021.emnlp-main.29 2021

[6] [6]

Marcin Konieczny. 2023. https://doi.org/10.59245/ps.32.2.1 Ignorance, Disinformation , Manipulation and Hate Speech as Effective Tools of Political Power . Policija i sigurnost, 32(2):123--134

work page doi:10.59245/ps.32.2.1 2023

[7] [7]

Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, and Ryan Lowe. 2022. https://doi.org/10.48550/arXiv.2203.02155 Training language mo...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2203.02155 2022

[8] [8]

Baker, and Rene F

Yan Tao, Olga Viberg, Ryan S. Baker, and Rene F. Kizilcec. 2024. https://doi.org/10.48550/arXiv.2311.14096 Cultural Bias and Cultural Alignment of Large Language Models . arXiv preprint. ArXiv:2311.14096

work page doi:10.48550/arxiv.2311.14096 2024

[9] [9]

Conor Walsh and Alok Joshi. 2024. https://doi.org/10.48550/arXiv.2303.06021 Machine learning for sports betting: should model selection be based on accuracy or calibration? arXiv preprint. ArXiv:2303.06021

work page doi:10.48550/arxiv.2303.06021 2024

[10] [10]

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, and 41 others. 2025. https://doi.org/10.48550/arXiv.2505.09388 Qwen3 Technical Report . arXiv preprint. ArXiv:2505.09388

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2505.09388 2025

[11] [11]

Shuzhou Yuan, Ercong Nie, Mario Tawfelis, Helmut Schmid, Hinrich Schütze, and Michael Färber. 2025. https://doi.org/10.48550/arXiv.2506.08593 Hateful Person or Hateful Model ? Investigating the Role of Personas in Hate Speech Detection by Large Language Models . arXiv preprint. ArXiv:2506.08593

work page doi:10.48550/arxiv.2506.08593 2025

[12] [12]

Min Zhang, Jianfeng He, Taoran Ji, and Chang-Tien Lu. 2024. https://doi.org/10.18653/v1/2024.acl-long.652 Don’t Go To Extremes : Revealing the Excessive Sensitivity and Calibration Limitations of LLMs in Implicit Hate Speech Detection . In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics ( Volume 1: Long Papers ) , p...

work page doi:10.18653/v1/2024.acl-long.652 2024

[13] [13]

online" 'onlinestring :=

ENTRY address archivePrefix author booktitle chapter edition editor eid eprint eprinttype howpublished institution journal key month note number organization pages publisher school series title type volume year doi pubmed url lastchecked label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block STRING...

work page

[14] [14]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...

work page