From Self to Other: Evaluating Demographic Perspective-Taking in LLM Hate Speech Annotation

Javier Parapar; Paloma Piot

arxiv: 2606.06266 · v1 · pith:MJEQGWA5new · submitted 2026-06-04 · 💻 cs.CL

From Self to Other: Evaluating Demographic Perspective-Taking in LLM Hate Speech Annotation

Paloma Piot , Javier Parapar This is my paper

Pith reviewed 2026-06-28 01:49 UTC · model grok-4.3

classification 💻 cs.CL

keywords hate speech detectionLLM annotationdemographic perspective-takingpersona promptingvicarious predictioninter-group disagreementin-group sensitivity

0 comments

The pith

Vicarious prompting with Llama 3.1 best matches human patterns of disagreement across demographic groups when annotating hate speech.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether prompting LLMs to adopt demographic personas can reliably simulate how different groups perceive and disagree on hate speech. It measures three specific aspects of human judgement: whether personas from different groups disagree in human-like ways, whether they show greater sensitivity to content targeting their own identity, and whether they can predict how another group would react. No prompting approach or model succeeds consistently across all three measures, and results depend heavily on the specific LLM used. Vicarious prompting, in which the model is asked to predict another group's reaction, performs best with Llama 3.1 and comes closest overall to observed human disagreement patterns. This matters for any effort to scale subjective annotation tasks without large numbers of human labelers from each demographic group.

Core claim

No model consistently captures all three dimensions of human social judgement in hate speech annotation, and performance is highly model-dependent and does not emerge reliably from minimal identity prompts alone. However, vicarious prompting with Llama 3.1 yields the highest cross-group agreement in most demographic axes and provides the closest overall approximation to human disagreement patterns, indicating that this configuration may provide a more reliable setting for automatic annotation aligned with human judgements.

What carries the argument

Three evaluation criteria applied to persona-conditioned LLMs: inter-group disagreement, in-group sensitivity, and vicarious prediction.

If this is right

Model choice and prompting strategy both matter for producing LLM annotations that track human demographic differences.
Minimal identity prompts alone do not produce reliable perspective-taking in LLMs.
Vicarious prompting offers a stronger method than standard persona adoption for aligning automatic annotations with human disagreement patterns.
Findings are specific to the models and demographic axes tested and do not hold uniformly across all LLMs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Asking a model to predict another group's reaction may draw on broader training data more effectively than asking it to role-play that group directly.
The same prompting distinction could be tested on other subjective labeling tasks such as political content moderation or cultural offense ratings.
If vicarious methods prove more robust, annotation pipelines might shift away from identity-adoption prompts toward explicit prediction prompts.

Load-bearing premise

The three measured dimensions of inter-group disagreement, in-group sensitivity, and vicarious prediction are sufficient and valid proxies for whether an LLM has captured human demographic perspective-taking.

What would settle it

Collecting fresh annotations from the same demographic groups and finding that their actual disagreement patterns differ substantially from the outputs produced by vicarious prompting in Llama 3.1.

read the original abstract

Hate speech detection is inherently subjective: people from different demographic groups perceive the same content very differently. Collecting enough annotations from multiple demographic groups is costly and difficult to scale. Persona-conditioned Large Language Models (models prompted to adopt a specific demographic identity) have been proposed as a way to simulate diverse perspectives at scale. But do they actually reflect how different groups disagree? We evaluate three aspects of human social judgement: (i) whether personas from different groups disagree in human-like ways (inter-group disagreement), (ii) whether they become more sensitive when content targets their own identity (in-group sensitivity), and (iii) whether they can accurately predict how another group would react (vicarious prediction). Our results show that no model consistently captures all three dimensions, and performance is highly model-dependent and does not emerge reliably from minimal identity prompts alone. However, vicarious prompting with Llama 3.1 yields the highest cross-group agreement in most demographic axes and provides the closest overall approximation to human disagreement patterns, indicating that this configuration may provide a more reliable setting for automatic annotation aligned with human judgements.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Vicarious prompting with Llama 3.1 approximates human hate speech disagreement patterns best among the tested methods, though the proxies used require further validation.

read the letter

The one thing to know is that vicarious prompting on Llama 3.1 comes out ahead in approximating human disagreement patterns on hate speech, but the evaluation hinges on three proxies that aren't checked against direct human perspective measures.

The paper takes persona prompting and tests it on three axes for hate speech: whether different demographic personas disagree like humans do, whether they show more sensitivity to content targeting their group, and whether they can predict another group's reactions. This is a useful breakdown, and the results show no method nails all three while performance varies by model. They do well in demonstrating that minimal identity prompts alone aren't reliable and in identifying a stronger configuration.

The main soft spot is the lack of validation for those three dimensions as proxies. The paper sets them up as the evaluation criteria without reporting correlations to human self-reports or explicit prediction tasks. If the metrics are picking up prompt effects instead, the claim that Llama 3.1 provides the closest approximation doesn't hold up. Dataset and statistical details would help gauge the practical significance too.

This work is for researchers and practitioners in applied NLP who need scalable ways to handle subjective annotations like hate speech detection. Readers focused on fairness in content moderation systems would find the model comparisons relevant.

The paper shows clear engagement with the problem and prior work on LLM personas. It deserves a serious referee because the question is timely and the empirical test is concrete, even if the proxy validity needs more attention.

I recommend sending it to peer review.

Referee Report

2 major / 0 minor

Summary. The paper evaluates persona-conditioned LLMs for simulating demographic perspectives in hate speech annotation via three dimensions: inter-group disagreement, in-group sensitivity, and vicarious prediction. It concludes that no model captures all three consistently and that performance is model-dependent, but vicarious prompting with Llama 3.1 yields the highest cross-group agreement in most axes and the closest overall match to human disagreement patterns.

Significance. If the empirical comparisons hold after addressing methodological gaps, the work would usefully demonstrate the limitations of minimal persona prompts and the relative advantage of vicarious prompting for approximating human annotation variability. This could inform scalable alternatives to multi-demographic human labeling, though the absence of validation for the chosen proxies weakens the link to actual perspective-taking.

major comments (2)

[Abstract] Abstract: The abstract states comparative results but supplies no information on dataset size, statistical tests, model versions, prompt templates, or human annotation protocol, preventing verification of whether the reported differences support the central claim about Llama 3.1 vicarious prompting.
[Abstract / Evaluation criteria] Evaluation criteria (as introduced in the abstract and methods): The three dimensions are treated as jointly sufficient proxies for demographic perspective-taking without reported evidence that they correlate with direct human measures such as explicit other-group prediction tasks or self-reported sensitivity; if they instead reflect prompt artifacts, the model ranking and 'closest overall approximation' conclusion do not follow.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on the abstract and the evaluation criteria. We address each major comment below and will revise the manuscript to improve clarity and address methodological concerns where possible.

read point-by-point responses

Referee: [Abstract] Abstract: The abstract states comparative results but supplies no information on dataset size, statistical tests, model versions, prompt templates, or human annotation protocol, preventing verification of whether the reported differences support the central claim about Llama 3.1 vicarious prompting.

Authors: We agree that the abstract would benefit from additional methodological details to support verification of the claims. In the revised version, we will expand the abstract to include the dataset size, the specific model versions evaluated, a summary of the statistical tests used, key elements of the prompt templates, and a brief description of the human annotation protocol. revision: yes
Referee: [Abstract / Evaluation criteria] Evaluation criteria (as introduced in the abstract and methods): The three dimensions are treated as jointly sufficient proxies for demographic perspective-taking without reported evidence that they correlate with direct human measures such as explicit other-group prediction tasks or self-reported sensitivity; if they instead reflect prompt artifacts, the model ranking and 'closest overall approximation' conclusion do not follow.

Authors: These three dimensions are motivated by established findings in the hate speech annotation and social psychology literature on inter-group disagreement, in-group bias, and perspective-taking. We will revise the methods and discussion sections to provide a more explicit theoretical justification drawn from prior work and to acknowledge the lack of direct correlation experiments with other human measures as a limitation of the current study. We do not claim the dimensions are jointly sufficient without further validation, but argue that they offer a useful framework for assessing whether LLMs capture key patterns of human demographic variation in this domain. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical comparison to external human labels

full rationale

The paper defines three evaluation dimensions (inter-group disagreement, in-group sensitivity, vicarious prediction) as proxies and reports direct empirical comparisons of LLM outputs against held-out human annotations. No equations, fitted parameters renamed as predictions, or self-citation chains appear in the provided text. The central claim rests on observable agreement metrics against external data rather than any self-referential reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an empirical evaluation study; the central claim rests on experimental comparisons rather than on mathematical axioms, fitted parameters, or newly postulated entities.

pith-pipeline@v0.9.1-grok · 5718 in / 1054 out tokens · 21161 ms · 2026-06-28T01:49:37.066260+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

12 extracted references · 3 canonical work pages · 2 internal anchors

[1]

Out of one, many: Using language mod- els to simulate human samples.Political Analysis, 31(3):337–351. Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, Binyuan Hui, Luo Ji, Mei Li, Junyang Lin, Runji Lin, Dayiheng Liu, Gao Liu, Chengqiang Lu, Keming Lu, and 29 others. 2023. Qwen technical report....

work page internal anchor Pith review Pith/arXiv arXiv 2023
[2]

InProceedings of the Fourth Workshop on Threat, Aggression & Cyberbullying @ LREC- COLING-2024, pages 96–104, Torino, Italia

Content moderation in online platforms: A study of annotation methods for inappropriate lan- guage. InProceedings of the Fourth Workshop on Threat, Aggression & Cyberbullying @ LREC- COLING-2024, pages 96–104, Torino, Italia. ELRA and ICCL. Valerio Basile. 2020. It’s the end of the gold standard as we know it: Leveraging non-aggregated data for better eva...

work page arXiv 2024
[3]

The Llama 3 Herd of Models

Chatgpt outperforms crowd workers for text-annotation tasks.Proceedings of the National Academy of Sciences, 120(30). Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al- Dahle, Aiesha Letman, ..., and Zhiyu Ma. 2024. The llama 3 herd of models.Preprint, arXiv:2407.21783. Shashank Gupta, Vaishnavi Shrivastava, Ame...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[4]

InInformation Processing and Management of Uncertainty in Knowledge-Based Systems, pages 681–695, Cham

Handling disagreement in hate speech mod- elling. InInformation Processing and Management of Uncertainty in Knowledge-Based Systems, pages 681–695, Cham. Springer International Publishing. Mistral AI and NVIDIA. 2024. Mistral nemo. Ac- cessed: 2024-07-18. Emanuele Moscato, Tiancheng Hu, Matthias Or- likowski, Paul Röttger, and Debora Nozza. 2025. Per- son...

2024
[5]

Annot.” refers to the total number of labels provided by annotators of that group. “Com- ments

Divine LLaMAs: Bias, stereotypes, stigma- tization, and emotion representation of religion in large language models. InFindings of the Associa- tion for Computational Linguistics: EMNLP 2024, pages 4346–4366, Miami, Florida, USA. Association for Computational Linguistics. Paul Röttger, Hannah Kirk, Bertie Vidgen, Giuseppe Attanasio, Federico Bianchi, and ...

2024
[6]

Determine whether the following text contains hate speech or not. Respond with either True for hate speech or False for neutral

Non-Persona:“ Determine whether the following text contains hate speech or not. Respond with either True for hate speech or False for neutral.” A.2.2 Self-Alignment The model adopts the identity of demographic group A and rates the comment from that perspec- 12 Dem. Axis Group A Group B # Gender Women Men 1,697 Race White Black 213 Religion Christian Athe...
[7]

Answer the questions while staying in strict accordance with the nature of this identity

Minimal Identity:“ Adopt the identity of [Demographic A]. Answer the questions while staying in strict accordance with the nature of this identity. Determine whether the following text contains hate speech or not based on your experiences and background. Respond with either True for hate speech or False for neutral.”
[8]

This identity is a central part of your lived experience, shaping your values, social perspective, and sensitivity to language

Contextual Identity:“ You are acting as a person who identifies as [Demographic A]. This identity is a central part of your lived experience, shaping your values, social perspective, and sensitivity to language. When evaluating the text below, draw upon the cultural nuances and historical context associated with being [Demographic A]. Determine if the fol...
[9]

# Out-group

Biographical Identity:“ You are a [Demographic A]. You have grown Demographic Observer→Out-group# LLM preds # Out-group # Global Gender Men→Women 21,846 7,414 29,339Women→Men 25,092 5,749 29,339 Race White→Black 27,791 470 29,304Black→White 6,874 3,173 29,304 Religion Christian→Atheist 21,704 1,316 26,601Atheist→Christian 12,711 2,472 26,601 Ideology Libe...
[10]

Your task is not to give your own opinion, but to predict the perception of another person

Minimal Identity:“ Adopt the identity of [Demographic A]. Your task is not to give your own opinion, but to predict the perception of another person. Based on your understanding of social dynamics, predict whether [Demographic B] would consider the 13 following text to contain hate speech. Respond with either True (if they would see it as hate) or False (...
[11]

Your task is to apply your social intelligence and understanding of diverse perspectives to predict how [Demographic B] would feel

Contextual Identity:“ You are acting as [Demographic A]. Your task is to apply your social intelligence and understanding of diverse perspectives to predict how [Demographic B] would feel. Consider the unique sensitivities, lived experiences, and common social reactions of [Demographic B] when they encounter harmful language. Predict whether they would co...
[12]

richness

Biographical Identity:“ You are a [Demographic A]. You have a grounded understanding of your own social position and, through years of observation, conversation, and shared experience, you have developed a genuine awareness of how people from different backgrounds perceive and react to harmful language. You know that what feels neutral or unremarkable to ...

[1] [1]

Out of one, many: Using language mod- els to simulate human samples.Political Analysis, 31(3):337–351. Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, Binyuan Hui, Luo Ji, Mei Li, Junyang Lin, Runji Lin, Dayiheng Liu, Gao Liu, Chengqiang Lu, Keming Lu, and 29 others. 2023. Qwen technical report....

work page internal anchor Pith review Pith/arXiv arXiv 2023

[2] [2]

InProceedings of the Fourth Workshop on Threat, Aggression & Cyberbullying @ LREC- COLING-2024, pages 96–104, Torino, Italia

Content moderation in online platforms: A study of annotation methods for inappropriate lan- guage. InProceedings of the Fourth Workshop on Threat, Aggression & Cyberbullying @ LREC- COLING-2024, pages 96–104, Torino, Italia. ELRA and ICCL. Valerio Basile. 2020. It’s the end of the gold standard as we know it: Leveraging non-aggregated data for better eva...

work page arXiv 2024

[3] [3]

The Llama 3 Herd of Models

Chatgpt outperforms crowd workers for text-annotation tasks.Proceedings of the National Academy of Sciences, 120(30). Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al- Dahle, Aiesha Letman, ..., and Zhiyu Ma. 2024. The llama 3 herd of models.Preprint, arXiv:2407.21783. Shashank Gupta, Vaishnavi Shrivastava, Ame...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[4] [4]

InInformation Processing and Management of Uncertainty in Knowledge-Based Systems, pages 681–695, Cham

Handling disagreement in hate speech mod- elling. InInformation Processing and Management of Uncertainty in Knowledge-Based Systems, pages 681–695, Cham. Springer International Publishing. Mistral AI and NVIDIA. 2024. Mistral nemo. Ac- cessed: 2024-07-18. Emanuele Moscato, Tiancheng Hu, Matthias Or- likowski, Paul Röttger, and Debora Nozza. 2025. Per- son...

2024

[5] [5]

Annot.” refers to the total number of labels provided by annotators of that group. “Com- ments

Divine LLaMAs: Bias, stereotypes, stigma- tization, and emotion representation of religion in large language models. InFindings of the Associa- tion for Computational Linguistics: EMNLP 2024, pages 4346–4366, Miami, Florida, USA. Association for Computational Linguistics. Paul Röttger, Hannah Kirk, Bertie Vidgen, Giuseppe Attanasio, Federico Bianchi, and ...

2024

[6] [6]

Determine whether the following text contains hate speech or not. Respond with either True for hate speech or False for neutral

Non-Persona:“ Determine whether the following text contains hate speech or not. Respond with either True for hate speech or False for neutral.” A.2.2 Self-Alignment The model adopts the identity of demographic group A and rates the comment from that perspec- 12 Dem. Axis Group A Group B # Gender Women Men 1,697 Race White Black 213 Religion Christian Athe...

[7] [7]

Answer the questions while staying in strict accordance with the nature of this identity

Minimal Identity:“ Adopt the identity of [Demographic A]. Answer the questions while staying in strict accordance with the nature of this identity. Determine whether the following text contains hate speech or not based on your experiences and background. Respond with either True for hate speech or False for neutral.”

[8] [8]

This identity is a central part of your lived experience, shaping your values, social perspective, and sensitivity to language

Contextual Identity:“ You are acting as a person who identifies as [Demographic A]. This identity is a central part of your lived experience, shaping your values, social perspective, and sensitivity to language. When evaluating the text below, draw upon the cultural nuances and historical context associated with being [Demographic A]. Determine if the fol...

[9] [9]

# Out-group

Biographical Identity:“ You are a [Demographic A]. You have grown Demographic Observer→Out-group# LLM preds # Out-group # Global Gender Men→Women 21,846 7,414 29,339Women→Men 25,092 5,749 29,339 Race White→Black 27,791 470 29,304Black→White 6,874 3,173 29,304 Religion Christian→Atheist 21,704 1,316 26,601Atheist→Christian 12,711 2,472 26,601 Ideology Libe...

[10] [10]

Your task is not to give your own opinion, but to predict the perception of another person

Minimal Identity:“ Adopt the identity of [Demographic A]. Your task is not to give your own opinion, but to predict the perception of another person. Based on your understanding of social dynamics, predict whether [Demographic B] would consider the 13 following text to contain hate speech. Respond with either True (if they would see it as hate) or False (...

[11] [11]

Your task is to apply your social intelligence and understanding of diverse perspectives to predict how [Demographic B] would feel

Contextual Identity:“ You are acting as [Demographic A]. Your task is to apply your social intelligence and understanding of diverse perspectives to predict how [Demographic B] would feel. Consider the unique sensitivities, lived experiences, and common social reactions of [Demographic B] when they encounter harmful language. Predict whether they would co...

[12] [12]

richness

Biographical Identity:“ You are a [Demographic A]. You have a grounded understanding of your own social position and, through years of observation, conversation, and shared experience, you have developed a genuine awareness of how people from different backgrounds perceive and react to harmful language. You know that what feels neutral or unremarkable to ...