From Self to Other: Evaluating Demographic Perspective-Taking in LLM Hate Speech Annotation
Pith reviewed 2026-06-28 01:49 UTC · model grok-4.3
The pith
Vicarious prompting with Llama 3.1 best matches human patterns of disagreement across demographic groups when annotating hate speech.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
No model consistently captures all three dimensions of human social judgement in hate speech annotation, and performance is highly model-dependent and does not emerge reliably from minimal identity prompts alone. However, vicarious prompting with Llama 3.1 yields the highest cross-group agreement in most demographic axes and provides the closest overall approximation to human disagreement patterns, indicating that this configuration may provide a more reliable setting for automatic annotation aligned with human judgements.
What carries the argument
Three evaluation criteria applied to persona-conditioned LLMs: inter-group disagreement, in-group sensitivity, and vicarious prediction.
If this is right
- Model choice and prompting strategy both matter for producing LLM annotations that track human demographic differences.
- Minimal identity prompts alone do not produce reliable perspective-taking in LLMs.
- Vicarious prompting offers a stronger method than standard persona adoption for aligning automatic annotations with human disagreement patterns.
- Findings are specific to the models and demographic axes tested and do not hold uniformly across all LLMs.
Where Pith is reading between the lines
- Asking a model to predict another group's reaction may draw on broader training data more effectively than asking it to role-play that group directly.
- The same prompting distinction could be tested on other subjective labeling tasks such as political content moderation or cultural offense ratings.
- If vicarious methods prove more robust, annotation pipelines might shift away from identity-adoption prompts toward explicit prediction prompts.
Load-bearing premise
The three measured dimensions of inter-group disagreement, in-group sensitivity, and vicarious prediction are sufficient and valid proxies for whether an LLM has captured human demographic perspective-taking.
What would settle it
Collecting fresh annotations from the same demographic groups and finding that their actual disagreement patterns differ substantially from the outputs produced by vicarious prompting in Llama 3.1.
read the original abstract
Hate speech detection is inherently subjective: people from different demographic groups perceive the same content very differently. Collecting enough annotations from multiple demographic groups is costly and difficult to scale. Persona-conditioned Large Language Models (models prompted to adopt a specific demographic identity) have been proposed as a way to simulate diverse perspectives at scale. But do they actually reflect how different groups disagree? We evaluate three aspects of human social judgement: (i) whether personas from different groups disagree in human-like ways (inter-group disagreement), (ii) whether they become more sensitive when content targets their own identity (in-group sensitivity), and (iii) whether they can accurately predict how another group would react (vicarious prediction). Our results show that no model consistently captures all three dimensions, and performance is highly model-dependent and does not emerge reliably from minimal identity prompts alone. However, vicarious prompting with Llama 3.1 yields the highest cross-group agreement in most demographic axes and provides the closest overall approximation to human disagreement patterns, indicating that this configuration may provide a more reliable setting for automatic annotation aligned with human judgements.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper evaluates persona-conditioned LLMs for simulating demographic perspectives in hate speech annotation via three dimensions: inter-group disagreement, in-group sensitivity, and vicarious prediction. It concludes that no model captures all three consistently and that performance is model-dependent, but vicarious prompting with Llama 3.1 yields the highest cross-group agreement in most axes and the closest overall match to human disagreement patterns.
Significance. If the empirical comparisons hold after addressing methodological gaps, the work would usefully demonstrate the limitations of minimal persona prompts and the relative advantage of vicarious prompting for approximating human annotation variability. This could inform scalable alternatives to multi-demographic human labeling, though the absence of validation for the chosen proxies weakens the link to actual perspective-taking.
major comments (2)
- [Abstract] Abstract: The abstract states comparative results but supplies no information on dataset size, statistical tests, model versions, prompt templates, or human annotation protocol, preventing verification of whether the reported differences support the central claim about Llama 3.1 vicarious prompting.
- [Abstract / Evaluation criteria] Evaluation criteria (as introduced in the abstract and methods): The three dimensions are treated as jointly sufficient proxies for demographic perspective-taking without reported evidence that they correlate with direct human measures such as explicit other-group prediction tasks or self-reported sensitivity; if they instead reflect prompt artifacts, the model ranking and 'closest overall approximation' conclusion do not follow.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on the abstract and the evaluation criteria. We address each major comment below and will revise the manuscript to improve clarity and address methodological concerns where possible.
read point-by-point responses
-
Referee: [Abstract] Abstract: The abstract states comparative results but supplies no information on dataset size, statistical tests, model versions, prompt templates, or human annotation protocol, preventing verification of whether the reported differences support the central claim about Llama 3.1 vicarious prompting.
Authors: We agree that the abstract would benefit from additional methodological details to support verification of the claims. In the revised version, we will expand the abstract to include the dataset size, the specific model versions evaluated, a summary of the statistical tests used, key elements of the prompt templates, and a brief description of the human annotation protocol. revision: yes
-
Referee: [Abstract / Evaluation criteria] Evaluation criteria (as introduced in the abstract and methods): The three dimensions are treated as jointly sufficient proxies for demographic perspective-taking without reported evidence that they correlate with direct human measures such as explicit other-group prediction tasks or self-reported sensitivity; if they instead reflect prompt artifacts, the model ranking and 'closest overall approximation' conclusion do not follow.
Authors: These three dimensions are motivated by established findings in the hate speech annotation and social psychology literature on inter-group disagreement, in-group bias, and perspective-taking. We will revise the methods and discussion sections to provide a more explicit theoretical justification drawn from prior work and to acknowledge the lack of direct correlation experiments with other human measures as a limitation of the current study. We do not claim the dimensions are jointly sufficient without further validation, but argue that they offer a useful framework for assessing whether LLMs capture key patterns of human demographic variation in this domain. revision: partial
Circularity Check
No circularity: empirical comparison to external human labels
full rationale
The paper defines three evaluation dimensions (inter-group disagreement, in-group sensitivity, vicarious prediction) as proxies and reports direct empirical comparisons of LLM outputs against held-out human annotations. No equations, fitted parameters renamed as predictions, or self-citation chains appear in the provided text. The central claim rests on observable agreement metrics against external data rather than any self-referential reduction.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Out of one, many: Using language mod- els to simulate human samples.Political Analysis, 31(3):337–351. Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, Binyuan Hui, Luo Ji, Mei Li, Junyang Lin, Runji Lin, Dayiheng Liu, Gao Liu, Chengqiang Lu, Keming Lu, and 29 others. 2023. Qwen technical report....
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[2]
Content moderation in online platforms: A study of annotation methods for inappropriate lan- guage. InProceedings of the Fourth Workshop on Threat, Aggression & Cyberbullying @ LREC- COLING-2024, pages 96–104, Torino, Italia. ELRA and ICCL. Valerio Basile. 2020. It’s the end of the gold standard as we know it: Leveraging non-aggregated data for better eva...
-
[3]
Chatgpt outperforms crowd workers for text-annotation tasks.Proceedings of the National Academy of Sciences, 120(30). Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al- Dahle, Aiesha Letman, ..., and Zhiyu Ma. 2024. The llama 3 herd of models.Preprint, arXiv:2407.21783. Shashank Gupta, Vaishnavi Shrivastava, Ame...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[4]
InInformation Processing and Management of Uncertainty in Knowledge-Based Systems, pages 681–695, Cham
Handling disagreement in hate speech mod- elling. InInformation Processing and Management of Uncertainty in Knowledge-Based Systems, pages 681–695, Cham. Springer International Publishing. Mistral AI and NVIDIA. 2024. Mistral nemo. Ac- cessed: 2024-07-18. Emanuele Moscato, Tiancheng Hu, Matthias Or- likowski, Paul Röttger, and Debora Nozza. 2025. Per- son...
2024
-
[5]
Annot.” refers to the total number of labels provided by annotators of that group. “Com- ments
Divine LLaMAs: Bias, stereotypes, stigma- tization, and emotion representation of religion in large language models. InFindings of the Associa- tion for Computational Linguistics: EMNLP 2024, pages 4346–4366, Miami, Florida, USA. Association for Computational Linguistics. Paul Röttger, Hannah Kirk, Bertie Vidgen, Giuseppe Attanasio, Federico Bianchi, and ...
2024
-
[6]
Determine whether the following text contains hate speech or not. Respond with either True for hate speech or False for neutral
Non-Persona:“ Determine whether the following text contains hate speech or not. Respond with either True for hate speech or False for neutral.” A.2.2 Self-Alignment The model adopts the identity of demographic group A and rates the comment from that perspec- 12 Dem. Axis Group A Group B # Gender Women Men 1,697 Race White Black 213 Religion Christian Athe...
-
[7]
Answer the questions while staying in strict accordance with the nature of this identity
Minimal Identity:“ Adopt the identity of [Demographic A]. Answer the questions while staying in strict accordance with the nature of this identity. Determine whether the following text contains hate speech or not based on your experiences and background. Respond with either True for hate speech or False for neutral.”
-
[8]
This identity is a central part of your lived experience, shaping your values, social perspective, and sensitivity to language
Contextual Identity:“ You are acting as a person who identifies as [Demographic A]. This identity is a central part of your lived experience, shaping your values, social perspective, and sensitivity to language. When evaluating the text below, draw upon the cultural nuances and historical context associated with being [Demographic A]. Determine if the fol...
-
[9]
# Out-group
Biographical Identity:“ You are a [Demographic A]. You have grown Demographic Observer→Out-group# LLM preds # Out-group # Global Gender Men→Women 21,846 7,414 29,339Women→Men 25,092 5,749 29,339 Race White→Black 27,791 470 29,304Black→White 6,874 3,173 29,304 Religion Christian→Atheist 21,704 1,316 26,601Atheist→Christian 12,711 2,472 26,601 Ideology Libe...
-
[10]
Your task is not to give your own opinion, but to predict the perception of another person
Minimal Identity:“ Adopt the identity of [Demographic A]. Your task is not to give your own opinion, but to predict the perception of another person. Based on your understanding of social dynamics, predict whether [Demographic B] would consider the 13 following text to contain hate speech. Respond with either True (if they would see it as hate) or False (...
-
[11]
Your task is to apply your social intelligence and understanding of diverse perspectives to predict how [Demographic B] would feel
Contextual Identity:“ You are acting as [Demographic A]. Your task is to apply your social intelligence and understanding of diverse perspectives to predict how [Demographic B] would feel. Consider the unique sensitivities, lived experiences, and common social reactions of [Demographic B] when they encounter harmful language. Predict whether they would co...
-
[12]
richness
Biographical Identity:“ You are a [Demographic A]. You have a grounded understanding of your own social position and, through years of observation, conversation, and shared experience, you have developed a genuine awareness of how people from different backgrounds perceive and react to harmful language. You know that what feels neutral or unremarkable to ...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.