Two Wrongs, No Right: Auditing Social-Desirability Bias in LLM Annotators for Computational Social Science
Pith reviewed 2026-06-30 22:27 UTC · model grok-4.3
The pith
A model that matches aggregate label statistics can still reverse the substantive empirical conclusions a researcher would draw from the data.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Three open-source 7B models exhibit distinct social-desirability biases when labeling TweetEval data. Zephyr shows leniency bias with high false benign rates on offensive and hate content, while Mistral and Qwen show overcorrection with elevated false alarm rates. All three display neutrality bias on abortion stance by underestimating opposition prevalence. Safety framing and chain-of-thought prompts do not correct these failures uniformly. Zephyr produces an exact match to gold hate-speech prevalence through large opposing errors in both directions, demonstrating that aggregate calibration does not ensure the class distributions remain faithful.
What carries the argument
A three-part taxonomy of leniency bias, overcorrection, and neutrality bias, diagnosed through false benign rate (FBR) and false alarm rate (FAR) signatures on gold-labeled samples.
If this is right
- Researchers cannot treat aggregate accuracy or prevalence match as sufficient validation when using LLMs to estimate label distributions in CSS.
- Common prompting interventions fail to eliminate social-desirability distortions across different base models.
- Stance detection tasks are especially prone to neutrality bias that compresses the reported prevalence of opposing views.
- A lightweight protocol that computes FBR and FAR on a small gold sample is needed before trusting LLM-derived empirical claims.
Where Pith is reading between the lines
- CSS papers using LLM annotations may need to report full confusion matrices rather than summary metrics alone.
- The observed cancellation effect suggests that similar audits should be run on other social-media or forum datasets to check whether the same bias signatures appear.
- Model-specific correction factors derived from small gold sets could be tested as a practical mitigation before large-scale annotation.
Load-bearing premise
The TweetEval gold labels themselves are free of social-desirability artifacts and the six tasks represent the annotation workloads typical in computational social science.
What would settle it
A model or prompt that simultaneously drives both FBR and FAR near zero on the offensive-language and abortion-stance tasks, so that the resulting prevalence estimates deviate by only a few percentage points from the gold distribution.
read the original abstract
LLM annotators are increasingly used in computational social science (CSS), but it is unclear whether their alignment-shaped errors preserve the empirical conclusions a researcher would report. We audit three open-source 7B instruction-tuned models (Zephyr, Mistral-Instruct, Qwen2.5-Instruct) across six TweetEval tasks under four prompt conditions (72 cells) and find that social-desirability failures do not run in a single direction. Zephyr exhibits leniency bias, systematically under-applying harmful labels (offensive language: false benign rate 0.729, false alarm rate 0.031). Mistral and Qwen exhibit overcorrection, over-applying the same labels (Mistral hate-speech FAR = 0.604). All three models exhibit neutrality bias on abortion stance, underestimating opposition prevalence by 24 to 40 percentage points and inflating the neutral label. None of the four prompting interventions we test (neutral, safety framing, depersonalized, chain-of-thought) corrects these failures across models; safety framing can worsen stance distortion. Strikingly, Zephyr's hate-speech prevalence estimate matches the gold rate exactly while its class-conditional errors are large in both directions, an accidental cancellation that misleads aggregate validation. We translate these patterns into a three-part taxonomy with diagnostic FBR/FAR signatures and a lightweight gold-sample validation protocol. The headline for trustworthy CSS: a model that looks calibrated on aggregate metrics can still flip the substantive empirical conclusion a researcher would report.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript audits three 7B instruction-tuned LLMs (Zephyr, Mistral-Instruct, Qwen2.5-Instruct) as annotators across six TweetEval tasks and four prompt conditions. It reports model-specific social-desirability biases: Zephyr shows leniency (high false benign rate, e.g., 0.729 on offensive language), Mistral and Qwen show overcorrection (high false alarm rates, e.g., 0.604 on hate speech), and all three show neutrality bias on abortion stance (underestimating opposition by 24-40 percentage points). A key example is Zephyr's hate-speech prevalence matching gold exactly via cancelling class-conditional errors. The paper translates these into a three-part taxonomy with FBR/FAR signatures and proposes a lightweight gold-sample validation protocol, arguing that aggregate calibration can mask errors that flip substantive CSS conclusions.
Significance. If the empirical patterns hold, the work is significant for computational social science because it demonstrates a concrete mechanism by which LLM annotators can preserve aggregate prevalence estimates while distorting class-conditional distributions, directly threatening the validity of downstream research claims. The taxonomy and validation protocol provide actionable diagnostics that go beyond standard accuracy or calibration metrics. The multi-model, multi-task, multi-prompt design adds robustness to the audit.
major comments (1)
- [Methods / Evaluation] The central claims about LLM-specific error rates (Zephyr FBR 0.729, Mistral FAR 0.604, neutrality bias of 24-40 pp) and the accidental cancellation in Zephyr's hate-speech prevalence all treat TweetEval gold labels as unbiased ground truth. The manuscript reports no checks on these labels for the same social-desirability artifacts (e.g., no stratified IAA by topic sensitivity, no comparison to alternative protocols, no rater demographics). This assumption is load-bearing for attributing the observed distortions and the 'misleading aggregate' pattern to the LLMs.
minor comments (2)
- [Abstract] The abstract states 72 cells but does not enumerate the exact task-prompt-model combinations or provide the precise wording of the four prompt conditions.
- [Discussion] The lightweight gold-sample validation protocol is described at a high level; adding pseudocode or a worked example on one task would improve usability.
Simulated Author's Rebuttal
We thank the referee for the thoughtful review and for highlighting this methodological assumption. We address the concern directly below and will incorporate revisions to strengthen the manuscript.
read point-by-point responses
-
Referee: [Methods / Evaluation] The central claims about LLM-specific error rates (Zephyr FBR 0.729, Mistral FAR 0.604, neutrality bias of 24-40 pp) and the accidental cancellation in Zephyr's hate-speech prevalence all treat TweetEval gold labels as unbiased ground truth. The manuscript reports no checks on these labels for the same social-desirability artifacts (e.g., no stratified IAA by topic sensitivity, no comparison to alternative protocols, no rater demographics). This assumption is load-bearing for attributing the observed distortions and the 'misleading aggregate' pattern to the LLMs.
Authors: We agree that the analysis takes TweetEval gold labels as the reference standard without independent validation for social-desirability bias in the human annotations. This is a genuine limitation of the current study. The reported LLM-specific patterns (leniency in Zephyr, overcorrection in Mistral/Qwen, neutrality bias across models) and the accidental cancellation are defined relative to these labels; absolute claims about real-world prevalence would require additional checks on the gold data. However, the core empirical contribution—model-dependent error signatures that can produce misleading aggregate calibration—remains valid as a relative audit against a standard benchmark. We will add a dedicated Limitations subsection that (1) explicitly states the assumption, (2) notes the absence of stratified IAA or rater-demographic data for TweetEval, and (3) recommends future comparisons to alternative annotation protocols. This revision clarifies scope without changing the reported numbers or taxonomy. revision: yes
Circularity Check
No circularity: purely empirical comparison to external gold labels
full rationale
This paper performs an empirical audit of LLM annotators against fixed external TweetEval gold labels across six tasks. All reported quantities (FBR, FAR, prevalence estimates, class-conditional errors) are computed directly from model outputs versus those labels; no parameters are fitted to the target metrics, no derivations reduce to self-definitions or inputs by construction, and no self-citations are invoked as load-bearing uniqueness theorems. The central claim about aggregate calibration masking substantive flips is therefore independent of the paper's own fitted values or prior author work.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
TweetEval: Unified benchmark and compar- ative evaluation for tweet classification
Barbieri, F., Camacho-Collados, J., Espinosa Anke, L., and Neves, L. TweetEval: Unified benchmark and compar- ative evaluation for tweet classification. InFindings of the Association for Computational Linguistics: EMNLP 2020, pp. 1644–1650,
2020
-
[2]
Specializing large language models to simulate survey response distributions for global popu- lations
Cao, Y ., Liu, H., Arora, A., Augenstein, I., R¨ottger, P., and Hershcovich, D. Specializing large language models to simulate survey response distributions for global popu- lations. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technolo- gies (NAACL), pp. 3141–3154,
2025
-
[3]
arXiv:2502.07068. Das, A., Zhang, Z., Hasan, N., Sarkar, S., Jamshidi, F., Bhattacharya, T., Rahgouy, M., Raychawdhary, N., Feng, D., Jain, V ., Chadha, A., Sandage, M., Pope, L., Dozier, G., and Seals, C. D. Investigating annotator bias in large language models for hate speech detection.arXiv preprint arXiv:2406.11109,
-
[4]
arXiv:2410.07991. Gu, J., Jiang, X., Shi, Z., Tan, H., Zhai, X., Xu, C., Li, W., Shen, Y ., Ma, S., Liu, H., Wang, S., Zhang, K., Wang, Y ., 7 Auditing Social-Desirability Bias in LLM Annotators Gao, W., Ni, L., and Guo, J. A survey on LLM-as-a-judge. arXiv preprint arXiv:2411.15594,
-
[5]
On the inevitability of left-leaning political bias in aligned language models, 2025
Hagendorff, T. On the inevitability of left-leaning polit- ical bias in aligned language models.arXiv preprint arXiv:2507.15328,
-
[6]
M., Duan, N., and Chen, W
He, X., Lin, Z., Gong, Y ., Jin, A.-L., Zhang, H., Lin, C., Jiao, J., Yiu, S. M., Duan, N., and Chen, W. AnnoLLM: Making large language models to be better crowdsourced annotators. InProceedings of the 2024 Conference of the North American Chapter of the Association for Com- putational Linguistics: Human Language Technologies (Volume 6: Industry Track), p...
2024
-
[7]
Jacobs, A. Z. and Wallach, H. Measurement and fairness. InProceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, pp. 375–385,
2021
-
[8]
Jiang, A. Q., Sablayrolles, A., Mensch, A., Bamford, C., Chaplot, D. S., de las Casas, D., Bressand, F., Lengyel, G., Lample, G., Saulnier, L., et al. Mistral 7B.arXiv preprint arXiv:2310.06825,
work page internal anchor Pith review Pith/arXiv arXiv
-
[9]
Kotte, V . Not all queries need rewriting: When prompt-only LLM refinement helps and hurts dense retrieval.arXiv preprint arXiv:2603.13301, 2026a. Kotte, V . PromptPort: A reliability layer for cross-model structured extraction.arXiv preprint arXiv:2601.06151, 2026b. Pangakis, N., Wolken, S., and Fasching, N. Automated annotation with generative AI requir...
-
[10]
The “problem” of human label variation: On ground truth in data, modeling and evaluation
Plank, B. The “problem” of human label variation: On ground truth in data, modeling and evaluation. InPro- ceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp. 10671–10682,
2022
-
[11]
Qwen Team. Qwen2.5 technical report.arXiv preprint arXiv:2412.15115,
work page internal anchor Pith review Pith/arXiv arXiv
-
[12]
Santurkar, S., Durmus, E., Ladd, F., Lee, C., Liang, P., and Hashimoto, T. Whose opinions do language models reflect?arXiv preprint arXiv:2303.17548,
-
[13]
Zephyr: Direct Distillation of LM Alignment
Tunstall, L., Beeching, E., Lambert, N., Rajani, N., Rasul, K., Belkada, Y ., Huang, S., von Werra, L., Fourrier, C., Habib, N., Sarrazin, N., Sanseviero, O., Rush, A. M., and Wolf, T. Zephyr: Direct distillation of LM alignment. arXiv preprint arXiv:2310.16944,
work page internal anchor Pith review Pith/arXiv arXiv
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.