Two Wrongs, No Right: Auditing Social-Desirability Bias in LLM Annotators for Computational Social Science

Varun Kotte

arxiv: 2606.12426 · v1 · pith:NCD52TRXnew · submitted 2026-05-12 · 💻 cs.CY · cs.CL· cs.LG

Two Wrongs, No Right: Auditing Social-Desirability Bias in LLM Annotators for Computational Social Science

Varun Kotte This is my paper

Pith reviewed 2026-06-30 22:27 UTC · model grok-4.3

classification 💻 cs.CY cs.CLcs.LG

keywords LLM annotatorssocial-desirability biascomputational social scienceTweetEvalannotation errorsaggregate calibrationstance detectionprevalence estimation

0 comments

The pith

A model that matches aggregate label statistics can still reverse the substantive empirical conclusions a researcher would draw from the data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper audits three 7B instruction-tuned LLMs on six TweetEval tasks under multiple prompt conditions to check whether their annotation errors preserve the prevalence estimates and class distributions that matter for computational social science. It identifies model-specific patterns: one model under-applies harmful labels while the other two over-apply them, and all three systematically inflate neutral labels on stance detection, under-counting opposition views by 24 to 40 percentage points. These directional errors can cancel exactly in aggregate metrics, so that overall prevalence matches the gold rate while class-conditional mistakes remain large. None of the four tested prompting strategies removes the distortions consistently across models. The authors convert the observed patterns into a taxonomy of bias types diagnosed by false benign and false alarm rates.

Core claim

Three open-source 7B models exhibit distinct social-desirability biases when labeling TweetEval data. Zephyr shows leniency bias with high false benign rates on offensive and hate content, while Mistral and Qwen show overcorrection with elevated false alarm rates. All three display neutrality bias on abortion stance by underestimating opposition prevalence. Safety framing and chain-of-thought prompts do not correct these failures uniformly. Zephyr produces an exact match to gold hate-speech prevalence through large opposing errors in both directions, demonstrating that aggregate calibration does not ensure the class distributions remain faithful.

What carries the argument

A three-part taxonomy of leniency bias, overcorrection, and neutrality bias, diagnosed through false benign rate (FBR) and false alarm rate (FAR) signatures on gold-labeled samples.

If this is right

Researchers cannot treat aggregate accuracy or prevalence match as sufficient validation when using LLMs to estimate label distributions in CSS.
Common prompting interventions fail to eliminate social-desirability distortions across different base models.
Stance detection tasks are especially prone to neutrality bias that compresses the reported prevalence of opposing views.
A lightweight protocol that computes FBR and FAR on a small gold sample is needed before trusting LLM-derived empirical claims.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

CSS papers using LLM annotations may need to report full confusion matrices rather than summary metrics alone.
The observed cancellation effect suggests that similar audits should be run on other social-media or forum datasets to check whether the same bias signatures appear.
Model-specific correction factors derived from small gold sets could be tested as a practical mitigation before large-scale annotation.

Load-bearing premise

The TweetEval gold labels themselves are free of social-desirability artifacts and the six tasks represent the annotation workloads typical in computational social science.

What would settle it

A model or prompt that simultaneously drives both FBR and FAR near zero on the offensive-language and abortion-stance tasks, so that the resulting prevalence estimates deviate by only a few percentage points from the gold distribution.

read the original abstract

LLM annotators are increasingly used in computational social science (CSS), but it is unclear whether their alignment-shaped errors preserve the empirical conclusions a researcher would report. We audit three open-source 7B instruction-tuned models (Zephyr, Mistral-Instruct, Qwen2.5-Instruct) across six TweetEval tasks under four prompt conditions (72 cells) and find that social-desirability failures do not run in a single direction. Zephyr exhibits leniency bias, systematically under-applying harmful labels (offensive language: false benign rate 0.729, false alarm rate 0.031). Mistral and Qwen exhibit overcorrection, over-applying the same labels (Mistral hate-speech FAR = 0.604). All three models exhibit neutrality bias on abortion stance, underestimating opposition prevalence by 24 to 40 percentage points and inflating the neutral label. None of the four prompting interventions we test (neutral, safety framing, depersonalized, chain-of-thought) corrects these failures across models; safety framing can worsen stance distortion. Strikingly, Zephyr's hate-speech prevalence estimate matches the gold rate exactly while its class-conditional errors are large in both directions, an accidental cancellation that misleads aggregate validation. We translate these patterns into a three-part taxonomy with diagnostic FBR/FAR signatures and a lightweight gold-sample validation protocol. The headline for trustworthy CSS: a model that looks calibrated on aggregate metrics can still flip the substantive empirical conclusion a researcher would report.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

LLM annotators can match gold prevalence by accident while flipping the actual class distributions a researcher would report, but the TweetEval labels themselves may carry similar biases.

read the letter

The main thing is that aggregate calibration on LLM outputs can mask large opposing errors, so a model hits the right overall rate but would lead a CSS study to the wrong substantive claim. Zephyr on hate speech is the clearest case: high false benign rate offset by low false alarm rate produces an exact match to gold prevalence.

The paper runs three 7B models across six TweetEval tasks and four prompt styles. It documents Zephyr's leniency on harmful labels, Mistral and Qwen's over-application of them, and a shared neutrality bias on abortion stance that shifts prevalence estimates by 24-40 points. Prompt tweaks do not fix the patterns. The FBR/FAR taxonomy and the suggestion to validate on a gold subsample are straightforward and usable.

Those patterns are the useful part. The cancellation example and the model-specific directions give concrete evidence that standard accuracy checks are not enough for prevalence work.

The soft spot is the ground truth. Everything is measured against TweetEval labels, yet the paper does not check whether those labels already embed leniency, overcorrection, or neutrality effects from their original annotators. On sensitive topics that is a real gap; if the gold is already distorted, the reported LLM errors are relative rather than absolute. The abstract also skips statistical tests, split details, and raw matrices, which makes the effect sizes hard to judge.

This is for people who annotate with LLMs in computational social science. A reader running prevalence studies would get direct value from the examples and the protocol.

It deserves peer review. The empirical question is worth referee time even with the ground-truth caveat.

Referee Report

1 major / 2 minor

Summary. The manuscript audits three 7B instruction-tuned LLMs (Zephyr, Mistral-Instruct, Qwen2.5-Instruct) as annotators across six TweetEval tasks and four prompt conditions. It reports model-specific social-desirability biases: Zephyr shows leniency (high false benign rate, e.g., 0.729 on offensive language), Mistral and Qwen show overcorrection (high false alarm rates, e.g., 0.604 on hate speech), and all three show neutrality bias on abortion stance (underestimating opposition by 24-40 percentage points). A key example is Zephyr's hate-speech prevalence matching gold exactly via cancelling class-conditional errors. The paper translates these into a three-part taxonomy with FBR/FAR signatures and proposes a lightweight gold-sample validation protocol, arguing that aggregate calibration can mask errors that flip substantive CSS conclusions.

Significance. If the empirical patterns hold, the work is significant for computational social science because it demonstrates a concrete mechanism by which LLM annotators can preserve aggregate prevalence estimates while distorting class-conditional distributions, directly threatening the validity of downstream research claims. The taxonomy and validation protocol provide actionable diagnostics that go beyond standard accuracy or calibration metrics. The multi-model, multi-task, multi-prompt design adds robustness to the audit.

major comments (1)

[Methods / Evaluation] The central claims about LLM-specific error rates (Zephyr FBR 0.729, Mistral FAR 0.604, neutrality bias of 24-40 pp) and the accidental cancellation in Zephyr's hate-speech prevalence all treat TweetEval gold labels as unbiased ground truth. The manuscript reports no checks on these labels for the same social-desirability artifacts (e.g., no stratified IAA by topic sensitivity, no comparison to alternative protocols, no rater demographics). This assumption is load-bearing for attributing the observed distortions and the 'misleading aggregate' pattern to the LLMs.

minor comments (2)

[Abstract] The abstract states 72 cells but does not enumerate the exact task-prompt-model combinations or provide the precise wording of the four prompt conditions.
[Discussion] The lightweight gold-sample validation protocol is described at a high level; adding pseudocode or a worked example on one task would improve usability.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the thoughtful review and for highlighting this methodological assumption. We address the concern directly below and will incorporate revisions to strengthen the manuscript.

read point-by-point responses

Referee: [Methods / Evaluation] The central claims about LLM-specific error rates (Zephyr FBR 0.729, Mistral FAR 0.604, neutrality bias of 24-40 pp) and the accidental cancellation in Zephyr's hate-speech prevalence all treat TweetEval gold labels as unbiased ground truth. The manuscript reports no checks on these labels for the same social-desirability artifacts (e.g., no stratified IAA by topic sensitivity, no comparison to alternative protocols, no rater demographics). This assumption is load-bearing for attributing the observed distortions and the 'misleading aggregate' pattern to the LLMs.

Authors: We agree that the analysis takes TweetEval gold labels as the reference standard without independent validation for social-desirability bias in the human annotations. This is a genuine limitation of the current study. The reported LLM-specific patterns (leniency in Zephyr, overcorrection in Mistral/Qwen, neutrality bias across models) and the accidental cancellation are defined relative to these labels; absolute claims about real-world prevalence would require additional checks on the gold data. However, the core empirical contribution—model-dependent error signatures that can produce misleading aggregate calibration—remains valid as a relative audit against a standard benchmark. We will add a dedicated Limitations subsection that (1) explicitly states the assumption, (2) notes the absence of stratified IAA or rater-demographic data for TweetEval, and (3) recommends future comparisons to alternative annotation protocols. This revision clarifies scope without changing the reported numbers or taxonomy. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical comparison to external gold labels

full rationale

This paper performs an empirical audit of LLM annotators against fixed external TweetEval gold labels across six tasks. All reported quantities (FBR, FAR, prevalence estimates, class-conditional errors) are computed directly from model outputs versus those labels; no parameters are fitted to the target metrics, no derivations reduce to self-definitions or inputs by construction, and no self-citations are invoked as load-bearing uniqueness theorems. The central claim about aggregate calibration masking substantive flips is therefore independent of the paper's own fitted values or prior author work.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central claim rests entirely on the empirical comparisons to TweetEval gold labels; no free parameters, axioms, or invented entities are introduced.

pith-pipeline@v0.9.1-grok · 5811 in / 995 out tokens · 30406 ms · 2026-06-30T22:27:49.519477+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

13 extracted references · 8 canonical work pages · 3 internal anchors

[1]

TweetEval: Unified benchmark and compar- ative evaluation for tweet classification

Barbieri, F., Camacho-Collados, J., Espinosa Anke, L., and Neves, L. TweetEval: Unified benchmark and compar- ative evaluation for tweet classification. InFindings of the Association for Computational Linguistics: EMNLP 2020, pp. 1644–1650,

2020
[2]

Specializing large language models to simulate survey response distributions for global popu- lations

Cao, Y ., Liu, H., Arora, A., Augenstein, I., R¨ottger, P., and Hershcovich, D. Specializing large language models to simulate survey response distributions for global popu- lations. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technolo- gies (NAACL), pp. 3141–3154,

2025
[3]

arXiv:2502.07068. Das, A., Zhang, Z., Hasan, N., Sarkar, S., Jamshidi, F., Bhattacharya, T., Rahgouy, M., Raychawdhary, N., Feng, D., Jain, V ., Chadha, A., Sandage, M., Pope, L., Dozier, G., and Seals, C. D. Investigating annotator bias in large language models for hate speech detection.arXiv preprint arXiv:2406.11109,

work page arXiv
[4]

arXiv:2410.07991. Gu, J., Jiang, X., Shi, Z., Tan, H., Zhai, X., Xu, C., Li, W., Shen, Y ., Ma, S., Liu, H., Wang, S., Zhang, K., Wang, Y ., 7 Auditing Social-Desirability Bias in LLM Annotators Gao, W., Ni, L., and Guo, J. A survey on LLM-as-a-judge. arXiv preprint arXiv:2411.15594,

work page arXiv
[5]

On the inevitability of left-leaning political bias in aligned language models, 2025

Hagendorff, T. On the inevitability of left-leaning polit- ical bias in aligned language models.arXiv preprint arXiv:2507.15328,

work page arXiv
[6]

M., Duan, N., and Chen, W

He, X., Lin, Z., Gong, Y ., Jin, A.-L., Zhang, H., Lin, C., Jiao, J., Yiu, S. M., Duan, N., and Chen, W. AnnoLLM: Making large language models to be better crowdsourced annotators. InProceedings of the 2024 Conference of the North American Chapter of the Association for Com- putational Linguistics: Human Language Technologies (Volume 6: Industry Track), p...

2024
[7]

Jacobs, A. Z. and Wallach, H. Measurement and fairness. InProceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, pp. 375–385,

2021
[8]

Mistral 7B

Jiang, A. Q., Sablayrolles, A., Mensch, A., Bamford, C., Chaplot, D. S., de las Casas, D., Bressand, F., Lengyel, G., Lample, G., Saulnier, L., et al. Mistral 7B.arXiv preprint arXiv:2310.06825,

work page internal anchor Pith review Pith/arXiv arXiv
[9]

Not all queries need rewriting: When prompt-only LLM refinement helps and hurts dense retrieval.arXiv preprint arXiv:2603.13301, 2026a

Kotte, V . Not all queries need rewriting: When prompt-only LLM refinement helps and hurts dense retrieval.arXiv preprint arXiv:2603.13301, 2026a. Kotte, V . PromptPort: A reliability layer for cross-model structured extraction.arXiv preprint arXiv:2601.06151, 2026b. Pangakis, N., Wolken, S., and Fasching, N. Automated annotation with generative AI requir...

work page arXiv
[10]

The “problem” of human label variation: On ground truth in data, modeling and evaluation

Plank, B. The “problem” of human label variation: On ground truth in data, modeling and evaluation. InPro- ceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp. 10671–10682,

2022
[11]

Qwen2.5 Technical Report

Qwen Team. Qwen2.5 technical report.arXiv preprint arXiv:2412.15115,

work page internal anchor Pith review Pith/arXiv arXiv
[12]

Santurkar et al

Santurkar, S., Durmus, E., Ladd, F., Lee, C., Liang, P., and Hashimoto, T. Whose opinions do language models reflect?arXiv preprint arXiv:2303.17548,

work page arXiv
[13]

Zephyr: Direct Distillation of LM Alignment

Tunstall, L., Beeching, E., Lambert, N., Rajani, N., Rasul, K., Belkada, Y ., Huang, S., von Werra, L., Fourrier, C., Habib, N., Sarrazin, N., Sanseviero, O., Rush, A. M., and Wolf, T. Zephyr: Direct distillation of LM alignment. arXiv preprint arXiv:2310.16944,

work page internal anchor Pith review Pith/arXiv arXiv

[1] [1]

TweetEval: Unified benchmark and compar- ative evaluation for tweet classification

Barbieri, F., Camacho-Collados, J., Espinosa Anke, L., and Neves, L. TweetEval: Unified benchmark and compar- ative evaluation for tweet classification. InFindings of the Association for Computational Linguistics: EMNLP 2020, pp. 1644–1650,

2020

[2] [2]

Specializing large language models to simulate survey response distributions for global popu- lations

Cao, Y ., Liu, H., Arora, A., Augenstein, I., R¨ottger, P., and Hershcovich, D. Specializing large language models to simulate survey response distributions for global popu- lations. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technolo- gies (NAACL), pp. 3141–3154,

2025

[3] [3]

arXiv:2502.07068. Das, A., Zhang, Z., Hasan, N., Sarkar, S., Jamshidi, F., Bhattacharya, T., Rahgouy, M., Raychawdhary, N., Feng, D., Jain, V ., Chadha, A., Sandage, M., Pope, L., Dozier, G., and Seals, C. D. Investigating annotator bias in large language models for hate speech detection.arXiv preprint arXiv:2406.11109,

work page arXiv

[4] [4]

arXiv:2410.07991. Gu, J., Jiang, X., Shi, Z., Tan, H., Zhai, X., Xu, C., Li, W., Shen, Y ., Ma, S., Liu, H., Wang, S., Zhang, K., Wang, Y ., 7 Auditing Social-Desirability Bias in LLM Annotators Gao, W., Ni, L., and Guo, J. A survey on LLM-as-a-judge. arXiv preprint arXiv:2411.15594,

work page arXiv

[5] [5]

On the inevitability of left-leaning political bias in aligned language models, 2025

Hagendorff, T. On the inevitability of left-leaning polit- ical bias in aligned language models.arXiv preprint arXiv:2507.15328,

work page arXiv

[6] [6]

M., Duan, N., and Chen, W

He, X., Lin, Z., Gong, Y ., Jin, A.-L., Zhang, H., Lin, C., Jiao, J., Yiu, S. M., Duan, N., and Chen, W. AnnoLLM: Making large language models to be better crowdsourced annotators. InProceedings of the 2024 Conference of the North American Chapter of the Association for Com- putational Linguistics: Human Language Technologies (Volume 6: Industry Track), p...

2024

[7] [7]

Jacobs, A. Z. and Wallach, H. Measurement and fairness. InProceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, pp. 375–385,

2021

[8] [8]

Mistral 7B

Jiang, A. Q., Sablayrolles, A., Mensch, A., Bamford, C., Chaplot, D. S., de las Casas, D., Bressand, F., Lengyel, G., Lample, G., Saulnier, L., et al. Mistral 7B.arXiv preprint arXiv:2310.06825,

work page internal anchor Pith review Pith/arXiv arXiv

[9] [9]

Not all queries need rewriting: When prompt-only LLM refinement helps and hurts dense retrieval.arXiv preprint arXiv:2603.13301, 2026a

Kotte, V . Not all queries need rewriting: When prompt-only LLM refinement helps and hurts dense retrieval.arXiv preprint arXiv:2603.13301, 2026a. Kotte, V . PromptPort: A reliability layer for cross-model structured extraction.arXiv preprint arXiv:2601.06151, 2026b. Pangakis, N., Wolken, S., and Fasching, N. Automated annotation with generative AI requir...

work page arXiv

[10] [10]

The “problem” of human label variation: On ground truth in data, modeling and evaluation

Plank, B. The “problem” of human label variation: On ground truth in data, modeling and evaluation. InPro- ceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp. 10671–10682,

2022

[11] [11]

Qwen2.5 Technical Report

Qwen Team. Qwen2.5 technical report.arXiv preprint arXiv:2412.15115,

work page internal anchor Pith review Pith/arXiv arXiv

[12] [12]

Santurkar et al

Santurkar, S., Durmus, E., Ladd, F., Lee, C., Liang, P., and Hashimoto, T. Whose opinions do language models reflect?arXiv preprint arXiv:2303.17548,

work page arXiv

[13] [13]

Zephyr: Direct Distillation of LM Alignment

Tunstall, L., Beeching, E., Lambert, N., Rajani, N., Rasul, K., Belkada, Y ., Huang, S., von Werra, L., Fourrier, C., Habib, N., Sarrazin, N., Sanseviero, O., Rush, A. M., and Wolf, T. Zephyr: Direct distillation of LM alignment. arXiv preprint arXiv:2310.16944,

work page internal anchor Pith review Pith/arXiv arXiv