pith. machine review for the scientific record. sign in

arxiv: 2604.17968 · v1 · submitted 2026-04-20 · 💻 cs.AI · cs.CL

Recognition: unknown

From Fallback to Frontline: When Can LLMs be Superior Annotators of Human Perspectives?

Authors on Pith no claims yet

Pith reviewed 2026-05-10 05:09 UTC · model grok-4.3

classification 💻 cs.AI cs.CL
keywords large language modelsannotatorshuman perspectivessubgroup opinionssubjective tasksgroup-level estimationbias and variance
0
0 comments X

The pith

LLMs can outperform human annotators, including in-group humans, when predicting aggregate subgroup opinions on subjective tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper challenges the view of LLMs as mere fallbacks for annotation by showing they can be superior estimators of group-level human perspectives on subjective issues. It frames the problem as estimating a latent group judgment and identifies when and why LLMs beat humans at this, including outperforming people from the relevant subgroup. This matters because it suggests LLMs could become the default method for scaling up opinion aggregation tasks in research and industry rather than a compromise. The key is that LLMs exhibit lower variance and less linkage between how they represent information and how they process it, which provides a statistical edge.

Core claim

By framing perspective-taking as the estimation of a latent group-level judgment, modern LLMs can outperform human annotators, including in-group humans, when predicting aggregate subgroup opinions on subjective tasks under conditions that are common in practice. This advantage arises from structural properties of LLMs as estimators, including low variance and reduced coupling between representation and processing biases, rather than any claim of lived experience. The analysis identifies clear regimes where LLMs act as statistically superior frontline estimators, as well as principled limits where human judgment remains essential.

What carries the argument

Framing perspective-taking as estimation of a latent group-level judgment, which reveals LLMs' advantages in low variance and reduced bias coupling as estimators of collective opinions.

Load-bearing premise

That the observed superiority of LLMs stems from their low variance and reduced coupling of representation and processing biases as estimators.

What would settle it

An empirical comparison on a subjective task showing that in-group human annotators achieve lower prediction error for aggregate subgroup opinions than LLMs do.

Figures

Figures reproduced from arXiv: 2604.17968 by Chien-Ju Ho, Harry Yizhou Tian, Hasan Amin, Ming Yin, Rajiv Khanna, Xiaoni Duan.

Figure 1
Figure 1. Figure 1: Single-annotation (k=1) error decompo￾sition for three GPT variants vs. human baselines on Toxicity Detection. LLMs achieve lower MSE (top) across all gender subgroups, driven by lower bias (mid￾dle) and substantially lower variance (bottom). (see [PITH_FULL_IMAGE:figures/full_fig_p005_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: MSE vs. annotation budget k for the female subgroup. A single LLM PT estimate (k=1) is compa￾rable to aggregating 3–5 direct human labels. LLM Ensembles vs. Human Crowds (k > 1). We next examine how performance scales with the annotation budget k ( [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
Figure 4
Figure 4. Figure 4: Effect of subgroup specificity on LLM (GPT￾OSS:120B) PT error on DICES. MSE rises monotoni￾cally as the target becomes more specific. 1 2 3 5 10 Number of Annotations Per Comment 0.000 0.025 0.050 0.075 0.100 0.125 0.150 0.175 Error Asian Black Latino Multiracial White [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 6
Figure 6. Figure 6: Impact of model family and scale on single￾annotation (k=1) PT for the female subgroup. Cross￾model differences in MSE (top) are dominated by bias (middle), with variance (bottom) remaining small. terventions by whether they primarily affect the Wide Lens (representation bias), Clear Lens (pro￾cessing bias), or the correlation floor. 2 Results here focus on the female subgroup, with additional groups and r… view at source ↗
Figure 7
Figure 7. Figure 7: Impact of prompting on single-annotation LLM (GPT-OSS:120B) PT for the female subgroup. Increasing prompt structure from L1 through L4 lowers MSE (left) primarily by shifting bias (right) and can even flip its sign. GPT-5.1 GPT-5.1-R Qwen3:1.7B Qwen3-R:1.7B Qwen3:8B Qwen3-R:8B Qwen3:32B Qwen3-R:32B 0.2 0.1 0.0 0.1 Bias Direct PT (In-Group) [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Reasoning paradox diagnostic on the non￾binary subgroup: for four base–reasoning pairs, en￾abling native reasoning (hatched) shifts systematic bias away from ground truth, consistent with criterion drift. not monotonic: mid-sized models from one family can outperform larger models from another. Error decomposition reveals that these differences are primarily driven by bias, with variance remaining comparat… view at source ↗
read the original abstract

Although large language models (LLMs) are increasingly used as annotators at scale, they are typically treated as a pragmatic fallback rather than a faithful estimator of human perspectives. This work challenges that presumption. By framing perspective-taking as the estimation of a latent group-level judgment, we characterize the conditions under which modern LLMs can outperform human annotators, including in-group humans, when predicting aggregate subgroup opinions on subjective tasks, and show that these conditions are common in practice. This advantage arises from structural properties of LLMs as estimators, including low variance and reduced coupling between representation and processing biases, rather than any claim of lived experience. Our analysis identifies clear regimes where LLMs act as statistically superior frontline estimators, as well as principled limits where human judgment remains essential. These findings reposition LLMs from a cost-saving compromise to a principled tool for estimating collective human perspectives.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that framing perspective-taking as estimation of a latent group-level judgment reveals conditions under which modern LLMs outperform human annotators (including in-group humans) when predicting aggregate subgroup opinions on subjective tasks; these conditions are common in practice. The advantage is attributed to structural estimator properties such as low variance and reduced coupling between representation and processing biases, rather than lived experience. The analysis identifies clear regimes of LLM superiority as frontline estimators alongside principled limits where human judgment remains essential, repositioning LLMs from pragmatic fallback to principled tool.

Significance. If the empirical characterization holds, the work provides a principled basis for using LLMs in scalable annotation of subjective opinions, with implications for AI evaluation, social science data collection, and collective preference modeling. The bounded framing (advantages plus explicit limits) and focus on estimator properties rather than anthropomorphic claims are strengths that could influence annotation guidelines and hybrid human-LLM workflows.

major comments (2)
  1. The central claim rests on an empirical analysis that identifies conditions and their prevalence, yet the abstract provides no details on datasets, experimental design, statistical tests, or pre-specification of regimes. The methods section must explicitly describe how superiority was measured (e.g., error metrics against ground-truth aggregates), how conditions were derived, and whether they were hypothesized a priori or identified post hoc, as this directly affects whether the superiority and prevalence assertions are supported.
  2. The claim that the advantage arises from 'structural properties' such as low variance and reduced bias coupling requires a concrete comparison section showing these properties hold after controlling for prompt engineering, model scale, and task framing; without such controls, the structural interpretation risks being confounded by implementation choices.
minor comments (2)
  1. Clarify the precise definition of 'in-group humans' and 'aggregate subgroup opinions' early in the introduction to avoid ambiguity in the comparison.
  2. Ensure all figures reporting performance differences include confidence intervals or statistical significance markers to allow readers to assess the magnitude and reliability of LLM advantages.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. The comments highlight important areas for improving methodological transparency and strengthening the evidence for our structural-properties interpretation. We have revised the manuscript accordingly and respond to each major comment below.

read point-by-point responses
  1. Referee: The central claim rests on an empirical analysis that identifies conditions and their prevalence, yet the abstract provides no details on datasets, experimental design, statistical tests, or pre-specification of regimes. The methods section must explicitly describe how superiority was measured (e.g., error metrics against ground-truth aggregates), how conditions were derived, and whether they were hypothesized a priori or identified post hoc, as this directly affects whether the superiority and prevalence assertions are supported.

    Authors: We agree that the abstract and methods require greater explicitness. In the revised manuscript we have (1) updated the abstract to name the primary datasets (subgroup opinion benchmarks drawn from public surveys and annotation corpora) and the core metrics (MSE and Pearson correlation against ground-truth aggregates), (2) expanded the Methods section to detail the error metrics, the statistical tests (paired t-tests and bootstrap confidence intervals for LLM vs. human superiority), and (3) clarified that the regimes were hypothesized a priori from the statistical properties of low-variance estimators and bias-decoupling, then validated on held-out data rather than discovered post hoc. A new paragraph also states the analysis plan was fixed before final experiments. revision: yes

  2. Referee: The claim that the advantage arises from 'structural properties' such as low variance and reduced bias coupling requires a concrete comparison section showing these properties hold after controlling for prompt engineering, model scale, and task framing; without such controls, the structural interpretation risks being confounded by implementation choices.

    Authors: We have added a dedicated 'Controlled Comparisons' subsection in the Results. This section reports three sets of controlled experiments: (a) identical prompt templates across all models with only minimal, non-optimized variations, (b) direct comparison of model scales (e.g., 7B vs. 70B+ variants) while holding all other factors fixed, and (c) multiple task framings (direct estimation, perspective-taking, and chain-of-thought). Under these controls the low-variance advantage and reduced representation-processing bias coupling remain statistically significant, with explicit variance (standard deviation across repeated runs) and bias-correlation metrics reported. We acknowledge that prompt engineering can modulate absolute performance, yet the relative structural benefits persist, supporting the interpretation. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper frames perspective-taking as estimation of a latent group-level judgment and characterizes empirical regimes where LLMs outperform humans due to structural estimator properties (low variance, reduced bias coupling). No equations, fitted parameters, or self-citations are invoked in the provided abstract or claim structure that reduce the central result to its own inputs by construction. The argument is bounded, identifies both advantages and limits, and relies on external empirical validation rather than self-referential definitions or ansatz smuggling. This is the most common honest finding for papers whose core contribution is an empirical characterization against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The abstract introduces no explicit free parameters, axioms, or invented entities; the central claim rests on the unelaborated assertion that LLMs possess low variance and reduced representation-processing coupling as estimators.

pith-pipeline@v0.9.0 · 5461 in / 1104 out tokens · 29660 ms · 2026-05-10T05:09:37.347707+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

26 extracted references · 6 canonical work pages

  1. [1]

    Exploring the cost-effectiveness of perspective taking in crowdsourcing subjective assessment: A case study of toxicity detection , author=. Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) , pages=

  2. [2]

    International conference on machine learning , pages=

    Machine theory of mind , author=. International conference on machine learning , pages=. 2018 , organization=

  3. [3]

    Proceedings of the 2021 CHI conference on human factors in computing systems , pages=

    Towards mutual theory of mind in human-ai interaction: How language reflects what students perceive about a virtual teaching assistant , author=. Proceedings of the 2021 CHI conference on human factors in computing systems , pages=

  4. [4]

    arXiv preprint arXiv:2111.07997 , year=

    Annotators with attitudes: How annotator beliefs and identities bias toxic language detection , author=. arXiv preprint arXiv:2111.07997 , year=

  5. [5]

    Transactions of the Association for Computational Linguistics , volume=

    Dealing with disagreements: Looking beyond the majority vote in subjective annotations , author=. Transactions of the Association for Computational Linguistics , volume=. 2022 , publisher=

  6. [6]

    Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics , pages=

    Why don’t you do it right? analysing annotators’ disagreement in subjective tasks , author=. Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics , pages=

  7. [7]

    Language Resources and Evaluation , volume=

    Perspectivist approaches to natural language processing: a survey , author=. Language Resources and Evaluation , volume=. 2025 , publisher=

  8. [8]

    Proceedings of the 57th annual meeting of the association for computational linguistics , pages=

    The risk of racial bias in hate speech detection , author=. Proceedings of the 57th annual meeting of the association for computational linguistics , pages=

  9. [9]

    International Journal of Oral Science , volume=

    ChatGPT for shaping the future of dentistry: the potential of multi-modal large language model , author=. International Journal of Oral Science , volume=. 2023 , publisher=

  10. [10]

    arXiv preprint arXiv:2311.10227 , year=

    Think twice: Perspective-taking improves large language models' theory-of-mind capabilities , author=. arXiv preprint arXiv:2311.10227 , year=

  11. [11]

    Advances in Neural Information Processing Systems , volume=

    Dices dataset: Diversity in conversational ai evaluation for safety , author=. Advances in Neural Information Processing Systems , volume=

  12. [12]

    Proceedings of the 2025 CHI Conference on Human Factors in Computing Systems , pages=

    A matter of perspective (s): Contrasting human and llm argumentation in subjective decision-making on subtle sexism , author=. Proceedings of the 2025 CHI Conference on Human Factors in Computing Systems , pages=

  13. [13]

    Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , pages=

    Annotation alignment: Comparing LLM and human annotations of conversational safety , author=. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , pages=

  14. [14]

    The Alternative Annotator Test for LLM -as-a-Judge: How to Statistically Justify Replacing Human Annotators with LLM s

    Calderon, Nitay and Reichart, Roi and Dror, Rotem. The Alternative Annotator Test for LLM -as-a-Judge: How to Statistically Justify Replacing Human Annotators with LLM s. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2025. doi:10.18653/v1/2025.acl-long.782

  15. [15]

    Nature Computational Science , volume=

    Human-like intuitive behavior and reasoning biases emerged in large language models but disappeared in ChatGPT , author=. Nature Computational Science , volume=. 2023 , publisher=

  16. [16]

    Nature Computational Science , volume=

    Generative language models exhibit social identity biases , author=. Nature Computational Science , volume=. 2025 , publisher=

  17. [17]

    Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=

    From generation to judgment: Opportunities and challenges of llm-as-a-judge , author=. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=

  18. [18]

    Benchmarking distributional alignment of large language models , author=. Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) , pages=

  19. [19]

    Improving Language Model Personas via Rationalization with Psychological Scaffolds

    Joshi, Brihi and Ren, Xiang and Swayamdipta, Swabha and Koncel-Kedziorski, Rik and Paek, Tim. Improving Language Model Personas via Rationalization with Psychological Scaffolds. Findings of the Association for Computational Linguistics: EMNLP 2025. 2025. doi:10.18653/v1/2025.findings-emnlp.1187

  20. [20]

    Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

    Beyond demographics: Fine-tuning large language models to predict individuals’ subjective text perceptions , author=. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

  21. [21]

    Forty-first International Conference on Machine Learning , year=

    Position: A Roadmap to Pluralistic Alignment , author=. Forty-first International Conference on Machine Learning , year=

  22. [22]

    Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , pages=

    Modular pluralism: Pluralistic alignment via multi-llm collaboration , author=. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , pages=

  23. [23]

    Advances in Neural Information Processing Systems , volume=

    Aligning to thousands of preferences via system message generalization , author=. Advances in Neural Information Processing Systems , volume=

  24. [24]

    Sociodemographic prompting is not yet an effective approach for simulating subjective judgments with LLMs , author=. Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 2: Short Papers) , pages=

  25. [25]

    The Prompt Makes the Person(a): A Systematic Evaluation of Sociodemographic Persona Prompting for Large Language Models

    Lutz, Marlene and Sen, Indira and Ahnert, Georg and Rogers, Elisa and Strohmaier, Markus. The Prompt Makes the Person(a): A Systematic Evaluation of Sociodemographic Persona Prompting for Large Language Models. Findings of the Association for Computational Linguistics: EMNLP 2025. 2025. doi:10.18653/v1/2025.findings-emnlp.1261

  26. [26]

    Population-aligned persona generation for llm-based social simulation

    Population-aligned persona generation for llm-based social simulation , author=. arXiv preprint arXiv:2509.10127 , year=