pith. sign in

arxiv: 2606.12754 · v1 · pith:XR3FQOYInew · submitted 2026-06-10 · 💻 cs.CL · cs.AI

LLMs Can Better Capture Human Judgments--With the Right Prompts

Pith reviewed 2026-06-27 09:24 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords LLM promptinghuman judgment alignmentresponse distributionsmoral scenariosstandard deviation elicitationconfusion ratingsAI-human agreement
0
0 comments X

The pith

Prompting LLMs to report standard deviations and response proportions recovers the full range of human moral judgments.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether large language models can better match human response distributions on moral scenarios when given targeted prompts instead of standard ones. Across a U.S. sample of moral dilemmas and an international survey on family and gender roles, models prompted to output standard deviations, response proportions, and clarity assessments align more closely with human variability and confusion levels. The work shows these elicitation methods outperform common prompting approaches while also revealing that models predict human variability better than their own error rates.

Core claim

Prompting models to report standard deviations and response proportions recovers the full range of human responses better than common strategies. Ensuring scenarios are clear to humans, as measured by confusion ratings, boosts model alignment, and LLMs can track those human confusion ratings across the two datasets.

What carries the argument

Elicitation prompts that require models to output distributional statistics (standard deviations and response proportions) plus scenario clarity checks.

If this is right

  • LLMs can predict human confusion ratings when scenarios are presented clearly.
  • LLMs' self-estimates of their own error remain poorly calibrated even with these prompts.
  • Models track human response variability more reliably than their own uncertainty.
  • Simple changes in what the model is asked to output improve alignment without retraining.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same distributional prompting could be tested on factual or preference judgments outside morality.
  • If clarity ratings transfer, they might serve as a filter for which items to trust in LLM-based surveys.
  • Poor self-calibration suggests separate mechanisms are needed for model uncertainty versus human variability.

Load-bearing premise

The two selected datasets of moral scenarios are representative enough that the prompting benefits will hold for other kinds of human judgments.

What would settle it

Run the same prompting strategies on a new set of non-moral judgment tasks and check whether the improvement in matching full human response distributions disappears.

read the original abstract

Are large language models (LLMs) bad at capturing human judgment? Two commonly stated limitations are that LLMs fail to capture full distributions of responses, and that their judgments are unstable across wording variations. We demonstrate simple prompting strategies that mitigate these limitations. Across two datasets--a U.S.-representative set of 144 moral scenarios and 38 moral beliefs from the International Social Survey Programme's Family and Changing Gender Roles module covering 32 countries--we show how simple elicitation techniques help improve AI-human alignment. First, prompting models to report standard deviations and response proportions recovers the full range of human responses better than common strategies. Second, ensuring scenarios are clear to human participants--as reflected in human confusion ratings--boosts model alignment, and LLMs can track human confusion ratings. At the same time, we find that LLMs' estimates of their own error are poorly calibrated, though they can predict human variability relatively well. These results suggest that asking better questions to LLMs can yield better answers.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that simple prompting strategies—specifically, instructing LLMs to report standard deviations and response proportions, plus ensuring scenario clarity via human confusion ratings—recover the full range of human responses and enable LLMs to track human confusion ratings better than common strategies. This is shown empirically on two fixed datasets: 144 U.S.-representative moral scenarios and 38 moral beliefs from the ISSP Family and Changing Gender Roles module across 32 countries. The work also reports that LLMs' self-error estimates are poorly calibrated while their predictions of human variability are relatively accurate.

Significance. If the reported improvements hold under the described prompting changes, the work supplies concrete, replicable elicitation techniques that address two frequently cited limitations of LLMs in modeling human judgment distributions. The explicit use of two distinct, multi-country datasets and direct comparison against baseline prompting strategies constitutes a strength; the finding that LLMs can predict human confusion ratings is a falsifiable empirical outcome that could be tested in follow-up work.

major comments (2)
  1. [Results] Results section: the claim that SD/proportion reporting 'recovers the full range of human responses better' requires the specific quantitative metric (e.g., Wasserstein distance, coverage of response bins, or KL divergence) and the associated statistical test with sample sizes and error bars; without these, it is impossible to judge whether the improvement is robust or merely descriptive.
  2. [Methods] Methods: the paper must report the exact number of LLM generations per item, temperature, and model versions used for the proportion and SD elicitation experiments; these parameters directly determine whether the reported gains in distribution matching are reproducible and not artifacts of a single run.
minor comments (2)
  1. [Abstract] The abstract states the datasets but does not give the exact item counts (144 and 38) until later; moving these numbers into the abstract would improve immediate clarity.
  2. [Figures/Tables] Table or figure captions should explicitly state the baseline prompting strategy against which the new techniques are compared so readers can interpret the magnitude of improvement without returning to the main text.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments and positive overall assessment. We address each major comment below and will revise the manuscript accordingly to improve clarity and reproducibility.

read point-by-point responses
  1. Referee: [Results] Results section: the claim that SD/proportion reporting 'recovers the full range of human responses better' requires the specific quantitative metric (e.g., Wasserstein distance, coverage of response bins, or KL divergence) and the associated statistical test with sample sizes and error bars; without these, it is impossible to judge whether the improvement is robust or merely descriptive.

    Authors: We agree that the current presentation relies on descriptive comparisons and would be strengthened by explicit distributional metrics. In the revised manuscript we will add Wasserstein distance and KL divergence between LLM-elicited and human response distributions for both datasets, report coverage of response bins, and include bootstrap-based error bars with the exact sample sizes (N=144 scenarios and N=38 beliefs). Statistical significance of the improvement over baseline prompting will also be reported. revision: yes

  2. Referee: [Methods] Methods: the paper must report the exact number of LLM generations per item, temperature, and model versions used for the proportion and SD elicitation experiments; these parameters directly determine whether the reported gains in distribution matching are reproducible and not artifacts of a single run.

    Authors: We acknowledge the omission of these implementation details. The revised Methods section will explicitly state the number of generations per item (10 independent samples), temperature settings (0.7 for proportion/SD elicitation), and the precise model versions (GPT-4-0613, Claude-3-Opus, Llama-3-70B) used in each experiment, along with the prompting templates. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper reports direct empirical comparisons of prompting techniques against external human response data on two fixed datasets. No equations, fitted parameters, derivations, or self-citations are invoked as load-bearing steps for the central claims. The results (improved alignment via SD/proportion reporting and clarity filtering) are presented as experimental outcomes rather than inferences that reduce to the inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract-only review yields minimal ledger entries; the central claim rests on the representativeness of the two moral datasets and the validity of human confusion ratings as a clarity measure.

axioms (1)
  • domain assumption The U.S. moral scenarios and ISSP module are representative of human moral judgments.
    The paper uses these datasets to demonstrate the prompting effects.

pith-pipeline@v0.9.1-grok · 5728 in / 1111 out tokens · 24260 ms · 2026-06-27T09:24:18.526458+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

3 extracted references · 3 canonical work pages · 2 internal anchors

  1. [1]

    Person X rescued a tree from an injured kitten

    Observed human response distributions and model-estimated response distributions for the first four ISSP family values items for Llama and GPT. Each panel shows the proportion of responses assigned to each ordered response option. Lines compare the human distribution with distributions estimated using Repeated sampling, Proportions, Moments: Mean + SD, Mo...

  2. [2]

    Language Models (Mostly) Know What They Know

    GESIS Data Archive, Cologne. ZA10000 Data file Version 2.0.0, https://doi.org/10.4232/5.ZA10000.2.0.0 Kadavath, S., Conerly, T., Askell, A., Henighan, T., Drain, D., Perez, E., et al. (2022). Language models (mostly) know what they know. arXiv preprint arXiv:2207.05221. Kambhatla, G., Gautam, S., Zhang, A., Liu, A., Srinivasan, R., Li, J. J., and Lease, M...

  3. [3]

    Teaching Models to Express Their Uncertainty in Words

    73–95. Lin, S., Hilton, J., and Evans, O. (2022). Teaching models to express their uncertainty in words. arXiv preprint arXiv:2205.14334. Liu, Y., Kaneko, M., and Chu, C. (2026). On the alignment of large language models with global human opinion. Proc. AAAI Conf. Artif. Intell. 40, 37673–37681. Meister, N., Guestrin, C., and Hashimoto, T. B. (2025). Benc...