pith. sign in

arxiv: 2604.02512 · v1 · submitted 2026-04-02 · 💻 cs.CL · cs.AI

Social Meaning in Large Language Models: Structure, Magnitude, and Pragmatic Prompting

Pith reviewed 2026-05-13 21:23 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords large language modelssocial inferencespragmatic promptingcalibration metricseffect size ratiomagnitude calibrationnumerical imprecisionspeaker knowledge
0
0 comments X

The pith

Large language models reproduce the structure of human social inferences but differ substantially in their magnitude calibration.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether LLMs approximate human social meaning both qualitatively and quantitatively. It introduces two metrics, the Effect Size Ratio and Calibration Deviation Score, to separate structural fidelity from magnitude accuracy. Using a case study on numerical imprecision, it applies prompting conditions drawn from pragmatic assumptions about linguistic alternatives and speaker knowledge states. All three models tested match the qualitative human patterns reliably, yet their quantitative strength varies, and only the combined prompting intervention improves every calibration metric, though not to full human alignment. This shows LLMs capture inferential form while distorting inferential force to varying degrees.

Core claim

All models reliably reproduce the qualitative structure of human social inferences but differ substantially in magnitude calibration. Prompting models to reason about speaker knowledge and motives most consistently reduces magnitude deviation, while prompting for alternative-awareness tends to amplify exaggeration. Combining both components is the only intervention that improves all calibration-sensitive metrics across all models, though fine-grained magnitude calibration remains only partially resolved.

What carries the argument

The Effect Size Ratio (ESR) and Calibration Deviation Score (CDS) metrics, which separate structural fidelity from magnitude calibration, together with prompting conditions derived from reasoning over linguistic alternatives and inferring speaker knowledge states and communicative motives.

If this is right

  • Pragmatic prompting that combines alternative-awareness with knowledge and motive reasoning improves calibration metrics across models.
  • LLMs capture inferential structure more reliably than inferential strength.
  • Magnitude calibration remains only partially resolved even with the best prompting intervention tested.
  • Fine-grained adjustments to inference strength require interventions beyond current pragmatic prompting.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same structural match may appear in other social domains such as politeness or implication.
  • Better magnitude calibration could make LLMs more reliable in applications requiring calibrated social advice or negotiation.
  • Larger models or different training regimes might reduce the remaining calibration gap without extra prompting.

Load-bearing premise

The ESR and CDS metrics accurately measure human-like social meaning and the numerical precision case study generalizes to other domains of social inference.

What would settle it

A direct comparison of model outputs versus human judgments on a non-numerical social inference task that shows mismatched qualitative structure.

Figures

Figures reproduced from arXiv: 2604.02512 by Roland M\"uhlenbernd.

Figure 1
Figure 1. Figure 1: Human vs. model mean ratings across all conditions (scenarios, contexts, utterance forms, and [PITH_FULL_IMAGE:figures/full_fig_p005_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Effect Size Ratios (ESR) per model, prompting condition, and benchmark effect. Rows are [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
read the original abstract

Large language models (LLMs) increasingly exhibit human-like patterns of pragmatic and social reasoning. This paper addresses two related questions: do LLMs approximate human social meaning not only qualitatively but also quantitatively, and can prompting strategies informed by pragmatic theory improve this approximation? To address the first, we introduce two calibration-focused metrics distinguishing structural fidelity from magnitude calibration: the Effect Size Ratio (ESR) and the Calibration Deviation Score (CDS). To address the second, we derive prompting conditions from two pragmatic assumptions: that social meaning arises from reasoning over linguistic alternatives, and that listeners infer speaker knowledge states and communicative motives. Applied to a case study on numerical (im)precision across three frontier LLMs, we find that all models reliably reproduce the qualitative structure of human social inferences but differ substantially in magnitude calibration. Prompting models to reason about speaker knowledge and motives most consistently reduces magnitude deviation, while prompting for alternative-awareness tends to amplify exaggeration. Combining both components is the only intervention that improves all calibration-sensitive metrics across all models, though fine-grained magnitude calibration remains only partially resolved. LLMs thus capture inferential structure while variably distorting inferential strength, and pragmatic theory provides a useful but incomplete handle for improving that approximation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper claims that LLMs reproduce the qualitative structure of human social inferences (via a numerical (im)precision case study across three frontier models) but differ substantially in magnitude calibration. It introduces ESR and CDS metrics to separate structural fidelity from magnitude calibration, derives pragmatic prompting conditions from assumptions about linguistic alternatives and speaker knowledge/motives, and reports that only the joint intervention improves all calibration-sensitive metrics, though fine-grained magnitude calibration remains only partially resolved.

Significance. If the ESR/CDS metrics prove robust, the work offers a useful distinction between inferential structure and strength in LLM social reasoning and shows that pragmatic-theory-informed prompting provides a partial but actionable handle for calibration. The cross-model consistency on qualitative reproduction and the finding that isolated prompting components can amplify exaggeration are potentially valuable for both theoretical accounts of LLM pragmatics and practical prompt engineering.

major comments (3)
  1. [§3] §3 (Metrics definition): ESR and CDS are computed by direct comparison to human effect sizes and means drawn from prior numerical-precision literature without fresh human data, inter-rater reliability, or item-level variance reported for the exact stimuli; this is load-bearing for the diagnosis of “substantial magnitude deviation” and for the claim that only the combined prompting intervention improves all calibration metrics.
  2. [Results] Results section (and abstract): The manuscript reports consistent directional findings across models but provides no sample sizes, statistical tests, exact formulas for ESR/CDS, or controls for prompt sensitivity; without these, it is unclear whether the data support the magnitude-calibration claims or the assertion that the joint intervention is uniquely effective.
  3. [§5] §5 (Prompting evaluation): The claim that “prompting for alternative-awareness tends to amplify exaggeration” while knowledge/motive reasoning reduces deviation rests on the chosen reference baselines; if those baselines contain high context-dependent variance, the differential effects of the two prompting components could be artifacts rather than model properties.
minor comments (2)
  1. [§3] The exact algebraic definitions of ESR and CDS should appear in the main text (not only appendix) with a worked numerical example for one stimulus.
  2. [Figures/Tables] Figure captions and tables should explicitly state the number of items, models, and prompt variants underlying each bar or cell.

Simulated Author's Rebuttal

3 responses · 1 unresolved

We thank the referee for the detailed and constructive feedback. We address each major comment below, indicating where revisions will be made to improve clarity, transparency, and robustness.

read point-by-point responses
  1. Referee: [§3] §3 (Metrics definition): ESR and CDS are computed by direct comparison to human effect sizes and means drawn from prior numerical-precision literature without fresh human data, inter-rater reliability, or item-level variance reported for the exact stimuli; this is load-bearing for the diagnosis of “substantial magnitude deviation” and for the claim that only the combined prompting intervention improves all calibration metrics.

    Authors: We used established human benchmarks from the numerical-precision literature cited in §2 because the study focuses on how LLMs approximate well-documented human patterns rather than re-collecting human judgments. We will revise §3 to include the exact formulas for ESR and CDS, report any item-level statistics available from the source papers, and explicitly discuss the limitations of relying on secondary human data (including the lack of new inter-rater reliability for our exact stimuli). This addresses the load-bearing concern while preserving the case-study design. revision: partial

  2. Referee: [Results] Results section (and abstract): The manuscript reports consistent directional findings across models but provides no sample sizes, statistical tests, exact formulas for ESR/CDS, or controls for prompt sensitivity; without these, it is unclear whether the data support the magnitude-calibration claims or the assertion that the joint intervention is uniquely effective.

    Authors: We agree that greater methodological transparency is needed. In the revised manuscript we will report exact sample sizes (number of generations per condition), provide the mathematical definitions of ESR and CDS in §3, include statistical tests comparing conditions (e.g., paired t-tests on CDS scores), and add controls for prompt sensitivity by reporting results across multiple prompt phrasings and temperature settings. These additions will directly support the claims regarding the joint intervention. revision: yes

  3. Referee: [§5] §5 (Prompting evaluation): The claim that “prompting for alternative-awareness tends to amplify exaggeration” while knowledge/motive reasoning reduces deviation rests on the chosen reference baselines; if those baselines contain high context-dependent variance, the differential effects of the two prompting components could be artifacts rather than model properties.

    Authors: The baselines were standard zero-shot and chain-of-thought prompts chosen to isolate the pragmatic components. The directional patterns held consistently across three frontier models, which argues against a pure artifact. Nevertheless, we will add an explicit analysis of baseline variance and include additional baseline variants in the revision to further demonstrate that the differential effects are not driven by context-dependent noise in the reference conditions. revision: partial

standing simulated objections not resolved
  • Collection of new human data with inter-rater reliability for the precise stimuli used in the LLM experiments

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper introduces ESR and CDS as new metrics defined via direct comparison of model outputs against external human effect sizes and means drawn from prior numerical-precision literature; these definitions are independent of the model results or prompting interventions being evaluated. Prompting conditions are derived from stated pragmatic assumptions about alternatives, speaker knowledge, and motives without any fitting or self-referential adjustment to the target data. No equations, self-citations, or ansatzes are shown to reduce the reported qualitative-structure findings or prompting improvements to the inputs by construction. The central claims therefore remain evaluable against independently sourced baselines and do not collapse into tautology.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

Central claim rests on the assumption that social meaning is generated by reasoning over alternatives and by inferring speaker knowledge/motives, plus the validity of the new ESR/CDS metrics for capturing human behavior.

axioms (2)
  • domain assumption Social meaning arises from reasoning over linguistic alternatives
    First pragmatic assumption stated in abstract as basis for prompting conditions.
  • domain assumption Listeners infer speaker knowledge states and communicative motives
    Second pragmatic assumption stated in abstract as basis for prompting conditions.

pith-pipeline@v0.9.0 · 5511 in / 1288 out tokens · 51753 ms · 2026-05-13T21:23:46.768985+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

10 extracted references · 10 canonical work pages · 1 internal anchor

  1. [1]

    Introduction Large language models (LLMs) increasingly exhibit sophisticatedformsofpragmaticandsocialreason- ing. Recent work has shown that they can recover conversational implicatures (Ruis et al., 2023; Sra- vanthi et al., 2024; Scherrer et al., 2024), reason pragmatically about scalar expressions (Cho and Kim, 2024), and produce context-sensitive soci...

  2. [2]

    Theoretical Background Many instances of social meaning are not directly encoded in linguistic form but emerge from infer- ential processes listeners apply when interpret- ing speakers’ utterances (Acton, 2019; Beltrama, 2020). These inferences often concern social attributes of the speaker, including competence, knowledgeability, and communicative intent...

  3. [3]

    The bicycle cost $500

    Behavioral Baseline: Social Inferences from (Im)Precision We ground our LLM evaluation in Experiment 1 of Solt et al. (2025), which investigates how the choice of numerical precision level conveys social meaning about the speaker, and how this meaning is modulated by the pragmatic requirements of the utterance context. The study’s central question is whet...

  4. [4]

    LLM Evaluation Models and protocol.We evaluated three fron- tier LLMs accessed via API: •GPT (gpt-4o-mini) •Claude (claude-sonnet-4-20250514) •Gemini (gemini-2.5-pro) For each combination of scenario, context, ut- terance form, and social attribute, models were prompted to rate the speaker on the given attribute using the identical 7-point scale as in the...

  5. [5]

    We assess alignment at three levels

    Evaluation Metrics LetHandMdenote the human and model mean ratings, respectively, for a given attribute, context, and utterance form. We assess alignment at three levels. Global pattern similarity.For each model– prompting condition pair, we measure overall align- ment across all H–M pairs using three comple- mentary metrics. TheSpearman rank correlation ...

  6. [6]

    Results Universal Structure, Variable Calibration. Structural alignment is uniformly high across all models and conditions: DAS and ISS equal 1.0 for all attributes with non-zero human effects, indicating perfect reproduction of both main effect polarity and form× context interaction directions. Spearman ρvalues range from 0.829 to 0.946, con- firming str...

  7. [7]

    Discussion Structure Without Calibration.Our results demonstrate a systematic dissociation between structural and quantitative alignment. All models achieveperfectdirectionalagreement(DAS=ISS= 1.0)andhighrankcorrelationsacrossallprompting conditions, yet CCC values fall consistently below Spearman ρ, and CDS reveals substantial magni- tude deviations. Thi...

  8. [8]

    Conclusion WeinvestigatedwhetherfrontierLLMsapproximate human social meaning not only qualitatively but also quantitatively, grounding evaluation in exper- imentally measured human effect sizes. Across three models and four prompting conditions, all models reliably reproduce the directional structure of human social inference, a finding that is ro- bust a...

  9. [9]

    Ye-eun Cho and Seong mook Kim

    Pragmatic reasoning through semantic inference.Semantics and Pragmatics, 9. Ye-eun Cho and Seong mook Kim. 2024. Prag- matic inference of scalar implicature by LLMs. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Vol- ume 4: Student Research Workshop), pages 10–20, Bangkok, Thailand. Association for Com- putati...

  10. [10]

    How much did the bicycle cost? I’ll start the paperwork right away

    Academic Press, New York. Luke Hewitt, Ashwini Ashokkumar, Isaias Ghezae, and Robb Willer. 2024. Predicting results of so- cial science experiments using large language models. Working paper, New York University. Jessica Hullman, David Broska, Huaman Sun, and Aaron Shaw. 2025. Validating LLM simulations as behavioral evidence. Preprint. Stephen C. Levinso...