Social Meaning in Large Language Models: Structure, Magnitude, and Pragmatic Prompting
Pith reviewed 2026-05-13 21:23 UTC · model grok-4.3
The pith
Large language models reproduce the structure of human social inferences but differ substantially in their magnitude calibration.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
All models reliably reproduce the qualitative structure of human social inferences but differ substantially in magnitude calibration. Prompting models to reason about speaker knowledge and motives most consistently reduces magnitude deviation, while prompting for alternative-awareness tends to amplify exaggeration. Combining both components is the only intervention that improves all calibration-sensitive metrics across all models, though fine-grained magnitude calibration remains only partially resolved.
What carries the argument
The Effect Size Ratio (ESR) and Calibration Deviation Score (CDS) metrics, which separate structural fidelity from magnitude calibration, together with prompting conditions derived from reasoning over linguistic alternatives and inferring speaker knowledge states and communicative motives.
If this is right
- Pragmatic prompting that combines alternative-awareness with knowledge and motive reasoning improves calibration metrics across models.
- LLMs capture inferential structure more reliably than inferential strength.
- Magnitude calibration remains only partially resolved even with the best prompting intervention tested.
- Fine-grained adjustments to inference strength require interventions beyond current pragmatic prompting.
Where Pith is reading between the lines
- The same structural match may appear in other social domains such as politeness or implication.
- Better magnitude calibration could make LLMs more reliable in applications requiring calibrated social advice or negotiation.
- Larger models or different training regimes might reduce the remaining calibration gap without extra prompting.
Load-bearing premise
The ESR and CDS metrics accurately measure human-like social meaning and the numerical precision case study generalizes to other domains of social inference.
What would settle it
A direct comparison of model outputs versus human judgments on a non-numerical social inference task that shows mismatched qualitative structure.
Figures
read the original abstract
Large language models (LLMs) increasingly exhibit human-like patterns of pragmatic and social reasoning. This paper addresses two related questions: do LLMs approximate human social meaning not only qualitatively but also quantitatively, and can prompting strategies informed by pragmatic theory improve this approximation? To address the first, we introduce two calibration-focused metrics distinguishing structural fidelity from magnitude calibration: the Effect Size Ratio (ESR) and the Calibration Deviation Score (CDS). To address the second, we derive prompting conditions from two pragmatic assumptions: that social meaning arises from reasoning over linguistic alternatives, and that listeners infer speaker knowledge states and communicative motives. Applied to a case study on numerical (im)precision across three frontier LLMs, we find that all models reliably reproduce the qualitative structure of human social inferences but differ substantially in magnitude calibration. Prompting models to reason about speaker knowledge and motives most consistently reduces magnitude deviation, while prompting for alternative-awareness tends to amplify exaggeration. Combining both components is the only intervention that improves all calibration-sensitive metrics across all models, though fine-grained magnitude calibration remains only partially resolved. LLMs thus capture inferential structure while variably distorting inferential strength, and pragmatic theory provides a useful but incomplete handle for improving that approximation.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that LLMs reproduce the qualitative structure of human social inferences (via a numerical (im)precision case study across three frontier models) but differ substantially in magnitude calibration. It introduces ESR and CDS metrics to separate structural fidelity from magnitude calibration, derives pragmatic prompting conditions from assumptions about linguistic alternatives and speaker knowledge/motives, and reports that only the joint intervention improves all calibration-sensitive metrics, though fine-grained magnitude calibration remains only partially resolved.
Significance. If the ESR/CDS metrics prove robust, the work offers a useful distinction between inferential structure and strength in LLM social reasoning and shows that pragmatic-theory-informed prompting provides a partial but actionable handle for calibration. The cross-model consistency on qualitative reproduction and the finding that isolated prompting components can amplify exaggeration are potentially valuable for both theoretical accounts of LLM pragmatics and practical prompt engineering.
major comments (3)
- [§3] §3 (Metrics definition): ESR and CDS are computed by direct comparison to human effect sizes and means drawn from prior numerical-precision literature without fresh human data, inter-rater reliability, or item-level variance reported for the exact stimuli; this is load-bearing for the diagnosis of “substantial magnitude deviation” and for the claim that only the combined prompting intervention improves all calibration metrics.
- [Results] Results section (and abstract): The manuscript reports consistent directional findings across models but provides no sample sizes, statistical tests, exact formulas for ESR/CDS, or controls for prompt sensitivity; without these, it is unclear whether the data support the magnitude-calibration claims or the assertion that the joint intervention is uniquely effective.
- [§5] §5 (Prompting evaluation): The claim that “prompting for alternative-awareness tends to amplify exaggeration” while knowledge/motive reasoning reduces deviation rests on the chosen reference baselines; if those baselines contain high context-dependent variance, the differential effects of the two prompting components could be artifacts rather than model properties.
minor comments (2)
- [§3] The exact algebraic definitions of ESR and CDS should appear in the main text (not only appendix) with a worked numerical example for one stimulus.
- [Figures/Tables] Figure captions and tables should explicitly state the number of items, models, and prompt variants underlying each bar or cell.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive feedback. We address each major comment below, indicating where revisions will be made to improve clarity, transparency, and robustness.
read point-by-point responses
-
Referee: [§3] §3 (Metrics definition): ESR and CDS are computed by direct comparison to human effect sizes and means drawn from prior numerical-precision literature without fresh human data, inter-rater reliability, or item-level variance reported for the exact stimuli; this is load-bearing for the diagnosis of “substantial magnitude deviation” and for the claim that only the combined prompting intervention improves all calibration metrics.
Authors: We used established human benchmarks from the numerical-precision literature cited in §2 because the study focuses on how LLMs approximate well-documented human patterns rather than re-collecting human judgments. We will revise §3 to include the exact formulas for ESR and CDS, report any item-level statistics available from the source papers, and explicitly discuss the limitations of relying on secondary human data (including the lack of new inter-rater reliability for our exact stimuli). This addresses the load-bearing concern while preserving the case-study design. revision: partial
-
Referee: [Results] Results section (and abstract): The manuscript reports consistent directional findings across models but provides no sample sizes, statistical tests, exact formulas for ESR/CDS, or controls for prompt sensitivity; without these, it is unclear whether the data support the magnitude-calibration claims or the assertion that the joint intervention is uniquely effective.
Authors: We agree that greater methodological transparency is needed. In the revised manuscript we will report exact sample sizes (number of generations per condition), provide the mathematical definitions of ESR and CDS in §3, include statistical tests comparing conditions (e.g., paired t-tests on CDS scores), and add controls for prompt sensitivity by reporting results across multiple prompt phrasings and temperature settings. These additions will directly support the claims regarding the joint intervention. revision: yes
-
Referee: [§5] §5 (Prompting evaluation): The claim that “prompting for alternative-awareness tends to amplify exaggeration” while knowledge/motive reasoning reduces deviation rests on the chosen reference baselines; if those baselines contain high context-dependent variance, the differential effects of the two prompting components could be artifacts rather than model properties.
Authors: The baselines were standard zero-shot and chain-of-thought prompts chosen to isolate the pragmatic components. The directional patterns held consistently across three frontier models, which argues against a pure artifact. Nevertheless, we will add an explicit analysis of baseline variance and include additional baseline variants in the revision to further demonstrate that the differential effects are not driven by context-dependent noise in the reference conditions. revision: partial
- Collection of new human data with inter-rater reliability for the precise stimuli used in the LLM experiments
Circularity Check
No significant circularity in derivation chain
full rationale
The paper introduces ESR and CDS as new metrics defined via direct comparison of model outputs against external human effect sizes and means drawn from prior numerical-precision literature; these definitions are independent of the model results or prompting interventions being evaluated. Prompting conditions are derived from stated pragmatic assumptions about alternatives, speaker knowledge, and motives without any fitting or self-referential adjustment to the target data. No equations, self-citations, or ansatzes are shown to reduce the reported qualitative-structure findings or prompting improvements to the inputs by construction. The central claims therefore remain evaluable against independently sourced baselines and do not collapse into tautology.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Social meaning arises from reasoning over linguistic alternatives
- domain assumption Listeners infer speaker knowledge states and communicative motives
Lean theorems connected to this paper
-
IndisputableMonolith/Costwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We introduce two calibration-focused metrics distinguishing structural fidelity from magnitude calibration: the Effect Size Ratio (ESR) and the Calibration Deviation Score (CDS).
-
IndisputableMonolith/Foundation/ArithmeticFromLogicabsolute_floor_iff_bare_distinguishability unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Prompting conditions derived from reasoning over linguistic alternatives and speaker knowledge states (RSA framework).
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Introduction Large language models (LLMs) increasingly exhibit sophisticatedformsofpragmaticandsocialreason- ing. Recent work has shown that they can recover conversational implicatures (Ruis et al., 2023; Sra- vanthi et al., 2024; Scherrer et al., 2024), reason pragmatically about scalar expressions (Cho and Kim, 2024), and produce context-sensitive soci...
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[2]
Theoretical Background Many instances of social meaning are not directly encoded in linguistic form but emerge from infer- ential processes listeners apply when interpret- ing speakers’ utterances (Acton, 2019; Beltrama, 2020). These inferences often concern social attributes of the speaker, including competence, knowledgeability, and communicative intent...
work page 2019
-
[3]
Behavioral Baseline: Social Inferences from (Im)Precision We ground our LLM evaluation in Experiment 1 of Solt et al. (2025), which investigates how the choice of numerical precision level conveys social meaning about the speaker, and how this meaning is modulated by the pragmatic requirements of the utterance context. The study’s central question is whet...
work page 2025
-
[4]
LLM Evaluation Models and protocol.We evaluated three fron- tier LLMs accessed via API: •GPT (gpt-4o-mini) •Claude (claude-sonnet-4-20250514) •Gemini (gemini-2.5-pro) For each combination of scenario, context, ut- terance form, and social attribute, models were prompted to rate the speaker on the given attribute using the identical 7-point scale as in the...
work page 2022
-
[5]
We assess alignment at three levels
Evaluation Metrics LetHandMdenote the human and model mean ratings, respectively, for a given attribute, context, and utterance form. We assess alignment at three levels. Global pattern similarity.For each model– prompting condition pair, we measure overall align- ment across all H–M pairs using three comple- mentary metrics. TheSpearman rank correlation ...
work page 1989
-
[6]
Results Universal Structure, Variable Calibration. Structural alignment is uniformly high across all models and conditions: DAS and ISS equal 1.0 for all attributes with non-zero human effects, indicating perfect reproduction of both main effect polarity and form× context interaction directions. Spearman ρvalues range from 0.829 to 0.946, con- firming str...
-
[7]
Discussion Structure Without Calibration.Our results demonstrate a systematic dissociation between structural and quantitative alignment. All models achieveperfectdirectionalagreement(DAS=ISS= 1.0)andhighrankcorrelationsacrossallprompting conditions, yet CCC values fall consistently below Spearman ρ, and CDS reveals substantial magni- tude deviations. Thi...
work page 2012
-
[8]
Conclusion WeinvestigatedwhetherfrontierLLMsapproximate human social meaning not only qualitatively but also quantitatively, grounding evaluation in exper- imentally measured human effect sizes. Across three models and four prompting conditions, all models reliably reproduce the directional structure of human social inference, a finding that is ro- bust a...
work page 2025
-
[9]
Pragmatic reasoning through semantic inference.Semantics and Pragmatics, 9. Ye-eun Cho and Seong mook Kim. 2024. Prag- matic inference of scalar implicature by LLMs. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Vol- ume 4: Student Research Workshop), pages 10–20, Bangkok, Thailand. Association for Com- putati...
work page 2024
-
[10]
How much did the bicycle cost? I’ll start the paperwork right away
Academic Press, New York. Luke Hewitt, Ashwini Ashokkumar, Isaias Ghezae, and Robb Willer. 2024. Predicting results of so- cial science experiments using large language models. Working paper, New York University. Jessica Hullman, David Broska, Huaman Sun, and Aaron Shaw. 2025. Validating LLM simulations as behavioral evidence. Preprint. Stephen C. Levinso...
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.