Social Meaning in Large Language Models: Structure, Magnitude, and Pragmatic Prompting

Roland M\"uhlenbernd

arxiv: 2604.02512 · v1 · submitted 2026-04-02 · 💻 cs.CL · cs.AI

Social Meaning in Large Language Models: Structure, Magnitude, and Pragmatic Prompting

Roland M\"uhlenbernd This is my paper

Pith reviewed 2026-05-13 21:23 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords large language modelssocial inferencespragmatic promptingcalibration metricseffect size ratiomagnitude calibrationnumerical imprecisionspeaker knowledge

0 comments

The pith

Large language models reproduce the structure of human social inferences but differ substantially in their magnitude calibration.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether LLMs approximate human social meaning both qualitatively and quantitatively. It introduces two metrics, the Effect Size Ratio and Calibration Deviation Score, to separate structural fidelity from magnitude accuracy. Using a case study on numerical imprecision, it applies prompting conditions drawn from pragmatic assumptions about linguistic alternatives and speaker knowledge states. All three models tested match the qualitative human patterns reliably, yet their quantitative strength varies, and only the combined prompting intervention improves every calibration metric, though not to full human alignment. This shows LLMs capture inferential form while distorting inferential force to varying degrees.

Core claim

All models reliably reproduce the qualitative structure of human social inferences but differ substantially in magnitude calibration. Prompting models to reason about speaker knowledge and motives most consistently reduces magnitude deviation, while prompting for alternative-awareness tends to amplify exaggeration. Combining both components is the only intervention that improves all calibration-sensitive metrics across all models, though fine-grained magnitude calibration remains only partially resolved.

What carries the argument

The Effect Size Ratio (ESR) and Calibration Deviation Score (CDS) metrics, which separate structural fidelity from magnitude calibration, together with prompting conditions derived from reasoning over linguistic alternatives and inferring speaker knowledge states and communicative motives.

If this is right

Pragmatic prompting that combines alternative-awareness with knowledge and motive reasoning improves calibration metrics across models.
LLMs capture inferential structure more reliably than inferential strength.
Magnitude calibration remains only partially resolved even with the best prompting intervention tested.
Fine-grained adjustments to inference strength require interventions beyond current pragmatic prompting.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same structural match may appear in other social domains such as politeness or implication.
Better magnitude calibration could make LLMs more reliable in applications requiring calibrated social advice or negotiation.
Larger models or different training regimes might reduce the remaining calibration gap without extra prompting.

Load-bearing premise

The ESR and CDS metrics accurately measure human-like social meaning and the numerical precision case study generalizes to other domains of social inference.

What would settle it

A direct comparison of model outputs versus human judgments on a non-numerical social inference task that shows mismatched qualitative structure.

Figures

Figures reproduced from arXiv: 2604.02512 by Roland M\"uhlenbernd.

**Figure 2.** Figure 2: Effect Size Ratios (ESR) per model, prompting condition, and benchmark effect. Rows are [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

read the original abstract

Large language models (LLMs) increasingly exhibit human-like patterns of pragmatic and social reasoning. This paper addresses two related questions: do LLMs approximate human social meaning not only qualitatively but also quantitatively, and can prompting strategies informed by pragmatic theory improve this approximation? To address the first, we introduce two calibration-focused metrics distinguishing structural fidelity from magnitude calibration: the Effect Size Ratio (ESR) and the Calibration Deviation Score (CDS). To address the second, we derive prompting conditions from two pragmatic assumptions: that social meaning arises from reasoning over linguistic alternatives, and that listeners infer speaker knowledge states and communicative motives. Applied to a case study on numerical (im)precision across three frontier LLMs, we find that all models reliably reproduce the qualitative structure of human social inferences but differ substantially in magnitude calibration. Prompting models to reason about speaker knowledge and motives most consistently reduces magnitude deviation, while prompting for alternative-awareness tends to amplify exaggeration. Combining both components is the only intervention that improves all calibration-sensitive metrics across all models, though fine-grained magnitude calibration remains only partially resolved. LLMs thus capture inferential structure while variably distorting inferential strength, and pragmatic theory provides a useful but incomplete handle for improving that approximation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

LLMs match the direction of human social inferences on numerical precision but miscalibrate the strength, and the paper's ESR and CDS metrics plus theory-derived prompts give a workable way to measure and partially correct that split.

read the letter

The paper's core finding is that three frontier LLMs reproduce the qualitative structure of human inferences about numerical (im)precision but differ in how strongly they apply it. ESR tracks whether the effect sizes line up with human data, while CDS measures overall calibration deviation. Prompting for speaker knowledge and motives reduces the magnitude gap most reliably, alternative-awareness prompting can increase exaggeration, and only the combination improves every calibration metric across models. Fine-grained calibration still falls short. This split between structure and magnitude is the useful distinction the work draws. The prompting conditions come directly from two pragmatic assumptions about alternatives and knowledge states, which keeps the interventions interpretable rather than ad hoc. The cross-model consistency on the combined prompt is a clear, replicable pattern worth having on record. The main limitation is that both metrics are scored against human baselines taken from existing numerical-precision studies. No new human data, item-level variance, or reliability checks are reported for the exact stimuli, so the size of the reported deviations and the size of the prompting gains could move if those reference numbers are context-sensitive or noisy. The abstract also omits sample sizes, statistical tests, and prompt-sensitivity controls, which leaves the strength of the cross-model claims hard to gauge. This work is aimed at researchers who evaluate or tune LLMs for pragmatic and social reasoning tasks. Anyone building dialogue systems or decision-support tools that need reliable implied meaning would find the metrics and the prompting results practical. It deserves a serious referee because the metrics are new, the theoretical grounding is explicit, and the directional results are consistent, even though the human baseline validation and experimental details will need tightening in revision.

Referee Report

3 major / 2 minor

Summary. The paper claims that LLMs reproduce the qualitative structure of human social inferences (via a numerical (im)precision case study across three frontier models) but differ substantially in magnitude calibration. It introduces ESR and CDS metrics to separate structural fidelity from magnitude calibration, derives pragmatic prompting conditions from assumptions about linguistic alternatives and speaker knowledge/motives, and reports that only the joint intervention improves all calibration-sensitive metrics, though fine-grained magnitude calibration remains only partially resolved.

Significance. If the ESR/CDS metrics prove robust, the work offers a useful distinction between inferential structure and strength in LLM social reasoning and shows that pragmatic-theory-informed prompting provides a partial but actionable handle for calibration. The cross-model consistency on qualitative reproduction and the finding that isolated prompting components can amplify exaggeration are potentially valuable for both theoretical accounts of LLM pragmatics and practical prompt engineering.

major comments (3)

[§3] §3 (Metrics definition): ESR and CDS are computed by direct comparison to human effect sizes and means drawn from prior numerical-precision literature without fresh human data, inter-rater reliability, or item-level variance reported for the exact stimuli; this is load-bearing for the diagnosis of “substantial magnitude deviation” and for the claim that only the combined prompting intervention improves all calibration metrics.
[Results] Results section (and abstract): The manuscript reports consistent directional findings across models but provides no sample sizes, statistical tests, exact formulas for ESR/CDS, or controls for prompt sensitivity; without these, it is unclear whether the data support the magnitude-calibration claims or the assertion that the joint intervention is uniquely effective.
[§5] §5 (Prompting evaluation): The claim that “prompting for alternative-awareness tends to amplify exaggeration” while knowledge/motive reasoning reduces deviation rests on the chosen reference baselines; if those baselines contain high context-dependent variance, the differential effects of the two prompting components could be artifacts rather than model properties.

minor comments (2)

[§3] The exact algebraic definitions of ESR and CDS should appear in the main text (not only appendix) with a worked numerical example for one stimulus.
[Figures/Tables] Figure captions and tables should explicitly state the number of items, models, and prompt variants underlying each bar or cell.

Simulated Author's Rebuttal

3 responses · 1 unresolved

We thank the referee for the detailed and constructive feedback. We address each major comment below, indicating where revisions will be made to improve clarity, transparency, and robustness.

read point-by-point responses

Referee: [§3] §3 (Metrics definition): ESR and CDS are computed by direct comparison to human effect sizes and means drawn from prior numerical-precision literature without fresh human data, inter-rater reliability, or item-level variance reported for the exact stimuli; this is load-bearing for the diagnosis of “substantial magnitude deviation” and for the claim that only the combined prompting intervention improves all calibration metrics.

Authors: We used established human benchmarks from the numerical-precision literature cited in §2 because the study focuses on how LLMs approximate well-documented human patterns rather than re-collecting human judgments. We will revise §3 to include the exact formulas for ESR and CDS, report any item-level statistics available from the source papers, and explicitly discuss the limitations of relying on secondary human data (including the lack of new inter-rater reliability for our exact stimuli). This addresses the load-bearing concern while preserving the case-study design. revision: partial
Referee: [Results] Results section (and abstract): The manuscript reports consistent directional findings across models but provides no sample sizes, statistical tests, exact formulas for ESR/CDS, or controls for prompt sensitivity; without these, it is unclear whether the data support the magnitude-calibration claims or the assertion that the joint intervention is uniquely effective.

Authors: We agree that greater methodological transparency is needed. In the revised manuscript we will report exact sample sizes (number of generations per condition), provide the mathematical definitions of ESR and CDS in §3, include statistical tests comparing conditions (e.g., paired t-tests on CDS scores), and add controls for prompt sensitivity by reporting results across multiple prompt phrasings and temperature settings. These additions will directly support the claims regarding the joint intervention. revision: yes
Referee: [§5] §5 (Prompting evaluation): The claim that “prompting for alternative-awareness tends to amplify exaggeration” while knowledge/motive reasoning reduces deviation rests on the chosen reference baselines; if those baselines contain high context-dependent variance, the differential effects of the two prompting components could be artifacts rather than model properties.

Authors: The baselines were standard zero-shot and chain-of-thought prompts chosen to isolate the pragmatic components. The directional patterns held consistently across three frontier models, which argues against a pure artifact. Nevertheless, we will add an explicit analysis of baseline variance and include additional baseline variants in the revision to further demonstrate that the differential effects are not driven by context-dependent noise in the reference conditions. revision: partial

standing simulated objections not resolved

Collection of new human data with inter-rater reliability for the precise stimuli used in the LLM experiments

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper introduces ESR and CDS as new metrics defined via direct comparison of model outputs against external human effect sizes and means drawn from prior numerical-precision literature; these definitions are independent of the model results or prompting interventions being evaluated. Prompting conditions are derived from stated pragmatic assumptions about alternatives, speaker knowledge, and motives without any fitting or self-referential adjustment to the target data. No equations, self-citations, or ansatzes are shown to reduce the reported qualitative-structure findings or prompting improvements to the inputs by construction. The central claims therefore remain evaluable against independently sourced baselines and do not collapse into tautology.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

Central claim rests on the assumption that social meaning is generated by reasoning over alternatives and by inferring speaker knowledge/motives, plus the validity of the new ESR/CDS metrics for capturing human behavior.

axioms (2)

domain assumption Social meaning arises from reasoning over linguistic alternatives
First pragmatic assumption stated in abstract as basis for prompting conditions.
domain assumption Listeners infer speaker knowledge states and communicative motives
Second pragmatic assumption stated in abstract as basis for prompting conditions.

pith-pipeline@v0.9.0 · 5511 in / 1288 out tokens · 51753 ms · 2026-05-13T21:23:46.768985+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We introduce two calibration-focused metrics distinguishing structural fidelity from magnitude calibration: the Effect Size Ratio (ESR) and the Calibration Deviation Score (CDS).
IndisputableMonolith/Foundation/ArithmeticFromLogic absolute_floor_iff_bare_distinguishability unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Prompting conditions derived from reasoning over linguistic alternatives and speaker knowledge states (RSA framework).

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

10 extracted references · 10 canonical work pages · 1 internal anchor

[1]

Introduction Large language models (LLMs) increasingly exhibit sophisticatedformsofpragmaticandsocialreason- ing. Recent work has shown that they can recover conversational implicatures (Ruis et al., 2023; Sra- vanthi et al., 2024; Scherrer et al., 2024), reason pragmatically about scalar expressions (Cho and Kim, 2024), and produce context-sensitive soci...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[2]

Theoretical Background Many instances of social meaning are not directly encoded in linguistic form but emerge from infer- ential processes listeners apply when interpret- ing speakers’ utterances (Acton, 2019; Beltrama, 2020). These inferences often concern social attributes of the speaker, including competence, knowledgeability, and communicative intent...

work page 2019
[3]

The bicycle cost $500

Behavioral Baseline: Social Inferences from (Im)Precision We ground our LLM evaluation in Experiment 1 of Solt et al. (2025), which investigates how the choice of numerical precision level conveys social meaning about the speaker, and how this meaning is modulated by the pragmatic requirements of the utterance context. The study’s central question is whet...

work page 2025
[4]

LLM Evaluation Models and protocol.We evaluated three fron- tier LLMs accessed via API: •GPT (gpt-4o-mini) •Claude (claude-sonnet-4-20250514) •Gemini (gemini-2.5-pro) For each combination of scenario, context, ut- terance form, and social attribute, models were prompted to rate the speaker on the given attribute using the identical 7-point scale as in the...

work page 2022
[5]

We assess alignment at three levels

Evaluation Metrics LetHandMdenote the human and model mean ratings, respectively, for a given attribute, context, and utterance form. We assess alignment at three levels. Global pattern similarity.For each model– prompting condition pair, we measure overall align- ment across all H–M pairs using three comple- mentary metrics. TheSpearman rank correlation ...

work page 1989
[6]

Results Universal Structure, Variable Calibration. Structural alignment is uniformly high across all models and conditions: DAS and ISS equal 1.0 for all attributes with non-zero human effects, indicating perfect reproduction of both main effect polarity and form× context interaction directions. Spearman ρvalues range from 0.829 to 0.946, con- firming str...

work page
[7]

Discussion Structure Without Calibration.Our results demonstrate a systematic dissociation between structural and quantitative alignment. All models achieveperfectdirectionalagreement(DAS=ISS= 1.0)andhighrankcorrelationsacrossallprompting conditions, yet CCC values fall consistently below Spearman ρ, and CDS reveals substantial magni- tude deviations. Thi...

work page 2012
[8]

Conclusion WeinvestigatedwhetherfrontierLLMsapproximate human social meaning not only qualitatively but also quantitatively, grounding evaluation in exper- imentally measured human effect sizes. Across three models and four prompting conditions, all models reliably reproduce the directional structure of human social inference, a finding that is ro- bust a...

work page 2025
[9]

Ye-eun Cho and Seong mook Kim

Pragmatic reasoning through semantic inference.Semantics and Pragmatics, 9. Ye-eun Cho and Seong mook Kim. 2024. Prag- matic inference of scalar implicature by LLMs. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Vol- ume 4: Student Research Workshop), pages 10–20, Bangkok, Thailand. Association for Com- putati...

work page 2024
[10]

How much did the bicycle cost? I’ll start the paperwork right away

Academic Press, New York. Luke Hewitt, Ashwini Ashokkumar, Isaias Ghezae, and Robb Willer. 2024. Predicting results of so- cial science experiments using large language models. Working paper, New York University. Jessica Hullman, David Broska, Huaman Sun, and Aaron Shaw. 2025. Validating LLM simulations as behavioral evidence. Preprint. Stephen C. Levinso...

work page 2024

[1] [1]

Introduction Large language models (LLMs) increasingly exhibit sophisticatedformsofpragmaticandsocialreason- ing. Recent work has shown that they can recover conversational implicatures (Ruis et al., 2023; Sra- vanthi et al., 2024; Scherrer et al., 2024), reason pragmatically about scalar expressions (Cho and Kim, 2024), and produce context-sensitive soci...

work page internal anchor Pith review Pith/arXiv arXiv 2023

[2] [2]

Theoretical Background Many instances of social meaning are not directly encoded in linguistic form but emerge from infer- ential processes listeners apply when interpret- ing speakers’ utterances (Acton, 2019; Beltrama, 2020). These inferences often concern social attributes of the speaker, including competence, knowledgeability, and communicative intent...

work page 2019

[3] [3]

The bicycle cost $500

Behavioral Baseline: Social Inferences from (Im)Precision We ground our LLM evaluation in Experiment 1 of Solt et al. (2025), which investigates how the choice of numerical precision level conveys social meaning about the speaker, and how this meaning is modulated by the pragmatic requirements of the utterance context. The study’s central question is whet...

work page 2025

[4] [4]

LLM Evaluation Models and protocol.We evaluated three fron- tier LLMs accessed via API: •GPT (gpt-4o-mini) •Claude (claude-sonnet-4-20250514) •Gemini (gemini-2.5-pro) For each combination of scenario, context, ut- terance form, and social attribute, models were prompted to rate the speaker on the given attribute using the identical 7-point scale as in the...

work page 2022

[5] [5]

We assess alignment at three levels

Evaluation Metrics LetHandMdenote the human and model mean ratings, respectively, for a given attribute, context, and utterance form. We assess alignment at three levels. Global pattern similarity.For each model– prompting condition pair, we measure overall align- ment across all H–M pairs using three comple- mentary metrics. TheSpearman rank correlation ...

work page 1989

[6] [6]

Results Universal Structure, Variable Calibration. Structural alignment is uniformly high across all models and conditions: DAS and ISS equal 1.0 for all attributes with non-zero human effects, indicating perfect reproduction of both main effect polarity and form× context interaction directions. Spearman ρvalues range from 0.829 to 0.946, con- firming str...

work page

[7] [7]

Discussion Structure Without Calibration.Our results demonstrate a systematic dissociation between structural and quantitative alignment. All models achieveperfectdirectionalagreement(DAS=ISS= 1.0)andhighrankcorrelationsacrossallprompting conditions, yet CCC values fall consistently below Spearman ρ, and CDS reveals substantial magni- tude deviations. Thi...

work page 2012

[8] [8]

Conclusion WeinvestigatedwhetherfrontierLLMsapproximate human social meaning not only qualitatively but also quantitatively, grounding evaluation in exper- imentally measured human effect sizes. Across three models and four prompting conditions, all models reliably reproduce the directional structure of human social inference, a finding that is ro- bust a...

work page 2025

[9] [9]

Ye-eun Cho and Seong mook Kim

Pragmatic reasoning through semantic inference.Semantics and Pragmatics, 9. Ye-eun Cho and Seong mook Kim. 2024. Prag- matic inference of scalar implicature by LLMs. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Vol- ume 4: Student Research Workshop), pages 10–20, Bangkok, Thailand. Association for Com- putati...

work page 2024

[10] [10]

How much did the bicycle cost? I’ll start the paperwork right away

Academic Press, New York. Luke Hewitt, Ashwini Ashokkumar, Isaias Ghezae, and Robb Willer. 2024. Predicting results of so- cial science experiments using large language models. Working paper, New York University. Jessica Hullman, David Broska, Huaman Sun, and Aaron Shaw. 2025. Validating LLM simulations as behavioral evidence. Preprint. Stephen C. Levinso...

work page 2024