pith. sign in

arxiv: 2605.09042 · v1 · submitted 2026-05-09 · 💻 cs.CL

Evaluating Pragmatic Reasoning in Large Language Models: Evidence from Scalar Diversity

Pith reviewed 2026-05-12 03:02 UTC · model grok-4.3

classification 💻 cs.CL
keywords pragmatic reasoninglarge language modelsscalar diversityscalar implicatureevaluation methodsprompting strategiesprobability distributionsmetalinguistic judgments
0
0 comments X p. Extension

The pith

Pragmatic reasoning in large language models arises from the interplay between internal probability distributions and specific evaluation prompts rather than from any fixed underlying competence.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether LLMs exhibit pragmatic inference by measuring how they handle scalar terms like 'some' versus 'all' under two different evaluation approaches: reading off probabilities directly from the model and asking the model to judge sentences explicitly. It finds that neither approach reliably produces better results than the other and that clear patterns of scalar diversity, where stronger terms are preferred over weaker ones, appear only in particular models under particular prompting conditions. This matters for anyone trying to assess what LLMs actually understand about language because it shows that apparent pragmatic ability is not a stable property of the model but depends heavily on how the test is set up. If the findings hold, claims about LLM pragmatics must be qualified by the exact measurement method used.

Core claim

Comparing direct probability measurement with metalinguistic prompting across multiple models and settings shows that neither method consistently outperforms the other. Pragmatic behavior varies substantially by model family, prompting strategy, and task structure, with scalar diversity gradients emerging only in specific combinations. These results indicate that pragmatic reasoning in LLMs reflects an interaction between internal probabilistic representations and task-induced prompting behavior rather than a stable competence that any single evaluation paradigm can capture.

What carries the argument

Scalar diversity, used as a graded diagnostic that tracks how strongly models favor stronger scalar expressions over weaker ones in contexts that license implicature.

If this is right

  • Evaluation design must be treated as a central variable when interpreting any observed pragmatic abilities in LLMs.
  • Direct probability measures and metalinguistic prompts can produce divergent pictures of the same model's behavior.
  • Pragmatic patterns are not uniform across model families or task formats.
  • Scalar diversity is observable only under particular model-condition pairings rather than as a general property.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Future tests of LLM pragmatics may need to combine probability readout with controlled prompting to separate internal knowledge from surface task effects.
  • The findings suggest that training data statistics alone do not guarantee stable pragmatic behavior once prompting changes.
  • Similar graded diagnostics could be applied to other pragmatic phenomena to check whether the same interaction between representation and task appears.

Load-bearing premise

That the degree of scalar diversity reliably separates genuine pragmatic inference from behavior produced merely by the evaluation task itself.

What would settle it

A replication in which scalar diversity gradients appear at comparable strength across every tested model family and both probability-based and prompting-based measurement methods.

Figures

Figures reproduced from arXiv: 2605.09042 by Ye-eun Cho.

Figure 1
Figure 1. Figure 1: Overview of the two evaluation paradigms [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overall accuracy of scalar inference predictions across models and prompting conditions for Experiments [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Item-level accuracy across scalar items for each model and evaluation condition in Experiment B [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Item-level accuracy across scalar items for each model and evaluation condition in Experiment A [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
read the original abstract

Evaluating pragmatic reasoning in large language models (LLMs) remains challenging because model behavior can vary depending on evaluation methods. Previous studies suggest that prompt-based judgments may diverge from models' internal probability distributions, raising questions about whether observed performance reflects underlying competence or task-induced behavior. This study examines this issue using scalar diversity as a graded diagnostic for pragmatic inference. Following Hu & Levy (2023), this study compares direct probability measurement and metalinguistic prompting across multiple models and experimental settings. The results show that neither evaluation method consistently outperforms the other and that pragmatic behavior varies substantially across model families, prompting strategies, and task structures. Moreover, scalar diversity gradients emerge only in specific model-condition combinations, suggesting that pragmatic reasoning in LLMs reflects an interaction between internal probabilistic representations and task-induced prompting behavior rather than a stable competence captured by a single evaluation paradigm. These findings highlight the central role of evaluation design in interpreting pragmatic abilities in LLMs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper evaluates pragmatic reasoning in LLMs using scalar diversity (following Hu & Levy 2023) as a graded diagnostic. It compares direct probability measurement against metalinguistic prompting across multiple models, prompting strategies, and task structures. The central results are that neither method consistently outperforms the other, pragmatic behavior varies substantially across model families and conditions, and scalar diversity gradients appear only in specific model-condition combinations. The authors conclude that pragmatic reasoning reflects an interaction between internal probabilistic representations and task-induced prompting behavior rather than a stable, method-independent competence.

Significance. If the empirical patterns hold after addressing missing details and alternative explanations, the work would usefully demonstrate the sensitivity of LLM pragmatic assessments to evaluation design. It extends prior scalar-diversity diagnostics to a multi-method comparison and provides evidence that single-paradigm evaluations may mischaracterize underlying abilities. This could encourage more careful, multi-method protocols in future studies of linguistic competence in LLMs.

major comments (2)
  1. [Abstract] Abstract: the claim that scalar diversity gradients emerge only in specific model-condition combinations (and that this supports an interaction interpretation) is presented at a high level without sample sizes, statistical tests, or data exclusion criteria. This prevents verification of whether the reported condition-specific patterns are robust or driven by the factors the authors highlight.
  2. [Results/Discussion] Results/Discussion: the central interpretation that pragmatic behavior reflects an interaction between internal representations and task-induced prompting (rather than stable competence) assumes scalar diversity specifically isolates pragmatic inference. The manuscript provides no indication of controls for alternative factors such as token frequency, training-data co-occurrence statistics, or uniform prompt-induced probability shifts that could produce similar graded patterns on scalar terms.
minor comments (1)
  1. [Methods] Ensure all model versions, exact prompt templates, and the precise definition of scalar diversity (including any adaptations from Hu & Levy 2023) are reported in full in the methods section to support reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their careful reading and constructive comments. We address each major point below, indicating planned revisions where appropriate.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the claim that scalar diversity gradients emerge only in specific model-condition combinations (and that this supports an interaction interpretation) is presented at a high level without sample sizes, statistical tests, or data exclusion criteria. This prevents verification of whether the reported condition-specific patterns are robust or driven by the factors the authors highlight.

    Authors: We agree that the abstract presents the central claim concisely without methodological specifics. The main text details the experimental scope (multiple models, prompting strategies, and task structures), reports statistical tests for gradient significance and interactions, and specifies data exclusion rules. To improve verifiability from the abstract alone, we will revise it to briefly reference the statistical support for the condition-specific patterns while remaining within length limits. revision: yes

  2. Referee: [Results/Discussion] Results/Discussion: the central interpretation that pragmatic behavior reflects an interaction between internal representations and task-induced prompting (rather than stable competence) assumes scalar diversity specifically isolates pragmatic inference. The manuscript provides no indication of controls for alternative factors such as token frequency, training-data co-occurrence statistics, or uniform prompt-induced probability shifts that could produce similar graded patterns on scalar terms.

    Authors: This concern is well-taken. Our design follows the scalar-diversity paradigm of Hu & Levy (2023) to probe graded pragmatic inference. Although we did not include explicit controls for token frequency or co-occurrence, the key finding that scalar diversity gradients appear only under certain method-model combinations (and not uniformly) is difficult to attribute solely to stable factors like frequency, which would be constant across direct-probability and metalinguistic conditions. We will add a paragraph in the Discussion explicitly acknowledging these alternative explanations and describing how future studies could control for them (e.g., frequency-matched item sets or prompt-ablation experiments). revision: yes

Circularity Check

0 steps flagged

Empirical comparison study with no derivations or self-referential reductions

full rationale

The paper conducts an experimental comparison of two evaluation methods (direct probability measurement and metalinguistic prompting) for pragmatic reasoning in LLMs, measuring scalar diversity gradients across models and conditions. It explicitly follows the definition of scalar diversity from Hu & Levy (2023) as an external diagnostic without re-deriving or fitting it to the current data. No equations, ansatzes, uniqueness theorems, or parameter fits are present that would reduce any claimed result to the inputs by construction. Conclusions rest on observed empirical differences rather than any self-definitional or load-bearing self-citation chain, rendering the analysis self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that scalar diversity is a reliable diagnostic for pragmatic inference; no free parameters or invented entities are introduced in the abstract.

axioms (1)
  • domain assumption Scalar diversity serves as a graded diagnostic for pragmatic inference in LLMs
    Explicitly follows Hu & Levy (2023) as stated in the abstract

pith-pipeline@v0.9.0 · 5450 in / 1232 out tokens · 103443 ms · 2026-05-12T03:02:56.947299+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

48 extracted references · 48 canonical work pages · 1 internal anchor

  1. [1]

    Korean Journal of Linguistics , volume=

    Prompting Strategies of Generative AI for Korean Pragmatic Inference , author=. Korean Journal of Linguistics , volume=

  2. [2]

    Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 4: Student Research Workshop) , pages=

    Pragmatic inference of scalar implicature by LLMs , author=. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 4: Student Research Workshop) , pages=

  3. [3]

    Can Vision-Language Models Infer Speaker ' s Ignorance? The Role of Visual and Linguistic Cues

    Cho, Ye-eun and Maeng, Yunho. Can Vision-Language Models Infer Speaker ' s Ignorance? The Role of Visual and Linguistic Cues. Proceedings of the 2nd Workshop on Uncertainty-Aware NLP (UncertaiNLP 2025). 2025

  4. [4]

    1965 , publisher=

    Aspects of the Theory of Syntax , author=. 1965 , publisher=

  5. [5]

    Journal of Machine Learning Research , volume=

    Scaling instruction-finetuned language models , author=. Journal of Machine Learning Research , volume=

  6. [6]

    Scientific Reports , volume=

    Manner implicatures in large language models , author=. Scientific Reports , volume=. 2024 , publisher=

  7. [7]

    Cognitive science , volume=

    Processing scalar implicature: A constraint-based approach , author=. Cognitive science , volume=. 2015 , publisher=

  8. [8]

    Speech acts , pages=

    Logic and conversation , author=. Speech acts , pages=. 1975 , publisher=

  9. [9]

    1972 , publisher=

    On the semantic properties of logical operators in English , author=. 1972 , publisher=

  10. [10]

    Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , pages=

    Prompting is not a substitute for probability measurements in large language models , author=. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , pages=

  11. [12]

    2000 , publisher=

    Presumptive meanings: The theory of generalized conversational implicature , author=. 2000 , publisher=

  12. [13]

    Scientific Reports , volume=

    Large language models predict human sensory judgments across six modalities , author=. Scientific Reports , volume=. 2024 , publisher=

  13. [14]

    Transactions of the Association for Computational Linguistics , volume=

    Reducing conversational agents’ overconfidence through linguistic calibration , author=. Transactions of the Association for Computational Linguistics , volume=. 2022 , publisher=

  14. [15]

    2024 , url =

    Hello GPT-4o , author =. 2024 , url =

  15. [16]

    Language and Cognition , volume=

    The role of relevance for scalar diversity: a usage-based approach , author=. Language and Cognition , volume=. 2021 , publisher=

  16. [17]

    Proceedings of the 38th Pacific Asia Conference on Language, Information and Computation , pages=

    Pragmatic Competence Evaluation of Large Language Models for the Korean Language , author=. Proceedings of the 38th Pacific Asia Conference on Language, Information and Computation , pages=

  17. [18]

    Journal of Linguistics , volume=

    Pragmatic inferences are QUD-sensitive: An experimental study , author=. Journal of Linguistics , volume=. 2021 , publisher=

  18. [19]

    Journal of Memory and Language , volume=

    What could have been said? Alternatives and variability in pragmatic inferences , author=. Journal of Memory and Language , volume=. 2024 , publisher=

  19. [20]

    Proceedings of the International Conference “Dialogue , volume=

    Evaluating the Pragmatic Competence of Large Language Models in Detecting Mitigated and Unmitigated Types of Disagreement , author=. Proceedings of the International Conference “Dialogue , volume=

  20. [21]

    Advances in Neural Information Processing Systems , volume=

    Language models don't always say what they think: Unfaithful explanations in chain-of-thought prompting , author=. Advances in Neural Information Processing Systems , volume=

  21. [22]

    Journal of semantics , volume=

    Scalar diversity , author=. Journal of semantics , volume=. 2016 , publisher=

  22. [23]

    Advances in neural information processing systems , volume=

    Superglue: A stickier benchmark for general-purpose language understanding systems , author=. Advances in neural information processing systems , volume=

  23. [24]

    Proceedings of the 2022 conference of the north american chapter of the association for computational linguistics: Human language technologies , pages=

    Do prompt-based models really understand the meaning of their prompts? , author=. Proceedings of the 2022 conference of the north american chapter of the association for computational linguistics: Human language technologies , pages=

  24. [26]

    Ye-eun Cho. 2025. Prompting strategies of generative ai for korean pragmatic inference. Korean Journal of Linguistics, 50(2):423--455

  25. [27]

    Ye-eun Cho and Seong mook Kim. 2024. Pragmatic inference of scalar implicature by llms. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 4: Student Research Workshop), pages 10--20

  26. [28]

    Ye-eun Cho and Yunho Maeng. 2025. Can vision-language models infer speaker ' s ignorance? the role of visual and linguistic cues. In Proceedings of the 2nd Workshop on Uncertainty-Aware NLP (UncertaiNLP 2025), pages 298--308, Suzhou, China. Association for Computational Linguistics

  27. [29]

    Noam Chomsky. 1965. Aspects of the Theory of Syntax. MIT press

  28. [30]

    Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Yunxuan Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, et al. 2024. Scaling instruction-finetuned language models. Journal of Machine Learning Research, 25(70):1--53

  29. [31]

    Yan Cong. 2024. Manner implicatures in large language models. Scientific Reports, 14(1):29113

  30. [32]

    Judith Degen and Michael K Tanenhaus. 2015. Processing scalar implicature: A constraint-based approach. Cognitive science, 39(4):667--710

  31. [33]

    Herbert P Grice. 1975. Logic and conversation. In Speech acts, pages 41--58. Brill

  32. [34]

    Laurence Robert Horn. 1972. On the semantic properties of logical operators in English. University of California, Los Angeles

  33. [35]

    Jennifer Hu and Michael C Frank. 2024. Auxiliary task demands mask the capabilities of smaller language models. arXiv preprint arXiv:2404.02418

  34. [36]

    Jennifer Hu and Roger Levy. 2023. Prompting is not a substitute for probability measurements in large language models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 5040--5060

  35. [37]

    Stephen C Levinson. 2000. Presumptive meanings: The theory of generalized conversational implicature. MIT press

  36. [38]

    Raja Marjieh, Ilia Sucholutsky, Pol van Rijn, Nori Jacoby, and Thomas L Griffiths. 2024. Large language models predict human sensory judgments across six modalities. Scientific Reports, 14(1):21445

  37. [39]

    Sabrina J Mielke, Arthur Szlam, Emily Dinan, and Y-Lan Boureau. 2022. Reducing conversational agents’ overconfidence through linguistic calibration. Transactions of the Association for Computational Linguistics, 10:857--872

  38. [40]

    OpenAI. 2024. https://openai.com/index/hello-gpt-4o Hello GPT-4o . OpenAI

  39. [41]

    Elizabeth Pankratz and Bob Van Tiel. 2021. The role of relevance for scalar diversity: a usage-based approach. Language and Cognition, 13(4):562--594

  40. [42]

    Dojun Park, Jiwoo Lee, Hyeyun Jeong, Seohyun Park, and Sungeun Lee. 2024. Pragmatic competence evaluation of large language models for the korean language. In Proceedings of the 38th Pacific Asia Conference on Language, Information and Computation, pages 256--266

  41. [43]

    Eszter Ronai and Ming Xiang. 2021. Pragmatic inferences are qud-sensitive: An experimental study. Journal of Linguistics, 57(4):841--870

  42. [44]

    Eszter Ronai and Ming Xiang. 2024. What could have been said? alternatives and variability in pragmatic inferences. Journal of Memory and Language, 136:104507

  43. [45]

    Valery Shulginov, Hasan Berkcan S im s ek, Sergei Kudriashov, Renata Randautsova, and Sofya A Shevela. 2025. Evaluating the pragmatic competence of large language models in detecting mitigated and unmitigated types of disagreement. In Proceedings of the International Conference “Dialogue, volume 2025

  44. [46]

    Miles Turpin, Julian Michael, Ethan Perez, and Samuel Bowman. 2023. Language models don't always say what they think: Unfaithful explanations in chain-of-thought prompting. Advances in Neural Information Processing Systems, 36:74952--74965

  45. [47]

    Bob Van Tiel, Emiel Van Miltenburg, Natalia Zevakhina, and Bart Geurts. 2016. Scalar diversity. Journal of semantics, 33(1):137--175

  46. [48]

    Alex Wang, Yada Pruksachatkun, Nikita Nangia, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel Bowman. 2019. Superglue: A stickier benchmark for general-purpose language understanding systems. Advances in neural information processing systems, 32

  47. [49]

    Albert Webson and Ellie Pavlick. 2022. Do prompt-based models really understand the meaning of their prompts? In Proceedings of the 2022 conference of the north american chapter of the association for computational linguistics: Human language technologies, pages 2300--2344

  48. [50]

    An Yang, Baosong Yang, Binyuan Hui, Bo Zheng, et al. 2024. Qwen2 technical report. arXiv preprint arXiv:2407.10671