Evaluating Pragmatic Reasoning in Large Language Models: Evidence from Scalar Diversity
Pith reviewed 2026-05-12 03:02 UTC · model grok-4.3
The pith
Pragmatic reasoning in large language models arises from the interplay between internal probability distributions and specific evaluation prompts rather than from any fixed underlying competence.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Comparing direct probability measurement with metalinguistic prompting across multiple models and settings shows that neither method consistently outperforms the other. Pragmatic behavior varies substantially by model family, prompting strategy, and task structure, with scalar diversity gradients emerging only in specific combinations. These results indicate that pragmatic reasoning in LLMs reflects an interaction between internal probabilistic representations and task-induced prompting behavior rather than a stable competence that any single evaluation paradigm can capture.
What carries the argument
Scalar diversity, used as a graded diagnostic that tracks how strongly models favor stronger scalar expressions over weaker ones in contexts that license implicature.
If this is right
- Evaluation design must be treated as a central variable when interpreting any observed pragmatic abilities in LLMs.
- Direct probability measures and metalinguistic prompts can produce divergent pictures of the same model's behavior.
- Pragmatic patterns are not uniform across model families or task formats.
- Scalar diversity is observable only under particular model-condition pairings rather than as a general property.
Where Pith is reading between the lines
- Future tests of LLM pragmatics may need to combine probability readout with controlled prompting to separate internal knowledge from surface task effects.
- The findings suggest that training data statistics alone do not guarantee stable pragmatic behavior once prompting changes.
- Similar graded diagnostics could be applied to other pragmatic phenomena to check whether the same interaction between representation and task appears.
Load-bearing premise
That the degree of scalar diversity reliably separates genuine pragmatic inference from behavior produced merely by the evaluation task itself.
What would settle it
A replication in which scalar diversity gradients appear at comparable strength across every tested model family and both probability-based and prompting-based measurement methods.
Figures
read the original abstract
Evaluating pragmatic reasoning in large language models (LLMs) remains challenging because model behavior can vary depending on evaluation methods. Previous studies suggest that prompt-based judgments may diverge from models' internal probability distributions, raising questions about whether observed performance reflects underlying competence or task-induced behavior. This study examines this issue using scalar diversity as a graded diagnostic for pragmatic inference. Following Hu & Levy (2023), this study compares direct probability measurement and metalinguistic prompting across multiple models and experimental settings. The results show that neither evaluation method consistently outperforms the other and that pragmatic behavior varies substantially across model families, prompting strategies, and task structures. Moreover, scalar diversity gradients emerge only in specific model-condition combinations, suggesting that pragmatic reasoning in LLMs reflects an interaction between internal probabilistic representations and task-induced prompting behavior rather than a stable competence captured by a single evaluation paradigm. These findings highlight the central role of evaluation design in interpreting pragmatic abilities in LLMs.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper evaluates pragmatic reasoning in LLMs using scalar diversity (following Hu & Levy 2023) as a graded diagnostic. It compares direct probability measurement against metalinguistic prompting across multiple models, prompting strategies, and task structures. The central results are that neither method consistently outperforms the other, pragmatic behavior varies substantially across model families and conditions, and scalar diversity gradients appear only in specific model-condition combinations. The authors conclude that pragmatic reasoning reflects an interaction between internal probabilistic representations and task-induced prompting behavior rather than a stable, method-independent competence.
Significance. If the empirical patterns hold after addressing missing details and alternative explanations, the work would usefully demonstrate the sensitivity of LLM pragmatic assessments to evaluation design. It extends prior scalar-diversity diagnostics to a multi-method comparison and provides evidence that single-paradigm evaluations may mischaracterize underlying abilities. This could encourage more careful, multi-method protocols in future studies of linguistic competence in LLMs.
major comments (2)
- [Abstract] Abstract: the claim that scalar diversity gradients emerge only in specific model-condition combinations (and that this supports an interaction interpretation) is presented at a high level without sample sizes, statistical tests, or data exclusion criteria. This prevents verification of whether the reported condition-specific patterns are robust or driven by the factors the authors highlight.
- [Results/Discussion] Results/Discussion: the central interpretation that pragmatic behavior reflects an interaction between internal representations and task-induced prompting (rather than stable competence) assumes scalar diversity specifically isolates pragmatic inference. The manuscript provides no indication of controls for alternative factors such as token frequency, training-data co-occurrence statistics, or uniform prompt-induced probability shifts that could produce similar graded patterns on scalar terms.
minor comments (1)
- [Methods] Ensure all model versions, exact prompt templates, and the precise definition of scalar diversity (including any adaptations from Hu & Levy 2023) are reported in full in the methods section to support reproducibility.
Simulated Author's Rebuttal
We thank the referee for their careful reading and constructive comments. We address each major point below, indicating planned revisions where appropriate.
read point-by-point responses
-
Referee: [Abstract] Abstract: the claim that scalar diversity gradients emerge only in specific model-condition combinations (and that this supports an interaction interpretation) is presented at a high level without sample sizes, statistical tests, or data exclusion criteria. This prevents verification of whether the reported condition-specific patterns are robust or driven by the factors the authors highlight.
Authors: We agree that the abstract presents the central claim concisely without methodological specifics. The main text details the experimental scope (multiple models, prompting strategies, and task structures), reports statistical tests for gradient significance and interactions, and specifies data exclusion rules. To improve verifiability from the abstract alone, we will revise it to briefly reference the statistical support for the condition-specific patterns while remaining within length limits. revision: yes
-
Referee: [Results/Discussion] Results/Discussion: the central interpretation that pragmatic behavior reflects an interaction between internal representations and task-induced prompting (rather than stable competence) assumes scalar diversity specifically isolates pragmatic inference. The manuscript provides no indication of controls for alternative factors such as token frequency, training-data co-occurrence statistics, or uniform prompt-induced probability shifts that could produce similar graded patterns on scalar terms.
Authors: This concern is well-taken. Our design follows the scalar-diversity paradigm of Hu & Levy (2023) to probe graded pragmatic inference. Although we did not include explicit controls for token frequency or co-occurrence, the key finding that scalar diversity gradients appear only under certain method-model combinations (and not uniformly) is difficult to attribute solely to stable factors like frequency, which would be constant across direct-probability and metalinguistic conditions. We will add a paragraph in the Discussion explicitly acknowledging these alternative explanations and describing how future studies could control for them (e.g., frequency-matched item sets or prompt-ablation experiments). revision: yes
Circularity Check
Empirical comparison study with no derivations or self-referential reductions
full rationale
The paper conducts an experimental comparison of two evaluation methods (direct probability measurement and metalinguistic prompting) for pragmatic reasoning in LLMs, measuring scalar diversity gradients across models and conditions. It explicitly follows the definition of scalar diversity from Hu & Levy (2023) as an external diagnostic without re-deriving or fitting it to the current data. No equations, ansatzes, uniqueness theorems, or parameter fits are present that would reduce any claimed result to the inputs by construction. Conclusions rest on observed empirical differences rather than any self-definitional or load-bearing self-citation chain, rendering the analysis self-contained.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Scalar diversity serves as a graded diagnostic for pragmatic inference in LLMs
Reference graph
Works this paper leans on
-
[1]
Korean Journal of Linguistics , volume=
Prompting Strategies of Generative AI for Korean Pragmatic Inference , author=. Korean Journal of Linguistics , volume=
-
[2]
Pragmatic inference of scalar implicature by LLMs , author=. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 4: Student Research Workshop) , pages=
-
[3]
Can Vision-Language Models Infer Speaker ' s Ignorance? The Role of Visual and Linguistic Cues
Cho, Ye-eun and Maeng, Yunho. Can Vision-Language Models Infer Speaker ' s Ignorance? The Role of Visual and Linguistic Cues. Proceedings of the 2nd Workshop on Uncertainty-Aware NLP (UncertaiNLP 2025). 2025
work page 2025
- [4]
-
[5]
Journal of Machine Learning Research , volume=
Scaling instruction-finetuned language models , author=. Journal of Machine Learning Research , volume=
-
[6]
Manner implicatures in large language models , author=. Scientific Reports , volume=. 2024 , publisher=
work page 2024
-
[7]
Processing scalar implicature: A constraint-based approach , author=. Cognitive science , volume=. 2015 , publisher=
work page 2015
-
[8]
Logic and conversation , author=. Speech acts , pages=. 1975 , publisher=
work page 1975
-
[9]
On the semantic properties of logical operators in English , author=. 1972 , publisher=
work page 1972
-
[10]
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , pages=
Prompting is not a substitute for probability measurements in large language models , author=. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , pages=
work page 2023
-
[12]
Presumptive meanings: The theory of generalized conversational implicature , author=. 2000 , publisher=
work page 2000
-
[13]
Large language models predict human sensory judgments across six modalities , author=. Scientific Reports , volume=. 2024 , publisher=
work page 2024
-
[14]
Transactions of the Association for Computational Linguistics , volume=
Reducing conversational agents’ overconfidence through linguistic calibration , author=. Transactions of the Association for Computational Linguistics , volume=. 2022 , publisher=
work page 2022
- [15]
-
[16]
Language and Cognition , volume=
The role of relevance for scalar diversity: a usage-based approach , author=. Language and Cognition , volume=. 2021 , publisher=
work page 2021
-
[17]
Proceedings of the 38th Pacific Asia Conference on Language, Information and Computation , pages=
Pragmatic Competence Evaluation of Large Language Models for the Korean Language , author=. Proceedings of the 38th Pacific Asia Conference on Language, Information and Computation , pages=
-
[18]
Journal of Linguistics , volume=
Pragmatic inferences are QUD-sensitive: An experimental study , author=. Journal of Linguistics , volume=. 2021 , publisher=
work page 2021
-
[19]
Journal of Memory and Language , volume=
What could have been said? Alternatives and variability in pragmatic inferences , author=. Journal of Memory and Language , volume=. 2024 , publisher=
work page 2024
-
[20]
Proceedings of the International Conference “Dialogue , volume=
Evaluating the Pragmatic Competence of Large Language Models in Detecting Mitigated and Unmitigated Types of Disagreement , author=. Proceedings of the International Conference “Dialogue , volume=
-
[21]
Advances in Neural Information Processing Systems , volume=
Language models don't always say what they think: Unfaithful explanations in chain-of-thought prompting , author=. Advances in Neural Information Processing Systems , volume=
-
[22]
Journal of semantics , volume=
Scalar diversity , author=. Journal of semantics , volume=. 2016 , publisher=
work page 2016
-
[23]
Advances in neural information processing systems , volume=
Superglue: A stickier benchmark for general-purpose language understanding systems , author=. Advances in neural information processing systems , volume=
-
[24]
Do prompt-based models really understand the meaning of their prompts? , author=. Proceedings of the 2022 conference of the north american chapter of the association for computational linguistics: Human language technologies , pages=
work page 2022
-
[26]
Ye-eun Cho. 2025. Prompting strategies of generative ai for korean pragmatic inference. Korean Journal of Linguistics, 50(2):423--455
work page 2025
-
[27]
Ye-eun Cho and Seong mook Kim. 2024. Pragmatic inference of scalar implicature by llms. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 4: Student Research Workshop), pages 10--20
work page 2024
-
[28]
Ye-eun Cho and Yunho Maeng. 2025. Can vision-language models infer speaker ' s ignorance? the role of visual and linguistic cues. In Proceedings of the 2nd Workshop on Uncertainty-Aware NLP (UncertaiNLP 2025), pages 298--308, Suzhou, China. Association for Computational Linguistics
work page 2025
-
[29]
Noam Chomsky. 1965. Aspects of the Theory of Syntax. MIT press
work page 1965
-
[30]
Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Yunxuan Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, et al. 2024. Scaling instruction-finetuned language models. Journal of Machine Learning Research, 25(70):1--53
work page 2024
-
[31]
Yan Cong. 2024. Manner implicatures in large language models. Scientific Reports, 14(1):29113
work page 2024
-
[32]
Judith Degen and Michael K Tanenhaus. 2015. Processing scalar implicature: A constraint-based approach. Cognitive science, 39(4):667--710
work page 2015
-
[33]
Herbert P Grice. 1975. Logic and conversation. In Speech acts, pages 41--58. Brill
work page 1975
-
[34]
Laurence Robert Horn. 1972. On the semantic properties of logical operators in English. University of California, Los Angeles
work page 1972
- [35]
-
[36]
Jennifer Hu and Roger Levy. 2023. Prompting is not a substitute for probability measurements in large language models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 5040--5060
work page 2023
-
[37]
Stephen C Levinson. 2000. Presumptive meanings: The theory of generalized conversational implicature. MIT press
work page 2000
-
[38]
Raja Marjieh, Ilia Sucholutsky, Pol van Rijn, Nori Jacoby, and Thomas L Griffiths. 2024. Large language models predict human sensory judgments across six modalities. Scientific Reports, 14(1):21445
work page 2024
-
[39]
Sabrina J Mielke, Arthur Szlam, Emily Dinan, and Y-Lan Boureau. 2022. Reducing conversational agents’ overconfidence through linguistic calibration. Transactions of the Association for Computational Linguistics, 10:857--872
work page 2022
-
[40]
OpenAI. 2024. https://openai.com/index/hello-gpt-4o Hello GPT-4o . OpenAI
work page 2024
-
[41]
Elizabeth Pankratz and Bob Van Tiel. 2021. The role of relevance for scalar diversity: a usage-based approach. Language and Cognition, 13(4):562--594
work page 2021
-
[42]
Dojun Park, Jiwoo Lee, Hyeyun Jeong, Seohyun Park, and Sungeun Lee. 2024. Pragmatic competence evaluation of large language models for the korean language. In Proceedings of the 38th Pacific Asia Conference on Language, Information and Computation, pages 256--266
work page 2024
-
[43]
Eszter Ronai and Ming Xiang. 2021. Pragmatic inferences are qud-sensitive: An experimental study. Journal of Linguistics, 57(4):841--870
work page 2021
-
[44]
Eszter Ronai and Ming Xiang. 2024. What could have been said? alternatives and variability in pragmatic inferences. Journal of Memory and Language, 136:104507
work page 2024
-
[45]
Valery Shulginov, Hasan Berkcan S im s ek, Sergei Kudriashov, Renata Randautsova, and Sofya A Shevela. 2025. Evaluating the pragmatic competence of large language models in detecting mitigated and unmitigated types of disagreement. In Proceedings of the International Conference “Dialogue, volume 2025
work page 2025
-
[46]
Miles Turpin, Julian Michael, Ethan Perez, and Samuel Bowman. 2023. Language models don't always say what they think: Unfaithful explanations in chain-of-thought prompting. Advances in Neural Information Processing Systems, 36:74952--74965
work page 2023
-
[47]
Bob Van Tiel, Emiel Van Miltenburg, Natalia Zevakhina, and Bart Geurts. 2016. Scalar diversity. Journal of semantics, 33(1):137--175
work page 2016
-
[48]
Alex Wang, Yada Pruksachatkun, Nikita Nangia, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel Bowman. 2019. Superglue: A stickier benchmark for general-purpose language understanding systems. Advances in neural information processing systems, 32
work page 2019
-
[49]
Albert Webson and Ellie Pavlick. 2022. Do prompt-based models really understand the meaning of their prompts? In Proceedings of the 2022 conference of the north american chapter of the association for computational linguistics: Human language technologies, pages 2300--2344
work page 2022
-
[50]
An Yang, Baosong Yang, Binyuan Hui, Bo Zheng, et al. 2024. Qwen2 technical report. arXiv preprint arXiv:2407.10671
work page internal anchor Pith review Pith/arXiv arXiv 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.