Evaluating Pragmatic Reasoning in Large Language Models: Evidence from Scalar Diversity

arxiv: 2605.09042 · v1 · submitted 2026-05-09 · 💻 cs.CL

Evaluating Pragmatic Reasoning in Large Language Models: Evidence from Scalar Diversity

Ye-eun Cho This is my paper

Pith reviewed 2026-05-12 03:02 UTC · model grok-4.3

classification 💻 cs.CL

keywords pragmatic reasoninglarge language modelsscalar diversityscalar implicatureevaluation methodsprompting strategiesprobability distributionsmetalinguistic judgments

0 comments p. Extension

The pith

Pragmatic reasoning in large language models arises from the interplay between internal probability distributions and specific evaluation prompts rather than from any fixed underlying competence.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether LLMs exhibit pragmatic inference by measuring how they handle scalar terms like 'some' versus 'all' under two different evaluation approaches: reading off probabilities directly from the model and asking the model to judge sentences explicitly. It finds that neither approach reliably produces better results than the other and that clear patterns of scalar diversity, where stronger terms are preferred over weaker ones, appear only in particular models under particular prompting conditions. This matters for anyone trying to assess what LLMs actually understand about language because it shows that apparent pragmatic ability is not a stable property of the model but depends heavily on how the test is set up. If the findings hold, claims about LLM pragmatics must be qualified by the exact measurement method used.

Core claim

Comparing direct probability measurement with metalinguistic prompting across multiple models and settings shows that neither method consistently outperforms the other. Pragmatic behavior varies substantially by model family, prompting strategy, and task structure, with scalar diversity gradients emerging only in specific combinations. These results indicate that pragmatic reasoning in LLMs reflects an interaction between internal probabilistic representations and task-induced prompting behavior rather than a stable competence that any single evaluation paradigm can capture.

What carries the argument

Scalar diversity, used as a graded diagnostic that tracks how strongly models favor stronger scalar expressions over weaker ones in contexts that license implicature.

If this is right

Evaluation design must be treated as a central variable when interpreting any observed pragmatic abilities in LLMs.
Direct probability measures and metalinguistic prompts can produce divergent pictures of the same model's behavior.
Pragmatic patterns are not uniform across model families or task formats.
Scalar diversity is observable only under particular model-condition pairings rather than as a general property.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Future tests of LLM pragmatics may need to combine probability readout with controlled prompting to separate internal knowledge from surface task effects.
The findings suggest that training data statistics alone do not guarantee stable pragmatic behavior once prompting changes.
Similar graded diagnostics could be applied to other pragmatic phenomena to check whether the same interaction between representation and task appears.

Load-bearing premise

That the degree of scalar diversity reliably separates genuine pragmatic inference from behavior produced merely by the evaluation task itself.

What would settle it

A replication in which scalar diversity gradients appear at comparable strength across every tested model family and both probability-based and prompting-based measurement methods.

Figures

Figures reproduced from arXiv: 2605.09042 by Ye-eun Cho.

**Figure 2.** Figure 2: Overall accuracy of scalar inference predictions across models and prompting conditions for Experiments [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Item-level accuracy across scalar items for each model and evaluation condition in Experiment B [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Item-level accuracy across scalar items for each model and evaluation condition in Experiment A [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

read the original abstract

Evaluating pragmatic reasoning in large language models (LLMs) remains challenging because model behavior can vary depending on evaluation methods. Previous studies suggest that prompt-based judgments may diverge from models' internal probability distributions, raising questions about whether observed performance reflects underlying competence or task-induced behavior. This study examines this issue using scalar diversity as a graded diagnostic for pragmatic inference. Following Hu & Levy (2023), this study compares direct probability measurement and metalinguistic prompting across multiple models and experimental settings. The results show that neither evaluation method consistently outperforms the other and that pragmatic behavior varies substantially across model families, prompting strategies, and task structures. Moreover, scalar diversity gradients emerge only in specific model-condition combinations, suggesting that pragmatic reasoning in LLMs reflects an interaction between internal probabilistic representations and task-induced prompting behavior rather than a stable competence captured by a single evaluation paradigm. These findings highlight the central role of evaluation design in interpreting pragmatic abilities in LLMs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper extends Hu & Levy by comparing probability and prompting methods across models and finds scalar diversity gradients only in specific combinations, but lacks controls that would separate pragmatic inference from simpler lexical or frequency effects.

read the letter

The main takeaway is that neither probability measurement nor metalinguistic prompting gives a consistent picture of pragmatic reasoning in LLMs, and scalar diversity effects appear only under particular model and task conditions. The authors read this as evidence that apparent pragmatic ability is really an interaction between internal probabilities and prompting rather than a stable competence that one method can capture reliably. That pattern is the concrete new result here. It builds directly on the earlier work by running the same diagnostic across more models and varying the setups, which turns up where the gradients do and do not show up. That is useful for anyone who has to choose an evaluation method and wants to see that the choice matters. The paper keeps the claims tied to the observed differences and does not overstate the scope. The soft spot is exactly the one the stress-test flags. Scalar diversity is treated as a graded diagnostic for pragmatic inference, yet the description gives no indication of controls for token frequency, training co-occurrence, or prompt-driven probability shifts that could produce similar patterns without any implicature reasoning. If those alternatives are not ruled out, the interaction claim rests on an assumption rather than a direct contrast. The abstract is also light on sample sizes, statistical tests, and exclusion criteria, so it is difficult to judge how stable the reported condition-specific gradients actually are. This is for researchers working on LLM evaluation in pragmatics or linguistic probing more generally. A reader who needs to know that single-paradigm tests can mislead will get something concrete from the variability findings, even if they will want tighter controls before treating the results as settled. I would send it to peer review. The comparison is straightforward and the question about method dependence is worth referee time, though the paper will need clearer reporting and additional checks to support the stronger interpretation.

Referee Report

2 major / 1 minor

Summary. The paper evaluates pragmatic reasoning in LLMs using scalar diversity (following Hu & Levy 2023) as a graded diagnostic. It compares direct probability measurement against metalinguistic prompting across multiple models, prompting strategies, and task structures. The central results are that neither method consistently outperforms the other, pragmatic behavior varies substantially across model families and conditions, and scalar diversity gradients appear only in specific model-condition combinations. The authors conclude that pragmatic reasoning reflects an interaction between internal probabilistic representations and task-induced prompting behavior rather than a stable, method-independent competence.

Significance. If the empirical patterns hold after addressing missing details and alternative explanations, the work would usefully demonstrate the sensitivity of LLM pragmatic assessments to evaluation design. It extends prior scalar-diversity diagnostics to a multi-method comparison and provides evidence that single-paradigm evaluations may mischaracterize underlying abilities. This could encourage more careful, multi-method protocols in future studies of linguistic competence in LLMs.

major comments (2)

[Abstract] Abstract: the claim that scalar diversity gradients emerge only in specific model-condition combinations (and that this supports an interaction interpretation) is presented at a high level without sample sizes, statistical tests, or data exclusion criteria. This prevents verification of whether the reported condition-specific patterns are robust or driven by the factors the authors highlight.
[Results/Discussion] Results/Discussion: the central interpretation that pragmatic behavior reflects an interaction between internal representations and task-induced prompting (rather than stable competence) assumes scalar diversity specifically isolates pragmatic inference. The manuscript provides no indication of controls for alternative factors such as token frequency, training-data co-occurrence statistics, or uniform prompt-induced probability shifts that could produce similar graded patterns on scalar terms.

minor comments (1)

[Methods] Ensure all model versions, exact prompt templates, and the precise definition of scalar diversity (including any adaptations from Hu & Levy 2023) are reported in full in the methods section to support reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their careful reading and constructive comments. We address each major point below, indicating planned revisions where appropriate.

read point-by-point responses

Referee: [Abstract] Abstract: the claim that scalar diversity gradients emerge only in specific model-condition combinations (and that this supports an interaction interpretation) is presented at a high level without sample sizes, statistical tests, or data exclusion criteria. This prevents verification of whether the reported condition-specific patterns are robust or driven by the factors the authors highlight.

Authors: We agree that the abstract presents the central claim concisely without methodological specifics. The main text details the experimental scope (multiple models, prompting strategies, and task structures), reports statistical tests for gradient significance and interactions, and specifies data exclusion rules. To improve verifiability from the abstract alone, we will revise it to briefly reference the statistical support for the condition-specific patterns while remaining within length limits. revision: yes
Referee: [Results/Discussion] Results/Discussion: the central interpretation that pragmatic behavior reflects an interaction between internal representations and task-induced prompting (rather than stable competence) assumes scalar diversity specifically isolates pragmatic inference. The manuscript provides no indication of controls for alternative factors such as token frequency, training-data co-occurrence statistics, or uniform prompt-induced probability shifts that could produce similar graded patterns on scalar terms.

Authors: This concern is well-taken. Our design follows the scalar-diversity paradigm of Hu & Levy (2023) to probe graded pragmatic inference. Although we did not include explicit controls for token frequency or co-occurrence, the key finding that scalar diversity gradients appear only under certain method-model combinations (and not uniformly) is difficult to attribute solely to stable factors like frequency, which would be constant across direct-probability and metalinguistic conditions. We will add a paragraph in the Discussion explicitly acknowledging these alternative explanations and describing how future studies could control for them (e.g., frequency-matched item sets or prompt-ablation experiments). revision: yes

Circularity Check

0 steps flagged

Empirical comparison study with no derivations or self-referential reductions

full rationale

The paper conducts an experimental comparison of two evaluation methods (direct probability measurement and metalinguistic prompting) for pragmatic reasoning in LLMs, measuring scalar diversity gradients across models and conditions. It explicitly follows the definition of scalar diversity from Hu & Levy (2023) as an external diagnostic without re-deriving or fitting it to the current data. No equations, ansatzes, uniqueness theorems, or parameter fits are present that would reduce any claimed result to the inputs by construction. Conclusions rest on observed empirical differences rather than any self-definitional or load-bearing self-citation chain, rendering the analysis self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that scalar diversity is a reliable diagnostic for pragmatic inference; no free parameters or invented entities are introduced in the abstract.

axioms (1)

domain assumption Scalar diversity serves as a graded diagnostic for pragmatic inference in LLMs
Explicitly follows Hu & Levy (2023) as stated in the abstract

pith-pipeline@v0.9.0 · 5450 in / 1232 out tokens · 103443 ms · 2026-05-12T03:02:56.947299+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

48 extracted references · 48 canonical work pages · 1 internal anchor

[1]

Korean Journal of Linguistics , volume=

Prompting Strategies of Generative AI for Korean Pragmatic Inference , author=. Korean Journal of Linguistics , volume=

work page
[2]

Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 4: Student Research Workshop) , pages=

Pragmatic inference of scalar implicature by LLMs , author=. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 4: Student Research Workshop) , pages=

work page
[3]

Can Vision-Language Models Infer Speaker ' s Ignorance? The Role of Visual and Linguistic Cues

Cho, Ye-eun and Maeng, Yunho. Can Vision-Language Models Infer Speaker ' s Ignorance? The Role of Visual and Linguistic Cues. Proceedings of the 2nd Workshop on Uncertainty-Aware NLP (UncertaiNLP 2025). 2025

work page 2025
[4]

1965 , publisher=

Aspects of the Theory of Syntax , author=. 1965 , publisher=

work page 1965
[5]

Journal of Machine Learning Research , volume=

Scaling instruction-finetuned language models , author=. Journal of Machine Learning Research , volume=

work page
[6]

Scientific Reports , volume=

Manner implicatures in large language models , author=. Scientific Reports , volume=. 2024 , publisher=

work page 2024
[7]

Cognitive science , volume=

Processing scalar implicature: A constraint-based approach , author=. Cognitive science , volume=. 2015 , publisher=

work page 2015
[8]

Speech acts , pages=

Logic and conversation , author=. Speech acts , pages=. 1975 , publisher=

work page 1975
[9]

1972 , publisher=

On the semantic properties of logical operators in English , author=. 1972 , publisher=

work page 1972
[10]

Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , pages=

Prompting is not a substitute for probability measurements in large language models , author=. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , pages=

work page 2023
[12]

2000 , publisher=

Presumptive meanings: The theory of generalized conversational implicature , author=. 2000 , publisher=

work page 2000
[13]

Scientific Reports , volume=

Large language models predict human sensory judgments across six modalities , author=. Scientific Reports , volume=. 2024 , publisher=

work page 2024
[14]

Transactions of the Association for Computational Linguistics , volume=

Reducing conversational agents’ overconfidence through linguistic calibration , author=. Transactions of the Association for Computational Linguistics , volume=. 2022 , publisher=

work page 2022
[15]

2024 , url =

Hello GPT-4o , author =. 2024 , url =

work page 2024
[16]

Language and Cognition , volume=

The role of relevance for scalar diversity: a usage-based approach , author=. Language and Cognition , volume=. 2021 , publisher=

work page 2021
[17]

Proceedings of the 38th Pacific Asia Conference on Language, Information and Computation , pages=

Pragmatic Competence Evaluation of Large Language Models for the Korean Language , author=. Proceedings of the 38th Pacific Asia Conference on Language, Information and Computation , pages=

work page
[18]

Journal of Linguistics , volume=

Pragmatic inferences are QUD-sensitive: An experimental study , author=. Journal of Linguistics , volume=. 2021 , publisher=

work page 2021
[19]

Journal of Memory and Language , volume=

What could have been said? Alternatives and variability in pragmatic inferences , author=. Journal of Memory and Language , volume=. 2024 , publisher=

work page 2024
[20]

Proceedings of the International Conference “Dialogue , volume=

Evaluating the Pragmatic Competence of Large Language Models in Detecting Mitigated and Unmitigated Types of Disagreement , author=. Proceedings of the International Conference “Dialogue , volume=

work page
[21]

Advances in Neural Information Processing Systems , volume=

Language models don't always say what they think: Unfaithful explanations in chain-of-thought prompting , author=. Advances in Neural Information Processing Systems , volume=

work page
[22]

Journal of semantics , volume=

Scalar diversity , author=. Journal of semantics , volume=. 2016 , publisher=

work page 2016
[23]

Advances in neural information processing systems , volume=

Superglue: A stickier benchmark for general-purpose language understanding systems , author=. Advances in neural information processing systems , volume=

work page
[24]

Proceedings of the 2022 conference of the north american chapter of the association for computational linguistics: Human language technologies , pages=

Do prompt-based models really understand the meaning of their prompts? , author=. Proceedings of the 2022 conference of the north american chapter of the association for computational linguistics: Human language technologies , pages=

work page 2022
[26]

Ye-eun Cho. 2025. Prompting strategies of generative ai for korean pragmatic inference. Korean Journal of Linguistics, 50(2):423--455

work page 2025
[27]

Ye-eun Cho and Seong mook Kim. 2024. Pragmatic inference of scalar implicature by llms. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 4: Student Research Workshop), pages 10--20

work page 2024
[28]

Ye-eun Cho and Yunho Maeng. 2025. Can vision-language models infer speaker ' s ignorance? the role of visual and linguistic cues. In Proceedings of the 2nd Workshop on Uncertainty-Aware NLP (UncertaiNLP 2025), pages 298--308, Suzhou, China. Association for Computational Linguistics

work page 2025
[29]

Noam Chomsky. 1965. Aspects of the Theory of Syntax. MIT press

work page 1965
[30]

Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Yunxuan Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, et al. 2024. Scaling instruction-finetuned language models. Journal of Machine Learning Research, 25(70):1--53

work page 2024
[31]

Yan Cong. 2024. Manner implicatures in large language models. Scientific Reports, 14(1):29113

work page 2024
[32]

Judith Degen and Michael K Tanenhaus. 2015. Processing scalar implicature: A constraint-based approach. Cognitive science, 39(4):667--710

work page 2015
[33]

Herbert P Grice. 1975. Logic and conversation. In Speech acts, pages 41--58. Brill

work page 1975
[34]

Laurence Robert Horn. 1972. On the semantic properties of logical operators in English. University of California, Los Angeles

work page 1972
[35]

Jennifer Hu and Michael C Frank. 2024. Auxiliary task demands mask the capabilities of smaller language models. arXiv preprint arXiv:2404.02418

work page arXiv 2024
[36]

Jennifer Hu and Roger Levy. 2023. Prompting is not a substitute for probability measurements in large language models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 5040--5060

work page 2023
[37]

Stephen C Levinson. 2000. Presumptive meanings: The theory of generalized conversational implicature. MIT press

work page 2000
[38]

Raja Marjieh, Ilia Sucholutsky, Pol van Rijn, Nori Jacoby, and Thomas L Griffiths. 2024. Large language models predict human sensory judgments across six modalities. Scientific Reports, 14(1):21445

work page 2024
[39]

Sabrina J Mielke, Arthur Szlam, Emily Dinan, and Y-Lan Boureau. 2022. Reducing conversational agents’ overconfidence through linguistic calibration. Transactions of the Association for Computational Linguistics, 10:857--872

work page 2022
[40]

OpenAI. 2024. https://openai.com/index/hello-gpt-4o Hello GPT-4o . OpenAI

work page 2024
[41]

Elizabeth Pankratz and Bob Van Tiel. 2021. The role of relevance for scalar diversity: a usage-based approach. Language and Cognition, 13(4):562--594

work page 2021
[42]

Dojun Park, Jiwoo Lee, Hyeyun Jeong, Seohyun Park, and Sungeun Lee. 2024. Pragmatic competence evaluation of large language models for the korean language. In Proceedings of the 38th Pacific Asia Conference on Language, Information and Computation, pages 256--266

work page 2024
[43]

Eszter Ronai and Ming Xiang. 2021. Pragmatic inferences are qud-sensitive: An experimental study. Journal of Linguistics, 57(4):841--870

work page 2021
[44]

Eszter Ronai and Ming Xiang. 2024. What could have been said? alternatives and variability in pragmatic inferences. Journal of Memory and Language, 136:104507

work page 2024
[45]

Valery Shulginov, Hasan Berkcan S im s ek, Sergei Kudriashov, Renata Randautsova, and Sofya A Shevela. 2025. Evaluating the pragmatic competence of large language models in detecting mitigated and unmitigated types of disagreement. In Proceedings of the International Conference “Dialogue, volume 2025

work page 2025
[46]

Miles Turpin, Julian Michael, Ethan Perez, and Samuel Bowman. 2023. Language models don't always say what they think: Unfaithful explanations in chain-of-thought prompting. Advances in Neural Information Processing Systems, 36:74952--74965

work page 2023
[47]

Bob Van Tiel, Emiel Van Miltenburg, Natalia Zevakhina, and Bart Geurts. 2016. Scalar diversity. Journal of semantics, 33(1):137--175

work page 2016
[48]

Alex Wang, Yada Pruksachatkun, Nikita Nangia, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel Bowman. 2019. Superglue: A stickier benchmark for general-purpose language understanding systems. Advances in neural information processing systems, 32

work page 2019
[49]

Albert Webson and Ellie Pavlick. 2022. Do prompt-based models really understand the meaning of their prompts? In Proceedings of the 2022 conference of the north american chapter of the association for computational linguistics: Human language technologies, pages 2300--2344

work page 2022
[50]

An Yang, Baosong Yang, Binyuan Hui, Bo Zheng, et al. 2024. Qwen2 technical report. arXiv preprint arXiv:2407.10671

work page internal anchor Pith review Pith/arXiv arXiv 2024

[1] [1]

Korean Journal of Linguistics , volume=

Prompting Strategies of Generative AI for Korean Pragmatic Inference , author=. Korean Journal of Linguistics , volume=

work page

[2] [2]

Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 4: Student Research Workshop) , pages=

Pragmatic inference of scalar implicature by LLMs , author=. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 4: Student Research Workshop) , pages=

work page

[3] [3]

Can Vision-Language Models Infer Speaker ' s Ignorance? The Role of Visual and Linguistic Cues

Cho, Ye-eun and Maeng, Yunho. Can Vision-Language Models Infer Speaker ' s Ignorance? The Role of Visual and Linguistic Cues. Proceedings of the 2nd Workshop on Uncertainty-Aware NLP (UncertaiNLP 2025). 2025

work page 2025

[4] [4]

1965 , publisher=

Aspects of the Theory of Syntax , author=. 1965 , publisher=

work page 1965

[5] [5]

Journal of Machine Learning Research , volume=

Scaling instruction-finetuned language models , author=. Journal of Machine Learning Research , volume=

work page

[6] [6]

Scientific Reports , volume=

Manner implicatures in large language models , author=. Scientific Reports , volume=. 2024 , publisher=

work page 2024

[7] [7]

Cognitive science , volume=

Processing scalar implicature: A constraint-based approach , author=. Cognitive science , volume=. 2015 , publisher=

work page 2015

[8] [8]

Speech acts , pages=

Logic and conversation , author=. Speech acts , pages=. 1975 , publisher=

work page 1975

[9] [9]

1972 , publisher=

On the semantic properties of logical operators in English , author=. 1972 , publisher=

work page 1972

[10] [10]

Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , pages=

Prompting is not a substitute for probability measurements in large language models , author=. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , pages=

work page 2023

[11] [12]

2000 , publisher=

Presumptive meanings: The theory of generalized conversational implicature , author=. 2000 , publisher=

work page 2000

[12] [13]

Scientific Reports , volume=

Large language models predict human sensory judgments across six modalities , author=. Scientific Reports , volume=. 2024 , publisher=

work page 2024

[13] [14]

Transactions of the Association for Computational Linguistics , volume=

Reducing conversational agents’ overconfidence through linguistic calibration , author=. Transactions of the Association for Computational Linguistics , volume=. 2022 , publisher=

work page 2022

[14] [15]

2024 , url =

Hello GPT-4o , author =. 2024 , url =

work page 2024

[15] [16]

Language and Cognition , volume=

The role of relevance for scalar diversity: a usage-based approach , author=. Language and Cognition , volume=. 2021 , publisher=

work page 2021

[16] [17]

Proceedings of the 38th Pacific Asia Conference on Language, Information and Computation , pages=

Pragmatic Competence Evaluation of Large Language Models for the Korean Language , author=. Proceedings of the 38th Pacific Asia Conference on Language, Information and Computation , pages=

work page

[17] [18]

Journal of Linguistics , volume=

Pragmatic inferences are QUD-sensitive: An experimental study , author=. Journal of Linguistics , volume=. 2021 , publisher=

work page 2021

[18] [19]

Journal of Memory and Language , volume=

What could have been said? Alternatives and variability in pragmatic inferences , author=. Journal of Memory and Language , volume=. 2024 , publisher=

work page 2024

[19] [20]

Proceedings of the International Conference “Dialogue , volume=

Evaluating the Pragmatic Competence of Large Language Models in Detecting Mitigated and Unmitigated Types of Disagreement , author=. Proceedings of the International Conference “Dialogue , volume=

work page

[20] [21]

Advances in Neural Information Processing Systems , volume=

Language models don't always say what they think: Unfaithful explanations in chain-of-thought prompting , author=. Advances in Neural Information Processing Systems , volume=

work page

[21] [22]

Journal of semantics , volume=

Scalar diversity , author=. Journal of semantics , volume=. 2016 , publisher=

work page 2016

[22] [23]

Advances in neural information processing systems , volume=

Superglue: A stickier benchmark for general-purpose language understanding systems , author=. Advances in neural information processing systems , volume=

work page

[23] [24]

Proceedings of the 2022 conference of the north american chapter of the association for computational linguistics: Human language technologies , pages=

Do prompt-based models really understand the meaning of their prompts? , author=. Proceedings of the 2022 conference of the north american chapter of the association for computational linguistics: Human language technologies , pages=

work page 2022

[24] [26]

Ye-eun Cho. 2025. Prompting strategies of generative ai for korean pragmatic inference. Korean Journal of Linguistics, 50(2):423--455

work page 2025

[25] [27]

Ye-eun Cho and Seong mook Kim. 2024. Pragmatic inference of scalar implicature by llms. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 4: Student Research Workshop), pages 10--20

work page 2024

[26] [28]

Ye-eun Cho and Yunho Maeng. 2025. Can vision-language models infer speaker ' s ignorance? the role of visual and linguistic cues. In Proceedings of the 2nd Workshop on Uncertainty-Aware NLP (UncertaiNLP 2025), pages 298--308, Suzhou, China. Association for Computational Linguistics

work page 2025

[27] [29]

Noam Chomsky. 1965. Aspects of the Theory of Syntax. MIT press

work page 1965

[28] [30]

Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Yunxuan Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, et al. 2024. Scaling instruction-finetuned language models. Journal of Machine Learning Research, 25(70):1--53

work page 2024

[29] [31]

Yan Cong. 2024. Manner implicatures in large language models. Scientific Reports, 14(1):29113

work page 2024

[30] [32]

Judith Degen and Michael K Tanenhaus. 2015. Processing scalar implicature: A constraint-based approach. Cognitive science, 39(4):667--710

work page 2015

[31] [33]

Herbert P Grice. 1975. Logic and conversation. In Speech acts, pages 41--58. Brill

work page 1975

[32] [34]

Laurence Robert Horn. 1972. On the semantic properties of logical operators in English. University of California, Los Angeles

work page 1972

[33] [35]

Jennifer Hu and Michael C Frank. 2024. Auxiliary task demands mask the capabilities of smaller language models. arXiv preprint arXiv:2404.02418

work page arXiv 2024

[34] [36]

Jennifer Hu and Roger Levy. 2023. Prompting is not a substitute for probability measurements in large language models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 5040--5060

work page 2023

[35] [37]

Stephen C Levinson. 2000. Presumptive meanings: The theory of generalized conversational implicature. MIT press

work page 2000

[36] [38]

Raja Marjieh, Ilia Sucholutsky, Pol van Rijn, Nori Jacoby, and Thomas L Griffiths. 2024. Large language models predict human sensory judgments across six modalities. Scientific Reports, 14(1):21445

work page 2024

[37] [39]

Sabrina J Mielke, Arthur Szlam, Emily Dinan, and Y-Lan Boureau. 2022. Reducing conversational agents’ overconfidence through linguistic calibration. Transactions of the Association for Computational Linguistics, 10:857--872

work page 2022

[38] [40]

OpenAI. 2024. https://openai.com/index/hello-gpt-4o Hello GPT-4o . OpenAI

work page 2024

[39] [41]

Elizabeth Pankratz and Bob Van Tiel. 2021. The role of relevance for scalar diversity: a usage-based approach. Language and Cognition, 13(4):562--594

work page 2021

[40] [42]

Dojun Park, Jiwoo Lee, Hyeyun Jeong, Seohyun Park, and Sungeun Lee. 2024. Pragmatic competence evaluation of large language models for the korean language. In Proceedings of the 38th Pacific Asia Conference on Language, Information and Computation, pages 256--266

work page 2024

[41] [43]

Eszter Ronai and Ming Xiang. 2021. Pragmatic inferences are qud-sensitive: An experimental study. Journal of Linguistics, 57(4):841--870

work page 2021

[42] [44]

Eszter Ronai and Ming Xiang. 2024. What could have been said? alternatives and variability in pragmatic inferences. Journal of Memory and Language, 136:104507

work page 2024

[43] [45]

Valery Shulginov, Hasan Berkcan S im s ek, Sergei Kudriashov, Renata Randautsova, and Sofya A Shevela. 2025. Evaluating the pragmatic competence of large language models in detecting mitigated and unmitigated types of disagreement. In Proceedings of the International Conference “Dialogue, volume 2025

work page 2025

[44] [46]

Miles Turpin, Julian Michael, Ethan Perez, and Samuel Bowman. 2023. Language models don't always say what they think: Unfaithful explanations in chain-of-thought prompting. Advances in Neural Information Processing Systems, 36:74952--74965

work page 2023

[45] [47]

Bob Van Tiel, Emiel Van Miltenburg, Natalia Zevakhina, and Bart Geurts. 2016. Scalar diversity. Journal of semantics, 33(1):137--175

work page 2016

[46] [48]

Alex Wang, Yada Pruksachatkun, Nikita Nangia, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel Bowman. 2019. Superglue: A stickier benchmark for general-purpose language understanding systems. Advances in neural information processing systems, 32

work page 2019

[47] [49]

Albert Webson and Ellie Pavlick. 2022. Do prompt-based models really understand the meaning of their prompts? In Proceedings of the 2022 conference of the north american chapter of the association for computational linguistics: Human language technologies, pages 2300--2344

work page 2022

[48] [50]

An Yang, Baosong Yang, Binyuan Hui, Bo Zheng, et al. 2024. Qwen2 technical report. arXiv preprint arXiv:2407.10671

work page internal anchor Pith review Pith/arXiv arXiv 2024