pith. machine review for the scientific record. sign in

arxiv: 2604.15873 · v1 · submitted 2026-04-17 · 💻 cs.CL

Recognition: unknown

How Hypocritical Is Your LLM judge? Listener-Speaker Asymmetries in the Pragmatic Competence of Large Language Models

Authors on Pith no claims yet

Pith reviewed 2026-05-10 08:39 UTC · model grok-4.3

classification 💻 cs.CL
keywords pragmaticlanguagellmsmodelslinguisticcompetenceevaluationgeneration
0
0 comments X

The pith

LLMs perform substantially better as pragmatic listeners judging language than as speakers generating it, revealing weak alignment between the two roles.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large language models are tested for pragmatic competence, which involves understanding implied meanings in conversation beyond literal words. Researchers set up three different pragmatic scenarios and had models play two roles. In the listener role, models judged whether a given response fit the social context appropriately. In the speaker role, models had to create their own appropriate responses for the same contexts. Across multiple open and proprietary models, performance was consistently higher when judging than when generating. The gap appeared robust and was not limited to one type of model. This suggests that the skills needed to evaluate pragmatic language do not automatically transfer to producing it. The authors argue that separate testing of these abilities misses important inconsistencies in how models handle real communication.

Core claim

We find a robust asymmetry between pragmatic evaluation and pragmatic generation: many models perform substantially better as listeners than as speakers.

Load-bearing premise

That the three chosen pragmatic settings and the specific judgment/generation tasks accurately measure pragmatic competence without introducing systematic biases that favor listening over speaking.

Figures

Figures reproduced from arXiv: 2604.15873 by Judith Sieker, Sina Zarrie{\ss}.

Figure 1
Figure 1. Figure 1: Example prompts for each task. False Presuppositions and Antipresuppositions prompts are originally in German. supposition. We extend this setup with a pragmatic listener condition: models are presented with the original prompt, the explicitly stated false presup￾position, and a model-generated response, and are instructed to judge the response using the same la￾bels (A, N, U) and guidelines as in the orig… view at source ↗
Figure 2
Figure 2. Figure 2: Speaker–Listener accuracy across the three pragmatic tasks. Each panel shows speaker accuracy on the x-axis and listener accuracy on the y-axis. Each point is one model; colors indicate model families. The diagonal indicates equal speaker and listener accuracy, so points above the line correspond to models that are better in the listener task than in the speaker task. we examine the conditional relationshi… view at source ↗
Figure 3
Figure 3. Figure 3: Speaker vs. Listener accuracy split by condition for Antipresuppositions task. DEF = MP! demands definite determiner, INDEF = MP! demands indefinite determiner, BOTH = MP! demands the quantifier ’both’. Each point is one model; colors indicate model families. The diagonal marks equal speaker and listener accuracy, so points above the line correspond to models that are better in the listener task than in th… view at source ↗
read the original abstract

Large language models (LLMs) are increasingly studied as repositories of linguistic knowledge. In this line of work, models are commonly evaluated both as generators of language and as judges of linguistic output, yet these two roles are rarely examined in direct relation to one another. As a result, it remains unclear whether success in one role aligns with success in the other. In this paper, we address this question for pragmatic competence by comparing LLMs' performance as pragmatic listeners, judging the appropriateness of linguistic outputs, and as pragmatic speakers, generating pragmatically appropriate language. We evaluate multiple open-weight and proprietary LLMs across three pragmatic settings. We find a robust asymmetry between pragmatic evaluation and pragmatic generation: many models perform substantially better as listeners than as speakers. Our results suggest that pragmatic judging and pragmatic generation are only weakly aligned in current LLMs, calling for more integrated evaluation practices.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Circularity Check

0 steps flagged

Empirical benchmarking study with no derivation chain or self-referential structure

full rationale

The paper is a direct empirical comparison of LLM performance in pragmatic listener (judgment) versus speaker (generation) roles across three settings. It reports observed asymmetries from model evaluations without any equations, fitted parameters, predictions derived from inputs, or load-bearing self-citations. The central claim rests on experimental results rather than any closed logical loop or renaming of prior findings. This matches the default case of a self-contained benchmarking study against external model outputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The work is an empirical evaluation study. No free parameters, mathematical axioms, or invented entities are described in the abstract.

pith-pipeline@v0.9.0 · 5454 in / 821 out tokens · 37694 ms · 2026-05-10T08:39:35.097029+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

51 extracted references · 33 canonical work pages · 4 internal anchors

  1. [1]

    Phi-4 Technical Report

    Marah Abdin, Jyoti Aneja, Harkirat Behl, Sébastien Bubeck, Ronen Eldan, Suriya Gunasekar, Michael Harrison, Russell J. Hewett, Mojan Javaheripi, Piero Kauffmann, James R. Lee, Yin Tat Lee, Yuanzhi Li, Weishung Liu, Caio C. T. Mendes, Anh Nguyen, Eric Price, Gustavo de Rosa, Olli Saarikivi, and 8 others. 2024. https://arxiv.org/abs/2412.08905 Phi-4 technic...

  2. [2]

    Manar Ali, Judith Sieker, Sina Zarrieß, and Hendrik Buschmeier. 2026. https://arxiv.org/abs/2601.07820 Reference games as a testbed for the alignment of model uncertainty and clarification requests . Preprint, arXiv:2601.07820

  3. [3]

    Anthropic . 2025. Introducing Claude Sonnet 4.5 . https://www.anthropic.com/news/claude-sonnet-4-5. Accessed: 2025-12-22

  4. [4]

    Nicholas Asher and Alex Lascarides. 2003. Logics of conversation

  5. [5]

    Raha Askari, Sina Zarrie , \"O zge Alacam, and Judith Sieker. 2025. https://doi.org/10.18653/v1/2025.babylm-main.4 Are B aby LM s deaf to G ricean maxims? a pragmatic evaluation of sample-efficient language models . In Proceedings of the First BabyLM Workshop, pages 52--65, Suzhou, China. Association for Computational Linguistics

  6. [6]

    Tara Azin, Daniel Dumitrescu, Diana Inkpen, and Raj Singh. 2025. https://caiac.pubpub.org/pub/keh8ij01 Let s CONFER : A Dataset for Evaluating Natural Language Inference Models on CONditional InFERence and Presupposition . Proceedings of the Canadian Conference on Artificial Intelligence

  7. [7]

    Anna Bavaresco, Raffaella Bernardi, Leonardo Bertolazzi, Desmond Elliott, Raquel Fern \'a ndez, Albert Gatt, Esam Ghaleb, Mario Giulianelli, Michael Hanna, Alexander Koller, Andre Martins, Philipp Mondorf, Vera Neplenbroek, Sandro Pezzelle, Barbara Plank, David Schlangen, Alessandro Suglia, Aditya K Surikuchi, Ece Takmaz, and Alberto Testoni. 2025. https:...

  8. [8]

    Bornstein and Charleene Hendricks

    Marc H. Bornstein and Charleene Hendricks. 2012. https://doi.org/10.1017/S0305000911000407 Basic language comprehension and production in > 100,000 young children from sixteen developing nations . Journal of Child Language, 39(4):899–918

  9. [9]

    Nitay Calderon, Roi Reichart, and Rotem Dror. 2025. https://doi.org/10.18653/v1/2025.acl-long.782 The alternative annotator test for LLM -as-a-judge: How to statistically justify replacing human annotators with LLM s . In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 16051--16081, Vi...

  10. [10]

    Chang and Benjamin K

    Tyler A. Chang and Benjamin K. Bergen. 2023. https://arxiv.org/abs/2303.11504 Language model behavior: A comprehensive survey . Preprint, arXiv:2303.11504

  11. [11]

    Ferreira

    Fernanda Ferreira and Victor S. Ferreira. 2024. https://oecs.mit.edu/pub/y1uhdz0y Psycholinguistics . MIT Press

  12. [12]

    Suzanne Flynn. 1986. https://doi.org/10.1017/S0272263100006057 Production vs. comprehension: Differences in underlying competences . Studies in Second Language Acquisition, 8(2):135–164

  13. [13]

    Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, Amy Yang, Angela Fan, and 1 others. 2024. https://arxiv.org/abs/2407.21783 The llama 3 herd of models . Preprint, arXiv:2407.21783

  14. [14]

    Dirk Groeneveld, Iz Beltagy, Evan Walsh, Akshita Bhagia, Rodney Kinney, Oyvind Tafjord, Ananya Jha, Hamish Ivison, Ian Magnusson, Yizhong Wang, Shane Arora, David Atkinson, Russell Authur, Khyathi Chandu, Arman Cohan, Jennifer Dumas, Yanai Elazar, Yuling Gu, Jack Hessel, and 24 others. 2024. https://doi.org/10.18653/v1/2024.acl-long.841 OLM o: Acceleratin...

  15. [15]

    Irene Heim. 1991. https://doi.org/doi:10.1515/9783110126969.7.487 Artikel und Definitheit , pages 487--535. De Gruyter Mouton, Berlin, New York

  16. [16]

    Jennifer Hu, Sammy Floyd, Olessia Jouravlev, Evelina Fedorenko, and Edward Gibson. 2023. https://doi.org/10.18653/v1/2023.acl-long.230 A fine-grained comparison of pragmatic language understanding in humans and language models . In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 4194--...

  17. [17]

    Jennifer Hu and Roger Levy. 2023. https://doi.org/10.18653/v1/2023.emnlp-main.306 Prompting is not a substitute for probability measurements in large language models . In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 5040--5060, Singapore. Association for Computational Linguistics

  18. [18]

    Siddharth

    Mingyue Jian and N. Siddharth. 2024. https://arxiv.org/abs/2411.01562 Are llms good pragmatic speakers? Preprint, arXiv:2411.01562

  19. [19]

    Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, Lélio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, and William El Sayed. 2023. https://arxiv.org/abs/2310.0...

  20. [20]

    Jad Kabbara and Jackie Chi Kit Cheung. 2022. https://aclanthology.org/2022.coling-1.65/ Investigating the performance of transformer-based NLI models on presuppositional inferences . In Proceedings of the 29th International Conference on Computational Linguistics, pages 779--785, Gyeongju, Republic of Korea. International Committee on Computational Linguistics

  21. [21]

    Clara Lachenmaier, Judith Sieker, and Sina Zarrie . 2025. https://doi.org/10.18653/v1/2025.acl-long.728 Can LLM s ground when they (don ' t) know: A study on direct and loaded political questions . In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 14956--14975, Vienna, Austria. Associ...

  22. [22]

    Haitao Li, Qian Dong, Junjie Chen, Huixue Su, Yujia Zhou, Qingyao Ai, Ziyi Ye, and Yiqun Liu. 2024. https://arxiv.org/abs/2412.05579 Llms-as-judges: A comprehensive survey on llm-based evaluation methods . Preprint, arXiv:2412.05579

  23. [23]

    Bolei Ma, Yuting Li, Wei Zhou, Ziwei Gong, Yang Janet Liu, Katja Jasinskaja, Annemarie Friedrich, Julia Hirschberg, Frauke Kreuter, and Barbara Plank. 2025. https://doi.org/10.18653/v1/2025.acl-long.425 Pragmatics in the era of large language models: A survey on datasets, evaluation, opportunities and challenges . In Proceedings of the 63rd Annual Meeting...

  24. [24]

    Meyer, Falk Huettig, and Willem J.M

    Antje S. Meyer, Falk Huettig, and Willem J.M. Levelt. 2016. https://doi.org/10.1016/j.jml.2016.03.002 Same, different, or closely related: What is the relationship between language production and comprehension? Journal of Memory and Language, 89:1--7. Speaking and Listening: Relationships Between Language Production and Comprehension

  25. [25]

    Mistral AI . 2023. Mixtral of experts . https://mistral.ai/news/mixtral-of-experts/. Accessed: 2025-12-22

  26. [26]

    Philipp Mondorf and Barbara Plank. 2024. https://doi.org/10.18653/v1/2024.acl-long.508 Comparing inferential strategies of humans and large language models in deductive reasoning . In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 9370--9402, Bangkok, Thailand. Association for Computa...

  27. [27]

    OpenAI . 2024. GPT-4o . https://openai.com/de-DE/index/hello-gpt-4o. Accessed: 2025-12-22

  28. [28]

    OpenAI . 2025 a . GPT-4.1 . https://platform.openai.com/docs/models/gpt-4.1. Accessed: 2025-12-22

  29. [29]

    OpenAI . 2025 b . GPT-5 . https://platform.openai.com/docs/models/gpt-5. Accessed: 2025-12-22

  30. [30]

    Walter Paci, Alessandro Panunzi, and Sandro Pezzelle. 2025. https://doi.org/10.18653/v1/2025.findings-acl.804 They want to pretend not to understand: The limits of current LLM s in interpreting implicit content of political discourse . In Findings of the Association for Computational Linguistics: ACL 2025, pages 15569--15593, Vienna, Austria. Association ...

  31. [31]

    Dojun Park, Jiwoo Lee, Seohyun Park, Hyeyun Jeong, Youngeun Koo, Soonha Hwang, Seonwoo Park, and Sungeun Lee. 2024. https://doi.org/10.18653/v1/2024.genbench-1.7 M ulti P rag E val: Multilingual pragmatic evaluation of large language models . In Proceedings of the 2nd GenBench Workshop on Generalisation (Benchmarking) in NLP, pages 96--119, Miami, Florida...

  32. [32]

    Orin Percus. 2006. https://semanticsarchive.net/Archive/GI3YzhlM/AntipresuppositionsVersion1.pdf Antipresuppositions . Theoretical and Empirical Studies of Reference and Anaphora: Toward the establishment of generative grammar as an empirical science,, pages 52--73

  33. [33]

    Paloma Piot, David Otero, Patricia Martín-Rodilla, and Javier Parapar. 2025. https://arxiv.org/abs/2512.09662 Can llms evaluate what they cannot annotate? revisiting llm reliability in hate speech detection . Preprint, arXiv:2512.09662

  34. [34]

    José Pombal, Dongkeun Yoon, Patrick Fernandes, Ian Wu, Seungone Kim, Ricardo Rei, Graham Neubig, and André F. T. Martins. 2025. https://arxiv.org/abs/2504.04953 M-prometheus: A suite of open multilingual llm judges . Preprint, arXiv:2504.04953

  35. [35]

    Zhang, Joshua B

    Linlu Qiu, Cedegao E. Zhang, Joshua B. Tenenbaum, Yoon Kim, and Roger P. Levy. 2025. https://doi.org/10.18653/v1/2025.emnlp-main.1008 On the same wavelength? evaluating pragmatic reasoning in language models across broad concepts . In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 19924--19946, Suzhou, China....

  36. [36]

    Qwen Team . 2025. Qwen3: Think Deeper, Act Faster . https://qwenlm.github.io/blog/qwen3/. Accessed: 2025-12-22

  37. [37]

    Cosima Schneider, Carolin Schonard, Michael Franke, Gerhard Jäger, and Markus Janczyk. 2019. https://doi.org/10.1016/j.cognition.2019.104024 Pragmatic processing: An investigation of the (anti-)presuppositions of determiners using mouse-tracking . Cognition, 193:104024

  38. [38]

    Judith Sieker, Oliver Bott, Torgrim Solstad, and Sina Zarrie . 2023. https://doi.org/10.18653/v1/2023.inlg-main.15 Beyond the bias: Unveiling the quality of implicit causality prompt continuations in language models . In Proceedings of the 16th International Natural Language Generation Conference, pages 206--220, Prague, Czechia. Association for Computati...

  39. [39]

    Judith Sieker, Clara Lachenmaier, and Sina Zarrie . 2025. https://escholarship.org/uc/item/4932r1hx LLMs struggle to reject false presuppositions when misinformation stakes are high . Proceedings of the Annual Meeting of the Cognitive Science Society, 47

  40. [40]

    Judith Sieker and Sina Zarrie . 2023. https://doi.org/10.18653/v1/2023.blackboxnlp-1.14 When your language model cannot E ven do determiners right: Probing for anti-presuppositions and the maximize presupposition! principle . In Proceedings of the 6th BlackboxNLP Workshop: Analyzing and Interpreting Neural Networks for NLP, pages 180--198, Singapore. Asso...

  41. [41]

    Damien Sileo, Philippe Muller, Tim Van de Cruys, and Camille Pradel. 2022. https://aclanthology.org/2022.lrec-1.255/ A pragmatics-centered evaluation framework for natural language understanding . In Proceedings of the Thirteenth Language Resources and Evaluation Conference, pages 2382--2394, Marseille, France. European Language Resources Association

  42. [42]

    Settaluri Sravanthi, Meet Doshi, Pavan Tankala, Rudra Murthy, Raj Dabre, and Pushpak Bhattacharyya. 2024. https://doi.org/10.18653/v1/2024.findings-acl.719 PUB : A pragmatics understanding benchmark for assessing LLM s' pragmatics capabilities . In Findings of the Association for Computational Linguistics: ACL 2024, pages 12075--12097, Bangkok, Thailand. ...

  43. [43]

    Robert Stalnaker. 1973. https://doi.org/10.1007/bf00262951 Presuppositions . Journal of Philosophical Logic, 2(4):447--457

  44. [44]

    Robert Stalnaker. 1978. Assertion. Syntax and Semantics (New York Academic Press), 9:315--332

  45. [45]

    Andreas Stephan, Dawei Zhu, Matthias A enmacher, Xiaoyu Shen, and Benjamin Roth. 2025. https://aclanthology.org/2025.gem-1.65/ From calculation to adjudication: Examining LLM judges on mathematical reasoning tasks . In Proceedings of the Fourth Workshop on Generation, Evaluation and Metrics (GEM ) , pages 759--773, Vienna, Austria and virtual meeting. Ass...

  46. [46]

    Aman Singh Thakur, Kartik Choudhary, Venkat Srinik Ramayapally, Sankaran Vaidyanathan, and Dieuwke Hupkes. 2025. https://aclanthology.org/2025.gem-1.33/ Judging the judges: Evaluating alignment and vulnerabilities in LLM s-as-judges . In Proceedings of the Fourth Workshop on Generation, Evaluation and Metrics (GEM ) , pages 404--430, Vienna, Austria and v...

  47. [47]

    Johnson-Laird

    Jean-Baptiste Van der Henst, Yingrui Yang, and P.N. Johnson-Laird. 2002. https://doi.org/10.1207/s15516709cog2604\_2 Strategies in sentential reasoning . Cognitive Science, 26(4):425--468

  48. [48]

    Shengguang Wu, Shusheng Yang, Zhenglun Chen, and Qi Su. 2024. https://doi.org/10.18653/v1/2024.emnlp-main.1258 Rethinking pragmatics in large language models: Towards open-ended evaluation and preference tuning . In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 22583--22599, Miami, Florida, USA. Association ...

  49. [49]

    Kefan Yu, Qingcheng Zeng, Weihao Xuan, Wanxin Li, Jingyi Wu, and Rob Voigt. 2025. https://arxiv.org/abs/2505.18497 The pragmatic mind of machines: Tracing the emergence of pragmatic competence in large language models . Preprint, arXiv:2505.18497

  50. [50]

    online" 'onlinestring :=

    ENTRY address archivePrefix author booktitle chapter edition editor eid eprint eprinttype howpublished institution journal key month note number organization pages publisher school series title type volume year doi pubmed url lastchecked label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block STRING...

  51. [51]

    write newline

    " write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...