pith. sign in

arxiv: 2605.21299 · v1 · pith:Y4KOYDJVnew · submitted 2026-05-20 · 💻 cs.CL · cs.AI

Tracing the ongoing emergence of human-like reasoning in Large Language Models

Pith reviewed 2026-05-21 05:01 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords large language modelspragmatic reasoningconditional inferencessemantic operatorshuman-AI comparisonpragmatic enrichmentsmultilingual reasoninginference tasks
0
0 comments X

The pith

Large language models accurately process literal meanings of conditionals but fail to make the pragmatic inferences that humans routinely add.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether large language models interpret conditional statements the way humans do by going beyond literal logic to add implied meanings. A population-matching experiment compared responses from twenty-five LLMs to equal numbers of human participants across four languages on sentences such as promises that depend on an action versus statements that hold regardless of a condition. Humans consistently enriched the logic with pragmatic assumptions that vary by context. The models split into two patterns: some followed the strict logical truth table and ignored implications, while others applied one fixed reading across all cases. This shows LLMs function well as semantic processors but have not acquired the flexible pragmatic layer of human reasoning, and the gap does not depend on model openness, training focus, or architecture.

Core claim

Humans enrich logical reasoning through pragmatic inferences when processing conditional sentences across languages. Model behavior is more variable. Some LLMs perfectly follow the truth-table of conditionals but ignore pragmatic inferences, while others deviate from the truth-table by adhering to a single interpretation across the board. Overall, LLMs are accurate semantic operators, but fail to capture the pragmatic enrichments characteristic of human reasoning. LLM accuracy is neither predicted nor boosted by open versus closed status, training orientation, or architecture type.

What carries the argument

Population-matching experiment that compares LLM and human responses to conditional inference questions in four languages.

If this is right

  • LLMs can be treated as reliable for literal logical processing of conditionals.
  • Pragmatic enrichment is not yet produced by current architectures or training regimes.
  • Human-like task performance can coexist with non-human underlying reasoning mechanisms.
  • Pragmatic reasoning remains an emerging rather than established capability in LLMs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Dialogue systems built on current LLMs may miss implied meanings that affect user expectations.
  • Targeted training on pragmatic inference examples could be tested as a way to close the observed gap.
  • The same experiment design could be applied to other linguistic constructions to map additional divergences.
  • Scaling model size alone may not automatically produce the missing pragmatic layer.

Load-bearing premise

The chosen conditional sentences and inference questions capture human pragmatic reasoning, and LLM output patterns reflect internal processes rather than surface response strategies.

What would settle it

New LLMs tested on the same conditional sentences and questions producing response distributions that match human pragmatic patterns across languages would challenge the central claim.

Figures

Figures reproduced from arXiv: 2605.21299 by Elena Pagliarini, Evelina Leivada, Fritz G\"unther, Nikoleta Pantelidou, Paolo Morosi.

Figure 1
Figure 1. Figure 1: Accuracy across conditions, agents, and languages. Colored triangles indicate mean LLM accuracy per language, while colored circles indicate mean human accuracy per language. Black triangles and circles indicate overall accuracy for models and humans respectively. Vertical bars represent SE. 4.2 Models and languages comparison Next, we turned to differences between individual LLM architectures. Within this… view at source ↗
Figure 2
Figure 2. Figure 2: Mean accuracy in experimental trials by model and language. The black line represents model mean accuracy across languages. We further tested whether LLMs differ in their sensitivity to the SCs and BCs contrast. Including a by-participant random slope of Conditionals [Accuracy ~ Prompt*Conditional + (Conditional | participant) + (1 | item) + (1|language) + [PITH_FULL_IMAGE:figures/full_fig_p012_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Mean accuracy across languages for all models in critical Standard and critical Biscuit conditions. The black line represents mean accuracy across the two conditions. Finally, we examined whether broad design features could account for the observed variation across models. Specifically, we tested 3 factors: openness (open vs. closed models), type of architecture (dense vs. mixture-of-experts), and training… view at source ↗
read the original abstract

Humans effortlessly go beyond literal meanings: If you mow the lawn, I will give you fifty dollars, is typically understood as implying that the speaker will pay only if the lawn is mowed, whereas If you are hungry, there is pizza in the oven implies that pizza is available regardless of the hearers hunger. Large Language Models - LLMs - show human-like performance on many tasks, yet it remains unclear whether they reason like humans. To address this, we conducted a population-matching experiment assessing how twentyfive LLMs compute conditional inferences across four languages, compared to an equal number of humans per language. We find that humans enrich logical reasoning through pragmatic inferences across languages. Model behavior is more variable. Some LLMs perfectly follow the truth-table of conditionals but they ignore pragmatic inferences, while others deviate from the truth-table, adhering to a single interpretation across the board, thus reflecting accurate rule-based processing but not human-like reasoning. Overall, LLMs are accurate semantic operators, but fail to capture the pragmatic enrichments characteristic of human reasoning. Crucially, LLM accuracy is neither predicted nor boosted by open vs. closed status, training orientation, or architecture type, suggesting that pragmatic reasoning is still an emerging ability in the cognitive toolkit of artificial systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper reports a population-matching experiment in which 25 LLMs and an equal number of human participants per language were tested on conditional inference tasks across four languages. Humans are shown to enrich logical conditionals with pragmatic inferences (e.g., biconditional or availability readings), while LLMs display heterogeneous behavior: some models adhere closely to classical truth-table semantics and ignore pragmatic enrichments, whereas others converge on a single interpretation regardless of context. The authors conclude that current LLMs act as accurate semantic operators but have not yet acquired the pragmatic reasoning component characteristic of human inference, and that this limitation is independent of model openness, training regime, or architecture.

Significance. If the empirical patterns are robust, the work supplies concrete cross-linguistic evidence that pragmatic enrichment remains an open challenge for LLMs, thereby sharpening the distinction between semantic competence and human-like reasoning. The finding that neither scale nor architectural type reliably predicts pragmatic performance is a useful negative result for model development. The population-matching design itself is a strength, as it directly compares model output distributions to human response distributions rather than relying on aggregate accuracy scores.

major comments (3)
  1. [Methods] Methods: The manuscript provides no explicit description of the exact prompt templates, paraphrases, or consistency checks used with the LLMs. Without these details it is impossible to determine whether the reported deviations from human pragmatic patterns reflect genuine reasoning differences or sensitivity to surface phrasing, as the skeptic note correctly flags.
  2. [Results] Results / population-matching analysis: No statistical tests, confidence intervals, or measures of inter-participant / inter-model variability are reported. The central claim that LLMs 'fail to capture pragmatic enrichments' therefore rests on qualitative descriptions of model behavior whose reliability cannot be assessed from the given information.
  3. [Discussion] Discussion: The assertion that LLM accuracy is 'neither predicted nor boosted by open vs. closed status, training orientation, or architecture type' is presented without the supporting regression or correlation analysis that would make the claim falsifiable.
minor comments (2)
  1. [Abstract] The abstract states 'twentyfive LLMs' without a hyphen; correct to 'twenty-five'.
  2. [Methods] Clarify whether the same set of conditional sentences was used for all languages or whether stimuli were translated and back-translated; this affects the cross-linguistic comparison.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments, which identify key areas where additional detail and quantitative support will strengthen the manuscript. We address each major point below and commit to revisions that directly respond to the concerns raised.

read point-by-point responses
  1. Referee: [Methods] Methods: The manuscript provides no explicit description of the exact prompt templates, paraphrases, or consistency checks used with the LLMs. Without these details it is impossible to determine whether the reported deviations from human pragmatic patterns reflect genuine reasoning differences or sensitivity to surface phrasing, as the skeptic note correctly flags.

    Authors: We agree that full transparency regarding prompt construction is necessary to evaluate whether observed differences arise from reasoning or from surface-form sensitivity. The revised manuscript will include the complete set of prompt templates for each language, all paraphrases employed, and the consistency checks applied across models. These materials will be placed in a dedicated appendix with example inputs and outputs. revision: yes

  2. Referee: [Results] Results / population-matching analysis: No statistical tests, confidence intervals, or measures of inter-participant / inter-model variability are reported. The central claim that LLMs 'fail to capture pragmatic enrichments' therefore rests on qualitative descriptions of model behavior whose reliability cannot be assessed from the given information.

    Authors: The population-matching design was chosen to enable direct visual comparison of response distributions rather than aggregate accuracy. Nevertheless, we recognize that the absence of formal statistical support limits the strength of the claims. In the revision we will add chi-squared tests comparing response-type distributions between humans and models, report inter-participant and inter-model variability (e.g., standard deviations and response entropy), and include bootstrap confidence intervals for the proportions of each inference type. revision: yes

  3. Referee: [Discussion] Discussion: The assertion that LLM accuracy is 'neither predicted nor boosted by open vs. closed status, training orientation, or architecture type' is presented without the supporting regression or correlation analysis that would make the claim falsifiable.

    Authors: The statement reflects the lack of any systematic advantage visible across the sampled models from different categories. To render the claim testable, the revised manuscript will include a supplementary regression analysis in which model-level accuracy on pragmatic enrichment is regressed on binary indicators for openness, training regime, and architecture family, with appropriate controls for model size. revision: yes

Circularity Check

0 steps flagged

No circularity: direct empirical comparison of LLM and human responses

full rationale

The paper reports results from a population-matching experiment that directly compares LLM outputs on conditional inferences to human responses across four languages. No mathematical derivations, fitted parameters, predictions, or self-referential steps are present in the abstract or described methodology. The central claim that LLMs act as accurate semantic operators but miss pragmatic enrichments rests on observed behavioral patterns rather than any reduction to inputs by construction. This is a standard empirical study with no load-bearing self-citations or ansatzes that could introduce circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that the chosen linguistic tasks measure pragmatic reasoning in a comparable way for humans and LLMs, drawn from standard linguistic theory.

axioms (1)
  • domain assumption Humans enrich logical reasoning through pragmatic inferences based on context and social norms.
    Invoked to interpret human performance as the target benchmark for human-like reasoning.

pith-pipeline@v0.9.0 · 5767 in / 1204 out tokens · 41310 ms · 2026-05-21T05:01:41.490963+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

96 extracted references · 96 canonical work pages · 17 internal anchors

  1. [1]

    Mahowald et al, Dissociating language and thought in large language models

    K. Mahowald et al, Dissociating language and thought in large language models. Trends Cogn Sci 28(6), 517–540 (2024)

  2. [2]

    Futrell, K

    R. Futrell, K. Mahowald, How linguistics learned to stop worrying and love the language models. Behav Brain Sci, 1–98 (2025). https://doi.org/10.1017/S0140525X2510112X

  3. [3]

    N. Chomsky. Aspects of the Theory of Syntax. The MIT Press. 1965

  4. [4]

    N. Chomsky. The Minimalist Program. The MIT Press. 1995

  5. [5]

    Modern language models refute Chomsky’s approach to language

    S. T. Piantadosi. “Modern language models refute Chomsky’s approach to language”. In From Fieldwork to Linguistic Theory: A Tribute to Dan Everett , E. Gibson, M. Poliak, Eds. (Language Science Press, 2023), pp. 353–414

  6. [6]

    A. Moro, M. Greco, S. F. Coppola, Large languages, impossible languages and human brains. Cortex 167, 82–85 (2023)

  7. [7]

    Rizzi, On the complementarity of Generative Grammar and Large Language Models

    L. Rizzi, On the complementarity of Generative Grammar and Large Language Models. Italian Journal of Linguistics 37(1), 145–152 (2025)

  8. [8]

    G. C. Ramchand, Is it the end of Generative linguistics as we know it? Italian Journal of Linguistics 37(1), 131–144 (2025)

  9. [9]

    Katzir, Why large language models are poor theories of human linguistic cognition: A reply to Piantadosi

    R. Katzir, Why large language models are poor theories of human linguistic cognition: A reply to Piantadosi. Biolinguistics 17 (2023). https://doi.org/10.5964/bioling.13153

  10. [10]

    Chomsky, I

    N. Chomsky, I. Roberts, J. Watumull. 2023. Noam Chomsky: The false promise of ChatGPT. The New York Times. Available at https://www.nytimes.com/2023/03/08/opinion/noam-chomsky-chatgpt-ai.html (accessed 20 February 2026)

  11. [11]

    Chesi, Is it the end of (generative) linguistics as we know it? Italian Journal of Linguistics 37(1), 3–44 (2025)

    C. Chesi, Is it the end of (generative) linguistics as we know it? Italian Journal of Linguistics 37(1), 3–44 (2025)

  12. [12]

    On the proper role of linguistically oriented deep net analysis in linguistic theorizing

    M. Baroni. “On the proper role of linguistically oriented deep net analysis in linguistic theorizing”. In Algebraic Structure in Natural Language , L. Shalom, J -P. Bernardy, Eds. (CRC Press, 2022), pp. 1–16

  13. [13]

    Counting the bugs in ChatGPT’s Wugs: A multilingual investigation into the morphological capabilities of a Large Language Model

    L. Weissweiler et al. “Counting the bugs in ChatGPT’s Wugs: A multilingual investigation into the morphological capabilities of a Large Language Model”. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, H. Bouamor, J. Pino, B. Kalika Eds. (ACL, 2023), pp. 6508–6524

  14. [14]

    Morphology matters: Probing the cross -linguistic morphological generalization abilities of Large Language Models through a Wug Test

    T. A. Dang, L. Raviv, L. Galke. “Morphology matters: Probing the cross -linguistic morphological generalization abilities of Large Language Models through a Wug Test”. In Proceedings of the 13th Edition of the Workshop on Cognitive Modeling and Computational Linguistics, T. Kuribavaski et al., Eds. (ACL, 2024), pp. 177–188

  15. [15]

    Pantelidou, E

    N. Pantelidou, E. Leivada, R. Montero, P. Morosi. Community size rather than grammatical complexity better predicts Large Language Model accuracy in a novel Wug Test . PLoS One 21(3): e0343164 (2026). https://doi.org/10.1371/journal.pone.0343164

  16. [16]

    Temerko, M

    A. Temerko, M. Garcia, P. Gamallo. “A continuous approach to metaphorically motivated regular polysemy in language models. In Proceedings of the 29th Conference on Computational Natural Language Learning . (ACL, 2025), pp. 419 – 436

  17. [17]

    Gulordava, P

    K. Gulordava, P. Bojanowski, E. Grave, T. Linzen, M. Baroni. “Colorless green recurrent networks dream hierarchically. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1: Long papers. (ACL, 2018), pp. 1195 – 1205

  18. [18]

    Mahowald

    K. Mahowald. “A discerning several thousand judgments: GPT -3 rates the article + adjective + numeral + noun construction. In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics. (ACL, 2023), pp. 265–273

  19. [19]

    C. Potts. Characterizing English preposing in PP constructions. Journal of Linguistics, 1–39 (2024)

  20. [20]

    Leivada, E

    E. Leivada, E. Murphy, G. Marcus, DALL·E 2 fails to reliably capture common syntactic processes. Social Sciences & Humanities Open 8(1) (2023). https://doi.org/10.1016/j.ssaho.2023.100648

  21. [21]

    Zhou et al

    H. Zhou et al. How well do Large Language Models understand syntax? An evaluation by asking natural language questions. arXiv [Preprint] (2023). https://arxiv.org/abs/2311.08287 (accessed 20 February 2026)

  22. [22]

    Anything goes? A crosslinguistic study of (im)possible language learning in LMs

    X. Yang, T. Aoyama, Y . Yao, E. Wilcox. “Anything goes? A crosslinguistic study of (im)possible language learning in LMs”. Proceedings of the 63 rd Annual Meeting of the Association for Computational Linguistics, Volume 1: Long papers. (ACL, 2025), pp. 26058–26077

  23. [23]

    Dentella, F

    V . Dentella, F. Günther, E. Leivada, Systematic testing of three Language Models reveals low language accuracy, absence of response stability, and a yes-response bias. Proc. Natl. Acad. Sci. U.S.A. 120 (51) e2309583120 (2023). https://doi.org/10.1073/pnas.2309583120

  24. [24]

    Leivada, F

    E. Leivada, F. Günther, V . Dentella, Reply to Hu et al: Applying different evaluation standards to humans vs. Large Language Models overestimates AI performance. Proc. Natl. Acad. Sci. U.S.A. 121 (36) e2406752121 (2024). https://doi.org/10.1073/pnas.2406752121

  25. [25]

    Climbing towards NLU: On meaning, form, and understanding in the age of data

    E. M. Bender, A. Koller. “Climbing towards NLU: On meaning, form, and understanding in the age of data”. In Proceedings of the 58 th Annual Meeting of the Association for Computational Linguistcs. (ACL, 2020), pp. 5185–5198

  26. [26]

    Shanahan, Talking about Large Language Models

    M. Shanahan, Talking about Large Language Models. Commun. ACM 67(2), 68–79 (2023). https://doi.org/10.1145/3624724

  27. [27]

    Dentella, F

    V . Dentella, F. Günther, E. Murphy, G. Marcus, E. Leivada, Testing AI on language comprehension tasks reveals insensitivity to underlying meaning. Sci Rep 14, 28083 (2023). https://doi.org/10.1038/s41598-024-79531-8

  28. [28]

    Leivada, G

    E. Leivada, G. Marcus, F. Günther, E. Murphy. A sentence is worth a thousand pictures: Can Large Language Models understand hum4n l4ngu4ge and the world behind words? Philosophical Transactions of the Royal Society A

  29. [29]

    Weissweiler, V

    L. Weissweiler, V . Hofmann, A. Köksal, H. Schütze, Explaining pretrained language models’ understanding of linguistic structures using construction grammar. Front Artif Intell 6, 1225791 (2023). https://doi.org/10.3389/frai.2023.1225791

  30. [30]

    Quantifying generalizations: Exploring the divide between human and LLMs’ sensitivity to quantification

    C. Collacciani, G. Rambelli, M. Bolognesi. “Quantifying generalizations: Exploring the divide between human and LLMs’ sensitivity to quantification”. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics, Volume 1: Long papers. (ACL, 2024), pp. 11811 – 11822

  31. [31]

    Montero, N

    R. Montero, N. Moskvina, P. Morosi, T. Serrano, E. Pagliarini, E. Leivada, Quantification and object perception in multimodal large language models deviate from human linguistic cognition. arXiv [Preprint] (2026). https://arxiv.org/abs/2511.08126. (accessed 25 March 2026)

  32. [32]

    Does ChatGPT resemble humans in processing implicatures?

    Z. Qiu, X. Duan, Z. Cai. “Does ChatGPT resemble humans in processing implicatures?” In Proceedings of the 4 th Natural Logic Meets Machine Learning Workshop (ACL, 2023), pp. 25–34

  33. [33]

    A fine-grained comparison of pragmatic language understanding in humans and language models

    J. Hu, S. Floyd, O. Juravlev, E. Fedorenko, E. Gibson. “A fine-grained comparison of pragmatic language understanding in humans and language models”. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics, Volume 1: Long papers. (ACL, 2023), pp. 4194–4213

  34. [34]

    Pragmatic competence evaluation of Large Language Models for the Korean Language

    D. Park, J. Lee, H. Jeong, S. Park, S. Lee. “Pragmatic competence evaluation of Large Language Models for the Korean Language”. Proceedings of the 38 th Pacific Asia Conference on Language, Information and Computation (ACL, 2024), pp. 256–266

  35. [35]

    Uncovering the Fragility of Trustworthy LLMs through Chinese Textual Ambiguity

    X. Wu et al. Uncovering the fragility of trustworthy LLMs through Chinese textual ambiguity. arXiv [Preprint] (2025). https://arxiv.org/abs/2507.23121 (accessed 20 February 2026)

  36. [36]

    Kahneman, A

    D. Kahneman, A. Tversky, On the study of statistical intuitions. Cognition 11(2), 123– 141 (1982). https://doi.org/10.1016/0010-0277(82)90022-1

  37. [37]

    D. J. Hilton, The social context of reasoning: Conversational inference and rational judgment. Psychol Bull 118(2), 248 –271 (1995). https://psycnet.apa.org/doi/10.1037/0033-2909.118.2.248

  38. [38]

    K. E. Stanovich, R. F. West, Individual differences in reasoning: Implications for the rationality debate? Behav Brain Sci 23(5), 645 –665 (2000). https://doi.org/10.1017/s0140525x00003435

  39. [39]

    Lassiter, Distinguishing semantics, pragmatics, and reasoning in the theory of conditionals

    D. Lassiter, Distinguishing semantics, pragmatics, and reasoning in the theory of conditionals. Inquiry 68(7), 2431 –2452 (2025). https://doi.org/10.1080/0020174X.2024.2405186

  40. [40]

    M. L. Geis, A. M. Zwicky, On invited references. Linguist Inq 2(4), 561–566

  41. [41]

    M. E. A. Siegel, Biscuit conditionals: Quantification over potential literal acts. Linguist Philos 29, 167–203 (2006). https://doi.org/10.1007/s10988-006-0003-2

  42. [42]

    Ifs and Cans

    J. L. Austin, “Ifs and Cans”, in Philosophical Papers (Oxford University Press, 1961), pp. 153–180

  43. [43]

    Evcen, D

    E. Evcen, D. Barner, Already perfect: Language users access the pragmatic meanings of conditionals first. Open Mind 9, 1098 –1120 (2025). https://doi.org/10.1162/opmi.a.17

  44. [44]

    van Tiel, W

    B. van Tiel, W. Schaeken, Processing conversational implicatures: Alternatives and counterfactual reasoning. Cogn Sci 41(S5), 1119 –1154 (2016). https://doi.org/10.1111/cogs.12362

  45. [45]

    Experimenting with (conditional) perfection: Tests of the Exhaustivity Theory

    F. Cariani, L. J. Rips. “Experimenting with (conditional) perfection: Tests of the Exhaustivity Theory”. In Palgrave Studies in Pragmatics, Language and Cognition, S. Kaufmann, D. E. Over, G. Sharma Eds. (Palgrave Macmillan, 2023). https://doi.org/10.1007/978-3-031-05682-6_9

  46. [46]

    de Varda, C

    A. de Varda, C. Saponaro, M. Marelli, High variability in LLMs’ analogical reasoning. Nat hum behav 9(7), 1339–1341 (2025)

  47. [47]

    Bender, Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell

    E. M. Bender, T. Gebru, A McMillan -Major, S. Shmitchell. “On the dangers od stochastic parrots: Can language models be too bog?”. In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency (FAccT, 2021), pp. 610–623. https://doi.org/10.1145/3442188.3445922

  48. [48]

    Wei et al., Chain-of-thought prompting elicits reasoning in large language models

    J. Wei et al., Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing systems 35, 24824–24837 (2022)

  49. [49]

    Schick, L

    T. Schick, L. Qin, H. Schütze, Tool -Augmented Language Models: Augmenting LLMs with External Reasoning Modules. arXiv [Preprint] (2025). https://arxiv.org/abs/2205.12255 (accessed 20 February 2026)

  50. [50]

    Language models are few -shot learners

    T. B. Brown et al. “Language models are few -shot learners”. In Proceedings of the 34th Internation Conference on Neural Information Processing Systems (NIPS, 2020), pp. 1877–1901

  51. [51]

    Chowdhery et al., PaLM: Scaling Language Modeling with Pathways

    A. Chowdhery et al., PaLM: Scaling Language Modeling with Pathways. J Mach Learn Res 24, 1–113 (2023)

  52. [52]

    Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

    N. Shazeer et al., Outrageously Large Neural Networks: The Sparsely-Gated Mixture- of-Experts Layer. arXiv [Preprint] (2017). https://arxiv.org/abs/1701.06538 (accessed 20 February 2026)

  53. [53]

    Fedus, B

    W. Fedus, B. Zoph, N. Shazeer, Switch transformers: Scaling to trillion parameter models with simple and efficient MoE. J Mach Learn Res 23, 1–39 (2022)

  54. [54]

    Artetxe et al., Efficient Large-Scale Language Modeling with Mixture of Experts

    M. Artetxe et al., Efficient Large-Scale Language Modeling with Mixture of Experts. arXiv [Preprint] (2022) https://arxiv.org/abs/2112.10684 (accessed 20 February 2026)

  55. [55]

    A theory of conditionals

    R. Stalnaker, “A theory of conditionals” In IFs. The University of Western Ontario Series in Philosophy of Science 15, W.L. Harper, R. Stalnaker, G. Pearce, EDS (Springer, 1968). https://doi.org/10.1007/978-94-009-9117-0_2

  56. [56]

    Lewis, Probabilities of conditionals and conditional probabilities

    D. Lewis, Probabilities of conditionals and conditional probabilities. Philos Rev 85, 297–315 (1976)

  57. [57]

    van der Auwera, Pragmatics in the last quarter century: The case of conditional perfection

    J. van der Auwera, Pragmatics in the last quarter century: The case of conditional perfection. J Pragmat 27, 261–274 (1997)

  58. [58]

    V on Fintel, Conditional Strengthening

    K. V on Fintel, Conditional Strengthening . Unpublished Manuscript (MIT, 2001). http://mit.edu/fintel/fintel-2001-condstrength.pdf

  59. [59]

    Signal to Act: Game Theory in Pragmatics

    M. Franke, “Signal to Act: Game Theory in Pragmatics”, PhD thesis. University of Amsterdam (2009); Available from https://hdl.handle.net/11245/1.313416

  60. [60]

    Only if: If only we understood it

    E. Herburger, “Only if: If only we understood it” In Proceedings of Sinn und Bedeutung 19 (2015). https://doi.org/10.18148/sub/2015.v19i0.235

  61. [61]

    The pragmatics of biscuit conditionals

    M. Franke, “The pragmatics of biscuit conditionals”, In Proceedings of the 16 th Amsterdam Colloquium , pp. 91 –96 (2007). https://platform.openjournals.nl/PAC/article/view/22859

  62. [62]

    Biezma, A

    M. Biezma, A. Goebel, Being pragmatic about biscuits. Linguist Philos 46(3), 567– 626 (2023)

  63. [63]

    J. Zehr, F. Schwarz. PennController for Internet Based Experiments (IBEX) (2018). https://doi.org/10.17605/OSF.IO/MD832

  64. [64]

    Claude Haiku 4.5 (Oct 15 version) [Large Language Model]

    Anthropic. Claude Haiku 4.5 (Oct 15 version) [Large Language Model]. https://claude.ai. (2025)

  65. [65]

    System Card: Claude Opus 4 & Claude Sonnet 4

    Anthropic. System Card: Claude Opus 4 & Claude Sonnet 4. Anthropic system cards page (2025)

  66. [66]

    Claude Sonnet 4.5 (Sep 29 version) [Large Language Model]

    Anthropic. Claude Sonnet 4.5 (Sep 29 version) [Large Language Model]. https://claude.ai (2025)

  67. [67]

    DeepSeek-V3 Technical Report

    DeepSeek-AI. DeepSeek -V3 Technical Report. arXiv [Preprint] (2024). https://arxiv.org/abs/2412.19437 (accessed 11 March 2026)

  68. [68]

    V . Sanh, L. Debut, J. Chaumond, T. Wolf. DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. arXiv [Preprint] (2020). https://arxiv.org/abs/1910.01108 (accessed 11 March 2026)

  69. [69]

    The Falcon Series of Open Language Models

    E. Almazrouei, et al. The Falcon Series of Open Language Models. arXiv [Preprint] (2023). https://arxiv.org/abs/2311.16867 (accessed 11 March 2026)

  70. [70]

    Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

    Google DeepMind. Pushing the Frontier with Advanced Reasoning, Multimodality, Log Context, and Next Generation Agentic Capabilities. arXiv [Preprint] (2025). https://arxiv.org/abs/2507.06261 (accessed 11 March 2026)

  71. [71]

    Gemma 3 Technical Report

    Gemma Team. Gemma 3 Technical Report. arXiv [Preprint] (2025). https://arxiv.org/abs/2503.19786 (accessed 11 March 2026)

  72. [72]

    GLM-4.5V and GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning

    GLM-V Team. GLM-4.5V and GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning. arXiv [Preprint] (2026). https://arxiv.org/abs/2507.01006 (accessed 11 March 2026)

  73. [73]

    GPT-4o System Card

    OpenAI. GPT -4o System Card. arXiv [Preprint] (2024). https://arxiv.org/abs/2410.21276 (accessed 11 March 2026)

  74. [74]

    OpenAI GPT-5 System Card

    A. Singh, et al. OpenAI GPT -5 System Card. arXiv [Preprint] (2025). https://arxiv.org/abs/2601.03267 (accessed 11 March 2026)

  75. [75]

    Grok 4 [Large language model]

    xAI. Grok 4 [Large language model]. https://x.ai/ (2025)

  76. [76]

    Kimi K2: Open Agentic Intelligence

    Kimi Team. Kimi K2: Open Agentic Intelligence. arXiv [Preprint] (2026). https://arxiv.org/abs/2507.20534 (accessed 11 March 2026)

  77. [77]

    The Llama 3 Herd of Models

    A. Grattafiori, et al. The Llama 3 Herd of Models. arXiv [Preprint] (2024). https://arxiv.org/abs/2407.21783 (accessed 11 March 2026)

  78. [78]

    The Llama 4 Herd: Architecture, Training, Evaluation, and Deployment Notes

    MetaAI. The Llama 4 Herd: Architecture, Training, Evaluation, and Deployment Notes. arXiv [Preprint] (2026). https://arxiv.org/abs/2601.11659 (accessed 11 March 2026)

  79. [79]

    Mistral 7B

    A. Jiang, et al. Mistral 7B. arXiv [Preprint] (2023). https://arxiv.org/abs/2310.06825 (accessed 11 March 2026)

  80. [80]

    NVIDIA Nemotron Nano 2: An Accurate and Efficient Hybrid Mamba-Transformer Reasoning Model

    NVIDIA. NVIDIA Nemotron Nano 2: An Accurate and Efficient Hybrid Mamba - Transformer Reasoning Model. arXiv [Preprint] (2025). https://arxiv.org/abs/2508.14444 (accessed 11 March 2026)

Showing first 80 references.