Tracing the ongoing emergence of human-like reasoning in Large Language Models

Elena Pagliarini; Evelina Leivada; Fritz G\"unther; Nikoleta Pantelidou; Paolo Morosi

arxiv: 2605.21299 · v1 · pith:Y4KOYDJVnew · submitted 2026-05-20 · 💻 cs.CL · cs.AI

Tracing the ongoing emergence of human-like reasoning in Large Language Models

Paolo Morosi , Nikoleta Pantelidou , Fritz G\"unther , Elena Pagliarini , Evelina Leivada This is my paper

Pith reviewed 2026-05-21 05:01 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords large language modelspragmatic reasoningconditional inferencessemantic operatorshuman-AI comparisonpragmatic enrichmentsmultilingual reasoninginference tasks

0 comments

The pith

Large language models accurately process literal meanings of conditionals but fail to make the pragmatic inferences that humans routinely add.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether large language models interpret conditional statements the way humans do by going beyond literal logic to add implied meanings. A population-matching experiment compared responses from twenty-five LLMs to equal numbers of human participants across four languages on sentences such as promises that depend on an action versus statements that hold regardless of a condition. Humans consistently enriched the logic with pragmatic assumptions that vary by context. The models split into two patterns: some followed the strict logical truth table and ignored implications, while others applied one fixed reading across all cases. This shows LLMs function well as semantic processors but have not acquired the flexible pragmatic layer of human reasoning, and the gap does not depend on model openness, training focus, or architecture.

Core claim

Humans enrich logical reasoning through pragmatic inferences when processing conditional sentences across languages. Model behavior is more variable. Some LLMs perfectly follow the truth-table of conditionals but ignore pragmatic inferences, while others deviate from the truth-table by adhering to a single interpretation across the board. Overall, LLMs are accurate semantic operators, but fail to capture the pragmatic enrichments characteristic of human reasoning. LLM accuracy is neither predicted nor boosted by open versus closed status, training orientation, or architecture type.

What carries the argument

Population-matching experiment that compares LLM and human responses to conditional inference questions in four languages.

If this is right

LLMs can be treated as reliable for literal logical processing of conditionals.
Pragmatic enrichment is not yet produced by current architectures or training regimes.
Human-like task performance can coexist with non-human underlying reasoning mechanisms.
Pragmatic reasoning remains an emerging rather than established capability in LLMs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Dialogue systems built on current LLMs may miss implied meanings that affect user expectations.
Targeted training on pragmatic inference examples could be tested as a way to close the observed gap.
The same experiment design could be applied to other linguistic constructions to map additional divergences.
Scaling model size alone may not automatically produce the missing pragmatic layer.

Load-bearing premise

The chosen conditional sentences and inference questions capture human pragmatic reasoning, and LLM output patterns reflect internal processes rather than surface response strategies.

What would settle it

New LLMs tested on the same conditional sentences and questions producing response distributions that match human pragmatic patterns across languages would challenge the central claim.

Figures

Figures reproduced from arXiv: 2605.21299 by Elena Pagliarini, Evelina Leivada, Fritz G\"unther, Nikoleta Pantelidou, Paolo Morosi.

**Figure 1.** Figure 1: Accuracy across conditions, agents, and languages. Colored triangles indicate mean LLM accuracy per language, while colored circles indicate mean human accuracy per language. Black triangles and circles indicate overall accuracy for models and humans respectively. Vertical bars represent SE. 4.2 Models and languages comparison Next, we turned to differences between individual LLM architectures. Within this… view at source ↗

**Figure 2.** Figure 2: Mean accuracy in experimental trials by model and language. The black line represents model mean accuracy across languages. We further tested whether LLMs differ in their sensitivity to the SCs and BCs contrast. Including a by-participant random slope of Conditionals [Accuracy ~ Prompt*Conditional + (Conditional | participant) + (1 | item) + (1|language) + [PITH_FULL_IMAGE:figures/full_fig_p012_2.png] view at source ↗

**Figure 3.** Figure 3: Mean accuracy across languages for all models in critical Standard and critical Biscuit conditions. The black line represents mean accuracy across the two conditions. Finally, we examined whether broad design features could account for the observed variation across models. Specifically, we tested 3 factors: openness (open vs. closed models), type of architecture (dense vs. mixture-of-experts), and training… view at source ↗

read the original abstract

Humans effortlessly go beyond literal meanings: If you mow the lawn, I will give you fifty dollars, is typically understood as implying that the speaker will pay only if the lawn is mowed, whereas If you are hungry, there is pizza in the oven implies that pizza is available regardless of the hearers hunger. Large Language Models - LLMs - show human-like performance on many tasks, yet it remains unclear whether they reason like humans. To address this, we conducted a population-matching experiment assessing how twentyfive LLMs compute conditional inferences across four languages, compared to an equal number of humans per language. We find that humans enrich logical reasoning through pragmatic inferences across languages. Model behavior is more variable. Some LLMs perfectly follow the truth-table of conditionals but they ignore pragmatic inferences, while others deviate from the truth-table, adhering to a single interpretation across the board, thus reflecting accurate rule-based processing but not human-like reasoning. Overall, LLMs are accurate semantic operators, but fail to capture the pragmatic enrichments characteristic of human reasoning. Crucially, LLM accuracy is neither predicted nor boosted by open vs. closed status, training orientation, or architecture type, suggesting that pragmatic reasoning is still an emerging ability in the cognitive toolkit of artificial systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper reports a population-matching experiment in which 25 LLMs and an equal number of human participants per language were tested on conditional inference tasks across four languages. Humans are shown to enrich logical conditionals with pragmatic inferences (e.g., biconditional or availability readings), while LLMs display heterogeneous behavior: some models adhere closely to classical truth-table semantics and ignore pragmatic enrichments, whereas others converge on a single interpretation regardless of context. The authors conclude that current LLMs act as accurate semantic operators but have not yet acquired the pragmatic reasoning component characteristic of human inference, and that this limitation is independent of model openness, training regime, or architecture.

Significance. If the empirical patterns are robust, the work supplies concrete cross-linguistic evidence that pragmatic enrichment remains an open challenge for LLMs, thereby sharpening the distinction between semantic competence and human-like reasoning. The finding that neither scale nor architectural type reliably predicts pragmatic performance is a useful negative result for model development. The population-matching design itself is a strength, as it directly compares model output distributions to human response distributions rather than relying on aggregate accuracy scores.

major comments (3)

[Methods] Methods: The manuscript provides no explicit description of the exact prompt templates, paraphrases, or consistency checks used with the LLMs. Without these details it is impossible to determine whether the reported deviations from human pragmatic patterns reflect genuine reasoning differences or sensitivity to surface phrasing, as the skeptic note correctly flags.
[Results] Results / population-matching analysis: No statistical tests, confidence intervals, or measures of inter-participant / inter-model variability are reported. The central claim that LLMs 'fail to capture pragmatic enrichments' therefore rests on qualitative descriptions of model behavior whose reliability cannot be assessed from the given information.
[Discussion] Discussion: The assertion that LLM accuracy is 'neither predicted nor boosted by open vs. closed status, training orientation, or architecture type' is presented without the supporting regression or correlation analysis that would make the claim falsifiable.

minor comments (2)

[Abstract] The abstract states 'twentyfive LLMs' without a hyphen; correct to 'twenty-five'.
[Methods] Clarify whether the same set of conditional sentences was used for all languages or whether stimuli were translated and back-translated; this affects the cross-linguistic comparison.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments, which identify key areas where additional detail and quantitative support will strengthen the manuscript. We address each major point below and commit to revisions that directly respond to the concerns raised.

read point-by-point responses

Referee: [Methods] Methods: The manuscript provides no explicit description of the exact prompt templates, paraphrases, or consistency checks used with the LLMs. Without these details it is impossible to determine whether the reported deviations from human pragmatic patterns reflect genuine reasoning differences or sensitivity to surface phrasing, as the skeptic note correctly flags.

Authors: We agree that full transparency regarding prompt construction is necessary to evaluate whether observed differences arise from reasoning or from surface-form sensitivity. The revised manuscript will include the complete set of prompt templates for each language, all paraphrases employed, and the consistency checks applied across models. These materials will be placed in a dedicated appendix with example inputs and outputs. revision: yes
Referee: [Results] Results / population-matching analysis: No statistical tests, confidence intervals, or measures of inter-participant / inter-model variability are reported. The central claim that LLMs 'fail to capture pragmatic enrichments' therefore rests on qualitative descriptions of model behavior whose reliability cannot be assessed from the given information.

Authors: The population-matching design was chosen to enable direct visual comparison of response distributions rather than aggregate accuracy. Nevertheless, we recognize that the absence of formal statistical support limits the strength of the claims. In the revision we will add chi-squared tests comparing response-type distributions between humans and models, report inter-participant and inter-model variability (e.g., standard deviations and response entropy), and include bootstrap confidence intervals for the proportions of each inference type. revision: yes
Referee: [Discussion] Discussion: The assertion that LLM accuracy is 'neither predicted nor boosted by open vs. closed status, training orientation, or architecture type' is presented without the supporting regression or correlation analysis that would make the claim falsifiable.

Authors: The statement reflects the lack of any systematic advantage visible across the sampled models from different categories. To render the claim testable, the revised manuscript will include a supplementary regression analysis in which model-level accuracy on pragmatic enrichment is regressed on binary indicators for openness, training regime, and architecture family, with appropriate controls for model size. revision: yes

Circularity Check

0 steps flagged

No circularity: direct empirical comparison of LLM and human responses

full rationale

The paper reports results from a population-matching experiment that directly compares LLM outputs on conditional inferences to human responses across four languages. No mathematical derivations, fitted parameters, predictions, or self-referential steps are present in the abstract or described methodology. The central claim that LLMs act as accurate semantic operators but miss pragmatic enrichments rests on observed behavioral patterns rather than any reduction to inputs by construction. This is a standard empirical study with no load-bearing self-citations or ansatzes that could introduce circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that the chosen linguistic tasks measure pragmatic reasoning in a comparable way for humans and LLMs, drawn from standard linguistic theory.

axioms (1)

domain assumption Humans enrich logical reasoning through pragmatic inferences based on context and social norms.
Invoked to interpret human performance as the target benchmark for human-like reasoning.

pith-pipeline@v0.9.0 · 5767 in / 1204 out tokens · 41310 ms · 2026-05-21T05:01:41.490963+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Overall, LLMs are accurate semantic operators, but fail to capture the pragmatic enrichments characteristic of human reasoning.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

96 extracted references · 96 canonical work pages · 17 internal anchors

[1]

Mahowald et al, Dissociating language and thought in large language models

K. Mahowald et al, Dissociating language and thought in large language models. Trends Cogn Sci 28(6), 517–540 (2024)

work page 2024
[2]

Futrell, K

R. Futrell, K. Mahowald, How linguistics learned to stop worrying and love the language models. Behav Brain Sci, 1–98 (2025). https://doi.org/10.1017/S0140525X2510112X

work page doi:10.1017/s0140525x2510112x 2025
[3]

N. Chomsky. Aspects of the Theory of Syntax. The MIT Press. 1965

work page 1965
[4]

N. Chomsky. The Minimalist Program. The MIT Press. 1995

work page 1995
[5]

Modern language models refute Chomsky’s approach to language

S. T. Piantadosi. “Modern language models refute Chomsky’s approach to language”. In From Fieldwork to Linguistic Theory: A Tribute to Dan Everett , E. Gibson, M. Poliak, Eds. (Language Science Press, 2023), pp. 353–414

work page 2023
[6]

A. Moro, M. Greco, S. F. Coppola, Large languages, impossible languages and human brains. Cortex 167, 82–85 (2023)

work page 2023
[7]

Rizzi, On the complementarity of Generative Grammar and Large Language Models

L. Rizzi, On the complementarity of Generative Grammar and Large Language Models. Italian Journal of Linguistics 37(1), 145–152 (2025)

work page 2025
[8]

G. C. Ramchand, Is it the end of Generative linguistics as we know it? Italian Journal of Linguistics 37(1), 131–144 (2025)

work page 2025
[9]

Katzir, Why large language models are poor theories of human linguistic cognition: A reply to Piantadosi

R. Katzir, Why large language models are poor theories of human linguistic cognition: A reply to Piantadosi. Biolinguistics 17 (2023). https://doi.org/10.5964/bioling.13153

work page doi:10.5964/bioling.13153 2023
[10]

Chomsky, I

N. Chomsky, I. Roberts, J. Watumull. 2023. Noam Chomsky: The false promise of ChatGPT. The New York Times. Available at https://www.nytimes.com/2023/03/08/opinion/noam-chomsky-chatgpt-ai.html (accessed 20 February 2026)

work page 2023
[11]

Chesi, Is it the end of (generative) linguistics as we know it? Italian Journal of Linguistics 37(1), 3–44 (2025)

C. Chesi, Is it the end of (generative) linguistics as we know it? Italian Journal of Linguistics 37(1), 3–44 (2025)

work page 2025
[12]

On the proper role of linguistically oriented deep net analysis in linguistic theorizing

M. Baroni. “On the proper role of linguistically oriented deep net analysis in linguistic theorizing”. In Algebraic Structure in Natural Language , L. Shalom, J -P. Bernardy, Eds. (CRC Press, 2022), pp. 1–16

work page 2022
[13]

Counting the bugs in ChatGPT’s Wugs: A multilingual investigation into the morphological capabilities of a Large Language Model

L. Weissweiler et al. “Counting the bugs in ChatGPT’s Wugs: A multilingual investigation into the morphological capabilities of a Large Language Model”. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, H. Bouamor, J. Pino, B. Kalika Eds. (ACL, 2023), pp. 6508–6524

work page 2023
[14]

Morphology matters: Probing the cross -linguistic morphological generalization abilities of Large Language Models through a Wug Test

T. A. Dang, L. Raviv, L. Galke. “Morphology matters: Probing the cross -linguistic morphological generalization abilities of Large Language Models through a Wug Test”. In Proceedings of the 13th Edition of the Workshop on Cognitive Modeling and Computational Linguistics, T. Kuribavaski et al., Eds. (ACL, 2024), pp. 177–188

work page 2024
[15]

Pantelidou, E

N. Pantelidou, E. Leivada, R. Montero, P. Morosi. Community size rather than grammatical complexity better predicts Large Language Model accuracy in a novel Wug Test . PLoS One 21(3): e0343164 (2026). https://doi.org/10.1371/journal.pone.0343164

work page doi:10.1371/journal.pone.0343164 2026
[16]

Temerko, M

A. Temerko, M. Garcia, P. Gamallo. “A continuous approach to metaphorically motivated regular polysemy in language models. In Proceedings of the 29th Conference on Computational Natural Language Learning . (ACL, 2025), pp. 419 – 436

work page 2025
[17]

Gulordava, P

K. Gulordava, P. Bojanowski, E. Grave, T. Linzen, M. Baroni. “Colorless green recurrent networks dream hierarchically. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1: Long papers. (ACL, 2018), pp. 1195 – 1205

work page 2018
[18]

Mahowald

K. Mahowald. “A discerning several thousand judgments: GPT -3 rates the article + adjective + numeral + noun construction. In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics. (ACL, 2023), pp. 265–273

work page 2023
[19]

C. Potts. Characterizing English preposing in PP constructions. Journal of Linguistics, 1–39 (2024)

work page 2024
[20]

Leivada, E

E. Leivada, E. Murphy, G. Marcus, DALL·E 2 fails to reliably capture common syntactic processes. Social Sciences & Humanities Open 8(1) (2023). https://doi.org/10.1016/j.ssaho.2023.100648

work page doi:10.1016/j.ssaho.2023.100648 2023
[21]

Zhou et al

H. Zhou et al. How well do Large Language Models understand syntax? An evaluation by asking natural language questions. arXiv [Preprint] (2023). https://arxiv.org/abs/2311.08287 (accessed 20 February 2026)

work page arXiv 2023
[22]

Anything goes? A crosslinguistic study of (im)possible language learning in LMs

X. Yang, T. Aoyama, Y . Yao, E. Wilcox. “Anything goes? A crosslinguistic study of (im)possible language learning in LMs”. Proceedings of the 63 rd Annual Meeting of the Association for Computational Linguistics, Volume 1: Long papers. (ACL, 2025), pp. 26058–26077

work page 2025
[23]

Dentella, F

V . Dentella, F. Günther, E. Leivada, Systematic testing of three Language Models reveals low language accuracy, absence of response stability, and a yes-response bias. Proc. Natl. Acad. Sci. U.S.A. 120 (51) e2309583120 (2023). https://doi.org/10.1073/pnas.2309583120

work page doi:10.1073/pnas.2309583120 2023
[24]

Leivada, F

E. Leivada, F. Günther, V . Dentella, Reply to Hu et al: Applying different evaluation standards to humans vs. Large Language Models overestimates AI performance. Proc. Natl. Acad. Sci. U.S.A. 121 (36) e2406752121 (2024). https://doi.org/10.1073/pnas.2406752121

work page doi:10.1073/pnas.2406752121 2024
[25]

Climbing towards NLU: On meaning, form, and understanding in the age of data

E. M. Bender, A. Koller. “Climbing towards NLU: On meaning, form, and understanding in the age of data”. In Proceedings of the 58 th Annual Meeting of the Association for Computational Linguistcs. (ACL, 2020), pp. 5185–5198

work page 2020
[26]

Shanahan, Talking about Large Language Models

M. Shanahan, Talking about Large Language Models. Commun. ACM 67(2), 68–79 (2023). https://doi.org/10.1145/3624724

work page doi:10.1145/3624724 2023
[27]

Dentella, F

V . Dentella, F. Günther, E. Murphy, G. Marcus, E. Leivada, Testing AI on language comprehension tasks reveals insensitivity to underlying meaning. Sci Rep 14, 28083 (2023). https://doi.org/10.1038/s41598-024-79531-8

work page doi:10.1038/s41598-024-79531-8 2023
[28]

Leivada, G

E. Leivada, G. Marcus, F. Günther, E. Murphy. A sentence is worth a thousand pictures: Can Large Language Models understand hum4n l4ngu4ge and the world behind words? Philosophical Transactions of the Royal Society A

work page
[29]

Weissweiler, V

L. Weissweiler, V . Hofmann, A. Köksal, H. Schütze, Explaining pretrained language models’ understanding of linguistic structures using construction grammar. Front Artif Intell 6, 1225791 (2023). https://doi.org/10.3389/frai.2023.1225791

work page doi:10.3389/frai.2023.1225791 2023
[30]

Quantifying generalizations: Exploring the divide between human and LLMs’ sensitivity to quantification

C. Collacciani, G. Rambelli, M. Bolognesi. “Quantifying generalizations: Exploring the divide between human and LLMs’ sensitivity to quantification”. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics, Volume 1: Long papers. (ACL, 2024), pp. 11811 – 11822

work page 2024
[31]

Montero, N

R. Montero, N. Moskvina, P. Morosi, T. Serrano, E. Pagliarini, E. Leivada, Quantification and object perception in multimodal large language models deviate from human linguistic cognition. arXiv [Preprint] (2026). https://arxiv.org/abs/2511.08126. (accessed 25 March 2026)

work page arXiv 2026
[32]

Does ChatGPT resemble humans in processing implicatures?

Z. Qiu, X. Duan, Z. Cai. “Does ChatGPT resemble humans in processing implicatures?” In Proceedings of the 4 th Natural Logic Meets Machine Learning Workshop (ACL, 2023), pp. 25–34

work page 2023
[33]

A fine-grained comparison of pragmatic language understanding in humans and language models

J. Hu, S. Floyd, O. Juravlev, E. Fedorenko, E. Gibson. “A fine-grained comparison of pragmatic language understanding in humans and language models”. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics, Volume 1: Long papers. (ACL, 2023), pp. 4194–4213

work page 2023
[34]

Pragmatic competence evaluation of Large Language Models for the Korean Language

D. Park, J. Lee, H. Jeong, S. Park, S. Lee. “Pragmatic competence evaluation of Large Language Models for the Korean Language”. Proceedings of the 38 th Pacific Asia Conference on Language, Information and Computation (ACL, 2024), pp. 256–266

work page 2024
[35]

Uncovering the Fragility of Trustworthy LLMs through Chinese Textual Ambiguity

X. Wu et al. Uncovering the fragility of trustworthy LLMs through Chinese textual ambiguity. arXiv [Preprint] (2025). https://arxiv.org/abs/2507.23121 (accessed 20 February 2026)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[36]

Kahneman, A

D. Kahneman, A. Tversky, On the study of statistical intuitions. Cognition 11(2), 123– 141 (1982). https://doi.org/10.1016/0010-0277(82)90022-1

work page doi:10.1016/0010-0277(82)90022-1 1982
[37]

D. J. Hilton, The social context of reasoning: Conversational inference and rational judgment. Psychol Bull 118(2), 248 –271 (1995). https://psycnet.apa.org/doi/10.1037/0033-2909.118.2.248

work page doi:10.1037/0033-2909.118.2.248 1995
[38]

K. E. Stanovich, R. F. West, Individual differences in reasoning: Implications for the rationality debate? Behav Brain Sci 23(5), 645 –665 (2000). https://doi.org/10.1017/s0140525x00003435

work page doi:10.1017/s0140525x00003435 2000
[39]

Lassiter, Distinguishing semantics, pragmatics, and reasoning in the theory of conditionals

D. Lassiter, Distinguishing semantics, pragmatics, and reasoning in the theory of conditionals. Inquiry 68(7), 2431 –2452 (2025). https://doi.org/10.1080/0020174X.2024.2405186

work page doi:10.1080/0020174x.2024.2405186 2025
[40]

M. L. Geis, A. M. Zwicky, On invited references. Linguist Inq 2(4), 561–566

work page
[41]

M. E. A. Siegel, Biscuit conditionals: Quantification over potential literal acts. Linguist Philos 29, 167–203 (2006). https://doi.org/10.1007/s10988-006-0003-2

work page doi:10.1007/s10988-006-0003-2 2006
[42]

Ifs and Cans

J. L. Austin, “Ifs and Cans”, in Philosophical Papers (Oxford University Press, 1961), pp. 153–180

work page 1961
[43]

Evcen, D

E. Evcen, D. Barner, Already perfect: Language users access the pragmatic meanings of conditionals first. Open Mind 9, 1098 –1120 (2025). https://doi.org/10.1162/opmi.a.17

work page doi:10.1162/opmi.a.17 2025
[44]

van Tiel, W

B. van Tiel, W. Schaeken, Processing conversational implicatures: Alternatives and counterfactual reasoning. Cogn Sci 41(S5), 1119 –1154 (2016). https://doi.org/10.1111/cogs.12362

work page doi:10.1111/cogs.12362 2016
[45]

Experimenting with (conditional) perfection: Tests of the Exhaustivity Theory

F. Cariani, L. J. Rips. “Experimenting with (conditional) perfection: Tests of the Exhaustivity Theory”. In Palgrave Studies in Pragmatics, Language and Cognition, S. Kaufmann, D. E. Over, G. Sharma Eds. (Palgrave Macmillan, 2023). https://doi.org/10.1007/978-3-031-05682-6_9

work page doi:10.1007/978-3-031-05682-6_9 2023
[46]

de Varda, C

A. de Varda, C. Saponaro, M. Marelli, High variability in LLMs’ analogical reasoning. Nat hum behav 9(7), 1339–1341 (2025)

work page 2025
[47]

Bender, Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell

E. M. Bender, T. Gebru, A McMillan -Major, S. Shmitchell. “On the dangers od stochastic parrots: Can language models be too bog?”. In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency (FAccT, 2021), pp. 610–623. https://doi.org/10.1145/3442188.3445922

work page doi:10.1145/3442188.3445922 2021
[48]

Wei et al., Chain-of-thought prompting elicits reasoning in large language models

J. Wei et al., Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing systems 35, 24824–24837 (2022)

work page 2022
[49]

Schick, L

T. Schick, L. Qin, H. Schütze, Tool -Augmented Language Models: Augmenting LLMs with External Reasoning Modules. arXiv [Preprint] (2025). https://arxiv.org/abs/2205.12255 (accessed 20 February 2026)

work page arXiv 2025
[50]

Language models are few -shot learners

T. B. Brown et al. “Language models are few -shot learners”. In Proceedings of the 34th Internation Conference on Neural Information Processing Systems (NIPS, 2020), pp. 1877–1901

work page 2020
[51]

Chowdhery et al., PaLM: Scaling Language Modeling with Pathways

A. Chowdhery et al., PaLM: Scaling Language Modeling with Pathways. J Mach Learn Res 24, 1–113 (2023)

work page 2023
[52]

Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

N. Shazeer et al., Outrageously Large Neural Networks: The Sparsely-Gated Mixture- of-Experts Layer. arXiv [Preprint] (2017). https://arxiv.org/abs/1701.06538 (accessed 20 February 2026)

work page internal anchor Pith review Pith/arXiv arXiv 2017
[53]

Fedus, B

W. Fedus, B. Zoph, N. Shazeer, Switch transformers: Scaling to trillion parameter models with simple and efficient MoE. J Mach Learn Res 23, 1–39 (2022)

work page 2022
[54]

Artetxe et al., Efficient Large-Scale Language Modeling with Mixture of Experts

M. Artetxe et al., Efficient Large-Scale Language Modeling with Mixture of Experts. arXiv [Preprint] (2022) https://arxiv.org/abs/2112.10684 (accessed 20 February 2026)

work page arXiv 2022
[55]

A theory of conditionals

R. Stalnaker, “A theory of conditionals” In IFs. The University of Western Ontario Series in Philosophy of Science 15, W.L. Harper, R. Stalnaker, G. Pearce, EDS (Springer, 1968). https://doi.org/10.1007/978-94-009-9117-0_2

work page doi:10.1007/978-94-009-9117-0_2 1968
[56]

Lewis, Probabilities of conditionals and conditional probabilities

D. Lewis, Probabilities of conditionals and conditional probabilities. Philos Rev 85, 297–315 (1976)

work page 1976
[57]

van der Auwera, Pragmatics in the last quarter century: The case of conditional perfection

J. van der Auwera, Pragmatics in the last quarter century: The case of conditional perfection. J Pragmat 27, 261–274 (1997)

work page 1997
[58]

V on Fintel, Conditional Strengthening

K. V on Fintel, Conditional Strengthening . Unpublished Manuscript (MIT, 2001). http://mit.edu/fintel/fintel-2001-condstrength.pdf

work page 2001
[59]

Signal to Act: Game Theory in Pragmatics

M. Franke, “Signal to Act: Game Theory in Pragmatics”, PhD thesis. University of Amsterdam (2009); Available from https://hdl.handle.net/11245/1.313416

work page 2009
[60]

Only if: If only we understood it

E. Herburger, “Only if: If only we understood it” In Proceedings of Sinn und Bedeutung 19 (2015). https://doi.org/10.18148/sub/2015.v19i0.235

work page doi:10.18148/sub/2015.v19i0.235 2015
[61]

The pragmatics of biscuit conditionals

M. Franke, “The pragmatics of biscuit conditionals”, In Proceedings of the 16 th Amsterdam Colloquium , pp. 91 –96 (2007). https://platform.openjournals.nl/PAC/article/view/22859

work page 2007
[62]

Biezma, A

M. Biezma, A. Goebel, Being pragmatic about biscuits. Linguist Philos 46(3), 567– 626 (2023)

work page 2023
[63]

J. Zehr, F. Schwarz. PennController for Internet Based Experiments (IBEX) (2018). https://doi.org/10.17605/OSF.IO/MD832

work page doi:10.17605/osf.io/md832 2018
[64]

Claude Haiku 4.5 (Oct 15 version) [Large Language Model]

Anthropic. Claude Haiku 4.5 (Oct 15 version) [Large Language Model]. https://claude.ai. (2025)

work page 2025
[65]

System Card: Claude Opus 4 & Claude Sonnet 4

Anthropic. System Card: Claude Opus 4 & Claude Sonnet 4. Anthropic system cards page (2025)

work page 2025
[66]

Claude Sonnet 4.5 (Sep 29 version) [Large Language Model]

Anthropic. Claude Sonnet 4.5 (Sep 29 version) [Large Language Model]. https://claude.ai (2025)

work page 2025
[67]

DeepSeek-V3 Technical Report

DeepSeek-AI. DeepSeek -V3 Technical Report. arXiv [Preprint] (2024). https://arxiv.org/abs/2412.19437 (accessed 11 March 2026)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[68]

V . Sanh, L. Debut, J. Chaumond, T. Wolf. DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. arXiv [Preprint] (2020). https://arxiv.org/abs/1910.01108 (accessed 11 March 2026)

work page internal anchor Pith review Pith/arXiv arXiv 2020
[69]

The Falcon Series of Open Language Models

E. Almazrouei, et al. The Falcon Series of Open Language Models. arXiv [Preprint] (2023). https://arxiv.org/abs/2311.16867 (accessed 11 March 2026)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[70]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Google DeepMind. Pushing the Frontier with Advanced Reasoning, Multimodality, Log Context, and Next Generation Agentic Capabilities. arXiv [Preprint] (2025). https://arxiv.org/abs/2507.06261 (accessed 11 March 2026)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[71]

Gemma 3 Technical Report

Gemma Team. Gemma 3 Technical Report. arXiv [Preprint] (2025). https://arxiv.org/abs/2503.19786 (accessed 11 March 2026)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[72]

GLM-4.5V and GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning

GLM-V Team. GLM-4.5V and GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning. arXiv [Preprint] (2026). https://arxiv.org/abs/2507.01006 (accessed 11 March 2026)

work page internal anchor Pith review Pith/arXiv arXiv 2026
[73]

GPT-4o System Card

OpenAI. GPT -4o System Card. arXiv [Preprint] (2024). https://arxiv.org/abs/2410.21276 (accessed 11 March 2026)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[74]

OpenAI GPT-5 System Card

A. Singh, et al. OpenAI GPT -5 System Card. arXiv [Preprint] (2025). https://arxiv.org/abs/2601.03267 (accessed 11 March 2026)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[75]

Grok 4 [Large language model]

xAI. Grok 4 [Large language model]. https://x.ai/ (2025)

work page 2025
[76]

Kimi K2: Open Agentic Intelligence

Kimi Team. Kimi K2: Open Agentic Intelligence. arXiv [Preprint] (2026). https://arxiv.org/abs/2507.20534 (accessed 11 March 2026)

work page internal anchor Pith review Pith/arXiv arXiv 2026
[77]

The Llama 3 Herd of Models

A. Grattafiori, et al. The Llama 3 Herd of Models. arXiv [Preprint] (2024). https://arxiv.org/abs/2407.21783 (accessed 11 March 2026)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[78]

The Llama 4 Herd: Architecture, Training, Evaluation, and Deployment Notes

MetaAI. The Llama 4 Herd: Architecture, Training, Evaluation, and Deployment Notes. arXiv [Preprint] (2026). https://arxiv.org/abs/2601.11659 (accessed 11 March 2026)

work page arXiv 2026
[79]

Mistral 7B

A. Jiang, et al. Mistral 7B. arXiv [Preprint] (2023). https://arxiv.org/abs/2310.06825 (accessed 11 March 2026)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[80]

NVIDIA Nemotron Nano 2: An Accurate and Efficient Hybrid Mamba-Transformer Reasoning Model

NVIDIA. NVIDIA Nemotron Nano 2: An Accurate and Efficient Hybrid Mamba - Transformer Reasoning Model. arXiv [Preprint] (2025). https://arxiv.org/abs/2508.14444 (accessed 11 March 2026)

work page internal anchor Pith review Pith/arXiv arXiv 2025

Showing first 80 references.

[1] [1]

Mahowald et al, Dissociating language and thought in large language models

K. Mahowald et al, Dissociating language and thought in large language models. Trends Cogn Sci 28(6), 517–540 (2024)

work page 2024

[2] [2]

Futrell, K

R. Futrell, K. Mahowald, How linguistics learned to stop worrying and love the language models. Behav Brain Sci, 1–98 (2025). https://doi.org/10.1017/S0140525X2510112X

work page doi:10.1017/s0140525x2510112x 2025

[3] [3]

N. Chomsky. Aspects of the Theory of Syntax. The MIT Press. 1965

work page 1965

[4] [4]

N. Chomsky. The Minimalist Program. The MIT Press. 1995

work page 1995

[5] [5]

Modern language models refute Chomsky’s approach to language

S. T. Piantadosi. “Modern language models refute Chomsky’s approach to language”. In From Fieldwork to Linguistic Theory: A Tribute to Dan Everett , E. Gibson, M. Poliak, Eds. (Language Science Press, 2023), pp. 353–414

work page 2023

[6] [6]

A. Moro, M. Greco, S. F. Coppola, Large languages, impossible languages and human brains. Cortex 167, 82–85 (2023)

work page 2023

[7] [7]

Rizzi, On the complementarity of Generative Grammar and Large Language Models

L. Rizzi, On the complementarity of Generative Grammar and Large Language Models. Italian Journal of Linguistics 37(1), 145–152 (2025)

work page 2025

[8] [8]

G. C. Ramchand, Is it the end of Generative linguistics as we know it? Italian Journal of Linguistics 37(1), 131–144 (2025)

work page 2025

[9] [9]

Katzir, Why large language models are poor theories of human linguistic cognition: A reply to Piantadosi

R. Katzir, Why large language models are poor theories of human linguistic cognition: A reply to Piantadosi. Biolinguistics 17 (2023). https://doi.org/10.5964/bioling.13153

work page doi:10.5964/bioling.13153 2023

[10] [10]

Chomsky, I

N. Chomsky, I. Roberts, J. Watumull. 2023. Noam Chomsky: The false promise of ChatGPT. The New York Times. Available at https://www.nytimes.com/2023/03/08/opinion/noam-chomsky-chatgpt-ai.html (accessed 20 February 2026)

work page 2023

[11] [11]

Chesi, Is it the end of (generative) linguistics as we know it? Italian Journal of Linguistics 37(1), 3–44 (2025)

C. Chesi, Is it the end of (generative) linguistics as we know it? Italian Journal of Linguistics 37(1), 3–44 (2025)

work page 2025

[12] [12]

On the proper role of linguistically oriented deep net analysis in linguistic theorizing

M. Baroni. “On the proper role of linguistically oriented deep net analysis in linguistic theorizing”. In Algebraic Structure in Natural Language , L. Shalom, J -P. Bernardy, Eds. (CRC Press, 2022), pp. 1–16

work page 2022

[13] [13]

Counting the bugs in ChatGPT’s Wugs: A multilingual investigation into the morphological capabilities of a Large Language Model

L. Weissweiler et al. “Counting the bugs in ChatGPT’s Wugs: A multilingual investigation into the morphological capabilities of a Large Language Model”. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, H. Bouamor, J. Pino, B. Kalika Eds. (ACL, 2023), pp. 6508–6524

work page 2023

[14] [14]

Morphology matters: Probing the cross -linguistic morphological generalization abilities of Large Language Models through a Wug Test

T. A. Dang, L. Raviv, L. Galke. “Morphology matters: Probing the cross -linguistic morphological generalization abilities of Large Language Models through a Wug Test”. In Proceedings of the 13th Edition of the Workshop on Cognitive Modeling and Computational Linguistics, T. Kuribavaski et al., Eds. (ACL, 2024), pp. 177–188

work page 2024

[15] [15]

Pantelidou, E

N. Pantelidou, E. Leivada, R. Montero, P. Morosi. Community size rather than grammatical complexity better predicts Large Language Model accuracy in a novel Wug Test . PLoS One 21(3): e0343164 (2026). https://doi.org/10.1371/journal.pone.0343164

work page doi:10.1371/journal.pone.0343164 2026

[16] [16]

Temerko, M

A. Temerko, M. Garcia, P. Gamallo. “A continuous approach to metaphorically motivated regular polysemy in language models. In Proceedings of the 29th Conference on Computational Natural Language Learning . (ACL, 2025), pp. 419 – 436

work page 2025

[17] [17]

Gulordava, P

K. Gulordava, P. Bojanowski, E. Grave, T. Linzen, M. Baroni. “Colorless green recurrent networks dream hierarchically. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1: Long papers. (ACL, 2018), pp. 1195 – 1205

work page 2018

[18] [18]

Mahowald

K. Mahowald. “A discerning several thousand judgments: GPT -3 rates the article + adjective + numeral + noun construction. In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics. (ACL, 2023), pp. 265–273

work page 2023

[19] [19]

C. Potts. Characterizing English preposing in PP constructions. Journal of Linguistics, 1–39 (2024)

work page 2024

[20] [20]

Leivada, E

E. Leivada, E. Murphy, G. Marcus, DALL·E 2 fails to reliably capture common syntactic processes. Social Sciences & Humanities Open 8(1) (2023). https://doi.org/10.1016/j.ssaho.2023.100648

work page doi:10.1016/j.ssaho.2023.100648 2023

[21] [21]

Zhou et al

H. Zhou et al. How well do Large Language Models understand syntax? An evaluation by asking natural language questions. arXiv [Preprint] (2023). https://arxiv.org/abs/2311.08287 (accessed 20 February 2026)

work page arXiv 2023

[22] [22]

Anything goes? A crosslinguistic study of (im)possible language learning in LMs

X. Yang, T. Aoyama, Y . Yao, E. Wilcox. “Anything goes? A crosslinguistic study of (im)possible language learning in LMs”. Proceedings of the 63 rd Annual Meeting of the Association for Computational Linguistics, Volume 1: Long papers. (ACL, 2025), pp. 26058–26077

work page 2025

[23] [23]

Dentella, F

V . Dentella, F. Günther, E. Leivada, Systematic testing of three Language Models reveals low language accuracy, absence of response stability, and a yes-response bias. Proc. Natl. Acad. Sci. U.S.A. 120 (51) e2309583120 (2023). https://doi.org/10.1073/pnas.2309583120

work page doi:10.1073/pnas.2309583120 2023

[24] [24]

Leivada, F

E. Leivada, F. Günther, V . Dentella, Reply to Hu et al: Applying different evaluation standards to humans vs. Large Language Models overestimates AI performance. Proc. Natl. Acad. Sci. U.S.A. 121 (36) e2406752121 (2024). https://doi.org/10.1073/pnas.2406752121

work page doi:10.1073/pnas.2406752121 2024

[25] [25]

Climbing towards NLU: On meaning, form, and understanding in the age of data

E. M. Bender, A. Koller. “Climbing towards NLU: On meaning, form, and understanding in the age of data”. In Proceedings of the 58 th Annual Meeting of the Association for Computational Linguistcs. (ACL, 2020), pp. 5185–5198

work page 2020

[26] [26]

Shanahan, Talking about Large Language Models

M. Shanahan, Talking about Large Language Models. Commun. ACM 67(2), 68–79 (2023). https://doi.org/10.1145/3624724

work page doi:10.1145/3624724 2023

[27] [27]

Dentella, F

V . Dentella, F. Günther, E. Murphy, G. Marcus, E. Leivada, Testing AI on language comprehension tasks reveals insensitivity to underlying meaning. Sci Rep 14, 28083 (2023). https://doi.org/10.1038/s41598-024-79531-8

work page doi:10.1038/s41598-024-79531-8 2023

[28] [28]

Leivada, G

E. Leivada, G. Marcus, F. Günther, E. Murphy. A sentence is worth a thousand pictures: Can Large Language Models understand hum4n l4ngu4ge and the world behind words? Philosophical Transactions of the Royal Society A

work page

[29] [29]

Weissweiler, V

L. Weissweiler, V . Hofmann, A. Köksal, H. Schütze, Explaining pretrained language models’ understanding of linguistic structures using construction grammar. Front Artif Intell 6, 1225791 (2023). https://doi.org/10.3389/frai.2023.1225791

work page doi:10.3389/frai.2023.1225791 2023

[30] [30]

Quantifying generalizations: Exploring the divide between human and LLMs’ sensitivity to quantification

C. Collacciani, G. Rambelli, M. Bolognesi. “Quantifying generalizations: Exploring the divide between human and LLMs’ sensitivity to quantification”. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics, Volume 1: Long papers. (ACL, 2024), pp. 11811 – 11822

work page 2024

[31] [31]

Montero, N

R. Montero, N. Moskvina, P. Morosi, T. Serrano, E. Pagliarini, E. Leivada, Quantification and object perception in multimodal large language models deviate from human linguistic cognition. arXiv [Preprint] (2026). https://arxiv.org/abs/2511.08126. (accessed 25 March 2026)

work page arXiv 2026

[32] [32]

Does ChatGPT resemble humans in processing implicatures?

Z. Qiu, X. Duan, Z. Cai. “Does ChatGPT resemble humans in processing implicatures?” In Proceedings of the 4 th Natural Logic Meets Machine Learning Workshop (ACL, 2023), pp. 25–34

work page 2023

[33] [33]

A fine-grained comparison of pragmatic language understanding in humans and language models

J. Hu, S. Floyd, O. Juravlev, E. Fedorenko, E. Gibson. “A fine-grained comparison of pragmatic language understanding in humans and language models”. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics, Volume 1: Long papers. (ACL, 2023), pp. 4194–4213

work page 2023

[34] [34]

Pragmatic competence evaluation of Large Language Models for the Korean Language

D. Park, J. Lee, H. Jeong, S. Park, S. Lee. “Pragmatic competence evaluation of Large Language Models for the Korean Language”. Proceedings of the 38 th Pacific Asia Conference on Language, Information and Computation (ACL, 2024), pp. 256–266

work page 2024

[35] [35]

Uncovering the Fragility of Trustworthy LLMs through Chinese Textual Ambiguity

X. Wu et al. Uncovering the fragility of trustworthy LLMs through Chinese textual ambiguity. arXiv [Preprint] (2025). https://arxiv.org/abs/2507.23121 (accessed 20 February 2026)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[36] [36]

Kahneman, A

D. Kahneman, A. Tversky, On the study of statistical intuitions. Cognition 11(2), 123– 141 (1982). https://doi.org/10.1016/0010-0277(82)90022-1

work page doi:10.1016/0010-0277(82)90022-1 1982

[37] [37]

D. J. Hilton, The social context of reasoning: Conversational inference and rational judgment. Psychol Bull 118(2), 248 –271 (1995). https://psycnet.apa.org/doi/10.1037/0033-2909.118.2.248

work page doi:10.1037/0033-2909.118.2.248 1995

[38] [38]

K. E. Stanovich, R. F. West, Individual differences in reasoning: Implications for the rationality debate? Behav Brain Sci 23(5), 645 –665 (2000). https://doi.org/10.1017/s0140525x00003435

work page doi:10.1017/s0140525x00003435 2000

[39] [39]

Lassiter, Distinguishing semantics, pragmatics, and reasoning in the theory of conditionals

D. Lassiter, Distinguishing semantics, pragmatics, and reasoning in the theory of conditionals. Inquiry 68(7), 2431 –2452 (2025). https://doi.org/10.1080/0020174X.2024.2405186

work page doi:10.1080/0020174x.2024.2405186 2025

[40] [40]

M. L. Geis, A. M. Zwicky, On invited references. Linguist Inq 2(4), 561–566

work page

[41] [41]

M. E. A. Siegel, Biscuit conditionals: Quantification over potential literal acts. Linguist Philos 29, 167–203 (2006). https://doi.org/10.1007/s10988-006-0003-2

work page doi:10.1007/s10988-006-0003-2 2006

[42] [42]

Ifs and Cans

J. L. Austin, “Ifs and Cans”, in Philosophical Papers (Oxford University Press, 1961), pp. 153–180

work page 1961

[43] [43]

Evcen, D

E. Evcen, D. Barner, Already perfect: Language users access the pragmatic meanings of conditionals first. Open Mind 9, 1098 –1120 (2025). https://doi.org/10.1162/opmi.a.17

work page doi:10.1162/opmi.a.17 2025

[44] [44]

van Tiel, W

B. van Tiel, W. Schaeken, Processing conversational implicatures: Alternatives and counterfactual reasoning. Cogn Sci 41(S5), 1119 –1154 (2016). https://doi.org/10.1111/cogs.12362

work page doi:10.1111/cogs.12362 2016

[45] [45]

Experimenting with (conditional) perfection: Tests of the Exhaustivity Theory

F. Cariani, L. J. Rips. “Experimenting with (conditional) perfection: Tests of the Exhaustivity Theory”. In Palgrave Studies in Pragmatics, Language and Cognition, S. Kaufmann, D. E. Over, G. Sharma Eds. (Palgrave Macmillan, 2023). https://doi.org/10.1007/978-3-031-05682-6_9

work page doi:10.1007/978-3-031-05682-6_9 2023

[46] [46]

de Varda, C

A. de Varda, C. Saponaro, M. Marelli, High variability in LLMs’ analogical reasoning. Nat hum behav 9(7), 1339–1341 (2025)

work page 2025

[47] [47]

Bender, Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell

E. M. Bender, T. Gebru, A McMillan -Major, S. Shmitchell. “On the dangers od stochastic parrots: Can language models be too bog?”. In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency (FAccT, 2021), pp. 610–623. https://doi.org/10.1145/3442188.3445922

work page doi:10.1145/3442188.3445922 2021

[48] [48]

Wei et al., Chain-of-thought prompting elicits reasoning in large language models

J. Wei et al., Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing systems 35, 24824–24837 (2022)

work page 2022

[49] [49]

Schick, L

T. Schick, L. Qin, H. Schütze, Tool -Augmented Language Models: Augmenting LLMs with External Reasoning Modules. arXiv [Preprint] (2025). https://arxiv.org/abs/2205.12255 (accessed 20 February 2026)

work page arXiv 2025

[50] [50]

Language models are few -shot learners

T. B. Brown et al. “Language models are few -shot learners”. In Proceedings of the 34th Internation Conference on Neural Information Processing Systems (NIPS, 2020), pp. 1877–1901

work page 2020

[51] [51]

Chowdhery et al., PaLM: Scaling Language Modeling with Pathways

A. Chowdhery et al., PaLM: Scaling Language Modeling with Pathways. J Mach Learn Res 24, 1–113 (2023)

work page 2023

[52] [52]

Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

N. Shazeer et al., Outrageously Large Neural Networks: The Sparsely-Gated Mixture- of-Experts Layer. arXiv [Preprint] (2017). https://arxiv.org/abs/1701.06538 (accessed 20 February 2026)

work page internal anchor Pith review Pith/arXiv arXiv 2017

[53] [53]

Fedus, B

W. Fedus, B. Zoph, N. Shazeer, Switch transformers: Scaling to trillion parameter models with simple and efficient MoE. J Mach Learn Res 23, 1–39 (2022)

work page 2022

[54] [54]

Artetxe et al., Efficient Large-Scale Language Modeling with Mixture of Experts

M. Artetxe et al., Efficient Large-Scale Language Modeling with Mixture of Experts. arXiv [Preprint] (2022) https://arxiv.org/abs/2112.10684 (accessed 20 February 2026)

work page arXiv 2022

[55] [55]

A theory of conditionals

R. Stalnaker, “A theory of conditionals” In IFs. The University of Western Ontario Series in Philosophy of Science 15, W.L. Harper, R. Stalnaker, G. Pearce, EDS (Springer, 1968). https://doi.org/10.1007/978-94-009-9117-0_2

work page doi:10.1007/978-94-009-9117-0_2 1968

[56] [56]

Lewis, Probabilities of conditionals and conditional probabilities

D. Lewis, Probabilities of conditionals and conditional probabilities. Philos Rev 85, 297–315 (1976)

work page 1976

[57] [57]

van der Auwera, Pragmatics in the last quarter century: The case of conditional perfection

J. van der Auwera, Pragmatics in the last quarter century: The case of conditional perfection. J Pragmat 27, 261–274 (1997)

work page 1997

[58] [58]

V on Fintel, Conditional Strengthening

K. V on Fintel, Conditional Strengthening . Unpublished Manuscript (MIT, 2001). http://mit.edu/fintel/fintel-2001-condstrength.pdf

work page 2001

[59] [59]

Signal to Act: Game Theory in Pragmatics

M. Franke, “Signal to Act: Game Theory in Pragmatics”, PhD thesis. University of Amsterdam (2009); Available from https://hdl.handle.net/11245/1.313416

work page 2009

[60] [60]

Only if: If only we understood it

E. Herburger, “Only if: If only we understood it” In Proceedings of Sinn und Bedeutung 19 (2015). https://doi.org/10.18148/sub/2015.v19i0.235

work page doi:10.18148/sub/2015.v19i0.235 2015

[61] [61]

The pragmatics of biscuit conditionals

M. Franke, “The pragmatics of biscuit conditionals”, In Proceedings of the 16 th Amsterdam Colloquium , pp. 91 –96 (2007). https://platform.openjournals.nl/PAC/article/view/22859

work page 2007

[62] [62]

Biezma, A

M. Biezma, A. Goebel, Being pragmatic about biscuits. Linguist Philos 46(3), 567– 626 (2023)

work page 2023

[63] [63]

J. Zehr, F. Schwarz. PennController for Internet Based Experiments (IBEX) (2018). https://doi.org/10.17605/OSF.IO/MD832

work page doi:10.17605/osf.io/md832 2018

[64] [64]

Claude Haiku 4.5 (Oct 15 version) [Large Language Model]

Anthropic. Claude Haiku 4.5 (Oct 15 version) [Large Language Model]. https://claude.ai. (2025)

work page 2025

[65] [65]

System Card: Claude Opus 4 & Claude Sonnet 4

Anthropic. System Card: Claude Opus 4 & Claude Sonnet 4. Anthropic system cards page (2025)

work page 2025

[66] [66]

Claude Sonnet 4.5 (Sep 29 version) [Large Language Model]

Anthropic. Claude Sonnet 4.5 (Sep 29 version) [Large Language Model]. https://claude.ai (2025)

work page 2025

[67] [67]

DeepSeek-V3 Technical Report

DeepSeek-AI. DeepSeek -V3 Technical Report. arXiv [Preprint] (2024). https://arxiv.org/abs/2412.19437 (accessed 11 March 2026)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[68] [68]

V . Sanh, L. Debut, J. Chaumond, T. Wolf. DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. arXiv [Preprint] (2020). https://arxiv.org/abs/1910.01108 (accessed 11 March 2026)

work page internal anchor Pith review Pith/arXiv arXiv 2020

[69] [69]

The Falcon Series of Open Language Models

E. Almazrouei, et al. The Falcon Series of Open Language Models. arXiv [Preprint] (2023). https://arxiv.org/abs/2311.16867 (accessed 11 March 2026)

work page internal anchor Pith review Pith/arXiv arXiv 2023

[70] [70]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Google DeepMind. Pushing the Frontier with Advanced Reasoning, Multimodality, Log Context, and Next Generation Agentic Capabilities. arXiv [Preprint] (2025). https://arxiv.org/abs/2507.06261 (accessed 11 March 2026)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[71] [71]

Gemma 3 Technical Report

Gemma Team. Gemma 3 Technical Report. arXiv [Preprint] (2025). https://arxiv.org/abs/2503.19786 (accessed 11 March 2026)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[72] [72]

GLM-4.5V and GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning

GLM-V Team. GLM-4.5V and GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning. arXiv [Preprint] (2026). https://arxiv.org/abs/2507.01006 (accessed 11 March 2026)

work page internal anchor Pith review Pith/arXiv arXiv 2026

[73] [73]

GPT-4o System Card

OpenAI. GPT -4o System Card. arXiv [Preprint] (2024). https://arxiv.org/abs/2410.21276 (accessed 11 March 2026)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[74] [74]

OpenAI GPT-5 System Card

A. Singh, et al. OpenAI GPT -5 System Card. arXiv [Preprint] (2025). https://arxiv.org/abs/2601.03267 (accessed 11 March 2026)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[75] [75]

Grok 4 [Large language model]

xAI. Grok 4 [Large language model]. https://x.ai/ (2025)

work page 2025

[76] [76]

Kimi K2: Open Agentic Intelligence

Kimi Team. Kimi K2: Open Agentic Intelligence. arXiv [Preprint] (2026). https://arxiv.org/abs/2507.20534 (accessed 11 March 2026)

work page internal anchor Pith review Pith/arXiv arXiv 2026

[77] [77]

The Llama 3 Herd of Models

A. Grattafiori, et al. The Llama 3 Herd of Models. arXiv [Preprint] (2024). https://arxiv.org/abs/2407.21783 (accessed 11 March 2026)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[78] [78]

The Llama 4 Herd: Architecture, Training, Evaluation, and Deployment Notes

MetaAI. The Llama 4 Herd: Architecture, Training, Evaluation, and Deployment Notes. arXiv [Preprint] (2026). https://arxiv.org/abs/2601.11659 (accessed 11 March 2026)

work page arXiv 2026

[79] [79]

Mistral 7B

A. Jiang, et al. Mistral 7B. arXiv [Preprint] (2023). https://arxiv.org/abs/2310.06825 (accessed 11 March 2026)

work page internal anchor Pith review Pith/arXiv arXiv 2023

[80] [80]

NVIDIA Nemotron Nano 2: An Accurate and Efficient Hybrid Mamba-Transformer Reasoning Model

NVIDIA. NVIDIA Nemotron Nano 2: An Accurate and Efficient Hybrid Mamba - Transformer Reasoning Model. arXiv [Preprint] (2025). https://arxiv.org/abs/2508.14444 (accessed 11 March 2026)

work page internal anchor Pith review Pith/arXiv arXiv 2025