Tracing the ongoing emergence of human-like reasoning in Large Language Models
Pith reviewed 2026-05-21 05:01 UTC · model grok-4.3
The pith
Large language models accurately process literal meanings of conditionals but fail to make the pragmatic inferences that humans routinely add.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Humans enrich logical reasoning through pragmatic inferences when processing conditional sentences across languages. Model behavior is more variable. Some LLMs perfectly follow the truth-table of conditionals but ignore pragmatic inferences, while others deviate from the truth-table by adhering to a single interpretation across the board. Overall, LLMs are accurate semantic operators, but fail to capture the pragmatic enrichments characteristic of human reasoning. LLM accuracy is neither predicted nor boosted by open versus closed status, training orientation, or architecture type.
What carries the argument
Population-matching experiment that compares LLM and human responses to conditional inference questions in four languages.
If this is right
- LLMs can be treated as reliable for literal logical processing of conditionals.
- Pragmatic enrichment is not yet produced by current architectures or training regimes.
- Human-like task performance can coexist with non-human underlying reasoning mechanisms.
- Pragmatic reasoning remains an emerging rather than established capability in LLMs.
Where Pith is reading between the lines
- Dialogue systems built on current LLMs may miss implied meanings that affect user expectations.
- Targeted training on pragmatic inference examples could be tested as a way to close the observed gap.
- The same experiment design could be applied to other linguistic constructions to map additional divergences.
- Scaling model size alone may not automatically produce the missing pragmatic layer.
Load-bearing premise
The chosen conditional sentences and inference questions capture human pragmatic reasoning, and LLM output patterns reflect internal processes rather than surface response strategies.
What would settle it
New LLMs tested on the same conditional sentences and questions producing response distributions that match human pragmatic patterns across languages would challenge the central claim.
Figures
read the original abstract
Humans effortlessly go beyond literal meanings: If you mow the lawn, I will give you fifty dollars, is typically understood as implying that the speaker will pay only if the lawn is mowed, whereas If you are hungry, there is pizza in the oven implies that pizza is available regardless of the hearers hunger. Large Language Models - LLMs - show human-like performance on many tasks, yet it remains unclear whether they reason like humans. To address this, we conducted a population-matching experiment assessing how twentyfive LLMs compute conditional inferences across four languages, compared to an equal number of humans per language. We find that humans enrich logical reasoning through pragmatic inferences across languages. Model behavior is more variable. Some LLMs perfectly follow the truth-table of conditionals but they ignore pragmatic inferences, while others deviate from the truth-table, adhering to a single interpretation across the board, thus reflecting accurate rule-based processing but not human-like reasoning. Overall, LLMs are accurate semantic operators, but fail to capture the pragmatic enrichments characteristic of human reasoning. Crucially, LLM accuracy is neither predicted nor boosted by open vs. closed status, training orientation, or architecture type, suggesting that pragmatic reasoning is still an emerging ability in the cognitive toolkit of artificial systems.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper reports a population-matching experiment in which 25 LLMs and an equal number of human participants per language were tested on conditional inference tasks across four languages. Humans are shown to enrich logical conditionals with pragmatic inferences (e.g., biconditional or availability readings), while LLMs display heterogeneous behavior: some models adhere closely to classical truth-table semantics and ignore pragmatic enrichments, whereas others converge on a single interpretation regardless of context. The authors conclude that current LLMs act as accurate semantic operators but have not yet acquired the pragmatic reasoning component characteristic of human inference, and that this limitation is independent of model openness, training regime, or architecture.
Significance. If the empirical patterns are robust, the work supplies concrete cross-linguistic evidence that pragmatic enrichment remains an open challenge for LLMs, thereby sharpening the distinction between semantic competence and human-like reasoning. The finding that neither scale nor architectural type reliably predicts pragmatic performance is a useful negative result for model development. The population-matching design itself is a strength, as it directly compares model output distributions to human response distributions rather than relying on aggregate accuracy scores.
major comments (3)
- [Methods] Methods: The manuscript provides no explicit description of the exact prompt templates, paraphrases, or consistency checks used with the LLMs. Without these details it is impossible to determine whether the reported deviations from human pragmatic patterns reflect genuine reasoning differences or sensitivity to surface phrasing, as the skeptic note correctly flags.
- [Results] Results / population-matching analysis: No statistical tests, confidence intervals, or measures of inter-participant / inter-model variability are reported. The central claim that LLMs 'fail to capture pragmatic enrichments' therefore rests on qualitative descriptions of model behavior whose reliability cannot be assessed from the given information.
- [Discussion] Discussion: The assertion that LLM accuracy is 'neither predicted nor boosted by open vs. closed status, training orientation, or architecture type' is presented without the supporting regression or correlation analysis that would make the claim falsifiable.
minor comments (2)
- [Abstract] The abstract states 'twentyfive LLMs' without a hyphen; correct to 'twenty-five'.
- [Methods] Clarify whether the same set of conditional sentences was used for all languages or whether stimuli were translated and back-translated; this affects the cross-linguistic comparison.
Simulated Author's Rebuttal
We thank the referee for the constructive comments, which identify key areas where additional detail and quantitative support will strengthen the manuscript. We address each major point below and commit to revisions that directly respond to the concerns raised.
read point-by-point responses
-
Referee: [Methods] Methods: The manuscript provides no explicit description of the exact prompt templates, paraphrases, or consistency checks used with the LLMs. Without these details it is impossible to determine whether the reported deviations from human pragmatic patterns reflect genuine reasoning differences or sensitivity to surface phrasing, as the skeptic note correctly flags.
Authors: We agree that full transparency regarding prompt construction is necessary to evaluate whether observed differences arise from reasoning or from surface-form sensitivity. The revised manuscript will include the complete set of prompt templates for each language, all paraphrases employed, and the consistency checks applied across models. These materials will be placed in a dedicated appendix with example inputs and outputs. revision: yes
-
Referee: [Results] Results / population-matching analysis: No statistical tests, confidence intervals, or measures of inter-participant / inter-model variability are reported. The central claim that LLMs 'fail to capture pragmatic enrichments' therefore rests on qualitative descriptions of model behavior whose reliability cannot be assessed from the given information.
Authors: The population-matching design was chosen to enable direct visual comparison of response distributions rather than aggregate accuracy. Nevertheless, we recognize that the absence of formal statistical support limits the strength of the claims. In the revision we will add chi-squared tests comparing response-type distributions between humans and models, report inter-participant and inter-model variability (e.g., standard deviations and response entropy), and include bootstrap confidence intervals for the proportions of each inference type. revision: yes
-
Referee: [Discussion] Discussion: The assertion that LLM accuracy is 'neither predicted nor boosted by open vs. closed status, training orientation, or architecture type' is presented without the supporting regression or correlation analysis that would make the claim falsifiable.
Authors: The statement reflects the lack of any systematic advantage visible across the sampled models from different categories. To render the claim testable, the revised manuscript will include a supplementary regression analysis in which model-level accuracy on pragmatic enrichment is regressed on binary indicators for openness, training regime, and architecture family, with appropriate controls for model size. revision: yes
Circularity Check
No circularity: direct empirical comparison of LLM and human responses
full rationale
The paper reports results from a population-matching experiment that directly compares LLM outputs on conditional inferences to human responses across four languages. No mathematical derivations, fitted parameters, predictions, or self-referential steps are present in the abstract or described methodology. The central claim that LLMs act as accurate semantic operators but miss pragmatic enrichments rests on observed behavioral patterns rather than any reduction to inputs by construction. This is a standard empirical study with no load-bearing self-citations or ansatzes that could introduce circularity.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Humans enrich logical reasoning through pragmatic inferences based on context and social norms.
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Overall, LLMs are accurate semantic operators, but fail to capture the pragmatic enrichments characteristic of human reasoning.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Mahowald et al, Dissociating language and thought in large language models
K. Mahowald et al, Dissociating language and thought in large language models. Trends Cogn Sci 28(6), 517–540 (2024)
work page 2024
-
[2]
R. Futrell, K. Mahowald, How linguistics learned to stop worrying and love the language models. Behav Brain Sci, 1–98 (2025). https://doi.org/10.1017/S0140525X2510112X
-
[3]
N. Chomsky. Aspects of the Theory of Syntax. The MIT Press. 1965
work page 1965
-
[4]
N. Chomsky. The Minimalist Program. The MIT Press. 1995
work page 1995
-
[5]
Modern language models refute Chomsky’s approach to language
S. T. Piantadosi. “Modern language models refute Chomsky’s approach to language”. In From Fieldwork to Linguistic Theory: A Tribute to Dan Everett , E. Gibson, M. Poliak, Eds. (Language Science Press, 2023), pp. 353–414
work page 2023
-
[6]
A. Moro, M. Greco, S. F. Coppola, Large languages, impossible languages and human brains. Cortex 167, 82–85 (2023)
work page 2023
-
[7]
Rizzi, On the complementarity of Generative Grammar and Large Language Models
L. Rizzi, On the complementarity of Generative Grammar and Large Language Models. Italian Journal of Linguistics 37(1), 145–152 (2025)
work page 2025
-
[8]
G. C. Ramchand, Is it the end of Generative linguistics as we know it? Italian Journal of Linguistics 37(1), 131–144 (2025)
work page 2025
-
[9]
R. Katzir, Why large language models are poor theories of human linguistic cognition: A reply to Piantadosi. Biolinguistics 17 (2023). https://doi.org/10.5964/bioling.13153
-
[10]
N. Chomsky, I. Roberts, J. Watumull. 2023. Noam Chomsky: The false promise of ChatGPT. The New York Times. Available at https://www.nytimes.com/2023/03/08/opinion/noam-chomsky-chatgpt-ai.html (accessed 20 February 2026)
work page 2023
-
[11]
C. Chesi, Is it the end of (generative) linguistics as we know it? Italian Journal of Linguistics 37(1), 3–44 (2025)
work page 2025
-
[12]
On the proper role of linguistically oriented deep net analysis in linguistic theorizing
M. Baroni. “On the proper role of linguistically oriented deep net analysis in linguistic theorizing”. In Algebraic Structure in Natural Language , L. Shalom, J -P. Bernardy, Eds. (CRC Press, 2022), pp. 1–16
work page 2022
-
[13]
L. Weissweiler et al. “Counting the bugs in ChatGPT’s Wugs: A multilingual investigation into the morphological capabilities of a Large Language Model”. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, H. Bouamor, J. Pino, B. Kalika Eds. (ACL, 2023), pp. 6508–6524
work page 2023
-
[14]
T. A. Dang, L. Raviv, L. Galke. “Morphology matters: Probing the cross -linguistic morphological generalization abilities of Large Language Models through a Wug Test”. In Proceedings of the 13th Edition of the Workshop on Cognitive Modeling and Computational Linguistics, T. Kuribavaski et al., Eds. (ACL, 2024), pp. 177–188
work page 2024
-
[15]
N. Pantelidou, E. Leivada, R. Montero, P. Morosi. Community size rather than grammatical complexity better predicts Large Language Model accuracy in a novel Wug Test . PLoS One 21(3): e0343164 (2026). https://doi.org/10.1371/journal.pone.0343164
-
[16]
A. Temerko, M. Garcia, P. Gamallo. “A continuous approach to metaphorically motivated regular polysemy in language models. In Proceedings of the 29th Conference on Computational Natural Language Learning . (ACL, 2025), pp. 419 – 436
work page 2025
-
[17]
K. Gulordava, P. Bojanowski, E. Grave, T. Linzen, M. Baroni. “Colorless green recurrent networks dream hierarchically. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1: Long papers. (ACL, 2018), pp. 1195 – 1205
work page 2018
- [18]
-
[19]
C. Potts. Characterizing English preposing in PP constructions. Journal of Linguistics, 1–39 (2024)
work page 2024
-
[20]
E. Leivada, E. Murphy, G. Marcus, DALL·E 2 fails to reliably capture common syntactic processes. Social Sciences & Humanities Open 8(1) (2023). https://doi.org/10.1016/j.ssaho.2023.100648
-
[21]
H. Zhou et al. How well do Large Language Models understand syntax? An evaluation by asking natural language questions. arXiv [Preprint] (2023). https://arxiv.org/abs/2311.08287 (accessed 20 February 2026)
-
[22]
Anything goes? A crosslinguistic study of (im)possible language learning in LMs
X. Yang, T. Aoyama, Y . Yao, E. Wilcox. “Anything goes? A crosslinguistic study of (im)possible language learning in LMs”. Proceedings of the 63 rd Annual Meeting of the Association for Computational Linguistics, Volume 1: Long papers. (ACL, 2025), pp. 26058–26077
work page 2025
-
[23]
V . Dentella, F. Günther, E. Leivada, Systematic testing of three Language Models reveals low language accuracy, absence of response stability, and a yes-response bias. Proc. Natl. Acad. Sci. U.S.A. 120 (51) e2309583120 (2023). https://doi.org/10.1073/pnas.2309583120
-
[24]
E. Leivada, F. Günther, V . Dentella, Reply to Hu et al: Applying different evaluation standards to humans vs. Large Language Models overestimates AI performance. Proc. Natl. Acad. Sci. U.S.A. 121 (36) e2406752121 (2024). https://doi.org/10.1073/pnas.2406752121
-
[25]
Climbing towards NLU: On meaning, form, and understanding in the age of data
E. M. Bender, A. Koller. “Climbing towards NLU: On meaning, form, and understanding in the age of data”. In Proceedings of the 58 th Annual Meeting of the Association for Computational Linguistcs. (ACL, 2020), pp. 5185–5198
work page 2020
-
[26]
Shanahan, Talking about Large Language Models
M. Shanahan, Talking about Large Language Models. Commun. ACM 67(2), 68–79 (2023). https://doi.org/10.1145/3624724
-
[27]
V . Dentella, F. Günther, E. Murphy, G. Marcus, E. Leivada, Testing AI on language comprehension tasks reveals insensitivity to underlying meaning. Sci Rep 14, 28083 (2023). https://doi.org/10.1038/s41598-024-79531-8
-
[28]
E. Leivada, G. Marcus, F. Günther, E. Murphy. A sentence is worth a thousand pictures: Can Large Language Models understand hum4n l4ngu4ge and the world behind words? Philosophical Transactions of the Royal Society A
-
[29]
L. Weissweiler, V . Hofmann, A. Köksal, H. Schütze, Explaining pretrained language models’ understanding of linguistic structures using construction grammar. Front Artif Intell 6, 1225791 (2023). https://doi.org/10.3389/frai.2023.1225791
-
[30]
C. Collacciani, G. Rambelli, M. Bolognesi. “Quantifying generalizations: Exploring the divide between human and LLMs’ sensitivity to quantification”. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics, Volume 1: Long papers. (ACL, 2024), pp. 11811 – 11822
work page 2024
-
[31]
R. Montero, N. Moskvina, P. Morosi, T. Serrano, E. Pagliarini, E. Leivada, Quantification and object perception in multimodal large language models deviate from human linguistic cognition. arXiv [Preprint] (2026). https://arxiv.org/abs/2511.08126. (accessed 25 March 2026)
-
[32]
Does ChatGPT resemble humans in processing implicatures?
Z. Qiu, X. Duan, Z. Cai. “Does ChatGPT resemble humans in processing implicatures?” In Proceedings of the 4 th Natural Logic Meets Machine Learning Workshop (ACL, 2023), pp. 25–34
work page 2023
-
[33]
A fine-grained comparison of pragmatic language understanding in humans and language models
J. Hu, S. Floyd, O. Juravlev, E. Fedorenko, E. Gibson. “A fine-grained comparison of pragmatic language understanding in humans and language models”. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics, Volume 1: Long papers. (ACL, 2023), pp. 4194–4213
work page 2023
-
[34]
Pragmatic competence evaluation of Large Language Models for the Korean Language
D. Park, J. Lee, H. Jeong, S. Park, S. Lee. “Pragmatic competence evaluation of Large Language Models for the Korean Language”. Proceedings of the 38 th Pacific Asia Conference on Language, Information and Computation (ACL, 2024), pp. 256–266
work page 2024
-
[35]
Uncovering the Fragility of Trustworthy LLMs through Chinese Textual Ambiguity
X. Wu et al. Uncovering the fragility of trustworthy LLMs through Chinese textual ambiguity. arXiv [Preprint] (2025). https://arxiv.org/abs/2507.23121 (accessed 20 February 2026)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[36]
D. Kahneman, A. Tversky, On the study of statistical intuitions. Cognition 11(2), 123– 141 (1982). https://doi.org/10.1016/0010-0277(82)90022-1
-
[37]
D. J. Hilton, The social context of reasoning: Conversational inference and rational judgment. Psychol Bull 118(2), 248 –271 (1995). https://psycnet.apa.org/doi/10.1037/0033-2909.118.2.248
-
[38]
K. E. Stanovich, R. F. West, Individual differences in reasoning: Implications for the rationality debate? Behav Brain Sci 23(5), 645 –665 (2000). https://doi.org/10.1017/s0140525x00003435
-
[39]
Lassiter, Distinguishing semantics, pragmatics, and reasoning in the theory of conditionals
D. Lassiter, Distinguishing semantics, pragmatics, and reasoning in the theory of conditionals. Inquiry 68(7), 2431 –2452 (2025). https://doi.org/10.1080/0020174X.2024.2405186
-
[40]
M. L. Geis, A. M. Zwicky, On invited references. Linguist Inq 2(4), 561–566
-
[41]
M. E. A. Siegel, Biscuit conditionals: Quantification over potential literal acts. Linguist Philos 29, 167–203 (2006). https://doi.org/10.1007/s10988-006-0003-2
-
[42]
J. L. Austin, “Ifs and Cans”, in Philosophical Papers (Oxford University Press, 1961), pp. 153–180
work page 1961
-
[43]
E. Evcen, D. Barner, Already perfect: Language users access the pragmatic meanings of conditionals first. Open Mind 9, 1098 –1120 (2025). https://doi.org/10.1162/opmi.a.17
-
[44]
B. van Tiel, W. Schaeken, Processing conversational implicatures: Alternatives and counterfactual reasoning. Cogn Sci 41(S5), 1119 –1154 (2016). https://doi.org/10.1111/cogs.12362
-
[45]
Experimenting with (conditional) perfection: Tests of the Exhaustivity Theory
F. Cariani, L. J. Rips. “Experimenting with (conditional) perfection: Tests of the Exhaustivity Theory”. In Palgrave Studies in Pragmatics, Language and Cognition, S. Kaufmann, D. E. Over, G. Sharma Eds. (Palgrave Macmillan, 2023). https://doi.org/10.1007/978-3-031-05682-6_9
-
[46]
A. de Varda, C. Saponaro, M. Marelli, High variability in LLMs’ analogical reasoning. Nat hum behav 9(7), 1339–1341 (2025)
work page 2025
-
[47]
Bender, Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell
E. M. Bender, T. Gebru, A McMillan -Major, S. Shmitchell. “On the dangers od stochastic parrots: Can language models be too bog?”. In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency (FAccT, 2021), pp. 610–623. https://doi.org/10.1145/3442188.3445922
-
[48]
Wei et al., Chain-of-thought prompting elicits reasoning in large language models
J. Wei et al., Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing systems 35, 24824–24837 (2022)
work page 2022
- [49]
-
[50]
Language models are few -shot learners
T. B. Brown et al. “Language models are few -shot learners”. In Proceedings of the 34th Internation Conference on Neural Information Processing Systems (NIPS, 2020), pp. 1877–1901
work page 2020
-
[51]
Chowdhery et al., PaLM: Scaling Language Modeling with Pathways
A. Chowdhery et al., PaLM: Scaling Language Modeling with Pathways. J Mach Learn Res 24, 1–113 (2023)
work page 2023
-
[52]
Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer
N. Shazeer et al., Outrageously Large Neural Networks: The Sparsely-Gated Mixture- of-Experts Layer. arXiv [Preprint] (2017). https://arxiv.org/abs/1701.06538 (accessed 20 February 2026)
work page internal anchor Pith review Pith/arXiv arXiv 2017
- [53]
-
[54]
Artetxe et al., Efficient Large-Scale Language Modeling with Mixture of Experts
M. Artetxe et al., Efficient Large-Scale Language Modeling with Mixture of Experts. arXiv [Preprint] (2022) https://arxiv.org/abs/2112.10684 (accessed 20 February 2026)
-
[55]
R. Stalnaker, “A theory of conditionals” In IFs. The University of Western Ontario Series in Philosophy of Science 15, W.L. Harper, R. Stalnaker, G. Pearce, EDS (Springer, 1968). https://doi.org/10.1007/978-94-009-9117-0_2
-
[56]
Lewis, Probabilities of conditionals and conditional probabilities
D. Lewis, Probabilities of conditionals and conditional probabilities. Philos Rev 85, 297–315 (1976)
work page 1976
-
[57]
van der Auwera, Pragmatics in the last quarter century: The case of conditional perfection
J. van der Auwera, Pragmatics in the last quarter century: The case of conditional perfection. J Pragmat 27, 261–274 (1997)
work page 1997
-
[58]
V on Fintel, Conditional Strengthening
K. V on Fintel, Conditional Strengthening . Unpublished Manuscript (MIT, 2001). http://mit.edu/fintel/fintel-2001-condstrength.pdf
work page 2001
-
[59]
Signal to Act: Game Theory in Pragmatics
M. Franke, “Signal to Act: Game Theory in Pragmatics”, PhD thesis. University of Amsterdam (2009); Available from https://hdl.handle.net/11245/1.313416
work page 2009
-
[60]
Only if: If only we understood it
E. Herburger, “Only if: If only we understood it” In Proceedings of Sinn und Bedeutung 19 (2015). https://doi.org/10.18148/sub/2015.v19i0.235
-
[61]
The pragmatics of biscuit conditionals
M. Franke, “The pragmatics of biscuit conditionals”, In Proceedings of the 16 th Amsterdam Colloquium , pp. 91 –96 (2007). https://platform.openjournals.nl/PAC/article/view/22859
work page 2007
- [62]
-
[63]
J. Zehr, F. Schwarz. PennController for Internet Based Experiments (IBEX) (2018). https://doi.org/10.17605/OSF.IO/MD832
-
[64]
Claude Haiku 4.5 (Oct 15 version) [Large Language Model]
Anthropic. Claude Haiku 4.5 (Oct 15 version) [Large Language Model]. https://claude.ai. (2025)
work page 2025
-
[65]
System Card: Claude Opus 4 & Claude Sonnet 4
Anthropic. System Card: Claude Opus 4 & Claude Sonnet 4. Anthropic system cards page (2025)
work page 2025
-
[66]
Claude Sonnet 4.5 (Sep 29 version) [Large Language Model]
Anthropic. Claude Sonnet 4.5 (Sep 29 version) [Large Language Model]. https://claude.ai (2025)
work page 2025
-
[67]
DeepSeek-AI. DeepSeek -V3 Technical Report. arXiv [Preprint] (2024). https://arxiv.org/abs/2412.19437 (accessed 11 March 2026)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[68]
V . Sanh, L. Debut, J. Chaumond, T. Wolf. DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. arXiv [Preprint] (2020). https://arxiv.org/abs/1910.01108 (accessed 11 March 2026)
work page internal anchor Pith review Pith/arXiv arXiv 2020
-
[69]
The Falcon Series of Open Language Models
E. Almazrouei, et al. The Falcon Series of Open Language Models. arXiv [Preprint] (2023). https://arxiv.org/abs/2311.16867 (accessed 11 March 2026)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[70]
Google DeepMind. Pushing the Frontier with Advanced Reasoning, Multimodality, Log Context, and Next Generation Agentic Capabilities. arXiv [Preprint] (2025). https://arxiv.org/abs/2507.06261 (accessed 11 March 2026)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[71]
Gemma Team. Gemma 3 Technical Report. arXiv [Preprint] (2025). https://arxiv.org/abs/2503.19786 (accessed 11 March 2026)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[72]
GLM-V Team. GLM-4.5V and GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning. arXiv [Preprint] (2026). https://arxiv.org/abs/2507.01006 (accessed 11 March 2026)
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[73]
OpenAI. GPT -4o System Card. arXiv [Preprint] (2024). https://arxiv.org/abs/2410.21276 (accessed 11 March 2026)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[74]
A. Singh, et al. OpenAI GPT -5 System Card. arXiv [Preprint] (2025). https://arxiv.org/abs/2601.03267 (accessed 11 March 2026)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[75]
xAI. Grok 4 [Large language model]. https://x.ai/ (2025)
work page 2025
-
[76]
Kimi K2: Open Agentic Intelligence
Kimi Team. Kimi K2: Open Agentic Intelligence. arXiv [Preprint] (2026). https://arxiv.org/abs/2507.20534 (accessed 11 March 2026)
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[77]
A. Grattafiori, et al. The Llama 3 Herd of Models. arXiv [Preprint] (2024). https://arxiv.org/abs/2407.21783 (accessed 11 March 2026)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[78]
The Llama 4 Herd: Architecture, Training, Evaluation, and Deployment Notes
MetaAI. The Llama 4 Herd: Architecture, Training, Evaluation, and Deployment Notes. arXiv [Preprint] (2026). https://arxiv.org/abs/2601.11659 (accessed 11 March 2026)
-
[79]
A. Jiang, et al. Mistral 7B. arXiv [Preprint] (2023). https://arxiv.org/abs/2310.06825 (accessed 11 March 2026)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[80]
NVIDIA Nemotron Nano 2: An Accurate and Efficient Hybrid Mamba-Transformer Reasoning Model
NVIDIA. NVIDIA Nemotron Nano 2: An Accurate and Efficient Hybrid Mamba - Transformer Reasoning Model. arXiv [Preprint] (2025). https://arxiv.org/abs/2508.14444 (accessed 11 March 2026)
work page internal anchor Pith review Pith/arXiv arXiv 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.