Developmental approach reveals the statistical learning of Neural Language Models: Transformers generalize from the most abstract statistical patterns

Elizabeth Wonnacott; Holly Jenkins; Wang Bojun

arxiv: 2606.27460 · v1 · pith:52VAUEFQnew · submitted 2026-06-25 · 💻 cs.CL

Developmental approach reveals the statistical learning of Neural Language Models: Transformers generalize from the most abstract statistical patterns

Wang Bojun , Holly Jenkins , Elizabeth Wonnacott This is my paper

Pith reviewed 2026-06-29 02:11 UTC · model grok-4.3

classification 💻 cs.CL

keywords neural language modelsstatistical learningtransformerssynthetic grammarover-generalizationsdevelopmental approachlanguage cognition

0 comments

The pith

Neural language models acquire the most abstract global statistical knowledge first, then the local dependencies.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper trains successive generative transformer models on a synthetic grammar and saves internal states at multiple points during training. Tracking how representations change shows that the broadest abstract statistical patterns appear earliest while narrower local dependencies appear later. The early phase includes many over-generalizations that later become more constrained. The authors present this sequence as the basis for a new framework describing statistical learning in neural language models.

Core claim

Through a developmental approach analyzing model states saved during training on a synthetic grammar, the models acquire the most abstract global statistical knowledge at the beginning of learning and later acquire the relatively local statistical dependencies. This learning path contains many over-generalizations from the very beginning and these over-generalizations are gradually constrained in the later stage of learning.

What carries the argument

Developmental tracking of internal representations saved at successive stages of training on a synthetic grammar.

If this is right

Over-generalizations occur from the earliest stages and become constrained later.
Abstract global statistical knowledge is acquired before relatively local dependencies.
The developmental sequence itself constitutes the statistical learning process of the models.
A new framework for statistical learning and language cognition in NLMs follows directly from the observed trajectory.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the same early-to-late ordering appears when models are trained on natural-language corpora, the framework would extend beyond synthetic data.
Intervening in training to supply local patterns earlier might alter the rate or final accuracy of acquiring abstract patterns.
Comparing the saved states against human developmental data on similar artificial grammars could test whether the model path matches any observed child learning order.

Load-bearing premise

The synthetic grammar's statistical structure is representative of the patterns that matter for natural language, and the observed changes in model representations reflect genuine statistical learning rather than training artifacts or analysis choices.

What would settle it

Re-training the models on a synthetic grammar whose structure reverses the abstract-to-local order and finding that local dependencies still appear first would falsify the claimed learning path.

Figures

Figures reproduced from arXiv: 2606.27460 by Elizabeth Wonnacott, Holly Jenkins, Wang Bojun.

**Figure 1.** Figure 1: Inheritance hierarchy for English transitive schema. schema specifies that the constituent before the word kick is the kicker and the constituent after it is the kicked. After acquiring multiple frequent and structurally similar lexical transitive schema, higher-order generalizations could be formed by tracking the structural alignment among the lexical schema. This relatively abstract grammatical schem… view at source ↗

**Figure 2.** Figure 2: Inheritance hierarchy in artificial language [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Dependency relation in the inheritance hierarchy. design a synthetic grammar with dependency relations nested in hierarchical manner. The grammar is designed to contain three levels of statistical regularity: global level, middle level, and local level. This allows us to examine the learning path in a nested inheritance hierarchy. The inheritance hierarchy is illustrated in figure 2. The dependency relat… view at source ↗

**Figure 4.** Figure 4: critical stages from motion chart visualization. The visualization demonstrates a gradience of statistical learning. The dependency schema on the global level is learned at the very early stage of development while the dependency on the local level is learned in the end. That is, the model acquired the most abstract global statistical knowledge at the beginning and later gradually acquire relatively local… view at source ↗

**Figure 6.** Figure 6: Example probability ranking at early stage of learning (15,000 iteration) given prompt M -N 1-P11 3.3 Permuted order of MNPQ language To demonstrate that the pattern we observed is not related to the linear order in this synthetic grammar, we created six different synthetic grammar dataset with different order of MNPQ categories. Q category remains at the final position, while the positions of the remai… view at source ↗

**Figure 7.** Figure 7: Probability mass analysis for models trained on different synthetic grammar. 4 Discussion The current investigation provides evidence that the statistical learning of NLMs is not a random process. There are systematic patterns in the path of generalizations. These models generalize from the most global distributional dependency in the input and later acquire the relatively local dependency schema. This i… view at source ↗

read the original abstract

In this study, we use a developmental approach to investigate the statistical learning and mental representation of neural language models (NLM). A series of Generative Transformer models are trained on a synthetic grammar. The model states are saved at multiple stages in the course of training. Through analyzing how the internal representations of these models change in the developmental path, we found that NLMs acquire the most abstract global statistical knowledge at the beginning of learning and later acquire the relatively local statistical dependencies. This learning path contains many over-generalizations from the very beginning and these over-generalizations are gradually constrained in the later stage of learning. Based on this observation, we propose a new framework to explain the statistical learning and language cognition of NLMs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Transformers on this synthetic grammar pick up abstract global stats early then local ones later, with early overgeneralization; the grammar's match to natural language is the open question.

read the letter

The main takeaway is that the authors train a series of generative transformers on a synthetic grammar, save checkpoints throughout training, and track how internal representations shift. They report that the most abstract global statistical patterns appear first, local dependencies come later, and early overgeneralizations get gradually constrained.

What stands out is the focus on the training trajectory itself rather than final performance. Saving multiple model states and analyzing representation changes is a direct way to observe order of acquisition, and that angle is worth having in the literature on statistical learning in neural nets.

The soft spot is the single synthetic grammar. Nothing in the abstract shows that its generative rules reproduce the nested long-range dependencies or hierarchical structure typical of natural language, so the observed order could be tied to the grammar's limited inductive bias rather than a general fact about NLMs. The abstract also gives no detail on how abstract versus local patterns are operationalized or measured, which leaves room for analysis choices to shape the result. The proposed framework appears derived from the same observations, so it functions more as a summary than an independently grounded account.

This is for readers working on training dynamics in language models or on links between neural net learning and cognitive models of acquisition. A person already interested in developmental analyses of transformers would get value from the checkpoint approach, provided the methods section clarifies the grammar and the metrics.

I would send it for peer review so the grammar design and representation analysis can be checked directly.

Referee Report

2 major / 2 minor

Summary. The manuscript trains a series of Generative Transformer models on a synthetic grammar, saves internal states at multiple training stages, and analyzes representational changes to argue that NLMs first acquire the most abstract global statistical patterns and only later acquire local dependencies, with early over-generalizations that are gradually constrained; a new explanatory framework for NLM statistical learning is proposed on this basis.

Significance. If the developmental trajectory is shown to be robust and the synthetic grammar's statistics are representative of natural language, the work could contribute to understanding generalization dynamics in transformers and offer parallels to human language acquisition studies. The use of checkpointed model states across training is a methodological strength that supports tracking of representational shifts.

major comments (2)

[Methods / synthetic grammar definition] The central claim that the observed global-to-local trajectory is a general property of NLM statistical learning rests on a single synthetic grammar (described in the methods); without evidence that its generative rules reproduce nested long-range dependencies or hierarchical structure typical of natural language, the trajectory may be an artifact of the grammar's inductive bias rather than a general finding.
[Framework section (post-results)] The proposed framework (outlined after the empirical results) is constructed directly from the training-path observations on this grammar; it is unclear whether the framework supplies independent, falsifiable predictions or simply restates quantities already fitted to the same developmental data, creating a circularity risk for the explanatory claim.

minor comments (2)

[Results / analysis] Add quantitative metrics, error bars, and statistical tests for the reported shifts in global vs. local knowledge and for the over-generalization counts across checkpoints.
[Analysis methods] Clarify the precise operational definitions used to label a statistical pattern as 'abstract global' versus 'local' and how these labels are computed from model representations.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major comment below and note planned revisions to strengthen the manuscript.

read point-by-point responses

Referee: [Methods / synthetic grammar definition] The central claim that the observed global-to-local trajectory is a general property of NLM statistical learning rests on a single synthetic grammar (described in the methods); without evidence that its generative rules reproduce nested long-range dependencies or hierarchical structure typical of natural language, the trajectory may be an artifact of the grammar's inductive bias rather than a general finding.

Authors: We appreciate the referee's point on the scope of generalization. The synthetic grammar was constructed with generative rules intended to produce both abstract global patterns and nested long-range dependencies (detailed in Methods). That said, we agree that evidence from only one grammar leaves open the possibility that the observed trajectory reflects grammar-specific biases. In revision we will expand the Methods with explicit examples and statistics showing how the grammar encodes hierarchical structure and long-range dependencies, and we will add a Limitations subsection in the Discussion that acknowledges the single-grammar design and calls for future multi-grammar validation. These changes will make the evidential basis and its boundaries clearer without overstating generality. revision: partial
Referee: [Framework section (post-results)] The proposed framework (outlined after the empirical results) is constructed directly from the training-path observations on this grammar; it is unclear whether the framework supplies independent, falsifiable predictions or simply restates quantities already fitted to the same developmental data, creating a circularity risk for the explanatory claim.

Authors: The framework is indeed derived from the developmental observations on this grammar and functions primarily as an organizing explanation of those data. We accept that, as currently presented, it risks appearing post-hoc. In the revised manuscript we will rewrite the Framework section to (a) explicitly state its empirical grounding, (b) separate descriptive summaries of the observed trajectory from the framework's interpretive claims, and (c) articulate concrete, testable predictions (e.g., how altering the ratio of global versus local statistics or changing model depth should shift the timing of over-generalization resolution). These additions will reduce circularity and position the framework as a source of future predictions rather than a restatement of the present results. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical observations on synthetic grammar do not reduce to fitted inputs by construction

full rationale

The paper trains Generative Transformer models on a synthetic grammar, saves intermediate states, and reports observed changes in internal representations (abstract global statistics acquired first, followed by local dependencies and gradual constraint of over-generalizations). The proposed framework is explicitly based on this empirical observation rather than any mathematical derivation, uniqueness theorem, or parameter fit that is then relabeled as a prediction. No equations, self-citations, or ansatzes are invoked in the provided text that would create a self-definitional loop or force a result by construction. The work is self-contained as a developmental empirical study; external validity concerns (e.g., grammar representativeness) fall outside circularity analysis.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities are described.

pith-pipeline@v0.9.1-grok · 5653 in / 986 out tokens · 21811 ms · 2026-06-29T02:11:48.405763+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

62 extracted references · 42 canonical work pages · 1 internal anchor

[1]

A., Goyal, N., & Tsvetkov, Y

Ahuja, K., Balachandran, V., Panwar, M., He, T., Smith, N. A., Goyal, N., & Tsvetkov, Y. (2024).Learning syntax without planting trees: Understanding when and why transformers generalize hierarchically

2024
[2]

Boas, H. C. (2011). Coercion and leaking argument structures in construction grammar.Linguistics, 49(6).https://doi.org/10.1515/ling.2011.036

work page doi:10.1515/ling.2011.036 2011
[3]

Bowerman, M., & Croft, W. (2007). The acquisition of the english causative alternation. InCrosslinguistic Perspectives on Argument Structure. Routledge

2007
[4]

Brown, H., Smith, K., Samara, A., & Wonnacott, E. (2022). Semantic cues in language learning: An artificial language study with adult and child learners.Language, Cognition and Neuroscience, 37(4), 509–531. https://doi.org/10.1080/23273798.2021.1995612

work page doi:10.1080/23273798.2021.1995612 2022
[5]

(1995).The minimalist program

Chomsky, N. (1995).The minimalist program. MIT Press

1995
[6]

Croft, W. (2015). Force dynamics and directed change in event lexicalization and argument realization. In R. G. de Almeida & C. Manouilidou (Eds),Cognitive Science Perspectives on Verb Representation and Processing(pp. 103–129). Springer International Publishing. https://doi.org/10.1007/978-3-319-10112-5_5

work page doi:10.1007/978-3-319-10112-5_5 2015
[7]

Croft, W. A. (2003). Lexical rules vs. constructions: A false dichotomy. In H. Cuyckens, T. Berg, R. Dirven, & K.-U. Panther (Eds),Current Issues in Linguistic Theory(Vol. 243, pp. 49–68). John Benjamins Publishing Company. https://doi.org/10.1075/cilt.243.07cro

work page doi:10.1075/cilt.243.07cro 2003
[8]

Croft, W., & Cruse, D. A. (2004).Cognitive Linguistics

2004
[9]

Futrell, R., & Mahowald, K. (2025). How linguistics learned to stop worrying and love the language models.Behavioral and Brain Sciences, 1–98. https://doi.org/10.1017/S0140525X2510112X

work page doi:10.1017/s0140525x2510112x 2025
[10]

Neural Language Models as Psycholinguistic Subjects: Representations of Syntactic State

Futrell, R., Wilcox, E., Morita, T., Qian, P., Ballesteros, M., & Levy, R. (2019).Neural language models as psycholinguistic subjects: Representations of syntactic state(No. arXiv:1903.03260). arXiv. https://doi.org/10.48550/arXiv.1903.03260

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1903.03260 2019
[11]

Goldberg, A. E. (1995).Constructions: A construction grammar approach to argument structure. University of Chicago Press.https://press.uchicago.edu/ ucp/books/book/chicago/C/bo3683810.html

1995
[12]

Goldberg, A. E. (2019, February 12).Explain me this: Creativity, competition, and the partial productivity of constructions. https://doi.org/10.2307/j.ctvc772nn

work page doi:10.2307/j.ctvc772nn 2019
[13]

Hardy, M., Sucholutsky, I., Thompson, B., & Griffiths, T. (2023). Large language models meet cognitive science: LLMs as tools, models, and participants. Proceedings of the Annual Meeting of the Cognitive Science Society,45(45). https://escholarship.org/uc/item/6dp9k2gz

2023
[14]

Haspelmath, M. (2008). Parametric versus functional explanations of syntactic universals. In T. Biberauer (Ed.),Linguistik Aktuell/Linguistics Today(Vol. 132, pp. 75–107). John Benjamins Publishing Company. https://doi.org/10.1075/la.132.04has

work page doi:10.1075/la.132.04has 2008
[15]

Hewitt, J., & Manning, C. D. (2019). A structural probe for finding syntax in word representations. In J

2019
[16]

Doran, & T

Burstein, C. Doran, & T. Solorio (Eds),Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (long and Short Papers)(pp. 4129–4138). Association for Computational Linguistics. https://doi.org/10.18653/v1/N19-1419

work page doi:10.18653/v1/n19-1419 2019
[17]

(2019).Construction Grammar and its Application to English

Hilpert, M. (2019).Construction Grammar and its Application to English. Edinburgh University Press. https://doi.org/10.1515/9781474433624

work page doi:10.1515/9781474433624 2019
[18]

R., Schütze, H., & Pierrehumbert, J

Hofmann, V., Weissweiler, L., Mortensen, D. R., Schütze, H., & Pierrehumbert, J. B. (2025). Derivational morphology reveals analogical generalization in large language models.Proceedings of the National Academy of Sciences,122(19), e2423232122. 8 https://doi.org/10.1073/pnas.2423232122

work page doi:10.1073/pnas.2423232122 2025
[19]

S., & Christiansen, M

Isbilen, E. S., & Christiansen, M. H. (2022). Statistical learning of language: A meta-analysis into 25 years of research.Cognitive Science, 46(9), e13198. https://doi.org/10.1111/cogs.13198

work page doi:10.1111/cogs.13198 2022
[20]

Jackendoff, R. (1977). X syntax: A study of phrase structure. MIT Press

1977
[21]

D., & Christiansen, M

Kallens, P., Kristensen-McLachlan, R. D., & Christiansen, M. H. (2023). Large Language Models Demonstrate the Potential of Statistical Learning in Language. Cognitive Science,47(3), e13256. https://doi.org/10.1111/cogs.13256

work page doi:10.1111/cogs.13256 2023
[22]

Kim, N., & Smolensky, P. (2021). Testing for grammatical category abstraction in neural language models. In A. Ettinger, E. Pavlick, & B. Prickett (Eds),Proceedings of the Society for Computation in Linguistics 2021(pp. 467–470). Association for Computational Linguistics. https://aclanthology.org/2021.scil-1.59/

2021
[23]

Kiparsky, P. (1997). Remarks on Denominal Verbs. In Alex A., Bresnan, J., & Sells. P(Eds.), Complex Predicates,The University of Chicago Press

1997
[24]

Baroni, M., & Dehaene, S. (2021). Mechanisms for handling nested dependencies in neural-network language models and humans.Cognition,213, 104699.https: //doi.org/10.1016/j.cognition.2021.104699

work page doi:10.1016/j.cognition.2021.104699 2021
[25]

Langacker, R. W. (1987).Foundations of cognitive grammar: Volume I: theoretical prerequisites. Stanford University Press

1987
[26]

Langacker, R. W. (2009).Investigations in cognitive grammar. Walter de Gruyter

2009
[27]

Lany, J., & Saffran, J. R. (2010). From Statistics to Meaning: Infants Acquisition of Lexical Categories. Psychological Science,21(2), 284–291. https://doi.org/10.1177/0956797609358570

work page doi:10.1177/0956797609358570 2010
[28]

Lany, J., & Saffran, J. R. (2011). Interactions between statistical and semantic information in infant language development: Interactions between statistical and semantic information.Developmental Science,14(5), 1207–1219.https: //doi.org/10.1111/j.1467-7687.2011.01073.x

work page doi:10.1111/j.1467-7687.2011.01073.x 2011
[29]

(1993).English verb classes and alternations: A preliminary investigation

Levin, B. (1993).English verb classes and alternations: A preliminary investigation. University of Chicago Press

1993
[30]

Levin, B., & Hovav, M. R. (1994).Unaccusativity: At the syntax-lexical semantics interface. MIT Press

1994
[31]

(1995).Unaccusativity

Levin, B & Rappaport Hovav, M. (1995).Unaccusativity. At the syntax-lexical semantics interface.MIT Press

1995
[32]

(2005).Argument realization.Cambridge University Press

Levin, B., & Rappaport Hovav, M. (2005).Argument realization.Cambridge University Press

2005
[33]

Levin, B. (2015). Semantics and pragmatics of argument alternations.Annual Review of Linguistics,1(Volume 1, 2015), 63–83.https://doi.org/10.1146/ annurev-linguist-030514-125141

2015
[34]

Li, B., Zhu, Z., Thomas, G., Rudzicz, F., & Xu, Y. (2022). Neural reality of argument structure constructions. In S. Muresan, P. Nakov, & A. Villavicencio (Eds), Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (volume 1: Long Papers)(pp. 7410–7423). Association for Computational Linguistics. https://doi.org/10.18653...

work page doi:10.18653/v1/2022.acl-long.512 2022
[35]

(2025).An investigation of comparative correlative constructions in auto-regressive large language models: From construction grammar to computational understanding[Preprint]

Li, J., & Liu, Y. (2025).An investigation of comparative correlative constructions in auto-regressive large language models: From construction grammar to computational understanding[Preprint]. Research Square. https://doi.org/10.21203/rs.3.rs-6702743/v1

work page doi:10.21203/rs.3.rs-6702743/v1 2025
[36]

Lieven, E. V. M., Pine, J. M., & Baldwin, G. (1997). Lexically-based learning and early grammatical development.Journal of Child Language,24(1), 187–219. https://doi.org/10.1017/S0305000996002930

work page doi:10.1017/s0305000996002930 1997
[37]

Linzen, T., Dupoux, E., & Goldberg, Y. (2016). Assessing the Ability of LSTMs to Learn Syntax-Sensitive Dependencies.Transactions of the Association for Computational Linguistics,4, 521–535. https://doi.org/10.1162/tacl_a_00115

work page doi:10.1162/tacl_a_00115 2016
[38]

Mintz, T. H. (2002). Category induction from distributional cues in an artificial language.Memory & Cognition,30(5), 678–686. https://doi.org/10.3758/BF03196424

work page doi:10.3758/bf03196424 2002
[39]

B., Christiansen, M

Misyak, J. B., Christiansen, M. H., & Tomblin, J. B. (2009). Statistical learning of nonadjacencies predicts on-line processing of long-distance dependencies in natural language.Proceedings of the Cognitive Science Society

2009
[40]

L., & Newport, E

Morgan, J. L., & Newport, E. L. (1981). The role of constituent structure in the induction of an artificial language.Journal of Verbal Learning and Verbal Behavior,20(1), 67–85. https://doi.org/10.1016/S0022-5371(81)90312-1 Müller, S. (2017). Head-driven phrase structure grammar, sign-based construction grammar, and fluid construction grammar: Commonaliti...

work page doi:10.1016/s0022-5371(81)90312-1 1981
[41]

Murty, S., Sharma, P., Andreas, J., & Manning, C. D. (2023).Grokking of hierarchical structure in vanilla transformers(No. arXiv:2305.18741). arXiv. https://doi.org/10.48550/arXiv.2305.18741

work page doi:10.48550/arxiv.2305.18741 2023
[42]

F., & Saffran, J

Pelucchi, B., Hay, J. F., & Saffran, J. R. (2009a). Learning in reverse: Eight-month-old infants track backward transitional probabilities.Cognition, 113(2), 244–247.https: //doi.org/10.1016/j.cognition.2009.07.011

work page doi:10.1016/j.cognition.2009.07.011 2009
[43]

(2015).Argument Structure in Usage-Based Construction Grammar: Experimental and corpus-based perspectives(Vol

Perek, F. (2015).Argument Structure in Usage-Based Construction Grammar: Experimental and corpus-based perspectives(Vol. 17). John Benjamins Publishing Company. https://doi.org/10.1075/cal.17

work page doi:10.1075/cal.17 2015
[44]

Perek, F., & Goldberg, A. E. (2015). Generalizing beyond the input: The functions of the constructions matter. Journal of Memory and Language,84, 108–127. https://doi.org/10.1016/j.jml.2015.04.006

work page doi:10.1016/j.jml.2015.04.006 2015
[45]

Perek, F., & Goldberg, A. E. (2017). Linguistic generalization on the basis of function and constraints on the basis of statistical preemption. Cognition,168, 276–293.https: //doi.org/10.1016/j.cognition.2017.06.019

work page doi:10.1016/j.cognition.2017.06.019 2017
[46]

(1989).Learnability and cognition: The 9 acquisition of argument structure(pp

Pinker, S. (1989).Learnability and cognition: The 9 acquisition of argument structure(pp. xiv, 411). The MIT Press. Rappaport Hovav, M., & Levin, B. (1998). Building verb meanings.The projection of arguments: Lexical and compositional factors,97-134

1989
[47]

A., Newport, E

Reeder, P. A., Newport, E. L., & Aslin, R. N. (2013). From shared contexts to syntactic categories: The role of distributional information in learning linguistic form-classes.Cognitive Psychology,66(1), 30–54. https://doi.org/10.1016/j.cogpsych.2012.09.001

work page doi:10.1016/j.cogpsych.2012.09.001 2013
[48]

A., Newport, E

Reeder, P. A., Newport, E. L., & Aslin, R. N. (2017). Distributional learning of subcategories in an artificial grammar: Category generalization and subcategory restrictions.Journal of Memory and Language,97, 17–29. https://doi.org/10.1016/j.jml.2017.07.006

work page doi:10.1016/j.jml.2017.07.006 2017
[49]

R., & Saffran, J

Romberg, A. R., & Saffran, J. R. (2010). Statistical learning and language acquisition.WIREs Cognitive Science,1(6), 906–914. https://doi.org/10.1002/wcs.78

work page doi:10.1002/wcs.78 2010
[50]

Saffran, J. R. (2001). The Use of Predictive Dependencies in Language Learning.Journal of Memory and Language,44(4), 493–515. https://doi.org/10.1006/jmla.2000.2759

work page doi:10.1006/jmla.2000.2759 2001
[51]

Saffran, J. R. (2020). Statistical Language Learning in Infancy.Child Development Perspectives,14(1), 49–54.https://doi.org/10.1111/cdep.12355

work page doi:10.1111/cdep.12355 2020
[52]

R., Aslin, R

Saffran, J. R., Aslin, R. N., & Newport, E. L. (1996). Statistical learning by 8-month-old infants.Science, 274(5294), 1926–1928. https://doi.org/10.1126/science.274.5294.1926

work page doi:10.1126/science.274.5294.1926 1996
[53]

Fazekas, J., & Ambridge, B. (2025). Learners restrict their linguistic generalizations using preemption but not entrenchment: Evidence from artificial-language-learning studies with adults and children.Psychological Review,132(1), 1–17. https://doi.org/10.1037/rev0000463

work page doi:10.1037/rev0000463 2025
[54]

Smith, K. H. (1969). Learning Co-occurrence restrictions: Rule induction or rote learning?Journal of Verbal Learning and Verbal Behavior,8(2), 319–321. https://doi.org/10.1016/S0022-5371(69)80086-1

work page doi:10.1016/s0022-5371(69)80086-1 1969
[55]

(2003).Constructing a language: A usage-based theory of language acquisition

Tomasello, M. (2003).Constructing a language: A usage-based theory of language acquisition. Harvard University Press. https://doi.org/10.2307/j.ctv26070v8

work page doi:10.2307/j.ctv26070v8 2003
[56]

Tomasello, M. (2007). Acquiring Linguistic Constructions. In W. Damon & R. M. Lerner (Eds),Handbook of Child Psychology(1st edn). Wiley.https: //doi.org/10.1002/9780470147658.chpsy0206

work page doi:10.1002/9780470147658.chpsy0206 2007
[57]

P., & Newport, E

Thompson, S. P., & Newport, E. L. (2007). Statistical learning of syntax: The role of transitional probability. Language learning and development, 3(1), 1-42

2007
[58]

Wei, J., Garrette, D., Linzen, T., & Pavlick, E. (2021). Frequency effects on syntactic rule learning in transformers(No. arXiv:2109.07020). arXiv. https://doi.org/10.48550/arXiv.2109.07020

work page doi:10.48550/arxiv.2109.07020 2021
[59]

(2023a).Construction grammar provides unique insight into neural language models(No

Levin, L., & Schütze, H. (2023a).Construction grammar provides unique insight into neural language models(No. arXiv:2302.02178). arXiv. https://doi.org/10.48550/arXiv.2302.02178

work page doi:10.48550/arxiv.2302.02178
[60]

Weissweiler, L., Hofmann, V., Köksal, A., & Schütze, H. (2023b). Explaining pretrained language models’ understanding of linguistic structures using construction grammar.Frontiers in Artificial Intelligence,6. https://doi.org/10.3389/frai.2023.1225791

work page doi:10.3389/frai.2023.1225791 2023
[61]

Wonnacott, E. (2013). Learning: Statistical mechanisms in language acquisition. In P.-M. Binder & K. Smith (Eds),The Language Phenomenon(pp. 65–92). Springer Berlin Heidelberg. https://doi.org/10.1007/978-3-642-36086-2_4

work page doi:10.1007/978-3-642-36086-2_4 2013
[62]

Wonnacott, E., Brown, H., & Nation, K. (2017). Skewing the evidence: The effect of input structure on child and adult learning of lexically based patterns in an artificial language.Journal of Memory and Language, 95, 36–48. https://doi.org/10.1016/j.jml.2017.01.005 10

work page doi:10.1016/j.jml.2017.01.005 2017

[1] [1]

A., Goyal, N., & Tsvetkov, Y

Ahuja, K., Balachandran, V., Panwar, M., He, T., Smith, N. A., Goyal, N., & Tsvetkov, Y. (2024).Learning syntax without planting trees: Understanding when and why transformers generalize hierarchically

2024

[2] [2]

Boas, H. C. (2011). Coercion and leaking argument structures in construction grammar.Linguistics, 49(6).https://doi.org/10.1515/ling.2011.036

work page doi:10.1515/ling.2011.036 2011

[3] [3]

Bowerman, M., & Croft, W. (2007). The acquisition of the english causative alternation. InCrosslinguistic Perspectives on Argument Structure. Routledge

2007

[4] [4]

Brown, H., Smith, K., Samara, A., & Wonnacott, E. (2022). Semantic cues in language learning: An artificial language study with adult and child learners.Language, Cognition and Neuroscience, 37(4), 509–531. https://doi.org/10.1080/23273798.2021.1995612

work page doi:10.1080/23273798.2021.1995612 2022

[5] [5]

(1995).The minimalist program

Chomsky, N. (1995).The minimalist program. MIT Press

1995

[6] [6]

Croft, W. (2015). Force dynamics and directed change in event lexicalization and argument realization. In R. G. de Almeida & C. Manouilidou (Eds),Cognitive Science Perspectives on Verb Representation and Processing(pp. 103–129). Springer International Publishing. https://doi.org/10.1007/978-3-319-10112-5_5

work page doi:10.1007/978-3-319-10112-5_5 2015

[7] [7]

Croft, W. A. (2003). Lexical rules vs. constructions: A false dichotomy. In H. Cuyckens, T. Berg, R. Dirven, & K.-U. Panther (Eds),Current Issues in Linguistic Theory(Vol. 243, pp. 49–68). John Benjamins Publishing Company. https://doi.org/10.1075/cilt.243.07cro

work page doi:10.1075/cilt.243.07cro 2003

[8] [8]

Croft, W., & Cruse, D. A. (2004).Cognitive Linguistics

2004

[9] [9]

Futrell, R., & Mahowald, K. (2025). How linguistics learned to stop worrying and love the language models.Behavioral and Brain Sciences, 1–98. https://doi.org/10.1017/S0140525X2510112X

work page doi:10.1017/s0140525x2510112x 2025

[10] [10]

Neural Language Models as Psycholinguistic Subjects: Representations of Syntactic State

Futrell, R., Wilcox, E., Morita, T., Qian, P., Ballesteros, M., & Levy, R. (2019).Neural language models as psycholinguistic subjects: Representations of syntactic state(No. arXiv:1903.03260). arXiv. https://doi.org/10.48550/arXiv.1903.03260

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1903.03260 2019

[11] [11]

Goldberg, A. E. (1995).Constructions: A construction grammar approach to argument structure. University of Chicago Press.https://press.uchicago.edu/ ucp/books/book/chicago/C/bo3683810.html

1995

[12] [12]

Goldberg, A. E. (2019, February 12).Explain me this: Creativity, competition, and the partial productivity of constructions. https://doi.org/10.2307/j.ctvc772nn

work page doi:10.2307/j.ctvc772nn 2019

[13] [13]

Hardy, M., Sucholutsky, I., Thompson, B., & Griffiths, T. (2023). Large language models meet cognitive science: LLMs as tools, models, and participants. Proceedings of the Annual Meeting of the Cognitive Science Society,45(45). https://escholarship.org/uc/item/6dp9k2gz

2023

[14] [14]

Haspelmath, M. (2008). Parametric versus functional explanations of syntactic universals. In T. Biberauer (Ed.),Linguistik Aktuell/Linguistics Today(Vol. 132, pp. 75–107). John Benjamins Publishing Company. https://doi.org/10.1075/la.132.04has

work page doi:10.1075/la.132.04has 2008

[15] [15]

Hewitt, J., & Manning, C. D. (2019). A structural probe for finding syntax in word representations. In J

2019

[16] [16]

Doran, & T

Burstein, C. Doran, & T. Solorio (Eds),Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (long and Short Papers)(pp. 4129–4138). Association for Computational Linguistics. https://doi.org/10.18653/v1/N19-1419

work page doi:10.18653/v1/n19-1419 2019

[17] [17]

(2019).Construction Grammar and its Application to English

Hilpert, M. (2019).Construction Grammar and its Application to English. Edinburgh University Press. https://doi.org/10.1515/9781474433624

work page doi:10.1515/9781474433624 2019

[18] [18]

R., Schütze, H., & Pierrehumbert, J

Hofmann, V., Weissweiler, L., Mortensen, D. R., Schütze, H., & Pierrehumbert, J. B. (2025). Derivational morphology reveals analogical generalization in large language models.Proceedings of the National Academy of Sciences,122(19), e2423232122. 8 https://doi.org/10.1073/pnas.2423232122

work page doi:10.1073/pnas.2423232122 2025

[19] [19]

S., & Christiansen, M

Isbilen, E. S., & Christiansen, M. H. (2022). Statistical learning of language: A meta-analysis into 25 years of research.Cognitive Science, 46(9), e13198. https://doi.org/10.1111/cogs.13198

work page doi:10.1111/cogs.13198 2022

[20] [20]

Jackendoff, R. (1977). X syntax: A study of phrase structure. MIT Press

1977

[21] [21]

D., & Christiansen, M

Kallens, P., Kristensen-McLachlan, R. D., & Christiansen, M. H. (2023). Large Language Models Demonstrate the Potential of Statistical Learning in Language. Cognitive Science,47(3), e13256. https://doi.org/10.1111/cogs.13256

work page doi:10.1111/cogs.13256 2023

[22] [22]

Kim, N., & Smolensky, P. (2021). Testing for grammatical category abstraction in neural language models. In A. Ettinger, E. Pavlick, & B. Prickett (Eds),Proceedings of the Society for Computation in Linguistics 2021(pp. 467–470). Association for Computational Linguistics. https://aclanthology.org/2021.scil-1.59/

2021

[23] [23]

Kiparsky, P. (1997). Remarks on Denominal Verbs. In Alex A., Bresnan, J., & Sells. P(Eds.), Complex Predicates,The University of Chicago Press

1997

[24] [24]

Baroni, M., & Dehaene, S. (2021). Mechanisms for handling nested dependencies in neural-network language models and humans.Cognition,213, 104699.https: //doi.org/10.1016/j.cognition.2021.104699

work page doi:10.1016/j.cognition.2021.104699 2021

[25] [25]

Langacker, R. W. (1987).Foundations of cognitive grammar: Volume I: theoretical prerequisites. Stanford University Press

1987

[26] [26]

Langacker, R. W. (2009).Investigations in cognitive grammar. Walter de Gruyter

2009

[27] [27]

Lany, J., & Saffran, J. R. (2010). From Statistics to Meaning: Infants Acquisition of Lexical Categories. Psychological Science,21(2), 284–291. https://doi.org/10.1177/0956797609358570

work page doi:10.1177/0956797609358570 2010

[28] [28]

Lany, J., & Saffran, J. R. (2011). Interactions between statistical and semantic information in infant language development: Interactions between statistical and semantic information.Developmental Science,14(5), 1207–1219.https: //doi.org/10.1111/j.1467-7687.2011.01073.x

work page doi:10.1111/j.1467-7687.2011.01073.x 2011

[29] [29]

(1993).English verb classes and alternations: A preliminary investigation

Levin, B. (1993).English verb classes and alternations: A preliminary investigation. University of Chicago Press

1993

[30] [30]

Levin, B., & Hovav, M. R. (1994).Unaccusativity: At the syntax-lexical semantics interface. MIT Press

1994

[31] [31]

(1995).Unaccusativity

Levin, B & Rappaport Hovav, M. (1995).Unaccusativity. At the syntax-lexical semantics interface.MIT Press

1995

[32] [32]

(2005).Argument realization.Cambridge University Press

Levin, B., & Rappaport Hovav, M. (2005).Argument realization.Cambridge University Press

2005

[33] [33]

Levin, B. (2015). Semantics and pragmatics of argument alternations.Annual Review of Linguistics,1(Volume 1, 2015), 63–83.https://doi.org/10.1146/ annurev-linguist-030514-125141

2015

[34] [34]

Li, B., Zhu, Z., Thomas, G., Rudzicz, F., & Xu, Y. (2022). Neural reality of argument structure constructions. In S. Muresan, P. Nakov, & A. Villavicencio (Eds), Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (volume 1: Long Papers)(pp. 7410–7423). Association for Computational Linguistics. https://doi.org/10.18653...

work page doi:10.18653/v1/2022.acl-long.512 2022

[35] [35]

(2025).An investigation of comparative correlative constructions in auto-regressive large language models: From construction grammar to computational understanding[Preprint]

Li, J., & Liu, Y. (2025).An investigation of comparative correlative constructions in auto-regressive large language models: From construction grammar to computational understanding[Preprint]. Research Square. https://doi.org/10.21203/rs.3.rs-6702743/v1

work page doi:10.21203/rs.3.rs-6702743/v1 2025

[36] [36]

Lieven, E. V. M., Pine, J. M., & Baldwin, G. (1997). Lexically-based learning and early grammatical development.Journal of Child Language,24(1), 187–219. https://doi.org/10.1017/S0305000996002930

work page doi:10.1017/s0305000996002930 1997

[37] [37]

Linzen, T., Dupoux, E., & Goldberg, Y. (2016). Assessing the Ability of LSTMs to Learn Syntax-Sensitive Dependencies.Transactions of the Association for Computational Linguistics,4, 521–535. https://doi.org/10.1162/tacl_a_00115

work page doi:10.1162/tacl_a_00115 2016

[38] [38]

Mintz, T. H. (2002). Category induction from distributional cues in an artificial language.Memory & Cognition,30(5), 678–686. https://doi.org/10.3758/BF03196424

work page doi:10.3758/bf03196424 2002

[39] [39]

B., Christiansen, M

Misyak, J. B., Christiansen, M. H., & Tomblin, J. B. (2009). Statistical learning of nonadjacencies predicts on-line processing of long-distance dependencies in natural language.Proceedings of the Cognitive Science Society

2009

[40] [40]

L., & Newport, E

Morgan, J. L., & Newport, E. L. (1981). The role of constituent structure in the induction of an artificial language.Journal of Verbal Learning and Verbal Behavior,20(1), 67–85. https://doi.org/10.1016/S0022-5371(81)90312-1 Müller, S. (2017). Head-driven phrase structure grammar, sign-based construction grammar, and fluid construction grammar: Commonaliti...

work page doi:10.1016/s0022-5371(81)90312-1 1981

[41] [41]

Murty, S., Sharma, P., Andreas, J., & Manning, C. D. (2023).Grokking of hierarchical structure in vanilla transformers(No. arXiv:2305.18741). arXiv. https://doi.org/10.48550/arXiv.2305.18741

work page doi:10.48550/arxiv.2305.18741 2023

[42] [42]

F., & Saffran, J

Pelucchi, B., Hay, J. F., & Saffran, J. R. (2009a). Learning in reverse: Eight-month-old infants track backward transitional probabilities.Cognition, 113(2), 244–247.https: //doi.org/10.1016/j.cognition.2009.07.011

work page doi:10.1016/j.cognition.2009.07.011 2009

[43] [43]

(2015).Argument Structure in Usage-Based Construction Grammar: Experimental and corpus-based perspectives(Vol

Perek, F. (2015).Argument Structure in Usage-Based Construction Grammar: Experimental and corpus-based perspectives(Vol. 17). John Benjamins Publishing Company. https://doi.org/10.1075/cal.17

work page doi:10.1075/cal.17 2015

[44] [44]

Perek, F., & Goldberg, A. E. (2015). Generalizing beyond the input: The functions of the constructions matter. Journal of Memory and Language,84, 108–127. https://doi.org/10.1016/j.jml.2015.04.006

work page doi:10.1016/j.jml.2015.04.006 2015

[45] [45]

Perek, F., & Goldberg, A. E. (2017). Linguistic generalization on the basis of function and constraints on the basis of statistical preemption. Cognition,168, 276–293.https: //doi.org/10.1016/j.cognition.2017.06.019

work page doi:10.1016/j.cognition.2017.06.019 2017

[46] [46]

(1989).Learnability and cognition: The 9 acquisition of argument structure(pp

Pinker, S. (1989).Learnability and cognition: The 9 acquisition of argument structure(pp. xiv, 411). The MIT Press. Rappaport Hovav, M., & Levin, B. (1998). Building verb meanings.The projection of arguments: Lexical and compositional factors,97-134

1989

[47] [47]

A., Newport, E

Reeder, P. A., Newport, E. L., & Aslin, R. N. (2013). From shared contexts to syntactic categories: The role of distributional information in learning linguistic form-classes.Cognitive Psychology,66(1), 30–54. https://doi.org/10.1016/j.cogpsych.2012.09.001

work page doi:10.1016/j.cogpsych.2012.09.001 2013

[48] [48]

A., Newport, E

Reeder, P. A., Newport, E. L., & Aslin, R. N. (2017). Distributional learning of subcategories in an artificial grammar: Category generalization and subcategory restrictions.Journal of Memory and Language,97, 17–29. https://doi.org/10.1016/j.jml.2017.07.006

work page doi:10.1016/j.jml.2017.07.006 2017

[49] [49]

R., & Saffran, J

Romberg, A. R., & Saffran, J. R. (2010). Statistical learning and language acquisition.WIREs Cognitive Science,1(6), 906–914. https://doi.org/10.1002/wcs.78

work page doi:10.1002/wcs.78 2010

[50] [50]

Saffran, J. R. (2001). The Use of Predictive Dependencies in Language Learning.Journal of Memory and Language,44(4), 493–515. https://doi.org/10.1006/jmla.2000.2759

work page doi:10.1006/jmla.2000.2759 2001

[51] [51]

Saffran, J. R. (2020). Statistical Language Learning in Infancy.Child Development Perspectives,14(1), 49–54.https://doi.org/10.1111/cdep.12355

work page doi:10.1111/cdep.12355 2020

[52] [52]

R., Aslin, R

Saffran, J. R., Aslin, R. N., & Newport, E. L. (1996). Statistical learning by 8-month-old infants.Science, 274(5294), 1926–1928. https://doi.org/10.1126/science.274.5294.1926

work page doi:10.1126/science.274.5294.1926 1996

[53] [53]

Fazekas, J., & Ambridge, B. (2025). Learners restrict their linguistic generalizations using preemption but not entrenchment: Evidence from artificial-language-learning studies with adults and children.Psychological Review,132(1), 1–17. https://doi.org/10.1037/rev0000463

work page doi:10.1037/rev0000463 2025

[54] [54]

Smith, K. H. (1969). Learning Co-occurrence restrictions: Rule induction or rote learning?Journal of Verbal Learning and Verbal Behavior,8(2), 319–321. https://doi.org/10.1016/S0022-5371(69)80086-1

work page doi:10.1016/s0022-5371(69)80086-1 1969

[55] [55]

(2003).Constructing a language: A usage-based theory of language acquisition

Tomasello, M. (2003).Constructing a language: A usage-based theory of language acquisition. Harvard University Press. https://doi.org/10.2307/j.ctv26070v8

work page doi:10.2307/j.ctv26070v8 2003

[56] [56]

Tomasello, M. (2007). Acquiring Linguistic Constructions. In W. Damon & R. M. Lerner (Eds),Handbook of Child Psychology(1st edn). Wiley.https: //doi.org/10.1002/9780470147658.chpsy0206

work page doi:10.1002/9780470147658.chpsy0206 2007

[57] [57]

P., & Newport, E

Thompson, S. P., & Newport, E. L. (2007). Statistical learning of syntax: The role of transitional probability. Language learning and development, 3(1), 1-42

2007

[58] [58]

Wei, J., Garrette, D., Linzen, T., & Pavlick, E. (2021). Frequency effects on syntactic rule learning in transformers(No. arXiv:2109.07020). arXiv. https://doi.org/10.48550/arXiv.2109.07020

work page doi:10.48550/arxiv.2109.07020 2021

[59] [59]

(2023a).Construction grammar provides unique insight into neural language models(No

Levin, L., & Schütze, H. (2023a).Construction grammar provides unique insight into neural language models(No. arXiv:2302.02178). arXiv. https://doi.org/10.48550/arXiv.2302.02178

work page doi:10.48550/arxiv.2302.02178

[60] [60]

Weissweiler, L., Hofmann, V., Köksal, A., & Schütze, H. (2023b). Explaining pretrained language models’ understanding of linguistic structures using construction grammar.Frontiers in Artificial Intelligence,6. https://doi.org/10.3389/frai.2023.1225791

work page doi:10.3389/frai.2023.1225791 2023

[61] [61]

Wonnacott, E. (2013). Learning: Statistical mechanisms in language acquisition. In P.-M. Binder & K. Smith (Eds),The Language Phenomenon(pp. 65–92). Springer Berlin Heidelberg. https://doi.org/10.1007/978-3-642-36086-2_4

work page doi:10.1007/978-3-642-36086-2_4 2013

[62] [62]

Wonnacott, E., Brown, H., & Nation, K. (2017). Skewing the evidence: The effect of input structure on child and adult learning of lexically based patterns in an artificial language.Journal of Memory and Language, 95, 36–48. https://doi.org/10.1016/j.jml.2017.01.005 10

work page doi:10.1016/j.jml.2017.01.005 2017