Measuring Form and Function in Language Models

Charles Yang; H\'ector Javier V\'azquez Mart\'inez

arxiv: 2605.28616 · v1 · pith:2LMHCOPVnew · submitted 2026-05-27 · 💻 cs.CL · cs.AI

Measuring Form and Function in Language Models

H\'ector Javier V\'azquez Mart\'inez , Charles Yang This is my paper

Pith reviewed 2026-06-29 12:47 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords language modelsdeterminerschild language acquisitionsyntactic propertiesdiscourse propertiesprompting methodsevaluation benchmarkscognitive status

0 comments

The pith

No current language model trained on data comparable to children's input meets both the formal syntactic and functional discourse benchmarks for English determiners that children achieve.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The authors develop quantitative tests based on child language acquisition research to assess language models' grasp of determiners. They introduce the Contextual Alternative Choice prompting method to separately evaluate syntactic form and discourse function. Direct comparisons reveal that models with training data volumes similar to what children receive do not reach human-like performance on both aspects at once. Some much larger models do meet the benchmarks. The work provides a way to benchmark models against established child acquisition statistics.

Core claim

The paper establishes that while some very large language models can satisfy both formal syntactic and functional discourse properties of English determiners to the level of young children, no model trained on a comparable amount of data does so simultaneously. The Contextual Alternative Choice method enables this evaluation by testing targeted knowledge in context.

What carries the argument

The Contextual Alternative Choice (CAC) prompting method, which supplies contextual alternatives to test both syntactic rules and discourse knowledge of determiners in direct comparison to child acquisition data.

If this is right

Models trained on data volumes typical of child input will fail to meet at least one of the two benchmarks.
Scaling model size allows achievement of both formal and functional knowledge.
The CAC method supports direct numerical comparison between model performance and child language statistics.
Determiners serve as an early-acquired linguistic item suitable for such evaluations.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Applying similar metrics to other early-acquired linguistic items could reveal broader patterns in model vs child acquisition.
Training regimes might need to incorporate more discourse-level signals to close the gap without extreme scaling.
The results imply that formal syntax and discourse function may require different amounts of data or different learning mechanisms in models.

Load-bearing premise

The Contextual Alternative Choice method accurately isolates and measures syntactic and discourse knowledge of determiners in a manner comparable to the statistical benchmarks from child language acquisition research.

What would settle it

A model trained on a data volume similar to children's input that passes both the formal and functional CAC tests at child-like levels, or empirical evidence that CAC scores do not align with child performance patterns.

Figures

Figures reproduced from arXiv: 2605.28616 by Charles Yang, H\'ector Javier V\'azquez Mart\'inez.

**Figure 2.** Figure 2: Formal and functional benchmarks for the representative models in Table [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗

read the original abstract

We introduce quantitative metrics for child language acquisition to evaluate language models. Our focus is on the formal syntactic and functional discourse properties of determiners in English, which young children acquire early and accurately. We propose Contextual Alternative Choice (CAC), a new prompting method which provides targeted tests for both syntactic and discourse knowledge of language. The method enables direct comparison of language models against children, and more importantly, against statistical benchmarks independently established in empirical research. No current model trained on a comparable amount of data simultaneously meet both formal and functional benchmarks like human children, but some very large models do. We present our results as methodological and technical contributions, with specific emphasis on cognitive status of language models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CAC prompting is a fresh angle for testing determiner knowledge but the paper gives no sign that it was checked against the original child data collection methods, so the headline comparison stays unanchored.

read the letter

The paper's core move is to introduce Contextual Alternative Choice prompting as a way to measure both the formal syntactic rules and the functional discourse uses of determiners in language models, then line those scores up against existing statistical benchmarks from child language acquisition studies. The authors report that only some very large models clear both bars the way children do.

The new piece is the CAC technique itself. It tries to create a single prompt format that hits syntax and discourse at once, which is a concrete attempt to make model tests line up with the kinds of distinctions developmental linguists already track.

The paper earns credit for reaching for those independent child benchmarks rather than inventing its own yardstick. That choice keeps the evaluation from being circular and points evaluation work toward real acquisition data.

The soft spot is exactly where the stress-test note flags it. The abstract describes CAC but shows no back-test or equivalence check against the original child corpora or elicitation procedures. If the prompt format, the alternatives presented, or the response demands differ in ways that change what knowledge is being tapped, the numerical comparison to child performance breaks. The main claim rests on that mapping holding, and nothing in the provided text confirms it does.

This work is for readers who want LM evaluations to track psycholinguistic benchmarks more closely. Someone already working on cognitive plausibility or determiner acquisition could pull useful ideas from the setup, but the current version leaves the central equivalence unshown.

I would send it to peer review because the goal is worth referee time and the method could be tightened with the missing validation steps, but only if the full paper supplies those checks. Without them the result stays preliminary.

Referee Report

2 major / 1 minor

Summary. The paper introduces quantitative metrics drawn from child language acquisition research to evaluate language models, focusing on the formal syntactic and functional discourse properties of English determiners. It proposes Contextual Alternative Choice (CAC), a new prompting technique designed to test both syntactic and discourse knowledge, enabling direct numerical comparison of LMs to independently established statistical benchmarks from child studies. The central empirical claim is that no current model trained on a comparable amount of data simultaneously meets both formal and functional benchmarks like human children, although some very large models do.

Significance. If the CAC method is validated as equivalent to child elicitation tasks, the work would supply a concrete, falsifiable protocol for measuring whether LMs exhibit human-like form-function dissociation in determiner use, strengthening the intersection of computational linguistics and developmental psycholinguistics.

major comments (2)

[Abstract] Abstract: The headline claim that CAC 'enables direct comparison' to child benchmarks and that 'no current model ... simultaneously meet both' is load-bearing, yet the abstract supplies no information on test construction, item counts, statistical controls, or any back-testing of CAC against the original child corpora or elicitation procedures used to derive the reference benchmarks. Without such validation, numerical equivalence cannot be assumed.
[Method] Method description (CAC): The assumption that CAC isolates syntactic versus discourse determiner knowledge in a manner directly comparable to child acquisition statistics is not shown to have been verified; mismatches in cue availability, response format, or task demands would render the cross-population comparison invalid and thereby undermine the reported model-versus-child result.

minor comments (1)

[Abstract] Abstract: The sentence 'No current model trained on a comparable amount of data simultaneously meet both formal and functional benchmarks like human children' contains a subject-verb agreement error ('meet' should be 'meets').

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful reading and constructive comments. We agree that the abstract and method sections would benefit from additional detail on validation and comparability to strengthen the claims. We address each point below and will make revisions accordingly.

read point-by-point responses

Referee: [Abstract] Abstract: The headline claim that CAC 'enables direct comparison' to child benchmarks and that 'no current model ... simultaneously meet both' is load-bearing, yet the abstract supplies no information on test construction, item counts, statistical controls, or any back-testing of CAC against the original child corpora or elicitation procedures used to derive the reference benchmarks. Without such validation, numerical equivalence cannot be assumed.

Authors: We agree the abstract is too concise on these points. The full manuscript (Section 3) details CAC construction, item counts drawn from child corpora, and direct numerical comparison to the cited acquisition benchmarks. However, explicit back-testing of the prompting format against original child elicitation procedures is not reported. We will revise the abstract to include a sentence summarizing test parameters, item counts, and the alignment procedure used. revision: yes
Referee: [Method] Method description (CAC): The assumption that CAC isolates syntactic versus discourse determiner knowledge in a manner directly comparable to child acquisition statistics is not shown to have been verified; mismatches in cue availability, response format, or task demands would render the cross-population comparison invalid and thereby undermine the reported model-versus-child result.

Authors: This concern is well-taken. The method section motivates CAC by mapping its syntactic and discourse conditions onto the properties measured in the child studies, but we do not present a dedicated verification experiment or quantitative comparison of cue availability and response formats. We will add a new subsection (or expand the existing method discussion) that explicitly addresses potential mismatches, reports any available controls, and notes limitations in direct equivalence. revision: yes

Circularity Check

0 steps flagged

No significant circularity; claims rest on independent external benchmarks

full rationale

The paper's core comparison uses CAC prompting to test LMs against statistical benchmarks from child language acquisition that are described as independently established in prior empirical research. The abstract and setup emphasize direct comparison to these external benchmarks rather than any internal fitting, self-definition, or self-citation chain that would force the result by construction. No equations, predictions, or uniqueness claims reduce to the paper's own inputs; the derivation chain remains self-contained against the cited child-language data.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no explicit free parameters, axioms, or invented entities; the central claim rests on the unstated assumption that CAC tests map cleanly onto established child acquisition statistics.

pith-pipeline@v0.9.1-grok · 5635 in / 1056 out tokens · 32736 ms · 2026-06-29T12:47:45.382385+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

62 extracted references · 21 canonical work pages

[1]

online" 'onlinestring :=

ENTRY address archivePrefix author booktitle chapter edition editor eid eprint eprinttype howpublished institution journal key month note number organization pages publisher school series title type volume year doi pubmed url lastchecked label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block STRING...
[2]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...
[3]

Alhama, Ruthe Foushee, Dan Byrne, Allyson Ettinger, Afra Alishahi, and Susan Goldin-Meadow

Raquel G. Alhama, Ruthe Foushee, Dan Byrne, Allyson Ettinger, Afra Alishahi, and Susan Goldin-Meadow. 2024. https://doi.org/10.1073/pnas.2316527121 Using computational modeling to validate the onset of productive determiner–noun combinations in English -learning children . Proceedings of the National Academy of Sciences, 121(50):e2316527121

work page doi:10.1073/pnas.2316527121 2024
[4]

Alhama, Ruthe Foushee, Daniel Byrne, Allyson Ettinger, Susan Goldin-Meadow, and Afra Alishahi

Raquel G. Alhama, Ruthe Foushee, Daniel Byrne, Allyson Ettinger, Susan Goldin-Meadow, and Afra Alishahi. 2023. https://doi.org/10.18653/v1/2023.ijcnlp-main.21 Linguistic Productivity : the Case of Determiners in English . In Proceedings of the 13th International Joint Conference on Natural Language Processing and the 3rd Conference of the Asia - Pacific C...

work page doi:10.18653/v1/2023.ijcnlp-main.21 2023
[5]

Jean Berko. 1958. The child's learning of English morphology. Word, 14(2--3):150--177

1958
[6]

Lois Bloom. 1970. Language development: Form and function in emerging grammar. MIT Press, Cambridge, MA

1970
[7]

Melissa Bowerman. 1982. Reorganizational process in lexical and syntactic development. In Eric Wanner and Lila R. Gleitman, editors, Language acquisition: The state of the art, pages 319--346. Cambridge University Press, New York

1982
[8]

Roger Brown. 1973. First Language : The Early Stages . Harvard University Press, Cambridge

1973
[9]

Tetsuro Chihara, John Oller, Kelley Weaver, and Mary Anne Chavez-Oller. 1977. Are cloze items sensitive to constraints across sentences? 1. Language learning, 27(1):63--70

1977
[10]

Carol Chomsky. 1969. The acquisition of syntax in children from 5 to 10. MIT Press, Cambridge, MA

1969
[11]

Noam Chomsky. 1955. The logical structure of linguistic theory. Manuscript, Harvard University

1955
[12]

Noam Chomsky. 1957. Syntactic structures. Mouton, The Hague

1957
[13]

Leshem Choshen, Ryan Cotterell, Mustafa Omer Gul, Jaap Jumelet, Tal Linzen, Aaron Mueller, Suchir Salhan, Raj Sanjay Shah, Alex Warstadt, and Ethan Gotlieb Wilcox. 2026. https://doi.org/10.48550/arXiv.2602.20092 BabyLM Turns 4 and Goes Multilingual : Call for Papers for the 2026 BabyLM Workshop . arXiv preprint. ArXiv:2602.20092 [cs]

work page doi:10.48550/arxiv.2602.20092 2026
[14]

McCarthy, Katharina Kann, Sabrina J

Ryan Cotterell, Christo Kirov, John Sylak-Glassman, G \'e raldine Walther, Ekaterina Vylomova, Arya D. McCarthy, Katharina Kann, Sabrina J. Mielke, Garrett Nicolai, Miikka Silfverberg, David Yarowsky, Jason Eisner, and Mans Hulden. 2018. https://doi.org/10.18653/v1/K18-3001 The C o NLL -- SIGMORPHON 2018 shared task: Universal morphological reinflection ....

work page doi:10.18653/v1/k18-3001 2018
[15]

Mark Davies. 2008. Corpus of Contemporary American English (COCA): 410+ million words, 1990-present. Available at http://www.americancorpus.org/

2008
[16]

Jill G De Villiers and Peter A De Villiers. 1974. Competence and performance in child language: Are children really competent to judge? Journal of Child Language, 1(1):11--22

1974
[17]

Susan M. Ervin. 1964. Imitation and structural change in children's language. In Eric H. Lenneberg, editor, New directions in the study of child language, pages 163--189. MIT Press, Cambridge

1964
[18]

Larry Fenson, Philip S Dale, J Steven Reznick, Elizabeth Bates, Donna J Thal, Stephen J Pethick, Michael Tomasello, Carolyn B Mervis, and Joan Stiles. 1994. Variability in early communicative development. Monographs of the Society for Research in Child Development, pages i--185

1994
[19]

Olivia La Fiandra, Nathalie Fernandez Echeverri, Patrick Shafto, and Naomi H. Feldman. 2025. https://doi.org/10.18653/v1/2025.babylm-main.8 Large Language Models and Children Have Different Learning Trajectories in Determiner Acquisition . In Proceedings of the First BabyLM Workshop , pages 100--108, Suzhou, China. Association for Computational Linguistics

work page doi:10.18653/v1/2025.babylm-main.8 2025
[20]

Jill Gilkerson, Jeffrey A Richards, Steven F Warren, Judith K Montgomery, Charles R Greenwood, D Kimbrough Oller, John HL Hansen, and Terrance D Paul. 2017. Mapping the early language environment using all-day recordings and automated analysis. American journal of speech-language pathology, 26(2):248--265

2017
[21]

Talmy Giv \'o n. 1983. Topic continuity in discourse. John Benjamins Publishing Company

1983
[22]

Lila Gleitman and Charles Yang. 2022. Two-year-olds' referential determiners in discourse . In Boston Unversity Conference on Language Development

2022
[24]

Susan Goldin-Meadow and Charles Yang. 2017 b . https://doi.org/10.1016/j.neubiorev.2016.12.016 Statistical evidence that a child can create a combinatorial linguistic system without external linguistic input: Implications for language evolution . Neuroscience & Biobehavioral Reviews, 81:150--157

work page doi:10.1016/j.neubiorev.2016.12.016 2017
[25]

Zellig S. Harris. 1951. Methods in structural linguistics. University of Chicago Press, Chicago

1951
[26]

Matthew Honnibal and Mark Johnson. 2015. https://doi.org/10.18653/v1/D15-1162 An improved non-monotonic transition system for dependency parsing . In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 1373--1378, Lisbon, Portugal. Association for Computational Linguistics

work page doi:10.18653/v1/d15-1162 2015
[27]

Jennifer Hu and Roger Levy. 2023. Prompting is not a substitute for probability measurements in large language models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 5040--5060

2023
[28]

Jacobs, Loïc Grobol, and Alvin Tsang

Cassandra L. Jacobs, Loïc Grobol, and Alvin Tsang. 2024. https://doi.org/10.48550/ARXIV.2410.12057 Large-scale cloze evaluation reveals that token prediction tasks are neither lexically nor semantically aligned . arXiv preprint. Version Number: 2

work page doi:10.48550/arxiv.2410.12057 2024
[29]

Carina Kauf, Emmanuele Chersoni, Alessandro Lenci, Evelina Fedorenko, and Anna A Ivanova. 2024. https://doi.org/10.18653/v1/2024.blackboxnlp-1.18 Log probabilities are a reliable estimate of semantic plausibility in base and instruction-tuned language models . In Proceedings of the 7th BlackboxNLP Workshop: Analyzing and Interpreting Neural Networks for N...

work page doi:10.18653/v1/2024.blackboxnlp-1.18 2024
[30]

Erwin Komen, Rosanne Hebing, Ans Van Kemenade, and Bettelou Los. 2014. Quantifying information structure change in english. In Information structure and syntactic change in Germanic and Romance languages , pages 81--110. John Benjamins Publishing Company, Amsterdam

2014
[31]

Nelson Francis

Henry Kučera and W. Nelson Francis. 1967. Computational analysis of present-day American English . Brown University Press, Providence

1967
[32]

Legate and Charles Yang

Julie A. Legate and Charles Yang. 2007. Morphosyntactic learning and the development of tense. Language Acquisition, 14(3):315--344

2007
[33]

Roger Levy. 2008. https://doi.org/10.1016/j.cognition.2007.05.006 Expectation-based syntactic comprehension . Cognition, 106(3):1126--1177

work page doi:10.1016/j.cognition.2007.05.006 2008
[34]

Christopher Lyons. 1999. Definiteness. Cambridge University Press, Cambridge

1999
[35]

Brian MacWhinney. 2000. The CHILDES project: Tools for analyzing talk , 3rd edition. Lawrence Erlbaum, Mahwah, NJ

2000
[36]

Michael Maratsos. 1976. The use of definite and indefinite reference in young children. Cambridge University Press, London

1976
[37]

McCarthy, Ekaterina Vylomova, Shijie Wu, Chaitanya Malaviya, Lawrence Wolf-Sonkin, Garrett Nicolai, Christo Kirov, Miikka Silfverberg, Sabrina J

Arya D. McCarthy, Ekaterina Vylomova, Shijie Wu, Chaitanya Malaviya, Lawrence Wolf-Sonkin, Garrett Nicolai, Christo Kirov, Miikka Silfverberg, Sabrina J. Mielke, Jeffrey Heinz, Ryan Cotterell, and Mans Hulden. 2019. https://doi.org/10.18653/v1/W19-4226 The SIGMORPHON 2019 shared task: Morphological analysis in context and cross-lingual transfer for inflec...

work page doi:10.18653/v1/w19-4226 2019
[38]

Thomas McCoy, Paul Smolensky, Tal Linzen, Jianfeng Gao, and Asli Celikyilmaz

R. Thomas McCoy, Paul Smolensky, Tal Linzen, Jianfeng Gao, and Asli Celikyilmaz. 2023. https://doi.org/10.1162/tacl_a_00567 How Much Do Language Models Copy From Their Training Data ? Evaluating Linguistic Novelty in Text Generation Using RAVEN . Transactions of the Association for Computational Linguistics, 11:652--670. Place: Cambridge, MA

work page doi:10.1162/tacl_a_00567 2023
[39]

Smith, and Yanai Elazar

William Merrill, Noah A. Smith, and Yanai Elazar. 2024. https://doi.org/10.18653/v1/2024.emnlp-main.800 Evaluating n- Gram Novelty of Language Models Using Rusty - DAWG . In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , pages 14459--14473, Miami, Florida, USA. Association for Computational Linguistics

work page doi:10.18653/v1/2024.emnlp-main.800 2024
[40]

Meylan, Michael C

Stephan C. Meylan, Michael C. Frank, Brandon C. Roy, and Roger Levy. 2017. https://doi.org/10.1177/0956797616677753 The Emergence of an Abstract Grammatical Category in Children ’s Early Speech . Psychological Science, 28(2):181--192

work page doi:10.1177/0956797616677753 2017
[41]

Denis Paperno, Germ \'a n Kruszewski, Angeliki Lazaridou, Ngoc-Quan Pham, Raffaella Bernardi, Sandro Pezzelle, Marco Baroni, Gemma Boleda, and Raquel Fern \'a ndez. 2016. The lambada dataset: Word prediction requiring a broad discourse context. In Proceedings of the 54th annual meeting of the association for computational linguistics (volume 1: Long paper...

2016
[42]

Julian M Pine, Daniel Freudenthal, Grzegorz Krajewski, and Fernand Gobet. 2013. Do young children have adult-like syntactic categories? Zipf's law and the case of the determiner . Cognition, 127(3):345--360

2013
[43]

Julian M Pine and Helen Martindale. 1996. Syntactic categories in the speech of young children: The case of the determiner. Journal of child language, 23(2):369--395

1996
[44]

Margot Isabella Rozendaal and Anne Edith Baker. 2008. A cross-linguistic investigation of the acquisition of the pragmatics of indefinite and definite reference in two-year-olds. Journal of Child Language , 35(4):773--807

2008
[45]

Rushen Shi. 2023. Functional categories, determiners, prosody and early child grammar. In D. Armoskaite, S. Armoskaite, and M Wiltschko, editors, Oxford handbook of determiners. Oxford University Press

2023
[46]

Dan Slobin. 1985. The crosslinguistic study of language acquisition. Lawrence Erlbaum, Mahwah, NJ

1985
[47]

Jon Sprouse, Beracah Yankama, Sagar Indurkhya, Sandiway Fong, and Robert C Berwick. 2018. Colorless green ideas do sleep furiously: gradient acceptability and the nature of the grammar. The Linguistic Review, 35(3):575--599

2018
[48]

Wilson L Taylor. 1953. ``cloze procedure'': A new tool for measuring readability. Journalism quarterly, 30(4):415--433

1953
[49]

Herbert S Terrace, Laura-Ann Petitto, Richard J Sanders, and Thomas G Bever. 1979. Can an ape create a sentence? Science, 206(4421):891--902

1979
[50]

Anna Theakston, Elena Lieven, Julian Pine, and Caroline Rowland. 2001. https://doi.org/10.1017/S0305000900004608 The role of performance limitations in the acquisition of verb-argument structure: An alternative account . Journal of child language, 28:127--52

work page doi:10.1017/s0305000900004608 2001
[51]

Rosalind Thornton. 2021. Judgments of acceptability, truth, and felicity in child language. The Cambridge handbook of experimental syntax (Cambridge handbooks in language and linguistics), pages 394--420

2021
[52]

Michael Tomasello. 2000. Do young children have adult syntactic competence? Cognition, 74(3):209--253

2000
[53]

Virginia Valian. 1986. Syntactic categories in the speech of young children. Developmental Psychology, 22(4):562

1986
[54]

Virginia Valian. 1991. Syntactic subjects in the early speech of American and Italian children. Cognition, 40(1):21--81

1991
[55]

Virginia Valian, Stephanie Solt, and John Stewart. 2009. Abstract categories or limited-scope formulae? The case of children's determiners . Journal of Child Language, 36(4):743--778

2009
[56]

Héctor Vázquez Martínez, Annika Lea Heuser, Charles Yang, and Jordan Kodner. 2023. https://doi.org/10.18653/v1/2023.genbench-1.4 Evaluating Neural Language Models as Cognitive Models of Language Acquisition . In Proceedings of the 1st GenBench Workshop on ( Benchmarking ) Generalisation in NLP , pages 48--64, Singapore. Association for Computational Linguistics

work page doi:10.18653/v1/2023.genbench-1.4 2023
[57]

Qi Wang, Diane Lillo-Martin, Catherine T Best, and Andrea Levitt. 1992. Null subject versus null object: Some evidence from the acquisition of Chinese and English . Language Acquisition, 2(3):221--254

1992
[58]

Alex Warstadt, Alicia Parrish, Haokun Liu, Anhad Mohananey, Wei Peng, Sheng-Fu Wang, and Samuel R Bowman. 2020. Blimp: The benchmark of linguistic minimal pairs for english. Transactions of the Association for Computational Linguistics, 8:377--392

2020
[59]

Wilcox, Tiago Pimentel, Clara Meister, Ryan Cotterell, and Roger P

Ethan G. Wilcox, Tiago Pimentel, Clara Meister, Ryan Cotterell, and Roger P. Levy. 2023. https://doi.org/10.1162/tacl_a_00612 Testing the predictions of surprisal theory in 11 languages . Transactions of the Association for Computational Linguistics, 11:1451--1470

work page doi:10.1162/tacl_a_00612 2023
[60]

Charles Yang. 2013. https://doi.org/10.1073/pnas.1216803110 Ontogeny and phylogeny of language . Proceedings of the National Academy of Sciences, 110(16):6324--6327

work page doi:10.1073/pnas.1216803110 2013
[61]

Charles Yang. 2016. https://doi.org/10.7551/mitpress/9780262035323.001.0001 The Price of Linguistic Productivity : How Children Learn to Break the Rules of Language . The MIT Press

work page doi:10.7551/mitpress/9780262035323.001.0001 2016
[62]

George Kingsley Zipf. 1949. http://archive.org/details/in.ernet.dli.2015.90211 Human Behavior And The Principle Of Least Effort . Addison-Wesley Press

work page arXiv 1949
[63]

Shuxian Zou, Shaonan Wang, Jiajun Zhang, and Chengqing Zong. 2022. https://doi.org/10.18653/v1/2022.findings-acl.54 Cross-modal cloze task: A new task to brain-to-word decoding . In Findings of the Association for Computational Linguistics: ACL 2022, pages 648--657, Dublin, Ireland. Association for Computational Linguistics

work page doi:10.18653/v1/2022.findings-acl.54 2022

[1] [1]

online" 'onlinestring :=

ENTRY address archivePrefix author booktitle chapter edition editor eid eprint eprinttype howpublished institution journal key month note number organization pages publisher school series title type volume year doi pubmed url lastchecked label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block STRING...

[2] [2]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...

[3] [3]

Alhama, Ruthe Foushee, Dan Byrne, Allyson Ettinger, Afra Alishahi, and Susan Goldin-Meadow

Raquel G. Alhama, Ruthe Foushee, Dan Byrne, Allyson Ettinger, Afra Alishahi, and Susan Goldin-Meadow. 2024. https://doi.org/10.1073/pnas.2316527121 Using computational modeling to validate the onset of productive determiner–noun combinations in English -learning children . Proceedings of the National Academy of Sciences, 121(50):e2316527121

work page doi:10.1073/pnas.2316527121 2024

[4] [4]

Alhama, Ruthe Foushee, Daniel Byrne, Allyson Ettinger, Susan Goldin-Meadow, and Afra Alishahi

Raquel G. Alhama, Ruthe Foushee, Daniel Byrne, Allyson Ettinger, Susan Goldin-Meadow, and Afra Alishahi. 2023. https://doi.org/10.18653/v1/2023.ijcnlp-main.21 Linguistic Productivity : the Case of Determiners in English . In Proceedings of the 13th International Joint Conference on Natural Language Processing and the 3rd Conference of the Asia - Pacific C...

work page doi:10.18653/v1/2023.ijcnlp-main.21 2023

[5] [5]

Jean Berko. 1958. The child's learning of English morphology. Word, 14(2--3):150--177

1958

[6] [6]

Lois Bloom. 1970. Language development: Form and function in emerging grammar. MIT Press, Cambridge, MA

1970

[7] [7]

Melissa Bowerman. 1982. Reorganizational process in lexical and syntactic development. In Eric Wanner and Lila R. Gleitman, editors, Language acquisition: The state of the art, pages 319--346. Cambridge University Press, New York

1982

[8] [8]

Roger Brown. 1973. First Language : The Early Stages . Harvard University Press, Cambridge

1973

[9] [9]

Tetsuro Chihara, John Oller, Kelley Weaver, and Mary Anne Chavez-Oller. 1977. Are cloze items sensitive to constraints across sentences? 1. Language learning, 27(1):63--70

1977

[10] [10]

Carol Chomsky. 1969. The acquisition of syntax in children from 5 to 10. MIT Press, Cambridge, MA

1969

[11] [11]

Noam Chomsky. 1955. The logical structure of linguistic theory. Manuscript, Harvard University

1955

[12] [12]

Noam Chomsky. 1957. Syntactic structures. Mouton, The Hague

1957

[13] [13]

Leshem Choshen, Ryan Cotterell, Mustafa Omer Gul, Jaap Jumelet, Tal Linzen, Aaron Mueller, Suchir Salhan, Raj Sanjay Shah, Alex Warstadt, and Ethan Gotlieb Wilcox. 2026. https://doi.org/10.48550/arXiv.2602.20092 BabyLM Turns 4 and Goes Multilingual : Call for Papers for the 2026 BabyLM Workshop . arXiv preprint. ArXiv:2602.20092 [cs]

work page doi:10.48550/arxiv.2602.20092 2026

[14] [14]

McCarthy, Katharina Kann, Sabrina J

Ryan Cotterell, Christo Kirov, John Sylak-Glassman, G \'e raldine Walther, Ekaterina Vylomova, Arya D. McCarthy, Katharina Kann, Sabrina J. Mielke, Garrett Nicolai, Miikka Silfverberg, David Yarowsky, Jason Eisner, and Mans Hulden. 2018. https://doi.org/10.18653/v1/K18-3001 The C o NLL -- SIGMORPHON 2018 shared task: Universal morphological reinflection ....

work page doi:10.18653/v1/k18-3001 2018

[15] [15]

Mark Davies. 2008. Corpus of Contemporary American English (COCA): 410+ million words, 1990-present. Available at http://www.americancorpus.org/

2008

[16] [16]

Jill G De Villiers and Peter A De Villiers. 1974. Competence and performance in child language: Are children really competent to judge? Journal of Child Language, 1(1):11--22

1974

[17] [17]

Susan M. Ervin. 1964. Imitation and structural change in children's language. In Eric H. Lenneberg, editor, New directions in the study of child language, pages 163--189. MIT Press, Cambridge

1964

[18] [18]

Larry Fenson, Philip S Dale, J Steven Reznick, Elizabeth Bates, Donna J Thal, Stephen J Pethick, Michael Tomasello, Carolyn B Mervis, and Joan Stiles. 1994. Variability in early communicative development. Monographs of the Society for Research in Child Development, pages i--185

1994

[19] [19]

Olivia La Fiandra, Nathalie Fernandez Echeverri, Patrick Shafto, and Naomi H. Feldman. 2025. https://doi.org/10.18653/v1/2025.babylm-main.8 Large Language Models and Children Have Different Learning Trajectories in Determiner Acquisition . In Proceedings of the First BabyLM Workshop , pages 100--108, Suzhou, China. Association for Computational Linguistics

work page doi:10.18653/v1/2025.babylm-main.8 2025

[20] [20]

Jill Gilkerson, Jeffrey A Richards, Steven F Warren, Judith K Montgomery, Charles R Greenwood, D Kimbrough Oller, John HL Hansen, and Terrance D Paul. 2017. Mapping the early language environment using all-day recordings and automated analysis. American journal of speech-language pathology, 26(2):248--265

2017

[21] [21]

Talmy Giv \'o n. 1983. Topic continuity in discourse. John Benjamins Publishing Company

1983

[22] [22]

Lila Gleitman and Charles Yang. 2022. Two-year-olds' referential determiners in discourse . In Boston Unversity Conference on Language Development

2022

[23] [24]

Susan Goldin-Meadow and Charles Yang. 2017 b . https://doi.org/10.1016/j.neubiorev.2016.12.016 Statistical evidence that a child can create a combinatorial linguistic system without external linguistic input: Implications for language evolution . Neuroscience & Biobehavioral Reviews, 81:150--157

work page doi:10.1016/j.neubiorev.2016.12.016 2017

[24] [25]

Zellig S. Harris. 1951. Methods in structural linguistics. University of Chicago Press, Chicago

1951

[25] [26]

Matthew Honnibal and Mark Johnson. 2015. https://doi.org/10.18653/v1/D15-1162 An improved non-monotonic transition system for dependency parsing . In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 1373--1378, Lisbon, Portugal. Association for Computational Linguistics

work page doi:10.18653/v1/d15-1162 2015

[26] [27]

Jennifer Hu and Roger Levy. 2023. Prompting is not a substitute for probability measurements in large language models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 5040--5060

2023

[27] [28]

Jacobs, Loïc Grobol, and Alvin Tsang

Cassandra L. Jacobs, Loïc Grobol, and Alvin Tsang. 2024. https://doi.org/10.48550/ARXIV.2410.12057 Large-scale cloze evaluation reveals that token prediction tasks are neither lexically nor semantically aligned . arXiv preprint. Version Number: 2

work page doi:10.48550/arxiv.2410.12057 2024

[28] [29]

Carina Kauf, Emmanuele Chersoni, Alessandro Lenci, Evelina Fedorenko, and Anna A Ivanova. 2024. https://doi.org/10.18653/v1/2024.blackboxnlp-1.18 Log probabilities are a reliable estimate of semantic plausibility in base and instruction-tuned language models . In Proceedings of the 7th BlackboxNLP Workshop: Analyzing and Interpreting Neural Networks for N...

work page doi:10.18653/v1/2024.blackboxnlp-1.18 2024

[29] [30]

Erwin Komen, Rosanne Hebing, Ans Van Kemenade, and Bettelou Los. 2014. Quantifying information structure change in english. In Information structure and syntactic change in Germanic and Romance languages , pages 81--110. John Benjamins Publishing Company, Amsterdam

2014

[30] [31]

Nelson Francis

Henry Kučera and W. Nelson Francis. 1967. Computational analysis of present-day American English . Brown University Press, Providence

1967

[31] [32]

Legate and Charles Yang

Julie A. Legate and Charles Yang. 2007. Morphosyntactic learning and the development of tense. Language Acquisition, 14(3):315--344

2007

[32] [33]

Roger Levy. 2008. https://doi.org/10.1016/j.cognition.2007.05.006 Expectation-based syntactic comprehension . Cognition, 106(3):1126--1177

work page doi:10.1016/j.cognition.2007.05.006 2008

[33] [34]

Christopher Lyons. 1999. Definiteness. Cambridge University Press, Cambridge

1999

[34] [35]

Brian MacWhinney. 2000. The CHILDES project: Tools for analyzing talk , 3rd edition. Lawrence Erlbaum, Mahwah, NJ

2000

[35] [36]

Michael Maratsos. 1976. The use of definite and indefinite reference in young children. Cambridge University Press, London

1976

[36] [37]

McCarthy, Ekaterina Vylomova, Shijie Wu, Chaitanya Malaviya, Lawrence Wolf-Sonkin, Garrett Nicolai, Christo Kirov, Miikka Silfverberg, Sabrina J

Arya D. McCarthy, Ekaterina Vylomova, Shijie Wu, Chaitanya Malaviya, Lawrence Wolf-Sonkin, Garrett Nicolai, Christo Kirov, Miikka Silfverberg, Sabrina J. Mielke, Jeffrey Heinz, Ryan Cotterell, and Mans Hulden. 2019. https://doi.org/10.18653/v1/W19-4226 The SIGMORPHON 2019 shared task: Morphological analysis in context and cross-lingual transfer for inflec...

work page doi:10.18653/v1/w19-4226 2019

[37] [38]

Thomas McCoy, Paul Smolensky, Tal Linzen, Jianfeng Gao, and Asli Celikyilmaz

R. Thomas McCoy, Paul Smolensky, Tal Linzen, Jianfeng Gao, and Asli Celikyilmaz. 2023. https://doi.org/10.1162/tacl_a_00567 How Much Do Language Models Copy From Their Training Data ? Evaluating Linguistic Novelty in Text Generation Using RAVEN . Transactions of the Association for Computational Linguistics, 11:652--670. Place: Cambridge, MA

work page doi:10.1162/tacl_a_00567 2023

[38] [39]

Smith, and Yanai Elazar

William Merrill, Noah A. Smith, and Yanai Elazar. 2024. https://doi.org/10.18653/v1/2024.emnlp-main.800 Evaluating n- Gram Novelty of Language Models Using Rusty - DAWG . In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , pages 14459--14473, Miami, Florida, USA. Association for Computational Linguistics

work page doi:10.18653/v1/2024.emnlp-main.800 2024

[39] [40]

Meylan, Michael C

Stephan C. Meylan, Michael C. Frank, Brandon C. Roy, and Roger Levy. 2017. https://doi.org/10.1177/0956797616677753 The Emergence of an Abstract Grammatical Category in Children ’s Early Speech . Psychological Science, 28(2):181--192

work page doi:10.1177/0956797616677753 2017

[40] [41]

Denis Paperno, Germ \'a n Kruszewski, Angeliki Lazaridou, Ngoc-Quan Pham, Raffaella Bernardi, Sandro Pezzelle, Marco Baroni, Gemma Boleda, and Raquel Fern \'a ndez. 2016. The lambada dataset: Word prediction requiring a broad discourse context. In Proceedings of the 54th annual meeting of the association for computational linguistics (volume 1: Long paper...

2016

[41] [42]

Julian M Pine, Daniel Freudenthal, Grzegorz Krajewski, and Fernand Gobet. 2013. Do young children have adult-like syntactic categories? Zipf's law and the case of the determiner . Cognition, 127(3):345--360

2013

[42] [43]

Julian M Pine and Helen Martindale. 1996. Syntactic categories in the speech of young children: The case of the determiner. Journal of child language, 23(2):369--395

1996

[43] [44]

Margot Isabella Rozendaal and Anne Edith Baker. 2008. A cross-linguistic investigation of the acquisition of the pragmatics of indefinite and definite reference in two-year-olds. Journal of Child Language , 35(4):773--807

2008

[44] [45]

Rushen Shi. 2023. Functional categories, determiners, prosody and early child grammar. In D. Armoskaite, S. Armoskaite, and M Wiltschko, editors, Oxford handbook of determiners. Oxford University Press

2023

[45] [46]

Dan Slobin. 1985. The crosslinguistic study of language acquisition. Lawrence Erlbaum, Mahwah, NJ

1985

[46] [47]

Jon Sprouse, Beracah Yankama, Sagar Indurkhya, Sandiway Fong, and Robert C Berwick. 2018. Colorless green ideas do sleep furiously: gradient acceptability and the nature of the grammar. The Linguistic Review, 35(3):575--599

2018

[47] [48]

Wilson L Taylor. 1953. ``cloze procedure'': A new tool for measuring readability. Journalism quarterly, 30(4):415--433

1953

[48] [49]

Herbert S Terrace, Laura-Ann Petitto, Richard J Sanders, and Thomas G Bever. 1979. Can an ape create a sentence? Science, 206(4421):891--902

1979

[49] [50]

Anna Theakston, Elena Lieven, Julian Pine, and Caroline Rowland. 2001. https://doi.org/10.1017/S0305000900004608 The role of performance limitations in the acquisition of verb-argument structure: An alternative account . Journal of child language, 28:127--52

work page doi:10.1017/s0305000900004608 2001

[50] [51]

Rosalind Thornton. 2021. Judgments of acceptability, truth, and felicity in child language. The Cambridge handbook of experimental syntax (Cambridge handbooks in language and linguistics), pages 394--420

2021

[51] [52]

Michael Tomasello. 2000. Do young children have adult syntactic competence? Cognition, 74(3):209--253

2000

[52] [53]

Virginia Valian. 1986. Syntactic categories in the speech of young children. Developmental Psychology, 22(4):562

1986

[53] [54]

Virginia Valian. 1991. Syntactic subjects in the early speech of American and Italian children. Cognition, 40(1):21--81

1991

[54] [55]

Virginia Valian, Stephanie Solt, and John Stewart. 2009. Abstract categories or limited-scope formulae? The case of children's determiners . Journal of Child Language, 36(4):743--778

2009

[55] [56]

Héctor Vázquez Martínez, Annika Lea Heuser, Charles Yang, and Jordan Kodner. 2023. https://doi.org/10.18653/v1/2023.genbench-1.4 Evaluating Neural Language Models as Cognitive Models of Language Acquisition . In Proceedings of the 1st GenBench Workshop on ( Benchmarking ) Generalisation in NLP , pages 48--64, Singapore. Association for Computational Linguistics

work page doi:10.18653/v1/2023.genbench-1.4 2023

[56] [57]

Qi Wang, Diane Lillo-Martin, Catherine T Best, and Andrea Levitt. 1992. Null subject versus null object: Some evidence from the acquisition of Chinese and English . Language Acquisition, 2(3):221--254

1992

[57] [58]

Alex Warstadt, Alicia Parrish, Haokun Liu, Anhad Mohananey, Wei Peng, Sheng-Fu Wang, and Samuel R Bowman. 2020. Blimp: The benchmark of linguistic minimal pairs for english. Transactions of the Association for Computational Linguistics, 8:377--392

2020

[58] [59]

Wilcox, Tiago Pimentel, Clara Meister, Ryan Cotterell, and Roger P

Ethan G. Wilcox, Tiago Pimentel, Clara Meister, Ryan Cotterell, and Roger P. Levy. 2023. https://doi.org/10.1162/tacl_a_00612 Testing the predictions of surprisal theory in 11 languages . Transactions of the Association for Computational Linguistics, 11:1451--1470

work page doi:10.1162/tacl_a_00612 2023

[59] [60]

Charles Yang. 2013. https://doi.org/10.1073/pnas.1216803110 Ontogeny and phylogeny of language . Proceedings of the National Academy of Sciences, 110(16):6324--6327

work page doi:10.1073/pnas.1216803110 2013

[60] [61]

Charles Yang. 2016. https://doi.org/10.7551/mitpress/9780262035323.001.0001 The Price of Linguistic Productivity : How Children Learn to Break the Rules of Language . The MIT Press

work page doi:10.7551/mitpress/9780262035323.001.0001 2016

[61] [62]

George Kingsley Zipf. 1949. http://archive.org/details/in.ernet.dli.2015.90211 Human Behavior And The Principle Of Least Effort . Addison-Wesley Press

work page arXiv 1949

[62] [63]

Shuxian Zou, Shaonan Wang, Jiajun Zhang, and Chengqing Zong. 2022. https://doi.org/10.18653/v1/2022.findings-acl.54 Cross-modal cloze task: A new task to brain-to-word decoding . In Findings of the Association for Computational Linguistics: ACL 2022, pages 648--657, Dublin, Ireland. Association for Computational Linguistics

work page doi:10.18653/v1/2022.findings-acl.54 2022