Language Acquisition Device in Large Language Models

Masato Mita; Ryo Yoshida; Taiga Someya; Yohei Oseki

arxiv: 2605.16758 · v1 · pith:ERUBIIKCnew · submitted 2026-05-16 · 💻 cs.CL

Language Acquisition Device in Large Language Models

Masato Mita , Taiga Someya , Ryo Yoshida , Yohei Oseki This is my paper

Pith reviewed 2026-05-19 21:41 UTC · model grok-4.3

classification 💻 cs.CL

keywords language acquisition devicepre-pretrainingsynthetic languagesformal languagetransformer modelssyntactic structuredata efficiency

0 comments

The pith

Pre-pretraining LLMs on MP-STRUCT achieves token efficiency on par with strong baselines while adding resistance to implausible languages.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes pre-pretraining large language models on MP-STRUCT, a synthetic formal language designed to encode the hierarchical composition, feature-based dependencies, and long-distance displacement that the Language Acquisition Device hypothesis attributes to innate human constraints. This pre-pretraining is meant to narrow the model's hypothesis space toward natural-language-like structures from the beginning, closing the data-efficiency gap with humans more effectively than prior synthetic-language approaches. A 500-step exposure to MP-STRUCT matches the token efficiency of strong baselines such as k-Shuffle Dyck while also preventing the model from acquiring structurally implausible patterns like the REVERSE language. Simplified variants show that the core component of MP-STRUCT outperforms k-Shuffle Dyck even though it falls outside C-RASP bounds on transformer expressivity, with functional landmarks that reduce dependency ambiguity emerging as the main driver of success.

Core claim

A brief 500-step PPT with MP-STRUCT matches strong formal-language baselines in token efficiency while additionally imparting a human-like resistance to structurally implausible languages such as REVERSE. MP-STRUCT CORE outperforms k-Shuffle Dyck despite not being definable in C-RASP, indicating that functional landmarks reducing dependency resolution ambiguity are a key driver and that effective PPT design depends on both expressivity and accessibility of dependency resolution.

What carries the argument

MP-STRUCT, the formal language whose strings encode hierarchical composition, feature-based dependencies, and long-distance displacement via the MERGE, AGREE, and MOVE operations.

If this is right

Short pre-pretraining on structured synthetic data can close part of the data-efficiency gap between LLMs and humans.
Models acquire resistance to structurally implausible languages beyond the efficiency improvements alone.
PPT language design must prioritize accessible dependency resolution landmarks in addition to raw hierarchical expressivity.
The requirement that effective PPT languages be both hierarchically expressive and definable in C-RASP is not necessary.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Pre-training curricula could incorporate more linguistically motivated synthetic data to encourage broader syntactic generalization.
The same approach might be tested on tasks requiring long-distance dependency resolution to measure transfer beyond the reported metrics.
Future variants could vary the density of functional landmarks to isolate their contribution to resistance against implausible languages.

Load-bearing premise

The structural properties built into MP-STRUCT successfully capture the innate constraints of the Language Acquisition Device hypothesis and transfer to improved natural-language behavior in LLMs.

What would settle it

A controlled test in which models pre-pretrained on MP-STRUCT still acquire the REVERSE language at rates comparable to baselines or show no efficiency gain on downstream natural-language tasks would falsify the central claim.

Figures

Figures reproduced from arXiv: 2605.16758 by Masato Mita, Ryo Yoshida, Taiga Someya, Yohei Oseki.

**Figure 2.** Figure 2: Robustness against semantic perturbation (∆sens = LJW − LNL). This metric quantifies the performance gap between semantic-free Jabberwocky inputs and natural language, where lower values indicate less reliance on lexical co-occurrence. as an alternative to high-expressivity baselines for improving token efficiency, and motivate further analysis of the factors underlying these gains (§6). 5 Analysis I: Qua… view at source ↗

**Figure 3.** Figure 3: Structural selectivity (∆sel = LImp − LNL) across three impossible language conditions. Positive values indicate a human-like preference for natural linguistic constraints over impossible distortions. and JW→JW). We then compare their losses to quantify sensitivity to semantic information. We define sensitivity as ∆sens = LJW − LNL, where LNL and LJW denote the losses obtained under the NL→NL and JW→JW co… view at source ↗

**Figure 4.** Figure 4: Comparison of C4 loss at 25,000 steps across [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: The trajectory of the training loss. both conditions share the same primitive operations as MP-STRUCT: one type of recursive structure and four types of functional dependencies (see Appendix G for details). The only factor varied is how these components are organized, allowing us to attribute any difference in efficiency directly to the organization of dependencies rather than to their number or type. Wi… view at source ↗

read the original abstract

Large Language Models (LLMs) remain substantially less data-efficient than humans. Pre-pretraining (PPT) on synthetic languages has been proposed to close this gap, with prior work emphasizing highly expressive formal languages such as $k$-Shuffle Dyck. Inspired by the Language Acquisition Device (LAD) hypothesis, which posits that innate constraints preemptively restrict the learner's hypothesis space to natural-language-like structure, we propose LAD-inspired PPT: pre-pretraining on MP-STRUCT, a formal language whose strings encode hierarchical composition, feature-based dependencies, and long-distance displacement via MERGE, AGREE, and MOVE. A brief 500-step PPT with MP-STRUCT matches strong formal-language baselines in token efficiency while additionally imparting a human-like resistance to structurally implausible languages (e.g., REVERSE). Analyzing simplified variants, we find that MP-STRUCT CORE outperforms $k$-Shuffle Dyck despite not being definable in C-RASP (a formal bound on transformer expressivity), challenging the prior hypothesis that effective PPT languages must be both hierarchically expressive and circuit-theoretically learnable. We show that functional landmarks, which reduce dependency resolution ambiguity, are a key driver, suggesting that effective PPT design depends not only on expressivity but also on the accessibility of dependency resolution.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MP-STRUCT gives a fresh linguistically grounded synthetic language for pre-pretraining that holds its own on formal tasks and pushes back on C-RASP requirements, but the transfer to real language acquisition remains unshown.

read the letter

The main point is that this paper constructs MP-STRUCT around the operations merge, agree, and move, then uses a short 500-step pre-pretraining run to match k-Shuffle Dyck baselines on token efficiency while adding resistance to reversal-style languages. It also shows that the core version beats the baseline even though it falls outside C-RASP, which undercuts the earlier assumption that effective PPT languages must be both expressive and circuit-learnable in that specific sense. The analysis of simplified variants is useful because it isolates functional landmarks as the main driver for better dependency handling.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes pre-pretraining LLMs on MP-STRUCT, a formal language whose strings encode hierarchical composition, feature-based dependencies, and long-distance displacement via MERGE, AGREE, and MOVE operations drawn from the Minimalist Program. It reports that a brief 500-step PPT on MP-STRUCT matches the token efficiency of strong baselines such as k-Shuffle Dyck on synthetic language learning while additionally conferring resistance to structurally implausible languages (e.g., REVERSE). Analysis of simplified variants identifies functional landmarks as a key driver of dependency resolution and shows that MP-STRUCT CORE outperforms k-Shuffle Dyck despite not being definable in C-RASP, thereby challenging the hypothesis that effective PPT languages must be both hierarchically expressive and circuit-theoretically learnable.

Significance. If the synthetic results hold and the design principles generalize, the work could meaningfully advance data-efficient LLM training by providing a linguistically motivated inductive bias. The explicit challenge to C-RASP necessity for PPT effectiveness is a substantive contribution to understanding transformer limitations and the role of functional landmarks offers a concrete, testable principle for future PPT language design. Credit is due for focusing on resistance metrics in addition to efficiency and for grounding the proposal in specific Minimalist operations rather than generic hierarchy.

major comments (2)

[Abstract and §4 (Experiments)] Abstract and §4 (Experiments): the central efficiency and resistance claims rest on synthetic learning curves, yet the manuscript provides no quantitative transfer results (e.g., perplexity on natural text or GLUE-style scores) comparing MP-STRUCT-pretrained models against the k-Shuffle Dyck baseline on actual natural language. This leaves the LAD-inspired interpretation—that MP-STRUCT successfully instantiates innate constraints and improves natural-language acquisition—an untested extrapolation.
[§5 (Analysis of variants)] §5 (Analysis of variants): the claim that MP-STRUCT CORE outperforms k-Shuffle Dyck while not being C-RASP definable is load-bearing for the challenge to prior PPT hypotheses. The manuscript must supply the explicit argument or reduction showing why MP-STRUCT CORE lies outside C-RASP, together with controls confirming that the performance difference is not attributable to differences in training hyperparameters or landmark density.

minor comments (2)

[§3 (MP-STRUCT definition)] The generation rules and feature inventory for MP-STRUCT and its CORE variant should be stated more formally (e.g., as a context-free or mildly context-sensitive grammar) to support reproducibility.
[Figures 1–3] Figure captions and axis labels for learning curves should explicitly state the number of random seeds, error-bar computation method, and exact token budgets used for each condition.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and insightful report. We provide point-by-point responses to the major comments below.

read point-by-point responses

Referee: [Abstract and §4 (Experiments)] Abstract and §4 (Experiments): the central efficiency and resistance claims rest on synthetic learning curves, yet the manuscript provides no quantitative transfer results (e.g., perplexity on natural text or GLUE-style scores) comparing MP-STRUCT-pretrained models against the k-Shuffle Dyck baseline on actual natural language. This leaves the LAD-inspired interpretation—that MP-STRUCT successfully instantiates innate constraints and improves natural-language acquisition—an untested extrapolation.

Authors: The experiments in the manuscript are deliberately designed within a synthetic framework to allow for controlled measurement of efficiency and resistance to implausible structures, which directly tests the proposed inductive bias from the Minimalist Program operations. We do not claim direct improvements on natural language benchmarks in this work, as such transfer would require separate evaluation protocols and could be influenced by many factors beyond the PPT language. The LAD-inspired interpretation is supported by the observed resistance metrics, which mimic human preferences for plausible structures. We will revise the manuscript to explicitly state that natural language transfer remains an important direction for future research and to temper the interpretation accordingly. revision: partial
Referee: [§5 (Analysis of variants)] §5 (Analysis of variants): the claim that MP-STRUCT CORE outperforms k-Shuffle Dyck while not being C-RASP definable is load-bearing for the challenge to prior PPT hypotheses. The manuscript must supply the explicit argument or reduction showing why MP-STRUCT CORE lies outside C-RASP, together with controls confirming that the performance difference is not attributable to differences in training hyperparameters or landmark density.

Authors: We will add to §5 an explicit reduction or argument establishing that MP-STRUCT CORE is not C-RASP definable, leveraging the presence of feature checking and displacement operations that cannot be captured within the constant-depth circuit constraints of C-RASP. Furthermore, we will include additional experimental controls where training hyperparameters are matched across models and landmark density is normalized to isolate the contribution of the core structural features. revision: yes

Circularity Check

0 steps flagged

No circularity in the LAD-inspired PPT derivation

full rationale

Inspection of the abstract and described experiments reveals no self-definitional steps, fitted inputs presented as predictions, or load-bearing self-citations. The proposal of MP-STRUCT encodes specific linguistic properties via MERGE, AGREE, and MOVE, and the results on token efficiency and resistance to REVERSE are presented as empirical outcomes from pre-pretraining, benchmarked against external baselines like k-Shuffle Dyck. The analysis of variants and functional landmarks further supports independent content in the claims, without reduction to the inputs by construction. The derivation chain remains self-contained against the synthetic task benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the LAD hypothesis as a domain assumption and introduces MP-STRUCT as a new entity without independent evidence outside the paper's experiments.

axioms (1)

domain assumption The Language Acquisition Device hypothesis posits that innate constraints restrict the learner's hypothesis space to natural-language-like structures.
Invoked to motivate the design of MP-STRUCT as encoding hierarchical composition, feature-based dependencies, and long-distance displacement.

invented entities (1)

MP-STRUCT no independent evidence
purpose: Formal language whose strings encode hierarchical composition, feature-based dependencies, and long-distance displacement via MERGE, AGREE, and MOVE.
Newly constructed for this work to test LAD-inspired pre-pretraining.

pith-pipeline@v0.9.0 · 5758 in / 1380 out tokens · 44444 ms · 2026-05-19T21:41:19.236203+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

MP-STRUCT generates sequences encoding hierarchical composition, feature-based dependencies, and long-distance displacement via MERGE, AGREE, and MOVE
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

MP-STRUCTCORE outperforms k-Shuffle Dyck despite not being definable in C-RASP

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

109 extracted references · 109 canonical work pages

[1]

Noam Chomsky , abstract =. On. Information and Control , volume =. 1959 , issn =. doi:https://doi.org/10.1016/S0019-9958(59)90362-6 , url =

work page doi:10.1016/s0019-9958(59)90362-6 1959
[2]

1995 , publisher=

The Minimalist Program , author=. 1995 , publisher=

work page 1995
[3]

Step by Step: Essays on Minimalist Syntax in Honor of Howard Lasnik , editor =

Noam Chomsky , title =. Step by Step: Essays on Minimalist Syntax in Honor of Howard Lasnik , editor =

work page
[4]

Derivation by Phase , booktitle =

Chomsky, Noam , isbn =. Derivation by Phase , booktitle =. 2001 , month =. doi:10.7551/mitpress/4056.003.0004 , url =

work page doi:10.7551/mitpress/4056.003.0004 2001
[5]

Jardine and Dwight L

Chomsky, Noam , isbn =. Beyond Explanatory Adequacy , booktitle =. 2004 , month =. doi:10.1093/oso/9780195171976.003.0004 , url =

work page doi:10.1093/oso/9780195171976.003.0004 2004
[6]

Colorless Green Recurrent Networks Dream Hierarchically

Gulordava, Kristina and Bojanowski, Piotr and Grave, Edouard and Linzen, Tal and Baroni, Marco. Colorless Green Recurrent Networks Dream Hierarchically. Proceedings of the 2018 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers). 2018. doi:10.18653/v1/N18-1108

work page doi:10.18653/v1/n18-1108 2018
[7]

A is B" fail to learn

Berglund, Lukas and Tong, Meg and Kaufmann, Maximilian and Balesni, Mikita and Stickland, Asa and Korbak, Tomek and Evans, Owain , booktitle =. The Reversal Curse: LLMs trained on "A is B" fail to learn "B is A" , url =

work page
[8]

and Sanyal, Soumya and Welleck, Sean and Ren, Xiang and Ettinger, Allyson and Harchaoui, Zaid and Choi, Yejin , title =

Dziri, Nouha and Lu, Ximing and Sclar, Melanie and Li, Xiang Lorraine and Jiang, Liwei and Lin, Bill Yuchen and West, Peter and Bhagavatula, Chandra and Le Bras, Ronan and Hwang, Jena D. and Sanyal, Soumya and Welleck, Sean and Ren, Xiang and Ettinger, Allyson and Harchaoui, Zaid and Choi, Yejin , title =. Proceedings of the 37th International Conference ...

work page 2023
[9]

International Conference on Learning Representations , year=

Are Transformers universal approximators of sequence-to-sequence functions? , author=. International Conference on Learning Representations , year=

work page
[10]

2009 , publisher=

Computational Complexity: A Modern Approach , author=. 2009 , publisher=

work page 2009
[11]

, title =

Alur, Rajeev and Madhusudan, P. , title =. Proceedings of the Thirty-Sixth Annual ACM Symposium on Theory of Computing , pages =. 2004 , isbn =. doi:10.1145/1007352.1007390 , abstract =

work page doi:10.1145/1007352.1007390 2004
[12]

Language , volume=

Structure dependence in grammar formation , author=. Language , volume=. 1987 , publisher=

work page 1987
[13]

Shieber , doi =

Stuart M. Shieber , doi =. Evidence Against the Context-Freeness of Natural Language , volume =. Linguistics and Philosophy , number =

work page
[14]

2020 , eprint=

Scaling Laws for Neural Language Models , author=. 2020 , eprint=

work page 2020
[15]

Selected Papers from the Third International Conference, on Logical Aspects of Computational Linguistics , pages =

Michaelis, Jens , title =. Selected Papers from the Third International Conference, on Logical Aspects of Computational Linguistics , pages =. 1998 , isbn =

work page 1998
[16]

, editor=

Joshi, Aravind K. , editor=. Tree adjoining grammars: How much context-sensitivity is required to provide reasonable structural descriptions? , booktitle=. 1985 , pages=

work page 1985
[17]

Lewis Carroll , title =

work page
[18]

2016 , eprint=

Pointer Sentinel Mixture Models , author=. 2016 , eprint=

work page 2016
[19]

Shazeer and Adam Roberts and Katherine Lee and Sharan Narang and Michael Matena and Yanqi Zhou and Wei Li and Peter J

Colin Raffel and Noam M. Shazeer and Adam Roberts and Katherine Lee and Sharan Narang and Michael Matena and Yanqi Zhou and Wei Li and Peter J. Liu , journal=. Exploring the. 2019 , volume=

work page 2019
[20]

Mission: Impossible Language Models

Kallini, Julie and Papadimitriou, Isabel and Futrell, Richard and Mahowald, Kyle and Potts, Christopher. Mission: Impossible Language Models. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2024. doi:10.18653/v1/2024.acl-long.787

work page doi:10.18653/v1/2024.acl-long.787 2024
[21]

L earning M usic H elps Y ou R ead: U sing Transfer to Study Linguistic Structure in Language Models

Papadimitriou, Isabel and Jurafsky, Dan. L earning M usic H elps Y ou R ead: U sing Transfer to Study Linguistic Structure in Language Models. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). 2020. doi:10.18653/v1/2020.emnlp-main.554

work page doi:10.18653/v1/2020.emnlp-main.554 2020
[22]

JBL i MP : J apanese Benchmark of Linguistic Minimal Pairs

Someya, Taiga and Oseki, Yohei. JBL i MP : J apanese Benchmark of Linguistic Minimal Pairs. Findings of the Association for Computational Linguistics: EACL 2023. 2023. doi:10.18653/v1/2023.findings-eacl.117

work page doi:10.18653/v1/2023.findings-eacl.117 2023
[23]

Thirty-seventh Conference on Neural Information Processing Systems , year=

A Logic for Expressing Log-Precision Transformers , author=. Thirty-seventh Conference on Neural Information Processing Systems , year=

work page
[24]

First Conference on Language Modeling , year=

Counting Like Transformers: Compiling Temporal Counting Logic Into Softmax Transformers , author=. First Conference on Language Modeling , year=

work page
[25]

Pretraining with Artificial Language: Studying Transferable Knowledge in Language Models

Ri, Ryokan and Tsuruoka, Yoshimasa. Pretraining with Artificial Language: Studying Transferable Knowledge in Language Models. Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2022. doi:10.18653/v1/2022.acl-long.504

work page doi:10.18653/v1/2022.acl-long.504 2022
[26]

Injecting structural hints: Using language models to study inductive biases in language learning

Papadimitriou, Isabel and Jurafsky, Dan. Injecting structural hints: Using language models to study inductive biases in language learning. Findings of the Association for Computational Linguistics: EMNLP 2023. 2023. doi:10.18653/v1/2023.findings-emnlp.563

work page doi:10.18653/v1/2023.findings-emnlp.563 2023
[27]

Developmentally-plausible Working Memory Shapes a Critical Period for Language Acquisition

Mita, Masato and Yoshida, Ryo and Oseki, Yohei. Developmentally-plausible Working Memory Shapes a Critical Period for Language Acquisition. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2025. doi:10.18653/v1/2025.acl-long.462

work page doi:10.18653/v1/2025.acl-long.462 2025
[28]

and Tao, Dacheng , title =

Gou, Jianping and Yu, Baosheng and Maybank, Stephen J. and Tao, Dacheng , title =. International Journal of Computer Vision , volume =. 2021 , doi =

work page 2021
[29]

Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (ACL) , year =

Fu, Yao and Peng, Hao and Chen, Hao and Sabharwal, Ashish and Khot, Tushar , title =. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (ACL) , year =

work page
[30]

and Schuurmans, Dale and Chi, Ed H

Hsieh, Cheng-Yu and Wei, Jason and Li, Xuezhi and Zhou, Denny and Le, Quoc V. and Schuurmans, Dale and Chi, Ed H. , title =. Proceedings of the 40th International Conference on Machine Learning (ICML) , year =

work page
[31]

Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP) , year =

Shridhara, Shreyas and Wang, Bailin and Wei, Jason and Zhou, Denny and Lin, Xi Victoria , title =. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP) , year =

work page 2023
[32]

Corder, S. P. , title =. International Review of Applied Linguistics , volume =

work page
[33]

International Review of Applied Linguistics , volume =

Selinker, Larry , title =. International Review of Applied Linguistics , volume =

work page
[34]

, title =

Krashen, Stephen D. , title =

work page
[35]

Input in Second Language Acquisition , editor =

Swain, Merrill , title =. Input in Second Language Acquisition , editor =

work page
[36]

Principle and Practice in Applied Linguistics: Studies in Honor of H

Swain, Merrill , title =. Principle and Practice in Applied Linguistics: Studies in Honor of H. G. Widdowson , editor =

work page
[37]

Sociocultural Theory and Second Language Learning , editor =

Swain, Merrill , title =. Sociocultural Theory and Second Language Learning , editor =

work page
[38]

Cognitive and Affective Aspects of Language Learning , editor =

Schmidt, Richard , title =. Cognitive and Affective Aspects of Language Learning , editor =

work page
[39]

Studies in Second Language Acquisition , volume =

Ellis, Rod and Loewen, Shawn and Erlam, Rosemary , title =. Studies in Second Language Acquisition , volume =

work page
[40]

1996 , url=

The Role of the Linguistic Environment in Second Language Acquisition , author=. 1996 , url=

work page 1996
[41]

Long , title =

Michael H. Long , title =. Applied Linguistics , volume =. 1983 , month =. doi:10.1093/applin/4.2.126 , url =

work page doi:10.1093/applin/4.2.126 1983
[42]

Long , title =

Michael H. Long , title =. Annals of the New York Academy of Sciences , volume =. doi:https://doi.org/10.1111/j.1749-6632.1981.tb42014.x , url =. https://nyaspubs.onlinelibrary.wiley.com/doi/pdf/10.1111/j.1749-6632.1981.tb42014.x , year =

work page doi:10.1111/j.1749-6632.1981.tb42014.x 1981
[43]

International Conference on Machine Learning , pages=

Pythia: A suite for analyzing large language models across training and scaling , author=. International Conference on Machine Learning , pages=. 2023 , organization=

work page 2023
[44]

The Twelfth International Conference on Learning Representations , year=

On-Policy Distillation of Language Models: Learning from Self-Generated Mistakes , author=. The Twelfth International Conference on Learning Representations , year=

work page
[45]

Autobiographical Notes,

Stabler, Edward P. , isbn =. Computational Perspectives on Minimalism , booktitle =. 2011 , month =. doi:10.1093/oxfordhb/9780199549368.013.0027 , url =

work page doi:10.1093/oxfordhb/9780199549368.013.0027 2011
[46]

Derivational Minimalism

Stabler, Edward , year =. Derivational Minimalism. , volume =

work page
[47]

and Petty, Jackson and Shi, Chuan and Merrill, William and Linzen, Tal

Hu, Michael Y. and Petty, Jackson and Shi, Chuan and Merrill, William and Linzen, Tal. Between Circuits and C homsky: Pre-pretraining on Formal Languages Imparts Linguistic Biases. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2025. doi:10.18653/v1/2025.acl-long.478

work page doi:10.18653/v1/2025.acl-long.478 2025
[48]

2025 , eprint=

Child-Directed Language Does Not Consistently Boost Syntax Learning in Language Models , author=. 2025 , eprint=

work page 2025
[49]

2025 , eprint=

How Linguistics Learned to Stop Worrying and Love the Language Models , author=. 2025 , eprint=

work page 2025
[50]

Annual Review of Linguistics , volume =

Kemp, Charles and Xu, Yang and Regier, Terry , title =. Annual Review of Linguistics , volume =. 2018 , doi =. https://doi.org/10.1146/annurev-linguistics-011817-045406 , abstract =

work page doi:10.1146/annurev-linguistics-011817-045406 2018
[51]

and Gibson, Edward A

Fedorenko, Evelina and Piantadosi, Steven T. and Gibson, Edward A. F. , journal =. Language is primarily a tool for communication rather than thought , volume =. 2024 , doi =

work page 2024
[52]

and Chater, Nick , year =

Christiansen, Morten H. and Chater, Nick , year =. doi:10.1017/S0140525X1500031X , journal =

work page doi:10.1017/s0140525x1500031x
[53]

1949 , publisher =

Human behavior and the principle of least effort , author =. 1949 , publisher =

work page 1949
[54]

Florian and Tily, Harry , title =

Jaeger, T. Florian and Tily, Harry , title =. WIREs Cognitive Science , volume =. doi:https://doi.org/10.1002/wcs.126 , url =. https://wires.onlinelibrary.wiley.com/doi/pdf/10.1002/wcs.126 , abstract =

work page doi:10.1002/wcs.126
[55]

1982 , isbn =

Marr, David , title =. 1982 , isbn =

work page 1982
[56]

Cognitive science in the era of artificial intelligence: A roadmap for reverse-engineering the infant language-learner , journal =

Emmanuel Dupoux , keywords =. Cognitive science in the era of artificial intelligence: A roadmap for reverse-engineering the infant language-learner , journal =. 2018 , issn =. doi:https://doi.org/10.1016/j.cognition.2017.11.008 , url =

work page doi:10.1016/j.cognition.2017.11.008 2018
[57]

Algebraic Structures in Natural Language , pages=

What artificial neural networks can tell us about human language acquisition , author=. Algebraic Structures in Natural Language , pages=. 2022 , publisher=

work page 2022
[58]

Miyu Oba and Tatsuki Kuribayashi and Hiroki Ouchi and Taro Watanabe , booktitle =

work page
[59]

and Lambon Ralph, Matthew A

Ellis, Andrew W. and Lambon Ralph, Matthew A. , journal =. 2000 , url=

work page 2000
[60]

and Zevin, Jason D

Seidenberg, Mark S. and Zevin, Jason D. , booktitle =. 2006 , publisher =

work page 2006
[61]

and Tenenbaum, Joshua B

Hartshorne, Joshua K. and Tenenbaum, Joshua B. and Pinker, Steven , journal =

work page
[62]

and Fischer, Susan D

Mayberry, Rachel I. and Fischer, Susan D. , doi =. Memory & Cognition , language =

work page
[63]

Transactions of the Association for Computational Linguistics , volume =

Constantinescu, Ionut and Pimentel, Tiago and Cotterell, Ryan and Warstadt, Alex , title =. Transactions of the Association for Computational Linguistics , volume =. 2025 , month =. doi:10.1162/tacl_a_00725 , url =

work page doi:10.1162/tacl_a_00725 2025
[64]

Brain , number =

Penfield, Wilder , doi =. Brain , number =

work page
[65]

2023 , eprint=

LLaMA: Open and Efficient Foundation Language Models , author=. 2023 , eprint=

work page 2023
[66]

Thomas Wolf and Lysandre Debut and Victor Sanh and Julien Chaumond and Clement Delangue and Anthony Moi and Pierric Cistac and Tim Rault and Rémi Louf and Morgan Funtowicz and Joe Davison and Sam Shleifer and Patrick von Platen and Clara Ma and Yacine Jernite and Julien Plu and Canwen Xu and Teven Le Scao and Sylvain Gugger and Mariama Drame and Quentin L...

work page 2020
[67]

CL i MP : A Benchmark for C hinese Language Model Evaluation

Xiang, Beilei and Yang, Changbing and Li, Yu and Warstadt, Alex and Kann, Katharina. CL i MP : A Benchmark for C hinese Language Model Evaluation. Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume. 2021. doi:10.18653/v1/2021.eacl-main.242

work page doi:10.18653/v1/2021.eacl-main.242 2021
[68]

2019 , eprint=

RoBERTa: A Robustly Optimized BERT Pretraining Approach , author=. 2019 , eprint=

work page 2019
[69]

Attention is All you Need , url =

Vaswani, Ashish and Shazeer, Noam and Parmar, Niki and Uszkoreit, Jakob and Jones, Llion and Gomez, Aidan N and Kaiser, ukasz and Polosukhin, Illia , booktitle =. Attention is All you Need , url =

work page
[70]

Journal of Machine Learning Research , year =

Laurens van der Maaten and Geoffrey Hinton , title =. Journal of Machine Learning Research , year =

work page
[71]

Feng, Noah D

Feng, Steven Y. and Goodman, Noah and Frank, Michael. Is Child-Directed Speech Effective Training Data for Language Models?. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. 2024. doi:10.18653/v1/2024.emnlp-main.1231

work page doi:10.18653/v1/2024.emnlp-main.1231 2024
[72]

Mitigating Frequency Bias and Anisotropy in Language Model Pre-Training with Syntactic Smoothing

Diehl Martinez, Richard and Goriely, Z \'e bulon and Caines, Andrew and Buttery, Paula and Beinborn, Lisa. Mitigating Frequency Bias and Anisotropy in Language Model Pre-Training with Syntactic Smoothing. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. 2024. doi:10.18653/v1/2024.emnlp-main.344

work page doi:10.18653/v1/2024.emnlp-main.344 2024
[73]

2023 , eprint=

Llama 2: Open Foundation and Fine-Tuned Chat Models , author=. 2023 , eprint=

work page 2023
[74]

2023 , eprint=

GPT-4 Technical Report , author=. 2023 , eprint=

work page 2023
[75]

wild Child

Genie: A Psycholinguistic Study of a Modern-day "wild Child" , author=. 1977 , publisher=

work page 1977
[76]

critical period

Victoria Fromkin and Stephen Krashen and Susan Curtiss and David Rigler and Marilyn Rigler , abstract =. The development of language in genie: a case of language acquisition beyond the “critical period” , journal =. 1974 , issn =. doi:https://doi.org/10.1016/0093-934X(74)90027-3 , url =

work page doi:10.1016/0093-934x(74)90027-3 1974
[77]

Modeling Overregularization in Children with Small Language Models

Haga, Akari and Sugawara, Saku and Fukatsu, Akiyo and Oba, Miyu and Ouchi, Hiroki and Watanabe, Taro and Oseki, Yohei. Modeling Overregularization in Children with Small Language Models. Findings of the Association for Computational Linguistics: ACL 2024. 2024. doi:10.18653/v1/2024.findings-acl.865

work page doi:10.18653/v1/2024.findings-acl.865 2024
[78]

The CHILDES project: tools for analyzing talk , volume =

Macwhinney, Brian , year =. The CHILDES project: tools for analyzing talk , volume =. Child Language Teaching and Therapy , doi =

work page
[79]

, title =

Patkowski, Mark S. , title =. Language Learning , volume =. doi:https://doi.org/10.1111/j.1467-1770.1980.tb00328.x , url =. https://onlinelibrary.wiley.com/doi/pdf/10.1111/j.1467-1770.1980.tb00328.x , abstract =

work page doi:10.1111/j.1467-1770.1980.tb00328.x 1980
[80]

Language and Speech , volume =

Sonia Tahta and Margaret Wood and Kate Loewenthal , title =. Language and Speech , volume =. 1981 , doi =. https://doi.org/10.1177/002383098102400306 , abstract =

work page doi:10.1177/002383098102400306 1981

Showing first 80 references.

[1] [1]

Noam Chomsky , abstract =. On. Information and Control , volume =. 1959 , issn =. doi:https://doi.org/10.1016/S0019-9958(59)90362-6 , url =

work page doi:10.1016/s0019-9958(59)90362-6 1959

[2] [2]

1995 , publisher=

The Minimalist Program , author=. 1995 , publisher=

work page 1995

[3] [3]

Step by Step: Essays on Minimalist Syntax in Honor of Howard Lasnik , editor =

Noam Chomsky , title =. Step by Step: Essays on Minimalist Syntax in Honor of Howard Lasnik , editor =

work page

[4] [4]

Derivation by Phase , booktitle =

Chomsky, Noam , isbn =. Derivation by Phase , booktitle =. 2001 , month =. doi:10.7551/mitpress/4056.003.0004 , url =

work page doi:10.7551/mitpress/4056.003.0004 2001

[5] [5]

Jardine and Dwight L

Chomsky, Noam , isbn =. Beyond Explanatory Adequacy , booktitle =. 2004 , month =. doi:10.1093/oso/9780195171976.003.0004 , url =

work page doi:10.1093/oso/9780195171976.003.0004 2004

[6] [6]

Colorless Green Recurrent Networks Dream Hierarchically

Gulordava, Kristina and Bojanowski, Piotr and Grave, Edouard and Linzen, Tal and Baroni, Marco. Colorless Green Recurrent Networks Dream Hierarchically. Proceedings of the 2018 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers). 2018. doi:10.18653/v1/N18-1108

work page doi:10.18653/v1/n18-1108 2018

[7] [7]

A is B" fail to learn

Berglund, Lukas and Tong, Meg and Kaufmann, Maximilian and Balesni, Mikita and Stickland, Asa and Korbak, Tomek and Evans, Owain , booktitle =. The Reversal Curse: LLMs trained on "A is B" fail to learn "B is A" , url =

work page

[8] [8]

and Sanyal, Soumya and Welleck, Sean and Ren, Xiang and Ettinger, Allyson and Harchaoui, Zaid and Choi, Yejin , title =

Dziri, Nouha and Lu, Ximing and Sclar, Melanie and Li, Xiang Lorraine and Jiang, Liwei and Lin, Bill Yuchen and West, Peter and Bhagavatula, Chandra and Le Bras, Ronan and Hwang, Jena D. and Sanyal, Soumya and Welleck, Sean and Ren, Xiang and Ettinger, Allyson and Harchaoui, Zaid and Choi, Yejin , title =. Proceedings of the 37th International Conference ...

work page 2023

[9] [9]

International Conference on Learning Representations , year=

Are Transformers universal approximators of sequence-to-sequence functions? , author=. International Conference on Learning Representations , year=

work page

[10] [10]

2009 , publisher=

Computational Complexity: A Modern Approach , author=. 2009 , publisher=

work page 2009

[11] [11]

, title =

Alur, Rajeev and Madhusudan, P. , title =. Proceedings of the Thirty-Sixth Annual ACM Symposium on Theory of Computing , pages =. 2004 , isbn =. doi:10.1145/1007352.1007390 , abstract =

work page doi:10.1145/1007352.1007390 2004

[12] [12]

Language , volume=

Structure dependence in grammar formation , author=. Language , volume=. 1987 , publisher=

work page 1987

[13] [13]

Shieber , doi =

Stuart M. Shieber , doi =. Evidence Against the Context-Freeness of Natural Language , volume =. Linguistics and Philosophy , number =

work page

[14] [14]

2020 , eprint=

Scaling Laws for Neural Language Models , author=. 2020 , eprint=

work page 2020

[15] [15]

Selected Papers from the Third International Conference, on Logical Aspects of Computational Linguistics , pages =

Michaelis, Jens , title =. Selected Papers from the Third International Conference, on Logical Aspects of Computational Linguistics , pages =. 1998 , isbn =

work page 1998

[16] [16]

, editor=

Joshi, Aravind K. , editor=. Tree adjoining grammars: How much context-sensitivity is required to provide reasonable structural descriptions? , booktitle=. 1985 , pages=

work page 1985

[17] [17]

Lewis Carroll , title =

work page

[18] [18]

2016 , eprint=

Pointer Sentinel Mixture Models , author=. 2016 , eprint=

work page 2016

[19] [19]

Shazeer and Adam Roberts and Katherine Lee and Sharan Narang and Michael Matena and Yanqi Zhou and Wei Li and Peter J

Colin Raffel and Noam M. Shazeer and Adam Roberts and Katherine Lee and Sharan Narang and Michael Matena and Yanqi Zhou and Wei Li and Peter J. Liu , journal=. Exploring the. 2019 , volume=

work page 2019

[20] [20]

Mission: Impossible Language Models

Kallini, Julie and Papadimitriou, Isabel and Futrell, Richard and Mahowald, Kyle and Potts, Christopher. Mission: Impossible Language Models. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2024. doi:10.18653/v1/2024.acl-long.787

work page doi:10.18653/v1/2024.acl-long.787 2024

[21] [21]

L earning M usic H elps Y ou R ead: U sing Transfer to Study Linguistic Structure in Language Models

Papadimitriou, Isabel and Jurafsky, Dan. L earning M usic H elps Y ou R ead: U sing Transfer to Study Linguistic Structure in Language Models. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). 2020. doi:10.18653/v1/2020.emnlp-main.554

work page doi:10.18653/v1/2020.emnlp-main.554 2020

[22] [22]

JBL i MP : J apanese Benchmark of Linguistic Minimal Pairs

Someya, Taiga and Oseki, Yohei. JBL i MP : J apanese Benchmark of Linguistic Minimal Pairs. Findings of the Association for Computational Linguistics: EACL 2023. 2023. doi:10.18653/v1/2023.findings-eacl.117

work page doi:10.18653/v1/2023.findings-eacl.117 2023

[23] [23]

Thirty-seventh Conference on Neural Information Processing Systems , year=

A Logic for Expressing Log-Precision Transformers , author=. Thirty-seventh Conference on Neural Information Processing Systems , year=

work page

[24] [24]

First Conference on Language Modeling , year=

Counting Like Transformers: Compiling Temporal Counting Logic Into Softmax Transformers , author=. First Conference on Language Modeling , year=

work page

[25] [25]

Pretraining with Artificial Language: Studying Transferable Knowledge in Language Models

Ri, Ryokan and Tsuruoka, Yoshimasa. Pretraining with Artificial Language: Studying Transferable Knowledge in Language Models. Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2022. doi:10.18653/v1/2022.acl-long.504

work page doi:10.18653/v1/2022.acl-long.504 2022

[26] [26]

Injecting structural hints: Using language models to study inductive biases in language learning

Papadimitriou, Isabel and Jurafsky, Dan. Injecting structural hints: Using language models to study inductive biases in language learning. Findings of the Association for Computational Linguistics: EMNLP 2023. 2023. doi:10.18653/v1/2023.findings-emnlp.563

work page doi:10.18653/v1/2023.findings-emnlp.563 2023

[27] [27]

Developmentally-plausible Working Memory Shapes a Critical Period for Language Acquisition

Mita, Masato and Yoshida, Ryo and Oseki, Yohei. Developmentally-plausible Working Memory Shapes a Critical Period for Language Acquisition. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2025. doi:10.18653/v1/2025.acl-long.462

work page doi:10.18653/v1/2025.acl-long.462 2025

[28] [28]

and Tao, Dacheng , title =

Gou, Jianping and Yu, Baosheng and Maybank, Stephen J. and Tao, Dacheng , title =. International Journal of Computer Vision , volume =. 2021 , doi =

work page 2021

[29] [29]

Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (ACL) , year =

Fu, Yao and Peng, Hao and Chen, Hao and Sabharwal, Ashish and Khot, Tushar , title =. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (ACL) , year =

work page

[30] [30]

and Schuurmans, Dale and Chi, Ed H

Hsieh, Cheng-Yu and Wei, Jason and Li, Xuezhi and Zhou, Denny and Le, Quoc V. and Schuurmans, Dale and Chi, Ed H. , title =. Proceedings of the 40th International Conference on Machine Learning (ICML) , year =

work page

[31] [31]

Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP) , year =

Shridhara, Shreyas and Wang, Bailin and Wei, Jason and Zhou, Denny and Lin, Xi Victoria , title =. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP) , year =

work page 2023

[32] [32]

Corder, S. P. , title =. International Review of Applied Linguistics , volume =

work page

[33] [33]

International Review of Applied Linguistics , volume =

Selinker, Larry , title =. International Review of Applied Linguistics , volume =

work page

[34] [34]

, title =

Krashen, Stephen D. , title =

work page

[35] [35]

Input in Second Language Acquisition , editor =

Swain, Merrill , title =. Input in Second Language Acquisition , editor =

work page

[36] [36]

Principle and Practice in Applied Linguistics: Studies in Honor of H

Swain, Merrill , title =. Principle and Practice in Applied Linguistics: Studies in Honor of H. G. Widdowson , editor =

work page

[37] [37]

Sociocultural Theory and Second Language Learning , editor =

Swain, Merrill , title =. Sociocultural Theory and Second Language Learning , editor =

work page

[38] [38]

Cognitive and Affective Aspects of Language Learning , editor =

Schmidt, Richard , title =. Cognitive and Affective Aspects of Language Learning , editor =

work page

[39] [39]

Studies in Second Language Acquisition , volume =

Ellis, Rod and Loewen, Shawn and Erlam, Rosemary , title =. Studies in Second Language Acquisition , volume =

work page

[40] [40]

1996 , url=

The Role of the Linguistic Environment in Second Language Acquisition , author=. 1996 , url=

work page 1996

[41] [41]

Long , title =

Michael H. Long , title =. Applied Linguistics , volume =. 1983 , month =. doi:10.1093/applin/4.2.126 , url =

work page doi:10.1093/applin/4.2.126 1983

[42] [42]

Long , title =

Michael H. Long , title =. Annals of the New York Academy of Sciences , volume =. doi:https://doi.org/10.1111/j.1749-6632.1981.tb42014.x , url =. https://nyaspubs.onlinelibrary.wiley.com/doi/pdf/10.1111/j.1749-6632.1981.tb42014.x , year =

work page doi:10.1111/j.1749-6632.1981.tb42014.x 1981

[43] [43]

International Conference on Machine Learning , pages=

Pythia: A suite for analyzing large language models across training and scaling , author=. International Conference on Machine Learning , pages=. 2023 , organization=

work page 2023

[44] [44]

The Twelfth International Conference on Learning Representations , year=

On-Policy Distillation of Language Models: Learning from Self-Generated Mistakes , author=. The Twelfth International Conference on Learning Representations , year=

work page

[45] [45]

Autobiographical Notes,

Stabler, Edward P. , isbn =. Computational Perspectives on Minimalism , booktitle =. 2011 , month =. doi:10.1093/oxfordhb/9780199549368.013.0027 , url =

work page doi:10.1093/oxfordhb/9780199549368.013.0027 2011

[46] [46]

Derivational Minimalism

Stabler, Edward , year =. Derivational Minimalism. , volume =

work page

[47] [47]

and Petty, Jackson and Shi, Chuan and Merrill, William and Linzen, Tal

Hu, Michael Y. and Petty, Jackson and Shi, Chuan and Merrill, William and Linzen, Tal. Between Circuits and C homsky: Pre-pretraining on Formal Languages Imparts Linguistic Biases. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2025. doi:10.18653/v1/2025.acl-long.478

work page doi:10.18653/v1/2025.acl-long.478 2025

[48] [48]

2025 , eprint=

Child-Directed Language Does Not Consistently Boost Syntax Learning in Language Models , author=. 2025 , eprint=

work page 2025

[49] [49]

2025 , eprint=

How Linguistics Learned to Stop Worrying and Love the Language Models , author=. 2025 , eprint=

work page 2025

[50] [50]

Annual Review of Linguistics , volume =

Kemp, Charles and Xu, Yang and Regier, Terry , title =. Annual Review of Linguistics , volume =. 2018 , doi =. https://doi.org/10.1146/annurev-linguistics-011817-045406 , abstract =

work page doi:10.1146/annurev-linguistics-011817-045406 2018

[51] [51]

and Gibson, Edward A

Fedorenko, Evelina and Piantadosi, Steven T. and Gibson, Edward A. F. , journal =. Language is primarily a tool for communication rather than thought , volume =. 2024 , doi =

work page 2024

[52] [52]

and Chater, Nick , year =

Christiansen, Morten H. and Chater, Nick , year =. doi:10.1017/S0140525X1500031X , journal =

work page doi:10.1017/s0140525x1500031x

[53] [53]

1949 , publisher =

Human behavior and the principle of least effort , author =. 1949 , publisher =

work page 1949

[54] [54]

Florian and Tily, Harry , title =

Jaeger, T. Florian and Tily, Harry , title =. WIREs Cognitive Science , volume =. doi:https://doi.org/10.1002/wcs.126 , url =. https://wires.onlinelibrary.wiley.com/doi/pdf/10.1002/wcs.126 , abstract =

work page doi:10.1002/wcs.126

[55] [55]

1982 , isbn =

Marr, David , title =. 1982 , isbn =

work page 1982

[56] [56]

Cognitive science in the era of artificial intelligence: A roadmap for reverse-engineering the infant language-learner , journal =

Emmanuel Dupoux , keywords =. Cognitive science in the era of artificial intelligence: A roadmap for reverse-engineering the infant language-learner , journal =. 2018 , issn =. doi:https://doi.org/10.1016/j.cognition.2017.11.008 , url =

work page doi:10.1016/j.cognition.2017.11.008 2018

[57] [57]

Algebraic Structures in Natural Language , pages=

What artificial neural networks can tell us about human language acquisition , author=. Algebraic Structures in Natural Language , pages=. 2022 , publisher=

work page 2022

[58] [58]

Miyu Oba and Tatsuki Kuribayashi and Hiroki Ouchi and Taro Watanabe , booktitle =

work page

[59] [59]

and Lambon Ralph, Matthew A

Ellis, Andrew W. and Lambon Ralph, Matthew A. , journal =. 2000 , url=

work page 2000

[60] [60]

and Zevin, Jason D

Seidenberg, Mark S. and Zevin, Jason D. , booktitle =. 2006 , publisher =

work page 2006

[61] [61]

and Tenenbaum, Joshua B

Hartshorne, Joshua K. and Tenenbaum, Joshua B. and Pinker, Steven , journal =

work page

[62] [62]

and Fischer, Susan D

Mayberry, Rachel I. and Fischer, Susan D. , doi =. Memory & Cognition , language =

work page

[63] [63]

Transactions of the Association for Computational Linguistics , volume =

Constantinescu, Ionut and Pimentel, Tiago and Cotterell, Ryan and Warstadt, Alex , title =. Transactions of the Association for Computational Linguistics , volume =. 2025 , month =. doi:10.1162/tacl_a_00725 , url =

work page doi:10.1162/tacl_a_00725 2025

[64] [64]

Brain , number =

Penfield, Wilder , doi =. Brain , number =

work page

[65] [65]

2023 , eprint=

LLaMA: Open and Efficient Foundation Language Models , author=. 2023 , eprint=

work page 2023

[66] [66]

Thomas Wolf and Lysandre Debut and Victor Sanh and Julien Chaumond and Clement Delangue and Anthony Moi and Pierric Cistac and Tim Rault and Rémi Louf and Morgan Funtowicz and Joe Davison and Sam Shleifer and Patrick von Platen and Clara Ma and Yacine Jernite and Julien Plu and Canwen Xu and Teven Le Scao and Sylvain Gugger and Mariama Drame and Quentin L...

work page 2020

[67] [67]

CL i MP : A Benchmark for C hinese Language Model Evaluation

Xiang, Beilei and Yang, Changbing and Li, Yu and Warstadt, Alex and Kann, Katharina. CL i MP : A Benchmark for C hinese Language Model Evaluation. Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume. 2021. doi:10.18653/v1/2021.eacl-main.242

work page doi:10.18653/v1/2021.eacl-main.242 2021

[68] [68]

2019 , eprint=

RoBERTa: A Robustly Optimized BERT Pretraining Approach , author=. 2019 , eprint=

work page 2019

[69] [69]

Attention is All you Need , url =

Vaswani, Ashish and Shazeer, Noam and Parmar, Niki and Uszkoreit, Jakob and Jones, Llion and Gomez, Aidan N and Kaiser, ukasz and Polosukhin, Illia , booktitle =. Attention is All you Need , url =

work page

[70] [70]

Journal of Machine Learning Research , year =

Laurens van der Maaten and Geoffrey Hinton , title =. Journal of Machine Learning Research , year =

work page

[71] [71]

Feng, Noah D

Feng, Steven Y. and Goodman, Noah and Frank, Michael. Is Child-Directed Speech Effective Training Data for Language Models?. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. 2024. doi:10.18653/v1/2024.emnlp-main.1231

work page doi:10.18653/v1/2024.emnlp-main.1231 2024

[72] [72]

Mitigating Frequency Bias and Anisotropy in Language Model Pre-Training with Syntactic Smoothing

Diehl Martinez, Richard and Goriely, Z \'e bulon and Caines, Andrew and Buttery, Paula and Beinborn, Lisa. Mitigating Frequency Bias and Anisotropy in Language Model Pre-Training with Syntactic Smoothing. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. 2024. doi:10.18653/v1/2024.emnlp-main.344

work page doi:10.18653/v1/2024.emnlp-main.344 2024

[73] [73]

2023 , eprint=

Llama 2: Open Foundation and Fine-Tuned Chat Models , author=. 2023 , eprint=

work page 2023

[74] [74]

2023 , eprint=

GPT-4 Technical Report , author=. 2023 , eprint=

work page 2023

[75] [75]

wild Child

Genie: A Psycholinguistic Study of a Modern-day "wild Child" , author=. 1977 , publisher=

work page 1977

[76] [76]

critical period

Victoria Fromkin and Stephen Krashen and Susan Curtiss and David Rigler and Marilyn Rigler , abstract =. The development of language in genie: a case of language acquisition beyond the “critical period” , journal =. 1974 , issn =. doi:https://doi.org/10.1016/0093-934X(74)90027-3 , url =

work page doi:10.1016/0093-934x(74)90027-3 1974

[77] [77]

Modeling Overregularization in Children with Small Language Models

Haga, Akari and Sugawara, Saku and Fukatsu, Akiyo and Oba, Miyu and Ouchi, Hiroki and Watanabe, Taro and Oseki, Yohei. Modeling Overregularization in Children with Small Language Models. Findings of the Association for Computational Linguistics: ACL 2024. 2024. doi:10.18653/v1/2024.findings-acl.865

work page doi:10.18653/v1/2024.findings-acl.865 2024

[78] [78]

The CHILDES project: tools for analyzing talk , volume =

Macwhinney, Brian , year =. The CHILDES project: tools for analyzing talk , volume =. Child Language Teaching and Therapy , doi =

work page

[79] [79]

, title =

Patkowski, Mark S. , title =. Language Learning , volume =. doi:https://doi.org/10.1111/j.1467-1770.1980.tb00328.x , url =. https://onlinelibrary.wiley.com/doi/pdf/10.1111/j.1467-1770.1980.tb00328.x , abstract =

work page doi:10.1111/j.1467-1770.1980.tb00328.x 1980

[80] [80]

Language and Speech , volume =

Sonia Tahta and Margaret Wood and Kate Loewenthal , title =. Language and Speech , volume =. 1981 , doi =. https://doi.org/10.1177/002383098102400306 , abstract =

work page doi:10.1177/002383098102400306 1981