Lil-Bevo: Explorations of Strategies for Training Language Models in More Humanlike Ways

Juan Diego Rodriguez; Kaj Bostrom; Kyle Mahowald; Venkata S Govindarajan

arxiv: 2310.17591 · v1 · submitted 2023-10-26 · 💻 cs.CL

Lil-Bevo: Explorations of Strategies for Training Language Models in More Humanlike Ways

Venkata S Govindarajan , Juan Diego Rodriguez , Kaj Bostrom , Kyle Mahowald This is my paper

Pith reviewed 2026-05-24 06:33 UTC · model grok-4.3

classification 💻 cs.CL

keywords masked language modelssequence length curriculumtargeted maskingsmall data pretraininglinguistic benchmarksmusic pretrainingmodel training strategies

0 comments

The pith

Training on shorter sequences first improves masked language model performance on linguistic benchmarks compared to longer sequences.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests three strategies for pretraining masked language models on limited data to achieve more humanlike linguistic abilities. These include starting with music data, using a sequence length curriculum from short to long, and applying targeted token masking for specific phenomena. Results indicate the short-to-long sequence approach yields better overall performance, music pretraining has small effects, and targeted masking aids only select tasks. This matters for developing efficient training methods when data is scarce, as current models require far more input than humans do.

Core claim

Experiments demonstrate that pretraining masked language models first on shorter sequences and then on longer sequences produces higher performance on the benchmark than training on longer sequences from the start. Initial pretraining on music data yields at most marginal gains. Targeted masking of tokens to address particular benchmark subtasks improves results on those subtasks but not across the board.

What carries the argument

A curriculum that begins masked language model training on short sequences before moving to longer sequences.

Load-bearing premise

The benchmark's subtasks reliably capture the humanlike linguistic capabilities targeted by the training strategies.

What would settle it

A replication showing no performance difference or worse results from the short-sequence curriculum on the same benchmark would undermine the central finding.

Figures

Figures reproduced from arXiv: 2310.17591 by Juan Diego Rodriguez, Kaj Bostrom, Kyle Mahowald, Venkata S Govindarajan.

read the original abstract

We present Lil-Bevo, our submission to the BabyLM Challenge. We pretrained our masked language models with three ingredients: an initial pretraining with music data, training on shorter sequences before training on longer ones, and masking specific tokens to target some of the BLiMP subtasks. Overall, our baseline models performed above chance, but far below the performance levels of larger LLMs trained on more data. We found that training on short sequences performed better than training on longer sequences.Pretraining on music may help performance marginally, but, if so, the effect seems small. Our targeted Masked Language Modeling augmentation did not seem to improve model performance in general, but did seem to help on some of the specific BLiMP tasks that we were targeting (e.g., Negative Polarity Items). Training performant LLMs on small amounts of data is a difficult but potentially informative task. While some of our techniques showed some promise, more work is needed to explore whether they can improve performance more than the modest gains here. Our code is available at https://github.com/venkatasg/Lil-Bevo and out models at https://huggingface.co/collections/venkatasg/babylm-653591cdb66f4bf68922873a

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Short-sequence training gave the clearest edge here, but the gains stay small and BLiMP may not be the right lens for the humanlike claims.

read the letter

Short-sequence training came out ahead in this BabyLM submission, but the overall gains from the three strategies stay modest and the evaluation setup has some gaps. The new part is the specific experimental outcomes on the BabyLM challenge data. The authors apply music pretraining, curriculum from short to long sequences, and targeted masking, then compare against a baseline. They find short sequences better than long, music maybe a small help, and targeted masking helping some BLiMP tasks but not the aggregate. Releasing code and models on Hugging Face is a plus for reproducibility. The paper does a solid job of running controlled comparisons and being upfront about the limited improvements. It's honest about the difficulty of training performant models on small data. Where it is softer is the evidence base. The abstract gives directional findings but skips statistical details, hyperparameter values, or error bars, so the claims rest on point estimates that could shift. More importantly, the choice of BLiMP as the main measure is questionable for testing humanlike training. BLiMP focuses on specific syntactic minimal pairs, which may not be sensitive enough to the kinds of input differences the strategies introduce, like music or curriculum order. The paper's own wording already hints at weak effects. If the goal is to explore more humanlike ways, a broader set of evaluations might have strengthened the case. This paper is mainly for people already following the BabyLM challenge or working on low-resource language modeling. A reader interested in incremental experiments on established benchmarks will find it informative. It is not a big theoretical advance or a new method. I would send it to peer review. The work is grounded enough with released artifacts to warrant referee time, though it would benefit from more rigorous stats and perhaps discussion of benchmark limitations.

Referee Report

2 major / 1 minor

Summary. The manuscript presents Lil-Bevo, a BabyLM Challenge submission that explores three strategies for pretraining masked language models on limited data in more humanlike ways: an initial phase on music data, a curriculum progressing from short to longer sequences, and targeted masking during MLM to address specific BLiMP phenomena. The authors report that short-sequence training outperformed longer sequences, music pretraining yielded only marginal gains if any, and targeted masking produced no general improvement but appeared beneficial on certain targeted BLiMP subtasks such as Negative Polarity Items. Overall performance exceeded chance levels but remained well below that of larger LLMs trained on more data; code and models are released.

Significance. If the directional findings hold under more rigorous statistical scrutiny, the work supplies concrete empirical comparisons of human-inspired training interventions within the BabyLM setting and demonstrates the difficulty of achieving strong performance on small data. The public release of code at https://github.com/venkatasg/Lil-Bevo and models on Hugging Face is a clear strength that supports reproducibility and follow-up experiments by the community.

major comments (2)

Abstract (results paragraph): the directional claims that 'training on short sequences performed better than training on longer sequences,' that music pretraining 'may help performance marginally,' and that targeted masking 'did not seem to improve model performance in general, but did seem to help on some of the specific BLiMP tasks' are presented without error bars, standard deviations across runs, p-values, or the number of independent training runs. These omissions are load-bearing for interpreting whether the reported modest or null effects reliably support or refute the three strategies.
Abstract: no hyperparameter values, learning-rate schedules, batch sizes, or exact data-mixture ratios are supplied for the three conditions being compared. Without these details the central empirical comparisons cannot be reproduced or assessed for sensitivity to implementation choices.

minor comments (1)

Abstract: 'out models' should read 'our models'.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful review and the emphasis on statistical transparency and reproducibility. We respond to each major comment below.

read point-by-point responses

Referee: [—] Abstract (results paragraph): the directional claims that 'training on short sequences performed better than training on longer sequences,' that music pretraining 'may help performance marginally,' and that targeted masking 'did not seem to improve model performance in general, but did seem to help on some of the specific BLiMP tasks' are presented without error bars, standard deviations across runs, p-values, or the number of independent training runs. These omissions are load-bearing for interpreting whether the reported modest or null effects reliably support or refute the three strategies.

Authors: We agree that the lack of variability measures or run counts limits interpretation of the modest effects. Each configuration was trained only once owing to the computational budget of the BabyLM challenge. We will revise the abstract to state explicitly that results derive from single runs and to frame the outcomes as directional observations rather than statistically tested effects. revision: yes
Referee: [—] Abstract: no hyperparameter values, learning-rate schedules, batch sizes, or exact data-mixture ratios are supplied for the three conditions being compared. Without these details the central empirical comparisons cannot be reproduced or assessed for sensitivity to implementation choices.

Authors: We acknowledge that the abstract currently omits these details. We will revise it to include the principal hyperparameter settings and data-mixture ratios used across conditions (as specified in the methods section), thereby improving reproducibility within the length constraints of the abstract. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical report of training runs and evaluations

full rationale

The paper describes three training strategies (music pretraining, short-to-long sequence curriculum, targeted masking) and reports their effects on BLiMP scores via direct experimental runs. No equations, derivations, fitted parameters renamed as predictions, or load-bearing self-citations appear; results are presented as observed outcomes of the described procedures without reduction to inputs by construction. The central claims rest on benchmark measurements rather than any self-referential logic.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests entirely on the outcomes of the described training runs and evaluations; no free parameters, ad-hoc axioms, or invented entities are introduced beyond the standard assumptions of masked language modeling and the BabyLM challenge setup.

axioms (1)

domain assumption Standard masked language modeling objective is a suitable proxy for learning linguistic structure
Implicit in the choice of pretraining method and evaluation on BLiMP.

pith-pipeline@v0.9.0 · 5768 in / 1289 out tokens · 33620 ms · 2026-05-24T06:33:29.914444+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We pretrained our masked language models with three ingredients: an initial pretraining with music data, training on shorter sequences before training on longer ones, and masking specific tokens to target some of the BLiMP subtasks.
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We found that training on short sequences performed better than training on longer sequences.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

39 extracted references · 39 canonical work pages · 2 internal anchors

[1]

URL: " 'urlintro :=

ENTRY address author booktitle chapter edition editor howpublished institution journal key month note number organization pages publisher school series title type volume year eprint doi pubmed url lastchecked label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block STRINGS urlintro eprinturl eprintpr...

work page
[2]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...

work page
[3]

Payal Bajaj, Chenyan Xiong, Guolin Ke, Xiaodong Liu, Di He, Saurabh Tiwary, Tie - Yan Liu, Paul Bennett, Xia Song, and Jianfeng Gao. 2022. https://doi.org/10.48550/arXiv.2204.06644 METRO: efficient denoising pretraining of large scale autoencoding language models with model generated signals . CoRR, abs/2204.06644

work page doi:10.48550/arxiv.2204.06644 2022
[4]

Marco Baroni. 2022. https://lingbuzz.net/lingbuzz/006031 On the proper role of linguistically-oriented deep net analysis in linguistic theorizing . In Shalom Lappin, editor, Algebraic systems and the representation of linguistic knowledge, chapter 1, pages 5--22. Taylor and Francis, Abingdon-on-Thames

work page 2022
[5]

Bender, Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell

Emily M. Bender, Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell. 2021. https://doi.org/10.1145/3442188.3445922 On the dangers of stochastic parrots: Can language models be too big? In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency , New York. Association for Computer Machinery – ACM

work page doi:10.1145/3442188.3445922 2021
[6]

Sébastien Bubeck, Varun Chandrasekaran, Ronen Eldan, Johannes Gehrke, Eric Horvitz, Ece Kamar, Peter Lee, Yin Tat Lee, Yuanzhi Li, Scott Lundberg, Harsha Nori, Hamid Palangi, Marco Tulio Ribeiro, and Yi Zhang. 2023. http://arxiv.org/abs/2303.12712 Sparks of artificial general intelligence: Early experiments with gpt-4

work page internal anchor Pith review Pith/arXiv arXiv 2023
[7]

Rosa Cao and Daniel Yamins. 2021. https://arxiv.org/abs/2104.01490 Explanatory models in neuroscience: Part 1--taking mechanistic abstraction seriously . arXiv preprint arXiv:2104.01490

work page arXiv 2021
[8]

Le, and Christopher D

Kevin Clark, Minh-Thang Luong, Quoc V. Le, and Christopher D. Manning. 2020. https://openreview.net/forum?id=r1xMH1BtvB Electra: Pre-training text encoders as discriminators rather than generators . In International Conference on Learning Representations

work page 2020
[9]

Frank et

The ManyBabies Consortium and Michael C. Frank et. al. 2020. https://doi.org/10.1177/2515245919900809 Quantifying sources of variability in infancy research using the infant-directed-speech preference . Advances in Methods and Practices in Psychological Science, 3(1):24--52

work page doi:10.1177/2515245919900809 2020
[10]

Michael C Frank. 2023. https://doi.org/10.31234/osf.io/wxt69 Large language models as models of human cognition

work page doi:10.31234/osf.io/wxt69 2023
[11]

Richard Futrell and Roger P Levy. 2019. https://aclanthology.org/W19-0106/ Do RNNs learn human-like abstract word order preferences? Proceedings of the Society for Computation in Linguistics, 2(1):50--59

work page 2019
[12]

Alison Gopnik, Andrew N Meltzoff, and Patricia K Kuhl. 1999. https://psycnet.apa.org/record/2000-07101-000 The scientist in the crib: Minds, brains, and how children learn. William Morrow & Co

work page 1999
[13]

Yuxian Gu, Zhengyan Zhang, Xiaozhi Wang, Zhiyuan Liu, and Maosong Sun. 2020. https://doi.org/10.18653/v1/2020.emnlp-main.566 Train no evil: Selective masking for task-guided pre-training . In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 6966--6974, Online. Association for Computational Linguistics

work page doi:10.18653/v1/2020.emnlp-main.566 2020
[14]

Curtis Hawthorne, Andriy Stasyuk, Adam Roberts, Ian Simon, Cheng-Zhi Anna Huang, Sander Dieleman, Erich Elsen, Jesse Engel, and Douglas Eck. 2019. https://openreview.net/forum?id=r1lYRjC9F7 Enabling factorized piano music modeling and generation with the MAESTRO dataset . In International Conference on Learning Representations

work page 2019
[15]

Taku Kudo and John Richardson. 2018. https://doi.org/10.18653/v1/D18-2012 S entence P iece: A simple and language independent subword tokenizer and detokenizer for neural text processing . In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 66--71, Brussels, Belgium. Association for Compu...

work page internal anchor Pith review doi:10.18653/v1/d18-2012 2018
[16]

Richard Kunert, Raquel Fern\'andez, and Willem Zuidema. 2011. https://staff.fnwi.uva.nl/r.fernandezrovira/papers/2011/CDS-semdial2011.pdf Adaptation in child directed speech: Evidence from corpora . In Proceedings of the 15th SemDial Workshop on the Semantics and Pragmatics of Dialogue (Los Angelogue), pages 112--119, Los Angeles, California, USA

work page 2011
[17]

Fred Lerdahl. 1996. http://www.jstor.org/stable/40286174 Calculating tonal tension . Music Perception: An Interdisciplinary Journal, 13(3):319--363

work page arXiv 1996
[18]

Tal Linzen and Marco Baroni. 2021. https://doi.org/10.1146/annurev-linguistics-032020-051035 Syntactic S tructure from D eep L earning . Annual Review of Linguistics, 7(1):195--212

work page doi:10.1146/annurev-linguistics-032020-051035 2021
[19]

Tal Linzen, Emmanuel Dupoux, and Yoav Goldberg. 2016. https://doi.org/10.1162/tacl_a_00115 Assessing the ability of LSTM s to learn syntax-sensitive dependencies . Transactions of the Association for Computational Linguistics, 4:521--535

work page doi:10.1162/tacl_a_00115 2016
[20]

Ivanova and Idan Asher Blank and Nancy Kanwisher and Joshua B

Kyle Mahowald, Anna A. Ivanova, Idan A. Blank, Nancy Kanwisher, Joshua B. Tenenbaum, and Evelina Fedorenko. 2023. http://arxiv.org/abs/2301.06627 Dissociating language and thought in large language models: a cognitive perspective

work page arXiv 2023
[21]

Gary F. Marcus. 1993. https://api.semanticscholar.org/CorpusID:23458757 Negative evidence in language acquisition . Cognition, 46:53--85

work page 1993
[22]

Aaron Mueller and Tal Linzen. 2023. https://doi.org/10.18653/v1/2023.acl-long.629 How to plant trees in language models: Data and architectural effects on the emergence of syntactic inductive biases . In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 11237--11252, Toronto, Canada. Ass...

work page doi:10.18653/v1/2023.acl-long.629 2023
[23]

Howard Nicholas, Patsy M Lightbown, and Nina Spada. 2001. https://onlinelibrary.wiley.com/doi/abs/10.1111/0023-8333.00172 Recasts as feedback to language learners . Language learning, 51(4):719--758

work page doi:10.1111/0023-8333.00172 2001
[24]

Isabel Papadimitriou and Dan Jurafsky. 2020. https://doi.org/10.18653/v1/2020.emnlp-main.554 L earning M usic H elps Y ou R ead: U sing transfer to study linguistic structure in language models . In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 6829--6839, Online. Association for Computational Linguistics

work page doi:10.18653/v1/2020.emnlp-main.554 2020
[25]

Steven T Piantadosi. 2023. https://lingbuzz.net/lingbuzz/007180 Modern language models refute chomsky’s approach to language . Lingbuzz Preprint, lingbuzz/007180

work page 2023
[26]

Smith, and Mike Lewis

Ofir Press, Noah A. Smith, and Mike Lewis. 2021. https://doi.org/10.18653/v1/2021.acl-long.427 Shortformer: Better language modeling using shorter inputs . In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 5493...

work page doi:10.18653/v1/2021.acl-long.427 2021
[27]

Nafis Sadeq, Canwen Xu, and Julian McAuley. 2022. https://doi.org/10.18653/v1/2022.emnlp-main.395 I nfor M ask: Unsupervised informative masking for language model pretraining . In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 5866--5878, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics

work page doi:10.18653/v1/2022.emnlp-main.395 2022
[28]

Nguyen, and Katrin Kirchhoff

Julian Salazar, Davis Liang, Toan Q. Nguyen, and Katrin Kirchhoff. 2020. https://doi.org/10.18653/v1/2020.acl-main.240 Masked language model scoring . In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 2699--2712, Online. Association for Computational Linguistics

work page doi:10.18653/v1/2020.acl-main.240 2020
[29]

Jessica F Schwab and Casey Lew-Williams. 2016. https://doi.org/10.1002/wcs.1393 Language learning, socioeconomic status, and child-directed speech . Wiley Interdisciplinary Reviews: Cognitive Science, 7(4):264--275

work page doi:10.1002/wcs.1393 2016
[30]

Michael Tomasello. 1992. https://api.semanticscholar.org/CorpusID:145799530 The social bases of language acquisition . Social Development, 1:67--87

work page 1992
[31]

Marten van Schijndel, Aaron Mueller, and Tal Linzen. 2019. https://doi.org/10.18653/v1/D19-1592 Quantity doesn ' t buy quality syntax with neural language models . In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 5831--5...

work page doi:10.18653/v1/d19-1592 2019
[32]

Alex Wang, Yada Pruksachatkun, Nikita Nangia, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R. Bowman. 2019. https://dl.acm.org/doi/abs/10.5555/3454287.3454581 SuperGLUE : A S tickier B enchmark for G eneral- P urpose L anguage U nderstanding S ystems . In Proceedings of the 33rd International Conference on Neural Information Processi...

work page doi:10.5555/3454287.3454581 2019
[33]

Alex Warstadt and Samuel R. Bowman. 2022. http://arxiv.org/abs/2208.07998 What artificial neural networks can tell us about human language acquisition

work page arXiv 2022
[34]

Alex Warstadt, Aaron Mueller, Leshem Choshen, Ethan Gotlieb Wilcox, Chengxu Zhuang, Juan Ciro, Rafael Mosquera, Adina Williams, Bhargavi Paranjabe, Tal Linzen, and Ryan Cotterell. 2023. https://babylm.github.io Findings of the 2023 B aby LM C hallenge: S ample-efficient pretraining on developmentally plausible corpora . In Proceedings of the 2023 B aby LM...

work page 2023
[35]

Alex Warstadt, Alicia Parrish, Haokun Liu, Anhad Mohananey, Wei Peng, Sheng-Fu Wang, and Samuel R. Bowman. 2020 a . https://doi.org/10.1162/tacl_a_00321 BL i MP : The benchmark of linguistic minimal pairs for E nglish . Transactions of the Association for Computational Linguistics, 8:377--392

work page doi:10.1162/tacl_a_00321 2020
[36]

Alex Warstadt, Yian Zhang, Xiaocheng Li, Haokun Liu, and Samuel R. Bowman. 2020 b . https://doi.org/10.18653/v1/2020.emnlp-main.16 Learning which features matter: R o BERT a acquires a preference for linguistic generalizations (eventually) . In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 217--235, ...

work page doi:10.18653/v1/2020.emnlp-main.16 2020
[37]

Adriana Weisleder and Anne Fernald. 2013. https://www.jstor.org/stable/24539354 Talking to children matters: Early language experience strengthens processing and builds vocabulary . Psychological science, 24(11):2143--2152

work page arXiv 2013
[38]

Ethan Gotlieb Wilcox, Richard Futrell, and Roger Levy. 2022. https://doi.org/10.1162/ling_a_00491 Using computational models to test syntactic learnability . Linguistic Inquiry, pages 1--88

work page doi:10.1162/ling_a_00491 2022
[39]

Yian Zhang, Alex Warstadt, Xiaocheng Li, and Samuel R. Bowman. 2021. https://doi.org/10.18653/v1/2021.acl-long.90 When do you need billions of words of pretraining data? In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Paper...

work page doi:10.18653/v1/2021.acl-long.90 2021

[1] [1]

URL: " 'urlintro :=

ENTRY address author booktitle chapter edition editor howpublished institution journal key month note number organization pages publisher school series title type volume year eprint doi pubmed url lastchecked label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block STRINGS urlintro eprinturl eprintpr...

work page

[2] [2]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...

work page

[3] [3]

Payal Bajaj, Chenyan Xiong, Guolin Ke, Xiaodong Liu, Di He, Saurabh Tiwary, Tie - Yan Liu, Paul Bennett, Xia Song, and Jianfeng Gao. 2022. https://doi.org/10.48550/arXiv.2204.06644 METRO: efficient denoising pretraining of large scale autoencoding language models with model generated signals . CoRR, abs/2204.06644

work page doi:10.48550/arxiv.2204.06644 2022

[4] [4]

Marco Baroni. 2022. https://lingbuzz.net/lingbuzz/006031 On the proper role of linguistically-oriented deep net analysis in linguistic theorizing . In Shalom Lappin, editor, Algebraic systems and the representation of linguistic knowledge, chapter 1, pages 5--22. Taylor and Francis, Abingdon-on-Thames

work page 2022

[5] [5]

Bender, Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell

Emily M. Bender, Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell. 2021. https://doi.org/10.1145/3442188.3445922 On the dangers of stochastic parrots: Can language models be too big? In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency , New York. Association for Computer Machinery – ACM

work page doi:10.1145/3442188.3445922 2021

[6] [6]

Sébastien Bubeck, Varun Chandrasekaran, Ronen Eldan, Johannes Gehrke, Eric Horvitz, Ece Kamar, Peter Lee, Yin Tat Lee, Yuanzhi Li, Scott Lundberg, Harsha Nori, Hamid Palangi, Marco Tulio Ribeiro, and Yi Zhang. 2023. http://arxiv.org/abs/2303.12712 Sparks of artificial general intelligence: Early experiments with gpt-4

work page internal anchor Pith review Pith/arXiv arXiv 2023

[7] [7]

Rosa Cao and Daniel Yamins. 2021. https://arxiv.org/abs/2104.01490 Explanatory models in neuroscience: Part 1--taking mechanistic abstraction seriously . arXiv preprint arXiv:2104.01490

work page arXiv 2021

[8] [8]

Le, and Christopher D

Kevin Clark, Minh-Thang Luong, Quoc V. Le, and Christopher D. Manning. 2020. https://openreview.net/forum?id=r1xMH1BtvB Electra: Pre-training text encoders as discriminators rather than generators . In International Conference on Learning Representations

work page 2020

[9] [9]

Frank et

The ManyBabies Consortium and Michael C. Frank et. al. 2020. https://doi.org/10.1177/2515245919900809 Quantifying sources of variability in infancy research using the infant-directed-speech preference . Advances in Methods and Practices in Psychological Science, 3(1):24--52

work page doi:10.1177/2515245919900809 2020

[10] [10]

Michael C Frank. 2023. https://doi.org/10.31234/osf.io/wxt69 Large language models as models of human cognition

work page doi:10.31234/osf.io/wxt69 2023

[11] [11]

Richard Futrell and Roger P Levy. 2019. https://aclanthology.org/W19-0106/ Do RNNs learn human-like abstract word order preferences? Proceedings of the Society for Computation in Linguistics, 2(1):50--59

work page 2019

[12] [12]

Alison Gopnik, Andrew N Meltzoff, and Patricia K Kuhl. 1999. https://psycnet.apa.org/record/2000-07101-000 The scientist in the crib: Minds, brains, and how children learn. William Morrow & Co

work page 1999

[13] [13]

Yuxian Gu, Zhengyan Zhang, Xiaozhi Wang, Zhiyuan Liu, and Maosong Sun. 2020. https://doi.org/10.18653/v1/2020.emnlp-main.566 Train no evil: Selective masking for task-guided pre-training . In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 6966--6974, Online. Association for Computational Linguistics

work page doi:10.18653/v1/2020.emnlp-main.566 2020

[14] [14]

Curtis Hawthorne, Andriy Stasyuk, Adam Roberts, Ian Simon, Cheng-Zhi Anna Huang, Sander Dieleman, Erich Elsen, Jesse Engel, and Douglas Eck. 2019. https://openreview.net/forum?id=r1lYRjC9F7 Enabling factorized piano music modeling and generation with the MAESTRO dataset . In International Conference on Learning Representations

work page 2019

[15] [15]

Taku Kudo and John Richardson. 2018. https://doi.org/10.18653/v1/D18-2012 S entence P iece: A simple and language independent subword tokenizer and detokenizer for neural text processing . In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 66--71, Brussels, Belgium. Association for Compu...

work page internal anchor Pith review doi:10.18653/v1/d18-2012 2018

[16] [16]

Richard Kunert, Raquel Fern\'andez, and Willem Zuidema. 2011. https://staff.fnwi.uva.nl/r.fernandezrovira/papers/2011/CDS-semdial2011.pdf Adaptation in child directed speech: Evidence from corpora . In Proceedings of the 15th SemDial Workshop on the Semantics and Pragmatics of Dialogue (Los Angelogue), pages 112--119, Los Angeles, California, USA

work page 2011

[17] [17]

Fred Lerdahl. 1996. http://www.jstor.org/stable/40286174 Calculating tonal tension . Music Perception: An Interdisciplinary Journal, 13(3):319--363

work page arXiv 1996

[18] [18]

Tal Linzen and Marco Baroni. 2021. https://doi.org/10.1146/annurev-linguistics-032020-051035 Syntactic S tructure from D eep L earning . Annual Review of Linguistics, 7(1):195--212

work page doi:10.1146/annurev-linguistics-032020-051035 2021

[19] [19]

Tal Linzen, Emmanuel Dupoux, and Yoav Goldberg. 2016. https://doi.org/10.1162/tacl_a_00115 Assessing the ability of LSTM s to learn syntax-sensitive dependencies . Transactions of the Association for Computational Linguistics, 4:521--535

work page doi:10.1162/tacl_a_00115 2016

[20] [20]

Ivanova and Idan Asher Blank and Nancy Kanwisher and Joshua B

Kyle Mahowald, Anna A. Ivanova, Idan A. Blank, Nancy Kanwisher, Joshua B. Tenenbaum, and Evelina Fedorenko. 2023. http://arxiv.org/abs/2301.06627 Dissociating language and thought in large language models: a cognitive perspective

work page arXiv 2023

[21] [21]

Gary F. Marcus. 1993. https://api.semanticscholar.org/CorpusID:23458757 Negative evidence in language acquisition . Cognition, 46:53--85

work page 1993

[22] [22]

Aaron Mueller and Tal Linzen. 2023. https://doi.org/10.18653/v1/2023.acl-long.629 How to plant trees in language models: Data and architectural effects on the emergence of syntactic inductive biases . In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 11237--11252, Toronto, Canada. Ass...

work page doi:10.18653/v1/2023.acl-long.629 2023

[23] [23]

Howard Nicholas, Patsy M Lightbown, and Nina Spada. 2001. https://onlinelibrary.wiley.com/doi/abs/10.1111/0023-8333.00172 Recasts as feedback to language learners . Language learning, 51(4):719--758

work page doi:10.1111/0023-8333.00172 2001

[24] [24]

Isabel Papadimitriou and Dan Jurafsky. 2020. https://doi.org/10.18653/v1/2020.emnlp-main.554 L earning M usic H elps Y ou R ead: U sing transfer to study linguistic structure in language models . In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 6829--6839, Online. Association for Computational Linguistics

work page doi:10.18653/v1/2020.emnlp-main.554 2020

[25] [25]

Steven T Piantadosi. 2023. https://lingbuzz.net/lingbuzz/007180 Modern language models refute chomsky’s approach to language . Lingbuzz Preprint, lingbuzz/007180

work page 2023

[26] [26]

Smith, and Mike Lewis

Ofir Press, Noah A. Smith, and Mike Lewis. 2021. https://doi.org/10.18653/v1/2021.acl-long.427 Shortformer: Better language modeling using shorter inputs . In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 5493...

work page doi:10.18653/v1/2021.acl-long.427 2021

[27] [27]

Nafis Sadeq, Canwen Xu, and Julian McAuley. 2022. https://doi.org/10.18653/v1/2022.emnlp-main.395 I nfor M ask: Unsupervised informative masking for language model pretraining . In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 5866--5878, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics

work page doi:10.18653/v1/2022.emnlp-main.395 2022

[28] [28]

Nguyen, and Katrin Kirchhoff

Julian Salazar, Davis Liang, Toan Q. Nguyen, and Katrin Kirchhoff. 2020. https://doi.org/10.18653/v1/2020.acl-main.240 Masked language model scoring . In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 2699--2712, Online. Association for Computational Linguistics

work page doi:10.18653/v1/2020.acl-main.240 2020

[29] [29]

Jessica F Schwab and Casey Lew-Williams. 2016. https://doi.org/10.1002/wcs.1393 Language learning, socioeconomic status, and child-directed speech . Wiley Interdisciplinary Reviews: Cognitive Science, 7(4):264--275

work page doi:10.1002/wcs.1393 2016

[30] [30]

Michael Tomasello. 1992. https://api.semanticscholar.org/CorpusID:145799530 The social bases of language acquisition . Social Development, 1:67--87

work page 1992

[31] [31]

Marten van Schijndel, Aaron Mueller, and Tal Linzen. 2019. https://doi.org/10.18653/v1/D19-1592 Quantity doesn ' t buy quality syntax with neural language models . In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 5831--5...

work page doi:10.18653/v1/d19-1592 2019

[32] [32]

Alex Wang, Yada Pruksachatkun, Nikita Nangia, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R. Bowman. 2019. https://dl.acm.org/doi/abs/10.5555/3454287.3454581 SuperGLUE : A S tickier B enchmark for G eneral- P urpose L anguage U nderstanding S ystems . In Proceedings of the 33rd International Conference on Neural Information Processi...

work page doi:10.5555/3454287.3454581 2019

[33] [33]

Alex Warstadt and Samuel R. Bowman. 2022. http://arxiv.org/abs/2208.07998 What artificial neural networks can tell us about human language acquisition

work page arXiv 2022

[34] [34]

Alex Warstadt, Aaron Mueller, Leshem Choshen, Ethan Gotlieb Wilcox, Chengxu Zhuang, Juan Ciro, Rafael Mosquera, Adina Williams, Bhargavi Paranjabe, Tal Linzen, and Ryan Cotterell. 2023. https://babylm.github.io Findings of the 2023 B aby LM C hallenge: S ample-efficient pretraining on developmentally plausible corpora . In Proceedings of the 2023 B aby LM...

work page 2023

[35] [35]

Alex Warstadt, Alicia Parrish, Haokun Liu, Anhad Mohananey, Wei Peng, Sheng-Fu Wang, and Samuel R. Bowman. 2020 a . https://doi.org/10.1162/tacl_a_00321 BL i MP : The benchmark of linguistic minimal pairs for E nglish . Transactions of the Association for Computational Linguistics, 8:377--392

work page doi:10.1162/tacl_a_00321 2020

[36] [36]

Alex Warstadt, Yian Zhang, Xiaocheng Li, Haokun Liu, and Samuel R. Bowman. 2020 b . https://doi.org/10.18653/v1/2020.emnlp-main.16 Learning which features matter: R o BERT a acquires a preference for linguistic generalizations (eventually) . In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 217--235, ...

work page doi:10.18653/v1/2020.emnlp-main.16 2020

[37] [37]

Adriana Weisleder and Anne Fernald. 2013. https://www.jstor.org/stable/24539354 Talking to children matters: Early language experience strengthens processing and builds vocabulary . Psychological science, 24(11):2143--2152

work page arXiv 2013

[38] [38]

Ethan Gotlieb Wilcox, Richard Futrell, and Roger Levy. 2022. https://doi.org/10.1162/ling_a_00491 Using computational models to test syntactic learnability . Linguistic Inquiry, pages 1--88

work page doi:10.1162/ling_a_00491 2022

[39] [39]

Yian Zhang, Alex Warstadt, Xiaocheng Li, and Samuel R. Bowman. 2021. https://doi.org/10.18653/v1/2021.acl-long.90 When do you need billions of words of pretraining data? In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Paper...

work page doi:10.18653/v1/2021.acl-long.90 2021