Lil-Bevo: Explorations of Strategies for Training Language Models in More Humanlike Ways
Pith reviewed 2026-05-24 06:33 UTC · model grok-4.3
The pith
Training on shorter sequences first improves masked language model performance on linguistic benchmarks compared to longer sequences.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Experiments demonstrate that pretraining masked language models first on shorter sequences and then on longer sequences produces higher performance on the benchmark than training on longer sequences from the start. Initial pretraining on music data yields at most marginal gains. Targeted masking of tokens to address particular benchmark subtasks improves results on those subtasks but not across the board.
What carries the argument
A curriculum that begins masked language model training on short sequences before moving to longer sequences.
Load-bearing premise
The benchmark's subtasks reliably capture the humanlike linguistic capabilities targeted by the training strategies.
What would settle it
A replication showing no performance difference or worse results from the short-sequence curriculum on the same benchmark would undermine the central finding.
Figures
read the original abstract
We present Lil-Bevo, our submission to the BabyLM Challenge. We pretrained our masked language models with three ingredients: an initial pretraining with music data, training on shorter sequences before training on longer ones, and masking specific tokens to target some of the BLiMP subtasks. Overall, our baseline models performed above chance, but far below the performance levels of larger LLMs trained on more data. We found that training on short sequences performed better than training on longer sequences.Pretraining on music may help performance marginally, but, if so, the effect seems small. Our targeted Masked Language Modeling augmentation did not seem to improve model performance in general, but did seem to help on some of the specific BLiMP tasks that we were targeting (e.g., Negative Polarity Items). Training performant LLMs on small amounts of data is a difficult but potentially informative task. While some of our techniques showed some promise, more work is needed to explore whether they can improve performance more than the modest gains here. Our code is available at https://github.com/venkatasg/Lil-Bevo and out models at https://huggingface.co/collections/venkatasg/babylm-653591cdb66f4bf68922873a
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents Lil-Bevo, a BabyLM Challenge submission that explores three strategies for pretraining masked language models on limited data in more humanlike ways: an initial phase on music data, a curriculum progressing from short to longer sequences, and targeted masking during MLM to address specific BLiMP phenomena. The authors report that short-sequence training outperformed longer sequences, music pretraining yielded only marginal gains if any, and targeted masking produced no general improvement but appeared beneficial on certain targeted BLiMP subtasks such as Negative Polarity Items. Overall performance exceeded chance levels but remained well below that of larger LLMs trained on more data; code and models are released.
Significance. If the directional findings hold under more rigorous statistical scrutiny, the work supplies concrete empirical comparisons of human-inspired training interventions within the BabyLM setting and demonstrates the difficulty of achieving strong performance on small data. The public release of code at https://github.com/venkatasg/Lil-Bevo and models on Hugging Face is a clear strength that supports reproducibility and follow-up experiments by the community.
major comments (2)
- Abstract (results paragraph): the directional claims that 'training on short sequences performed better than training on longer sequences,' that music pretraining 'may help performance marginally,' and that targeted masking 'did not seem to improve model performance in general, but did seem to help on some of the specific BLiMP tasks' are presented without error bars, standard deviations across runs, p-values, or the number of independent training runs. These omissions are load-bearing for interpreting whether the reported modest or null effects reliably support or refute the three strategies.
- Abstract: no hyperparameter values, learning-rate schedules, batch sizes, or exact data-mixture ratios are supplied for the three conditions being compared. Without these details the central empirical comparisons cannot be reproduced or assessed for sensitivity to implementation choices.
minor comments (1)
- Abstract: 'out models' should read 'our models'.
Simulated Author's Rebuttal
We thank the referee for the careful review and the emphasis on statistical transparency and reproducibility. We respond to each major comment below.
read point-by-point responses
-
Referee: [—] Abstract (results paragraph): the directional claims that 'training on short sequences performed better than training on longer sequences,' that music pretraining 'may help performance marginally,' and that targeted masking 'did not seem to improve model performance in general, but did seem to help on some of the specific BLiMP tasks' are presented without error bars, standard deviations across runs, p-values, or the number of independent training runs. These omissions are load-bearing for interpreting whether the reported modest or null effects reliably support or refute the three strategies.
Authors: We agree that the lack of variability measures or run counts limits interpretation of the modest effects. Each configuration was trained only once owing to the computational budget of the BabyLM challenge. We will revise the abstract to state explicitly that results derive from single runs and to frame the outcomes as directional observations rather than statistically tested effects. revision: yes
-
Referee: [—] Abstract: no hyperparameter values, learning-rate schedules, batch sizes, or exact data-mixture ratios are supplied for the three conditions being compared. Without these details the central empirical comparisons cannot be reproduced or assessed for sensitivity to implementation choices.
Authors: We acknowledge that the abstract currently omits these details. We will revise it to include the principal hyperparameter settings and data-mixture ratios used across conditions (as specified in the methods section), thereby improving reproducibility within the length constraints of the abstract. revision: yes
Circularity Check
No circularity: purely empirical report of training runs and evaluations
full rationale
The paper describes three training strategies (music pretraining, short-to-long sequence curriculum, targeted masking) and reports their effects on BLiMP scores via direct experimental runs. No equations, derivations, fitted parameters renamed as predictions, or load-bearing self-citations appear; results are presented as observed outcomes of the described procedures without reduction to inputs by construction. The central claims rest on benchmark measurements rather than any self-referential logic.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Standard masked language modeling objective is a suitable proxy for learning linguistic structure
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We pretrained our masked language models with three ingredients: an initial pretraining with music data, training on shorter sequences before training on longer ones, and masking specific tokens to target some of the BLiMP subtasks.
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We found that training on short sequences performed better than training on longer sequences.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
ENTRY address author booktitle chapter edition editor howpublished institution journal key month note number organization pages publisher school series title type volume year eprint doi pubmed url lastchecked label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block STRINGS urlintro eprinturl eprintpr...
-
[2]
" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...
-
[3]
Payal Bajaj, Chenyan Xiong, Guolin Ke, Xiaodong Liu, Di He, Saurabh Tiwary, Tie - Yan Liu, Paul Bennett, Xia Song, and Jianfeng Gao. 2022. https://doi.org/10.48550/arXiv.2204.06644 METRO: efficient denoising pretraining of large scale autoencoding language models with model generated signals . CoRR, abs/2204.06644
-
[4]
Marco Baroni. 2022. https://lingbuzz.net/lingbuzz/006031 On the proper role of linguistically-oriented deep net analysis in linguistic theorizing . In Shalom Lappin, editor, Algebraic systems and the representation of linguistic knowledge, chapter 1, pages 5--22. Taylor and Francis, Abingdon-on-Thames
work page 2022
-
[5]
Bender, Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell
Emily M. Bender, Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell. 2021. https://doi.org/10.1145/3442188.3445922 On the dangers of stochastic parrots: Can language models be too big? In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency , New York. Association for Computer Machinery – ACM
-
[6]
Sébastien Bubeck, Varun Chandrasekaran, Ronen Eldan, Johannes Gehrke, Eric Horvitz, Ece Kamar, Peter Lee, Yin Tat Lee, Yuanzhi Li, Scott Lundberg, Harsha Nori, Hamid Palangi, Marco Tulio Ribeiro, and Yi Zhang. 2023. http://arxiv.org/abs/2303.12712 Sparks of artificial general intelligence: Early experiments with gpt-4
work page internal anchor Pith review Pith/arXiv arXiv 2023
- [7]
-
[8]
Kevin Clark, Minh-Thang Luong, Quoc V. Le, and Christopher D. Manning. 2020. https://openreview.net/forum?id=r1xMH1BtvB Electra: Pre-training text encoders as discriminators rather than generators . In International Conference on Learning Representations
work page 2020
-
[9]
The ManyBabies Consortium and Michael C. Frank et. al. 2020. https://doi.org/10.1177/2515245919900809 Quantifying sources of variability in infancy research using the infant-directed-speech preference . Advances in Methods and Practices in Psychological Science, 3(1):24--52
-
[10]
Michael C Frank. 2023. https://doi.org/10.31234/osf.io/wxt69 Large language models as models of human cognition
-
[11]
Richard Futrell and Roger P Levy. 2019. https://aclanthology.org/W19-0106/ Do RNNs learn human-like abstract word order preferences? Proceedings of the Society for Computation in Linguistics, 2(1):50--59
work page 2019
-
[12]
Alison Gopnik, Andrew N Meltzoff, and Patricia K Kuhl. 1999. https://psycnet.apa.org/record/2000-07101-000 The scientist in the crib: Minds, brains, and how children learn. William Morrow & Co
work page 1999
-
[13]
Yuxian Gu, Zhengyan Zhang, Xiaozhi Wang, Zhiyuan Liu, and Maosong Sun. 2020. https://doi.org/10.18653/v1/2020.emnlp-main.566 Train no evil: Selective masking for task-guided pre-training . In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 6966--6974, Online. Association for Computational Linguistics
-
[14]
Curtis Hawthorne, Andriy Stasyuk, Adam Roberts, Ian Simon, Cheng-Zhi Anna Huang, Sander Dieleman, Erich Elsen, Jesse Engel, and Douglas Eck. 2019. https://openreview.net/forum?id=r1lYRjC9F7 Enabling factorized piano music modeling and generation with the MAESTRO dataset . In International Conference on Learning Representations
work page 2019
-
[15]
Taku Kudo and John Richardson. 2018. https://doi.org/10.18653/v1/D18-2012 S entence P iece: A simple and language independent subword tokenizer and detokenizer for neural text processing . In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 66--71, Brussels, Belgium. Association for Compu...
work page internal anchor Pith review doi:10.18653/v1/d18-2012 2018
-
[16]
Richard Kunert, Raquel Fern\'andez, and Willem Zuidema. 2011. https://staff.fnwi.uva.nl/r.fernandezrovira/papers/2011/CDS-semdial2011.pdf Adaptation in child directed speech: Evidence from corpora . In Proceedings of the 15th SemDial Workshop on the Semantics and Pragmatics of Dialogue (Los Angelogue), pages 112--119, Los Angeles, California, USA
work page 2011
- [17]
-
[18]
Tal Linzen and Marco Baroni. 2021. https://doi.org/10.1146/annurev-linguistics-032020-051035 Syntactic S tructure from D eep L earning . Annual Review of Linguistics, 7(1):195--212
work page doi:10.1146/annurev-linguistics-032020-051035 2021
-
[19]
Tal Linzen, Emmanuel Dupoux, and Yoav Goldberg. 2016. https://doi.org/10.1162/tacl_a_00115 Assessing the ability of LSTM s to learn syntax-sensitive dependencies . Transactions of the Association for Computational Linguistics, 4:521--535
-
[20]
Ivanova and Idan Asher Blank and Nancy Kanwisher and Joshua B
Kyle Mahowald, Anna A. Ivanova, Idan A. Blank, Nancy Kanwisher, Joshua B. Tenenbaum, and Evelina Fedorenko. 2023. http://arxiv.org/abs/2301.06627 Dissociating language and thought in large language models: a cognitive perspective
-
[21]
Gary F. Marcus. 1993. https://api.semanticscholar.org/CorpusID:23458757 Negative evidence in language acquisition . Cognition, 46:53--85
work page 1993
-
[22]
Aaron Mueller and Tal Linzen. 2023. https://doi.org/10.18653/v1/2023.acl-long.629 How to plant trees in language models: Data and architectural effects on the emergence of syntactic inductive biases . In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 11237--11252, Toronto, Canada. Ass...
-
[23]
Howard Nicholas, Patsy M Lightbown, and Nina Spada. 2001. https://onlinelibrary.wiley.com/doi/abs/10.1111/0023-8333.00172 Recasts as feedback to language learners . Language learning, 51(4):719--758
-
[24]
Isabel Papadimitriou and Dan Jurafsky. 2020. https://doi.org/10.18653/v1/2020.emnlp-main.554 L earning M usic H elps Y ou R ead: U sing transfer to study linguistic structure in language models . In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 6829--6839, Online. Association for Computational Linguistics
-
[25]
Steven T Piantadosi. 2023. https://lingbuzz.net/lingbuzz/007180 Modern language models refute chomsky’s approach to language . Lingbuzz Preprint, lingbuzz/007180
work page 2023
-
[26]
Ofir Press, Noah A. Smith, and Mike Lewis. 2021. https://doi.org/10.18653/v1/2021.acl-long.427 Shortformer: Better language modeling using shorter inputs . In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 5493...
-
[27]
Nafis Sadeq, Canwen Xu, and Julian McAuley. 2022. https://doi.org/10.18653/v1/2022.emnlp-main.395 I nfor M ask: Unsupervised informative masking for language model pretraining . In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 5866--5878, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics
-
[28]
Julian Salazar, Davis Liang, Toan Q. Nguyen, and Katrin Kirchhoff. 2020. https://doi.org/10.18653/v1/2020.acl-main.240 Masked language model scoring . In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 2699--2712, Online. Association for Computational Linguistics
-
[29]
Jessica F Schwab and Casey Lew-Williams. 2016. https://doi.org/10.1002/wcs.1393 Language learning, socioeconomic status, and child-directed speech . Wiley Interdisciplinary Reviews: Cognitive Science, 7(4):264--275
-
[30]
Michael Tomasello. 1992. https://api.semanticscholar.org/CorpusID:145799530 The social bases of language acquisition . Social Development, 1:67--87
work page 1992
-
[31]
Marten van Schijndel, Aaron Mueller, and Tal Linzen. 2019. https://doi.org/10.18653/v1/D19-1592 Quantity doesn ' t buy quality syntax with neural language models . In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 5831--5...
-
[32]
Alex Wang, Yada Pruksachatkun, Nikita Nangia, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R. Bowman. 2019. https://dl.acm.org/doi/abs/10.5555/3454287.3454581 SuperGLUE : A S tickier B enchmark for G eneral- P urpose L anguage U nderstanding S ystems . In Proceedings of the 33rd International Conference on Neural Information Processi...
- [33]
-
[34]
Alex Warstadt, Aaron Mueller, Leshem Choshen, Ethan Gotlieb Wilcox, Chengxu Zhuang, Juan Ciro, Rafael Mosquera, Adina Williams, Bhargavi Paranjabe, Tal Linzen, and Ryan Cotterell. 2023. https://babylm.github.io Findings of the 2023 B aby LM C hallenge: S ample-efficient pretraining on developmentally plausible corpora . In Proceedings of the 2023 B aby LM...
work page 2023
-
[35]
Alex Warstadt, Alicia Parrish, Haokun Liu, Anhad Mohananey, Wei Peng, Sheng-Fu Wang, and Samuel R. Bowman. 2020 a . https://doi.org/10.1162/tacl_a_00321 BL i MP : The benchmark of linguistic minimal pairs for E nglish . Transactions of the Association for Computational Linguistics, 8:377--392
-
[36]
Alex Warstadt, Yian Zhang, Xiaocheng Li, Haokun Liu, and Samuel R. Bowman. 2020 b . https://doi.org/10.18653/v1/2020.emnlp-main.16 Learning which features matter: R o BERT a acquires a preference for linguistic generalizations (eventually) . In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 217--235, ...
- [37]
-
[38]
Ethan Gotlieb Wilcox, Richard Futrell, and Roger Levy. 2022. https://doi.org/10.1162/ling_a_00491 Using computational models to test syntactic learnability . Linguistic Inquiry, pages 1--88
-
[39]
Yian Zhang, Alex Warstadt, Xiaocheng Li, and Samuel R. Bowman. 2021. https://doi.org/10.18653/v1/2021.acl-long.90 When do you need billions of words of pretraining data? In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Paper...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.