Bringing Up a Bilingual BabyLM: Investigating Multilingual Language Acquisition Using Small-Scale Models

Linda Zeng; Michael C. Frank; Steven Y. Feng

arxiv: 2603.29552 · v2 · submitted 2026-03-31 · 💻 cs.CL · cs.AI· cs.LG

Bringing Up a Bilingual BabyLM: Investigating Multilingual Language Acquisition Using Small-Scale Models

Linda Zeng , Steven Y. Feng , Michael C. Frank This is my paper

Pith reviewed 2026-05-13 23:49 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG

keywords bilingual language acquisitionsmall language modelsGPT-2multilingual learningstatistical learningsynthetic datasetsperplexitygrammaticality

0 comments

The pith

Small GPT-2 models trained on matched bilingual data learn both languages without measurable cost to the first.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper trains small GPT-2 models on carefully matched 100-million-word datasets that are either monolingual or bilingual, using synthetic text and machine translation to control exposure. It tests whether bilingual training creates delays or trade-offs compared with monolingual training across perplexity, grammaticality judgments, and semantic tasks. The models show equivalent performance in the primary language while also reaching strong levels in the second language, and different bilingual exposure patterns produce nearly identical outcomes. This setup lets the authors isolate the effects of mixed input without the confounds present in child language data.

Core claim

Across model scales and evaluation measures, bilingual models perform similarly to monolingual models in one language while also demonstrating strong performance in the second language. No substantial differences appear between different bilingual exposure regimes, indicating that bilingual input creates no in-principle obstacles for agnostic statistical learners.

What carries the argument

Small-scale GPT-2 models trained on matched 100M-word mono- and bilingual synthetic datasets, used to simulate controlled exposure regimes and evaluated on perplexity, grammaticality, and semantic knowledge.

If this is right

Bilingual exposure does not inherently slow learning of either language for statistical models.
Varying the proportion or ordering of the two languages produces little difference in final performance.
Statistical learners can acquire two languages from mixed input without requiring language-specific biases or dedicated mechanisms.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the pattern holds for larger models, it would suggest that bilingual acquisition scales without additional architectural changes.
The results point toward testing whether human bilingual children show similar lack of trade-offs when input volume is strictly equated across languages.
Similar experiments could be run with different model families to check whether the outcome depends on the Transformer architecture.

Load-bearing premise

That training small GPT-2 models on 100 million synthetic words reproduces the core mechanisms and results of how children acquire one or two languages.

What would settle it

A direct comparison in which bilingual models trained on the same total word count as monolingual models show reliably higher perplexity or lower grammatical accuracy in the first language.

read the original abstract

Multilingualism is incredibly common around the world, leading to many important theoretical and practical questions about how children learn multiple languages at once. For example, does multilingual acquisition lead to delays in learning? Are there better and worse ways to structure multilingual input? Many correlational studies address these questions, but it is surprisingly difficult to get definitive answers because children cannot be randomly assigned to be multilingual and data are typically not matched between languages. We use language model training as a method for simulating a variety of highly controlled exposure conditions, and create matched 100M-word mono- and bilingual datasets using synthetic data and machine translation. We train GPT-2 models on monolingual and bilingual data organized to reflect a range of exposure regimes, and evaluate their performance on perplexity, grammaticality, and semantic knowledge. Across model scales and measures, bilingual models perform similarly to monolingual models in one language, but show strong performance in the second language as well. These results suggest that there are no strong differences between different bilingual exposure regimes, and that bilingual input poses no in-principle challenges for agnostic statistical learners.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Matched synthetic datasets let small LMs handle bilingual input without clear costs, though links to child learning are tentative.

read the letter

The main takeaway is that by creating matched 100M-word synthetic mono- and bilingual datasets and training GPT-2 models on them, the authors find bilingual models perform similarly to monolinguals on the primary language while also acquiring the second language effectively, with no strong differences across exposure regimes. This supports the view that bilingual input poses no fundamental challenges for these statistical learners. The work stands out for its controlled simulation setup. Using synthetic data and machine translation to match input conditions allows isolation of variables like language balance that are difficult in observational studies. Consistent results across model scales and evaluation metrics (perplexity, grammaticality, semantics) provide a solid computational testbed for these questions. Soft spots include the distance from human data. Small models on generated text may not fully capture the complexities of child multilingual acquisition, so the no-delay conclusion is best seen as specific to this artificial regime. The paper could benefit from more explicit details on dataset matching and any controls for data quality, as those are central to the claims. This is for colleagues in BabyLM or computational psycholinguistics who want to explore multilingual acquisition in controlled settings. It offers a replicable framework that readers can build on or critique. The setup is clean enough to warrant peer review, with potential for revisions on the interpretation side. I would recommend sending it for review.

Referee Report

2 major / 2 minor

Summary. The paper claims that training small GPT-2 models on precisely matched 100M-word monolingual and bilingual synthetic datasets (constructed via synthetic data and machine translation to reflect varied exposure regimes) yields bilingual models that perform comparably to monolingual baselines on one language while also acquiring the second language, as measured by perplexity, grammaticality, and semantic tasks; this implies no strong differences across bilingual regimes and no in-principle challenges for agnostic statistical learners.

Significance. If the results hold, the work supplies controlled computational evidence that bilingual input does not inherently impair statistical language acquisition, offering a useful complement to correlational child-language studies. The direct scale comparisons and use of synthetic data for matching constitute a clear methodological strength.

major comments (2)

§3 (Dataset construction): The manuscript provides insufficient detail on the precise data-matching procedures, machine-translation quality controls, and verification steps used to ensure equivalence between the 100M-word mono- and bilingual corpora; this information is load-bearing for the central claim that performance differences are attributable to exposure regime rather than corpus artifacts.
Evaluation and Results sections: Exact implementations of the grammaticality and semantic-knowledge probes are not fully specified, nor are statistical controls (e.g., run-to-run variance, multiple-comparison corrections, or equivalence tests) reported; without these, the assertion of 'no strong differences' across regimes rests on qualitative pattern descriptions rather than quantified robustness.

minor comments (2)

Abstract: The number of exposure regimes and the exact GPT-2 model scales (parameter counts) should be stated explicitly to allow readers to assess the scope of the 'across scales' claim.
Figure captions: Several figures comparing perplexity curves would benefit from error bars or shaded regions indicating variability across random seeds.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and positive recommendation for minor revision. We agree that providing more detailed information on dataset construction and evaluation methods will improve the clarity and replicability of our work, and we will revise the manuscript accordingly.

read point-by-point responses

Referee: §3 (Dataset construction): The manuscript provides insufficient detail on the precise data-matching procedures, machine-translation quality controls, and verification steps used to ensure equivalence between the 100M-word mono- and bilingual corpora; this information is load-bearing for the central claim that performance differences are attributable to exposure regime rather than corpus artifacts.

Authors: We appreciate the referee's emphasis on this critical aspect. The original manuscript describes constructing matched corpora using synthetic data generation and machine translation to simulate varied bilingual exposure regimes while keeping total word counts at 100M. However, we concur that additional specifics are necessary to fully substantiate the equivalence. In the revised version, we will augment §3 with precise descriptions of the data-matching procedures (including how sentence lengths, vocabulary distributions, and topic coverage were aligned), the machine translation pipeline (specifying the model, any fine-tuning, and quality assurance via automatic metrics like BLEU and COMET scores on test sets), and verification steps (such as statistical tests for distributional similarity and qualitative reviews). This will strengthen the argument that observed similarities in model performance are due to the exposure regimes rather than unintended corpus differences. revision: yes
Referee: Evaluation and Results sections: Exact implementations of the grammaticality and semantic-knowledge probes are not fully specified, nor are statistical controls (e.g., run-to-run variance, multiple-comparison corrections, or equivalence tests) reported; without these, the assertion of 'no strong differences' across regimes rests on qualitative pattern descriptions rather than quantified robustness.

Authors: We thank the referee for this observation, which will help make our claims more rigorous. The grammaticality probes consist of targeted tasks such as subject-verb agreement and word order judgments using cloze-style completions, while semantic-knowledge probes involve analogy and similarity judgments based on vector representations. We will fully specify these in the revision by including the exact task formulations, example items, and evaluation metrics. Furthermore, we will incorporate statistical controls by reporting means and standard deviations across 3-5 independent training runs with different random seeds, applying Bonferroni corrections for multiple comparisons where relevant, and conducting equivalence tests (e.g., two one-sided tests) to formally support the 'no strong differences' conclusion. These details will be added to the Evaluation and Results sections to provide a more robust quantitative foundation. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper reports empirical results obtained by constructing matched 100M-word mono- and bilingual synthetic datasets, training GPT-2 models of varying scales on them, and measuring performance on held-out perplexity, grammaticality, and semantic tasks. All central claims follow directly from these training runs and evaluations; no derivation chain, fitted parameter renamed as prediction, or self-citation load-bearing step reduces the outcomes to the inputs by construction. The work is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that LM training serves as a valid proxy for human acquisition and on modeling choices for dataset size and synthetic generation; no new entities are postulated.

free parameters (2)

Dataset size (100M words)
Selected to approximate cumulative child exposure but constitutes a specific modeling decision rather than a derived quantity.
GPT-2 model scales
Multiple scales tested; exact architecture and training hyperparameters are chosen parameters.

axioms (1)

domain assumption Small-scale transformer language models can serve as proxies for key aspects of human multilingual language acquisition
Invoked to justify using LM training outcomes as evidence about child learning processes.

pith-pipeline@v0.9.0 · 5497 in / 1237 out tokens · 61375 ms · 2026-05-13T23:49:31.686099+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We use language model training as a method for simulating a variety of highly controlled exposure conditions... bilingual input poses no in-principle challenges for agnostic statistical learners.
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Across model scales and measures, bilingual models perform similarly to monolingual models in one language...

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

HalluWorld: A Controlled Benchmark for Hallucination via Reference World Models
cs.CL 2026-05 conditional novelty 8.0

HalluWorld is a controlled benchmark using explicit reference world models to automatically label and disentangle hallucinations in LLMs across synthetic environments with varying complexity and observability.