What Kind of Language is Easy to Language-Model Under Curriculum Learning?

Nadine El-Naggar; Tatsuki Kuribayashi; Ted Briscoe

arxiv: 2604.26844 · v1 · submitted 2026-04-29 · 💻 cs.CL

What Kind of Language is Easy to Language-Model Under Curriculum Learning?

Nadine El-Naggar , Tatsuki Kuribayashi , Ted Briscoe This is my paper

Pith reviewed 2026-05-07 11:46 UTC · model grok-4.3

classification 💻 cs.CL

keywords curriculum learninglanguage modelstypologyinductive biasword orderlearning ordertypological universals

0 comments

The pith

Starting with simpler sentences substantially alters language models' apparent inductive bias for typological features.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines how curriculum learning, where language models start training on simpler sentences before moving to more complex ones, interacts with their inductive bias in learning different language structures. It tests this by comparing training with ordered input to random ordering and measures the effect on which typological feature combinations the models find easier. The authors find that this approach substantially alters the apparent learning preferences of the models for common versus rare language configurations. A reader would care because it shows that the way models are trained can shift what counts as 'natural' for them, beyond their architecture alone.

Core claim

We expand existing LM-based exploration with a simple CL variant and find that CL substantially impacts the apparent inductive bias of LMs.

What carries the argument

Curriculum learning as a developmentally motivated scenario that orders input from simpler to more complex sentences rather than using random order.

Load-bearing premise

The simple curriculum learning variant tested here is a valid proxy for developmentally motivated learning scenarios and the chosen typological features and language models are representative enough to generalize.

What would settle it

Training language models on the same data with and without the curriculum ordering and observing no difference in their performance or preferences on rare versus common language types would falsify the substantial impact claim.

Figures

Figures reproduced from arXiv: 2604.26844 by Nadine El-Naggar, Tatsuki Kuribayashi, Ted Briscoe.

**Figure 1.** Figure 1: Examples of sentences and their GCG derivation (somewhat simplified for space limitations). view at source ↗

**Figure 2.** Figure 2: Distributions of perplexities and typological plausibility across languages. The error bars indicate view at source ↗

**Figure 3.** Figure 3: Correlation of word order preference between different models view at source ↗

**Figure 4.** Figure 4: Examples of short templates (lengths 3-10) being combined to create longer templates (lengths view at source ↗

**Figure 5.** Figure 5: The number of occurrences of the different categories for all templates lengths 3-10 (a), the view at source ↗

**Figure 6.** Figure 6: The average number of combinatory operations in the GCG derivation for the templates of view at source ↗

read the original abstract

Many of the thousands of attested languages share common configurations of features, creating a spectrum from typologically very rare (e.g., object-verb-subject word order) or impossible languages to very common combinations of features (e.g., subject-object-verb word order). One central question is under what conditions such typological tendencies can be predicted, and specifically whether the learning bias of language models (LMs) is sufficient to reproduce such patterns. In this study, we add one dimensionality to such analysis -- the learning scenario for LMs -- to explore its interaction with the inductive bias of LMs. Specifically, as a first study, we examine the effect of curriculum learning (CL), as a developmentally motivated learning scenario, i.e., starting with simpler sentences rather than randomly-ordered input. We expand existing LM-based exploration (El-Naggar et al., 2025a,b) with a simple CL variant and find that CL substantially impacts the apparent inductive bias of LMs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The paper claims that curriculum learning (CL), implemented as a simple variant starting with simpler sentences rather than random ordering, substantially impacts the apparent inductive bias of language models (LMs) with respect to typological features such as word order and feature combinations, extending prior LM-based analyses by El-Naggar et al.

Significance. If the central result holds after addressing potential confounds, the work demonstrates that the learning scenario interacts with LM inductive biases in reproducing typological patterns, adding a developmentally motivated dimension to computational studies of language universals and acquisition.

major comments (1)

Abstract: The claim that CL 'substantially impacts the apparent inductive bias' is load-bearing on the assumption that the curriculum is neutral with respect to the typological dimensions under test. No operationalization is provided for how 'simpler sentences' are selected (e.g., by length, parse depth, lexical frequency, or explicit feature filtering). If the simplicity metric preferentially selects common configurations such as SOV over OSV, any measured shift could be an artifact of differential data exposure rather than altered learning dynamics.

minor comments (1)

Abstract: The citations to El-Naggar et al. (2025a,b) appear to reference forthcoming or preprint work; ensure full bibliographic details and confirmation that the current experiments are independent extensions.

Simulated Author's Rebuttal

1 responses · 0 unresolved

Thank you for the detailed review. We address the referee's concern about the curriculum operationalization below and will make revisions to clarify this aspect of the study.

read point-by-point responses

Referee: Abstract: The claim that CL 'substantially impacts the apparent inductive bias' is load-bearing on the assumption that the curriculum is neutral with respect to the typological dimensions under test. No operationalization is provided for how 'simpler sentences' are selected (e.g., by length, parse depth, lexical frequency, or explicit feature filtering). If the simplicity metric preferentially selects common configurations such as SOV over OSV, any measured shift could be an artifact of differential data exposure rather than altered learning dynamics.

Authors: We agree that the abstract does not provide sufficient operationalization of how 'simpler sentences' are selected. We will revise the manuscript to include a clear description of the curriculum selection criterion in both the abstract and the methods section. Additionally, we will include an analysis demonstrating that the selected simpler sentences do not preferentially expose the model to common typological configurations, thereby ruling out the potential confound of differential data exposure. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical comparison of CL vs random ordering rests on independent experimental runs rather than self-referential definitions or fits.

full rationale

The paper reports an experimental finding that a simple curriculum learning variant alters apparent inductive bias relative to random-order baselines from prior work. No equations, fitted parameters, or predictions-by-construction are present. The self-citation to El-Naggar et al. (2025a,b) provides the baseline setup but does not justify the central claim by definition; the new CL runs constitute independent evidence. No ansatz, uniqueness theorem, or renaming of known results is invoked. The derivation chain is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an empirical study on language model training regimes. No mathematical derivations, free parameters, axioms, or invented entities are mentioned in the abstract.

pith-pipeline@v0.9.0 · 5471 in / 1079 out tokens · 45516 ms · 2026-05-07T11:46:47.938925+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

5 extracted references · 5 canonical work pages

[1]

Ryan Cotterell, Sabrina J

Noam chomsky: The false promise of chatgpt.The New York Times, 8. Ryan Cotterell, Sabrina J. Mielke, Jason Eisner, andBrianRoark.2018. Arealllanguagesequally hard to language-model? InProceedings of the 2018 Conference of the North American Chap- ter of the Association for Computational Linguis- tics: Human Language Technologies, volume 2 (Short Papers), ...

work page 2018
[2]

InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 9386–9399, Vienna, Austria

Developmentally-plausible working mem- ory shapes a critical period for language acquisi- tion. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 9386–9399, Vienna, Austria. Association for Computational Linguistics. Edith Moravcsik. 1978. Language contact.Univer- sals of human languag...

work page 1978
[3]

less is more

Targetedsyntacticevaluationonthechom- sky hierarchy. InProceedings of the 2024 Joint International Conference on Computational Lin- guistics, Language Resources and Evaluation, LREC/COLING 2024, 20-25 May, 2024, Torino, Italy, pages 15595–15605. ELRA and ICCL. ValentinISpitkovsky, HiyanAlshawi, andDanielJu- rafsky. 2009. Baby steps: How “less is more” in ...

work page 2024
[4]

Concatenatedwithaconjunction(Fig.4a),

work page
[5]

Embedded with a conjunction (Fig. 4b). The resulting longer templates are parsed to filter out ungrammatical ones. Because there are millions of valid templates of length 11-20, 20K templates are randomly sampled, and for each one, the lexicon is sampled. It is worth Fairseq model share-decoder-input-output-embed True embed_dim 128 ffn_embed_dim 512 layer...

work page

[1] [1]

Ryan Cotterell, Sabrina J

Noam chomsky: The false promise of chatgpt.The New York Times, 8. Ryan Cotterell, Sabrina J. Mielke, Jason Eisner, andBrianRoark.2018. Arealllanguagesequally hard to language-model? InProceedings of the 2018 Conference of the North American Chap- ter of the Association for Computational Linguis- tics: Human Language Technologies, volume 2 (Short Papers), ...

work page 2018

[2] [2]

InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 9386–9399, Vienna, Austria

Developmentally-plausible working mem- ory shapes a critical period for language acquisi- tion. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 9386–9399, Vienna, Austria. Association for Computational Linguistics. Edith Moravcsik. 1978. Language contact.Univer- sals of human languag...

work page 1978

[3] [3]

less is more

Targetedsyntacticevaluationonthechom- sky hierarchy. InProceedings of the 2024 Joint International Conference on Computational Lin- guistics, Language Resources and Evaluation, LREC/COLING 2024, 20-25 May, 2024, Torino, Italy, pages 15595–15605. ELRA and ICCL. ValentinISpitkovsky, HiyanAlshawi, andDanielJu- rafsky. 2009. Baby steps: How “less is more” in ...

work page 2024

[4] [4]

Concatenatedwithaconjunction(Fig.4a),

work page

[5] [5]

Embedded with a conjunction (Fig. 4b). The resulting longer templates are parsed to filter out ungrammatical ones. Because there are millions of valid templates of length 11-20, 20K templates are randomly sampled, and for each one, the lexicon is sampled. It is worth Fairseq model share-decoder-input-output-embed True embed_dim 128 ffn_embed_dim 512 layer...

work page