arxiv: 2604.17930 · v1 · submitted 2026-04-20 · 💻 cs.CL · cs.AI· cs.LG

Recognition: unknown

Heterogeneity in Formal Linguistic Competence of Language Models: Is Data the Real Bottleneck?

H S V N S Kowndinya Renduchintala , Sumit Bhatia

Authors on Pith no claims yet

Pith reviewed 2026-05-10 04:52 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG

keywords language modelslinguistic competenceBLiMPdata compositionsynthetic datapre-traininggrammar acquisition

0 comments

The pith

Small language models fix most grammar weaknesses when training data includes targeted examples of them.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper asks why language models master some grammatical patterns near perfectly yet fail badly on others even after training on enormous text collections. It tests whether these gaps arise because the missing patterns simply appear too rarely in ordinary web data. Researchers start with standard pre-training of a 124 million parameter model on a 100 million token sample and then add just one percent synthetic sentences built around the weakest patterns. Accuracy rises on eight of the nine hardest test cases, including one that jumps from 20.9 percent to 69.4 percent correct, while overall scores stay steady or improve. The outcome supplies direct evidence that data exposure, rather than model size or architecture, explains the uneven linguistic performance.

Core claim

Injecting one percent of synthetic data that targets specific linguistic constructions into the pre-training corpus of GPT-2 Small models produces large gains on eight of the nine BLiMP paradigms that had previously shown the weakest results. The paradigm only_npi_scope improves from 20.9 percent to 69.4 percent accuracy. Overall performance across all paradigms is preserved or slightly enhanced. One construction, principle_A_c_command, remains below chance even after the intervention.

What carries the argument

Targeted synthetic data injection that increases exposure to specific grammatical constructions during pre-training.

Load-bearing premise

The added synthetic sentences teach the model the actual grammatical rules rather than letting it memorize the new examples or change the training dynamics in ways that artificially raise the test scores.

What would settle it

Running the same pre-training with an equal amount of random or non-targeted data added and finding that the targeted BLiMP paradigms show no improvement.

Figures

Figures reproduced from arXiv: 2604.17930 by H S V N S Kowndinya Renduchintala, Sumit Bhatia.

**Figure 2.** Figure 2: BLiMP evaluation results by linguistic phenomenon (2/2). [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗

read the original abstract

Large Language Models (LLMs) exhibit a puzzling disparity in their formal linguistic competence: while they learn some linguistic phenomena with near-perfect mastery, they often perform below chance on others, even after training on trillions of tokens. In this work, we investigate whether these failures stem from inherent architectural limitations or simply the scarcity of these specific grammatical constructions in web-scale corpora. We pre-train simple GPT-2 Small (124M) models on a 100M-token random sample of the FineWeb corpus and intervene by injecting a minimal amount (1%) of synthetic data targeting specific linguistic phenomena. We find that this targeted intervention substantially improves model performance in 8 out of the 9 worst-performing BLiMP paradigms - notably the accuracy on a specific paradigm, only_npi_scope, surges from 20.9% to 69.4%. Furthermore, we observe that these interventions generally preserve or slightly improve aggregate performance. However, while we also identify a resistant phenomenon, principle_A_c_command, whose performance remains below chance even after our data augmentation, our findings do serve as an optimistic existence proof that even small language models can substantially improve on those linguistic phenomena on which models typically perform poorly, provided the pre-training data contains sufficient exposure to them. This suggests that efforts towards human-scale language modeling may benefit greatly by focusing on data composition. The code to reproduce our results is open-sourced at https://github.com/kowndinya-renduchintala/heterogeneity-in-formal-linguistic-competence.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Targeted 1% synthetic data injection lifts most BLiMP failures in a small GPT-2 but leaves one resistant and provides no ablations to separate real learning from memorization or training artifacts.

read the letter

The main thing to know is that pretraining a 124M GPT-2 on a 100M-token FineWeb sample plus 1% synthetic data aimed at specific weak constructions raises accuracy on eight of the nine lowest BLiMP paradigms, with only_npi_scope moving from 20.9% to 69.4%, while overall performance stays flat or improves slightly. One case, principle_A_c_command, stays below chance even after the addition. This is presented as an existence proof that data composition can address these gaps in small models. What is new is the controlled before-and-after measurement on a fixed setup with a minimal intervention. Earlier discussions of data quality were more general; here you get direct numbers on how little targeted data moves the needle on the exact failing paradigms. The paper does this cleanly by starting from a random corpus slice and reporting both the wins and the resistant case, which keeps the result from looking cherry-picked. The soft spots sit in the interpretation of the gains. Nothing in the abstract rules out that the model is memorizing the finite synthetic set, responding to superficial cues that overlap with BLiMP templates, or simply experiencing a different optimization path. No ablations on held-out variants or natural-data generalization are described, and details on how the synthetic examples were constructed are missing from the summary. The open code helps, but the paper itself should address these alternatives. This is useful for people who run BLiMP evaluations or study why certain syntactic phenomena remain hard after large-scale pretraining. Readers interested in data curation over pure scaling will find the numbers worth looking at. It deserves a serious referee because the experiment is straightforward, the directional result is clear, and the question ties directly into current work on model competence. I would send it to peer review and ask mainly for mechanism checks and more detail on the synthetic data generation.

Referee Report

2 major / 2 minor

Summary. The paper claims that disparities in language models' formal linguistic competence on BLiMP paradigms arise primarily from insufficient exposure to specific constructions in web-scale pre-training data rather than inherent architectural limits. By pre-training GPT-2 Small (124M) on a 100M-token FineWeb sample and injecting 1% targeted synthetic data, the authors report substantial gains on 8 of 9 underperforming paradigms (e.g., only_npi_scope accuracy rising from 20.9% to 69.4%), with aggregate performance preserved or slightly improved, while noting one resistant case (principle_A_c_command). This is presented as an existence proof that data composition can address such gaps, with code released for reproducibility.

Significance. If robust, the result would be significant as an empirical demonstration that minimal, targeted data augmentation can close large gaps in specific linguistic phenomena for small models, shifting emphasis toward data curation in scaling laws for formal competence. The open-sourced reproduction code is a clear strength supporting verification of the empirical measurements.

major comments (2)

[Experimental Setup] The central claim that 1% synthetic data supplies genuine additional exposure (rather than memorization or non-specific training effects) is load-bearing but unsupported by ablations; no controls are described to test whether gains on paradigms such as only_npi_scope arise from the linguistic content versus finite-set memorization or shifts in token distribution/optimization trajectory.
[Results] The resistant case of principle_A_c_command is noted but receives no mechanistic analysis or comparison to the successful cases, leaving open whether data exposure is uniformly the bottleneck or whether other factors differentiate the phenomena.

minor comments (2)

[Abstract] Details on synthetic data generation (templates, sampling, filtering) and exact statistical controls (e.g., significance tests on accuracy deltas) are not summarized in the abstract and should be expanded in the main text for clarity.
[Methods] The 100M-token FineWeb subsample size and the precise 1% injection ratio could be justified with reference to token counts or ablation on smaller fractions.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for their constructive feedback and positive assessment of the work's significance. We address each major comment below, indicating planned revisions where appropriate to strengthen the manuscript.

read point-by-point responses

Referee: [Experimental Setup] The central claim that 1% synthetic data supplies genuine additional exposure (rather than memorization or non-specific training effects) is load-bearing but unsupported by ablations; no controls are described to test whether gains on paradigms such as only_npi_scope arise from the linguistic content versus finite-set memorization or shifts in token distribution/optimization trajectory.

Authors: We agree that explicit controls would better isolate the contribution of the targeted linguistic structures. The synthetic data was produced via templates yielding diverse sentence realizations of each construction (rather than repeated identical examples), and the 1% injection was applied uniformly across training. To address the concern directly, we will add ablation experiments in the revision: one using synthetic data matched for token statistics but lacking the critical syntactic patterns, and another using non-targeted random text of equivalent volume. These will be reported alongside the existing results. revision: yes
Referee: [Results] The resistant case of principle_A_c_command is noted but receives no mechanistic analysis or comparison to the successful cases, leaving open whether data exposure is uniformly the bottleneck or whether other factors differentiate the phenomena.

Authors: We will expand the discussion section to compare principle_A_c_command with the eight improved paradigms, highlighting differences in rule complexity and potential interactions with other phenomena that may explain its resistance. However, a full mechanistic analysis (e.g., via probing or intervention studies) lies beyond the scope of the present work, which centers on demonstrating the impact of data composition rather than interpretability techniques. revision: partial

standing simulated objections not resolved

A detailed mechanistic explanation for why principle_A_c_command remains resistant despite targeted data augmentation.

Circularity Check

0 steps flagged

No circularity: purely empirical pre-training intervention with external benchmark

full rationale

The paper describes an experimental protocol of pre-training GPT-2 Small models on a 100M-token FineWeb sample, injecting 1% targeted synthetic data for specific BLiMP paradigms, and reporting accuracy changes (e.g., only_npi_scope from 20.9% to 69.4%) against the fixed external BLiMP test set. No equations, parameter fits, or derivations are present; results are direct empirical measurements. No self-citations are invoked as load-bearing uniqueness theorems or ansatzes, and the resistant case (principle_A_c_command) is reported without forcing the outcome. The derivation chain is therefore self-contained as a controlled intervention study.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that BLiMP paradigms are valid proxies for formal linguistic competence and that the synthetic injection isolates the effect of exposure; no free parameters or invented entities are introduced.

axioms (1)

domain assumption BLiMP test suites provide a valid and independent measure of formal linguistic competence in language models.
All success claims are defined in terms of accuracy on these paradigms.

pith-pipeline@v0.9.0 · 5587 in / 1456 out tokens · 44796 ms · 2026-05-10T04:52:24.970168+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

20 extracted references · 10 canonical work pages · 6 internal anchors

[1]

Dissociating language and thought in large language models: a cognitive perspective,

Dissociating language and thought in large language models: a cognitive perspective , author=. arXiv preprint arXiv:2301.06627 , volume=

work page arXiv
[2]

How can we accelerate progress towards human-like linguistic generalization?arXiv preprint arXiv:2005.00955,

How can we accelerate progress towards human-like linguistic generalization? , author=. arXiv preprint arXiv:2005.00955 , year=

work page arXiv 2005
[3]

Computational Linguistics , pages=

Are formal and functional linguistic mechanisms dissociated in language models? , author=. Computational Linguistics , pages=. 2025 , publisher=

2025
[4]

American journal of speech-language pathology , volume=

Mapping the early language environment using all-day recordings and automated analysis , author=. American journal of speech-language pathology , volume=. 2017 , publisher=

2017
[5]

arXiv e-prints , pages=

The llama 3 herd of models , author=. arXiv e-prints , pages=
[6]

Proceedings of the 2025 ACM Conference on Fairness, Accountability, and Transparency , pages=

People cannot distinguish GPT-4 from a human in a Turing test , author=. Proceedings of the 2025 ACM Conference on Fairness, Accountability, and Transparency , pages=

2025
[7]

Language Models Learn Rare Phenomena from Less Rare Phenomena: The Case of the Missing AANN s

Misra, Kanishka and Mahowald, Kyle. Language Models Learn Rare Phenomena from Less Rare Phenomena: The Case of the Missing AANN s. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. 2024. doi:10.18653/v1/2024.emnlp-main.53

work page doi:10.18653/v1/2024.emnlp-main.53 2024
[8]

arXiv preprint arXiv:2412.05149 , year=

Findings of the second BabyLM challenge: Sample-efficient pretraining on developmentally plausible corpora , author=. arXiv preprint arXiv:2412.05149 , year=

work page arXiv
[9]

Transactions of the Association for Computational Linguistics , volume=

BLiMP: The benchmark of linguistic minimal pairs for English , author=. Transactions of the Association for Computational Linguistics , volume=. 2020 , publisher=

2020
[10]

Olmo 3

Olmo 3 , author=. arXiv preprint arXiv:2512.13961 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[11]

GPT-4 Technical Report

Gpt-4 technical report , author=. arXiv preprint arXiv:2303.08774 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[12]

Advances in neural information processing systems , volume=

Language models are few-shot learners , author=. Advances in neural information processing systems , volume=
[13]

Journal of Linguistics , volume=

Machine learning theory and practice as a source of insightinto universal grammar , author=. Journal of Linguistics , volume=. 2007 , publisher=

2007
[14]

OpenAI blog , volume=

Language models are unsupervised multitask learners , author=. OpenAI blog , volume=
[15]

SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model

SmolLM2: When Smol Goes Big--Data-Centric Training of a Small Language Model , author=. arXiv preprint arXiv:2502.02737 , year=

work page internal anchor Pith review arXiv
[16]

The Llama 3 Herd of Models

The llama 3 herd of models , author=. arXiv preprint arXiv:2407.21783 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[17]

Advances in Neural Information Processing Systems , volume=

The fineweb datasets: Decanting the web for the finest text data at scale , author=. Advances in Neural Information Processing Systems , volume=
[18]

gpt-oss-120b & gpt-oss-20b Model Card

gpt-oss-120b & gpt-oss-20b model card , author=. arXiv preprint arXiv:2508.10925 , year=

work page internal anchor Pith review arXiv
[19]

Norvig, Peter , title =
[20]

FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning

Flashattention-2: Faster attention with better parallelism and work partitioning , author=. arXiv preprint arXiv:2307.08691 , year=

work page internal anchor Pith review arXiv