Recognition: unknown
Heterogeneity in Formal Linguistic Competence of Language Models: Is Data the Real Bottleneck?
Pith reviewed 2026-05-10 04:52 UTC · model grok-4.3
The pith
Small language models fix most grammar weaknesses when training data includes targeted examples of them.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Injecting one percent of synthetic data that targets specific linguistic constructions into the pre-training corpus of GPT-2 Small models produces large gains on eight of the nine BLiMP paradigms that had previously shown the weakest results. The paradigm only_npi_scope improves from 20.9 percent to 69.4 percent accuracy. Overall performance across all paradigms is preserved or slightly enhanced. One construction, principle_A_c_command, remains below chance even after the intervention.
What carries the argument
Targeted synthetic data injection that increases exposure to specific grammatical constructions during pre-training.
Load-bearing premise
The added synthetic sentences teach the model the actual grammatical rules rather than letting it memorize the new examples or change the training dynamics in ways that artificially raise the test scores.
What would settle it
Running the same pre-training with an equal amount of random or non-targeted data added and finding that the targeted BLiMP paradigms show no improvement.
Figures
read the original abstract
Large Language Models (LLMs) exhibit a puzzling disparity in their formal linguistic competence: while they learn some linguistic phenomena with near-perfect mastery, they often perform below chance on others, even after training on trillions of tokens. In this work, we investigate whether these failures stem from inherent architectural limitations or simply the scarcity of these specific grammatical constructions in web-scale corpora. We pre-train simple GPT-2 Small (124M) models on a 100M-token random sample of the FineWeb corpus and intervene by injecting a minimal amount (1%) of synthetic data targeting specific linguistic phenomena. We find that this targeted intervention substantially improves model performance in 8 out of the 9 worst-performing BLiMP paradigms - notably the accuracy on a specific paradigm, only_npi_scope, surges from 20.9% to 69.4%. Furthermore, we observe that these interventions generally preserve or slightly improve aggregate performance. However, while we also identify a resistant phenomenon, principle_A_c_command, whose performance remains below chance even after our data augmentation, our findings do serve as an optimistic existence proof that even small language models can substantially improve on those linguistic phenomena on which models typically perform poorly, provided the pre-training data contains sufficient exposure to them. This suggests that efforts towards human-scale language modeling may benefit greatly by focusing on data composition. The code to reproduce our results is open-sourced at https://github.com/kowndinya-renduchintala/heterogeneity-in-formal-linguistic-competence.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that disparities in language models' formal linguistic competence on BLiMP paradigms arise primarily from insufficient exposure to specific constructions in web-scale pre-training data rather than inherent architectural limits. By pre-training GPT-2 Small (124M) on a 100M-token FineWeb sample and injecting 1% targeted synthetic data, the authors report substantial gains on 8 of 9 underperforming paradigms (e.g., only_npi_scope accuracy rising from 20.9% to 69.4%), with aggregate performance preserved or slightly improved, while noting one resistant case (principle_A_c_command). This is presented as an existence proof that data composition can address such gaps, with code released for reproducibility.
Significance. If robust, the result would be significant as an empirical demonstration that minimal, targeted data augmentation can close large gaps in specific linguistic phenomena for small models, shifting emphasis toward data curation in scaling laws for formal competence. The open-sourced reproduction code is a clear strength supporting verification of the empirical measurements.
major comments (2)
- [Experimental Setup] The central claim that 1% synthetic data supplies genuine additional exposure (rather than memorization or non-specific training effects) is load-bearing but unsupported by ablations; no controls are described to test whether gains on paradigms such as only_npi_scope arise from the linguistic content versus finite-set memorization or shifts in token distribution/optimization trajectory.
- [Results] The resistant case of principle_A_c_command is noted but receives no mechanistic analysis or comparison to the successful cases, leaving open whether data exposure is uniformly the bottleneck or whether other factors differentiate the phenomena.
minor comments (2)
- [Abstract] Details on synthetic data generation (templates, sampling, filtering) and exact statistical controls (e.g., significance tests on accuracy deltas) are not summarized in the abstract and should be expanded in the main text for clarity.
- [Methods] The 100M-token FineWeb subsample size and the precise 1% injection ratio could be justified with reference to token counts or ablation on smaller fractions.
Simulated Author's Rebuttal
We thank the referee for their constructive feedback and positive assessment of the work's significance. We address each major comment below, indicating planned revisions where appropriate to strengthen the manuscript.
read point-by-point responses
-
Referee: [Experimental Setup] The central claim that 1% synthetic data supplies genuine additional exposure (rather than memorization or non-specific training effects) is load-bearing but unsupported by ablations; no controls are described to test whether gains on paradigms such as only_npi_scope arise from the linguistic content versus finite-set memorization or shifts in token distribution/optimization trajectory.
Authors: We agree that explicit controls would better isolate the contribution of the targeted linguistic structures. The synthetic data was produced via templates yielding diverse sentence realizations of each construction (rather than repeated identical examples), and the 1% injection was applied uniformly across training. To address the concern directly, we will add ablation experiments in the revision: one using synthetic data matched for token statistics but lacking the critical syntactic patterns, and another using non-targeted random text of equivalent volume. These will be reported alongside the existing results. revision: yes
-
Referee: [Results] The resistant case of principle_A_c_command is noted but receives no mechanistic analysis or comparison to the successful cases, leaving open whether data exposure is uniformly the bottleneck or whether other factors differentiate the phenomena.
Authors: We will expand the discussion section to compare principle_A_c_command with the eight improved paradigms, highlighting differences in rule complexity and potential interactions with other phenomena that may explain its resistance. However, a full mechanistic analysis (e.g., via probing or intervention studies) lies beyond the scope of the present work, which centers on demonstrating the impact of data composition rather than interpretability techniques. revision: partial
- A detailed mechanistic explanation for why principle_A_c_command remains resistant despite targeted data augmentation.
Circularity Check
No circularity: purely empirical pre-training intervention with external benchmark
full rationale
The paper describes an experimental protocol of pre-training GPT-2 Small models on a 100M-token FineWeb sample, injecting 1% targeted synthetic data for specific BLiMP paradigms, and reporting accuracy changes (e.g., only_npi_scope from 20.9% to 69.4%) against the fixed external BLiMP test set. No equations, parameter fits, or derivations are present; results are direct empirical measurements. No self-citations are invoked as load-bearing uniqueness theorems or ansatzes, and the resistant case (principle_A_c_command) is reported without forcing the outcome. The derivation chain is therefore self-contained as a controlled intervention study.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption BLiMP test suites provide a valid and independent measure of formal linguistic competence in language models.
Reference graph
Works this paper leans on
-
[1]
Dissociating language and thought in large language models: a cognitive perspective,
Dissociating language and thought in large language models: a cognitive perspective , author=. arXiv preprint arXiv:2301.06627 , volume=
-
[2]
How can we accelerate progress towards human-like linguistic generalization? , author=. arXiv preprint arXiv:2005.00955 , year=
-
[3]
Computational Linguistics , pages=
Are formal and functional linguistic mechanisms dissociated in language models? , author=. Computational Linguistics , pages=. 2025 , publisher=
2025
-
[4]
American journal of speech-language pathology , volume=
Mapping the early language environment using all-day recordings and automated analysis , author=. American journal of speech-language pathology , volume=. 2017 , publisher=
2017
-
[5]
arXiv e-prints , pages=
The llama 3 herd of models , author=. arXiv e-prints , pages=
-
[6]
Proceedings of the 2025 ACM Conference on Fairness, Accountability, and Transparency , pages=
People cannot distinguish GPT-4 from a human in a Turing test , author=. Proceedings of the 2025 ACM Conference on Fairness, Accountability, and Transparency , pages=
2025
-
[7]
Language Models Learn Rare Phenomena from Less Rare Phenomena: The Case of the Missing AANN s
Misra, Kanishka and Mahowald, Kyle. Language Models Learn Rare Phenomena from Less Rare Phenomena: The Case of the Missing AANN s. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. 2024. doi:10.18653/v1/2024.emnlp-main.53
-
[8]
arXiv preprint arXiv:2412.05149 , year=
Findings of the second BabyLM challenge: Sample-efficient pretraining on developmentally plausible corpora , author=. arXiv preprint arXiv:2412.05149 , year=
-
[9]
Transactions of the Association for Computational Linguistics , volume=
BLiMP: The benchmark of linguistic minimal pairs for English , author=. Transactions of the Association for Computational Linguistics , volume=. 2020 , publisher=
2020
-
[10]
Olmo 3 , author=. arXiv preprint arXiv:2512.13961 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[11]
Gpt-4 technical report , author=. arXiv preprint arXiv:2303.08774 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[12]
Advances in neural information processing systems , volume=
Language models are few-shot learners , author=. Advances in neural information processing systems , volume=
-
[13]
Journal of Linguistics , volume=
Machine learning theory and practice as a source of insightinto universal grammar , author=. Journal of Linguistics , volume=. 2007 , publisher=
2007
-
[14]
OpenAI blog , volume=
Language models are unsupervised multitask learners , author=. OpenAI blog , volume=
-
[15]
SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model
SmolLM2: When Smol Goes Big--Data-Centric Training of a Small Language Model , author=. arXiv preprint arXiv:2502.02737 , year=
work page internal anchor Pith review arXiv
-
[16]
The llama 3 herd of models , author=. arXiv preprint arXiv:2407.21783 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[17]
Advances in Neural Information Processing Systems , volume=
The fineweb datasets: Decanting the web for the finest text data at scale , author=. Advances in Neural Information Processing Systems , volume=
-
[18]
gpt-oss-120b & gpt-oss-20b Model Card
gpt-oss-120b & gpt-oss-20b model card , author=. arXiv preprint arXiv:2508.10925 , year=
work page internal anchor Pith review arXiv
-
[19]
Norvig, Peter , title =
-
[20]
FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning
Flashattention-2: Faster attention with better parallelism and work partitioning , author=. arXiv preprint arXiv:2307.08691 , year=
work page internal anchor Pith review arXiv
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.