Recognition: unknown
Implicit Representations of Grammaticality in Language Models
Pith reviewed 2026-05-08 16:27 UTC · model grok-4.3
The pith
Language models develop an implicit grammaticality distinction in their hidden layers that is distinct from string probability.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Training a linear classifier on LM hidden representations to separate grammatical sentences from synthetically perturbed ungrammatical ones yields a probe that generalizes to human-curated grammaticality benchmarks, beats string-probability baselines on those benchmarks, performs worse than probability on semantic-plausibility pairs, shows cross-lingual transfer, and correlates only weakly with raw log-probabilities.
What carries the argument
A linear probe trained on LM hidden states to classify grammatical versus synthetically ungrammatical sentences.
If this is right
- The probe outperforms LM string probabilities on standard grammaticality judgment benchmarks.
- The probe underperforms string probabilities when both sentences in a pair are grammatical and differ only in plausibility.
- The English-trained probe transfers to grammaticality benchmarks in multiple other languages better than probability does.
- Probe scores correlate only weakly with the LM's own string probabilities.
Where Pith is reading between the lines
- Grammatical knowledge may be linearly extractable from LM representations even when overall likelihood does not sharply encode it.
- Different layers or subspaces may separately track syntactic well-formedness and semantic plausibility.
- The finding suggests a route for testing whether LMs acquire other abstract linguistic properties in their hidden states.
Load-bearing premise
The synthetic perturbations used to generate ungrammatical examples produce a representative sample of grammatical violations rather than artifacts that the probe can exploit.
What would settle it
Evaluating the same probe on a fresh set of ungrammatical sentences constructed by entirely different perturbation rules or by human linguists, and checking whether its advantage over probability disappears.
Figures
read the original abstract
Grammaticality and likelihood are distinct notions in human language. Pretrained language models (LMs), which are probabilistic models of language fitted to maximize corpus likelihood, generate grammatically well-formed text and discriminate well between grammatical and ungrammatical sentences in tightly controlled minimal pairs. However, their string probabilities do not sharply discriminate between grammatical and ungrammatical sentences overall. But do LMs implicitly acquire a grammaticality distinction distinct from string probability? We explore this question through studying internal representations of LMs, by training a linear probe on a dataset of grammatical and (synthetic) ungrammatical sentences obtained by applying perturbations to a naturalistic text corpus. We find that this simple grammaticality probe generalizes to human-curated grammaticality judgment benchmarks and outperforms LM probability-based grammaticality judgments. When applied to semantic plausibility benchmarks, in which both members of a minimal pair are grammatical and differ in only plausibility, the probe however performs worse than string probability. The English-trained probe also exhibits nontrivial cross-lingual generalization, outperforming string probabilities on grammaticality benchmarks in numerous other languages. Additionally, probe scores correlate only weakly with string probabilities. These results collectively suggest that LMs acquire to some extent an implicit grammaticality distinction within their hidden layers.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that pretrained language models implicitly encode a grammaticality distinction in their hidden-layer representations that is distinct from string probability. They train a linear probe on hidden states from grammatical sentences and synthetically perturbed ungrammatical sentences drawn from a naturalistic corpus, then show that the probe generalizes to human-curated grammaticality benchmarks (outperforming probability-based judgments), underperforms probability on semantic plausibility minimal pairs, exhibits cross-lingual generalization, and correlates only weakly with string probabilities.
Significance. If the central claim holds, the work provides useful empirical evidence that LMs acquire grammatical knowledge separable from likelihood modeling, with implications for interpretability and probing methods. Strengths include the linear-probe setup, explicit comparison to string probability, the plausibility control task, and cross-lingual tests; these elements make the distinction falsifiable and worth further investigation.
major comments (3)
- [§3] §3 (Dataset and perturbation construction): The central claim that the probe captures an implicit grammaticality signal acquired by the LM rests on the assumption that synthetic perturbations produce violations whose statistical signatures match those of natural grammatical errors. The manuscript provides insufficient detail on the specific perturbation operations, their distribution, and any validation that they avoid consistent artifacts (e.g., local n-gram disruptions or length shifts) that a linear probe could exploit. This is load-bearing; without such validation or an ablation showing robustness to perturbation type, the generalization results cannot be confidently attributed to grammaticality rather than probe-detectable artifacts.
- [§4.3] §4.3 and Table 4 (cross-lingual results): The reported nontrivial cross-lingual generalization is a key supporting result, yet the manuscript does not clarify whether the same English-trained probe weights are applied directly to non-English hidden states or whether any alignment or language-specific adaptation occurs. This detail is necessary to interpret whether the probe is truly capturing a language-general grammaticality representation or benefiting from English-centric artifacts that happen to transfer.
- [§4.1] §4.1 (comparison to string probability): While the probe is shown to outperform raw string probability on grammaticality benchmarks, the paper should report the exact baseline implementation (e.g., whether probability is computed as normalized log-probability per token or sentence-level) and confirm that the probe's advantage is not an artifact of differing normalization or length sensitivity. This comparison is central to the claim of a distinct grammaticality signal.
minor comments (2)
- The abstract states that the probe 'outperforms LM probability-based grammaticality judgments' on grammaticality benchmarks; a table or figure directly comparing probe accuracy/F1 to probability baselines across all test sets would improve clarity.
- Minor notation inconsistency: 'string probabilities' and 'LM probability' are used interchangeably; a single defined term would reduce ambiguity in §2 and §4.
Simulated Author's Rebuttal
We thank the referee for their constructive comments, which have helped us improve the clarity and rigor of our manuscript. We address each major comment below and will incorporate the necessary revisions.
read point-by-point responses
-
Referee: [§3] §3 (Dataset and perturbation construction): The central claim that the probe captures an implicit grammaticality signal acquired by the LM rests on the assumption that synthetic perturbations produce violations whose statistical signatures match those of natural grammatical errors. The manuscript provides insufficient detail on the specific perturbation operations, their distribution, and any validation that they avoid consistent artifacts (e.g., local n-gram disruptions or length shifts) that a linear probe could exploit. This is load-bearing; without such validation or an ablation showing robustness to perturbation type, the generalization results cannot be confidently attributed to grammaticality rather than probe-detectable artifacts.
Authors: We agree that the description of the perturbation construction in §3 requires more detail to fully support the central claim. In the revised manuscript, we will expand this section to include a complete specification of the perturbation operations (such as the types of syntactic and morphological changes applied), their frequency distribution in the dataset, and quantitative validation that these perturbations do not introduce systematic artifacts like consistent n-gram disruptions or unintended length shifts. We will also include an ablation study demonstrating that the probe's performance remains consistent across variations in perturbation strategies, thereby strengthening the attribution to grammaticality. revision: yes
-
Referee: [§4.3] §4.3 and Table 4 (cross-lingual results): The reported nontrivial cross-lingual generalization is a key supporting result, yet the manuscript does not clarify whether the same English-trained probe weights are applied directly to non-English hidden states or whether any alignment or language-specific adaptation occurs. This detail is necessary to interpret whether the probe is truly capturing a language-general grammaticality representation or benefiting from English-centric artifacts that happen to transfer.
Authors: We confirm that the probe is applied directly using the English-trained weights to the hidden states extracted from non-English sentences, with no alignment, fine-tuning, or language-specific adaptation. This direct transfer is what enables the cross-lingual evaluation. We will revise §4.3 to explicitly state this procedure and discuss its implications for the language-general nature of the representation. revision: yes
-
Referee: [§4.1] §4.1 (comparison to string probability): While the probe is shown to outperform raw string probability on grammaticality benchmarks, the paper should report the exact baseline implementation (e.g., whether probability is computed as normalized log-probability per token or sentence-level) and confirm that the probe's advantage is not an artifact of differing normalization or length sensitivity. This comparison is central to the claim of a distinct grammaticality signal.
Authors: We appreciate this point regarding the baseline implementation. In the original manuscript, string probability is computed as the mean log-probability per token to account for sentence length. We will add explicit details in §4.1 on this normalization and include additional analysis confirming that the probe's superior performance on grammaticality benchmarks holds even when controlling for length sensitivity, ensuring the distinction from probability-based judgments is robust. revision: yes
Circularity Check
No circularity: empirical probe training on synthetic data with independent benchmark evaluation
full rationale
The paper's central claim rests on training a linear probe to classify grammatical vs. synthetically perturbed ungrammatical sentences from hidden states, then measuring generalization to separate human-curated grammaticality benchmarks, cross-lingual data, and plausibility minimal pairs. These evaluation sets are constructed independently of the training perturbations, and the reported weak correlation with string probability is measured directly rather than assumed. No derivation, equation, or self-citation reduces the distinction to a fitted parameter by construction, nor does any uniqueness theorem or ansatz from prior work by the authors close the loop. The setup is a standard supervised probe evaluation whose success is falsifiable on the held-out data.
Axiom & Free-Parameter Ledger
free parameters (1)
- linear probe weights
axioms (2)
- domain assumption Grammaticality is linearly separable in the LM's hidden representations
- domain assumption Perturbations applied to naturalistic sentences produce representative ungrammatical examples
Reference graph
Works this paper leans on
-
[1]
Association for Computing Machinery
On the dangers of stochastic parrots: Can language mod- els be too big? InProceedings of the 2021 ACM Conference on Fairness, Accountability, and Trans- parency, FAccT ’21, page 610–623, New York, NY , USA. Association for Computing Machinery. Kathryn Bock and Carol A Miller
2021
-
[2]
InProceed- ings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 4798–4808, Online
Unsu- pervised parsing via constituency tests. InProceed- ings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 4798–4808, Online. Association for Computational Linguistics. Boli Chen, Yao Fu, Guangwei Xu, Pengjun Xie, Chuanqi Tan, Mosha Chen, and Liping Jing
2020
-
[3]
Association for Computational Linguistics
Are all languages equally hard to language-model? InProceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), pages 536–541, New Orleans, Louisiana. Association for Computational Linguistics. Tyler Cowen
2018
-
[4]
InFind- ings of the Association for Computational Linguistics: EMNLP 2022, pages 2801–2813, Abu Dhabi, United Arab Emirates
Probing for incremental parse states in autoregressive language models. InFind- ings of the Association for Computational Linguistics: EMNLP 2022, pages 2801–2813, Abu Dhabi, United Arab Emirates. Association for Computational Lin- guistics. Evelina Fedorenko, Idan Asher Blank, Matthew Siegel- man, and Zachary Mineroff
2022
-
[5]
Neural language models as psycholinguistic subjects: Representations of syntactic state. InProceedings of the 2019 Conference of the North American Chap- ter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 32–42, Minneapolis, Minnesota. Association for Computational Linguistics. Jon Ga...
2019
-
[6]
Gemma 2: Improving open lan- guage models at a practical size.arXiv preprint arXiv:2408.00118. GenRM
work page internal anchor Pith review arXiv
-
[7]
Hugging Face Datasets
Gutenberg DPO (gutenberg-dpo- v0.1-jondurbin). Hugging Face Datasets. License: CC BY 4.0. Duplicated from jondurbin/gutenberg-dpo-v0.1. Accessed: 2026-01-04. Anirudh Goyal and Yoshua Bengio
2026
-
[8]
The llama 3 herd of models.Preprint, arXiv:2407.21783. John Hewitt and Christopher D. Manning
work page internal anchor Pith review arXiv
-
[9]
A structural probe for finding syntax in word represen- tations. InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Tech- nologies, Volume 1 (Long and Short Papers), pages 4129–4138, Minneapolis, Minnesota. Association for Computational Linguistics. Jennifer Hu, Jon Gauthier,...
2019
-
[10]
InThe 2023 Conference on Empirical Methods in Natural Language Processing
Prompting is not a substitute for probability measurements in large lan- guage models. InThe 2023 Conference on Empirical Methods in Natural Language Processing. Jennifer Hu, Ethan Gotlieb Wilcox, Siyuan Song, Kyle Mahowald, and Roger P. Levy
2023
- [11]
-
[12]
MultiBLiMP 1.0: A Massively Multilingual Benchmark of Linguistic Minimal Pairs
Multiblimp 1.0: A massively multilingual benchmark of linguistic minimal pairs. Preprint, arXiv:2504.02768. Julie Kallini, Isabel Papadimitriou, Richard Futrell, Kyle Mahowald, and Christopher Potts
work page internal anchor Pith review Pith/arXiv arXiv
-
[13]
InProceed- ings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 1192–1202, Brussels, Belgium
Targeted syn- tactic evaluation of language models. InProceed- ings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 1192–1202, Brussels, Belgium. Association for Computational Linguistics. Vladislav Mikhailov, Tatiana Shamardina, Max Ryabinin, Alena Pestova, Ivan Smurov, and Ekaterina Artemova
2018
-
[14]
InProceedings of the 2022 Con- ference on Empirical Methods in Natural Language Processing, pages 5207–5227, Abu Dhabi, United Arab Emirates
RuCoLA: Russian corpus of lin- guistic acceptability. InProceedings of the 2022 Con- ference on Empirical Methods in Natural Language Processing, pages 5207–5227, Abu Dhabi, United Arab Emirates. Association for Computational Lin- guistics. Raphaël Millière
2022
-
[15]
Olmo 3.Preprint, arXiv:2512.13961. Team OLMo, Pete Walsh, Luca Soldaini, Dirk Groen- eveld, Kyle Lo, Shane Arora, Akshita Bhagia, Yuling Gu, Shengyi Huang, Matt Jordan, Nathan Lambert, Dustin Schwenk, Oyvind Tafjord, Taira Anderson, David Atkinson, Faeze Brahman, Christopher Clark, Pradeep Dasigi, Nouha Dziri, and 24 others
work page internal anchor Pith review Pith/arXiv arXiv
-
[16]
2 olmo 2 furious.Preprint, arXiv:2501.00656. Kiho Park, Yo Joong Choe, Yibo Jiang, and Victor Veitch
work page internal anchor Pith review arXiv
-
[17]
InProceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), pages 9477–9488, Torino, Italia
JCoLA: Japanese corpus of linguistic acceptabil- ity. InProceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), pages 9477–9488, Torino, Italia. ELRA and ICCL. Yixiao Song, Kalpesh Krishna, Rajesh Bhatt, and Mo- hit Iyyer
2024
-
[18]
InProceedings of the 2022 Conference on Empirical Methods in Natu- ral Language Processing, pages 4606–4634, Abu Dhabi, United Arab Emirates
SLING: Sino linguistic evaluation of large language models. InProceedings of the 2022 Conference on Empirical Methods in Natu- ral Language Processing, pages 4606–4634, Abu Dhabi, United Arab Emirates. Association for Com- putational Linguistics. Michelle Suijkerbuijk, Zoë Prins, Marianne de Heer Kloots, Willem Zuidema, and Stefan L. Frank
2022
-
[19]
What goes into a LM acceptability judgment? rethinking the impact of frequency and length. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 2173– 2186, Albuquerque, New Mexico. Association for Computational Linguistics. D...
2025
-
[20]
InLREC 2018 Workshop on Linguistic and Neurocognitive Resources (LiNCR), Miyazaki, Japan
Event Knowledge in Sentence Processing: A New Dataset for the Evaluation of Argument Typicality. InLREC 2018 Workshop on Linguistic and Neurocognitive Resources (LiNCR), Miyazaki, Japan. Alex Warstadt, Alicia Parrish, Haokun Liu, Anhad Mo- hananey, Wei Peng, Sheng-Fu Wang, and Samuel R. Bowman
2018
-
[21]
InProceedings of the 2021 Conference of the North American Chapter of the Association for Computa- tional Linguistics: Human Language Technologies, pages 132–138, Online
A non-linear structural probe. InProceedings of the 2021 Conference of the North American Chapter of the Association for Computa- tional Linguistics: Human Language Technologies, pages 132–138, Online. Association for Computa- tional Linguistics. Ethan Wilcox, Roger Levy, Takashi Morita, and Richard Futrell
2021
-
[22]
Association for Com- putational Linguistics
What do RNN language models learn about filler–gap dependencies? InProceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyz- ing and Interpreting Neural Networks for NLP, pages 211–221, Brussels, Belgium. Association for Com- putational Linguistics. Tianyang Xu, Tatsuki Kuribayashi, Yohei Oseki, Ryan Cotterell, and Alex Warstadt
2018
-
[23]
good") sentences: 93.72% judged ac- ceptable. • Perturbed (
Can language models learn typologically implausible languages? Preprint, arXiv:2502.12317. A Contrastive Training Sets Generic text corpus.To build a generic English text corpus, we combine raw text from two sources: (i) Penn Treebank (PTB; Wall Street Journal text- only splits) (Marcus et al., 1994), and (ii) the GenRM/gutenberg-dpo-v0.1-jondurbin datase...
-
[24]
SyntaxGym is grouped by sentence conditions
All of them are publicly available on Hugging Face. SyntaxGym is grouped by sentence conditions. We label conditions as acceptable vs. unacceptable using: Acceptable:np_match, vp_match, that_nogap, what_subjgap, what_gap, neg_pos, neg_neg, match_sing, match_plural, no-sub_no-matrix, sub_matrix. Unacceptable:np_mismatch, vp_mismatch, what_nogap, that_subjg...
2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.