arxiv: 2605.05197 · v1 · submitted 2026-05-06 · 💻 cs.CL

Recognition: unknown

Implicit Representations of Grammaticality in Language Models

Yingshan Susan Wang , Linlu Qiu , Zhaofeng Wu , Roger P. Levy , Yoon Kim

Authors on Pith no claims yet

Pith reviewed 2026-05-08 16:27 UTC · model grok-4.3

classification 💻 cs.CL

keywords language modelsgrammaticalitylinear probesinternal representationssyntactic knowledgecross-lingual transfer

0 comments

The pith

Language models develop an implicit grammaticality distinction in their hidden layers that is distinct from string probability.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper asks whether pretrained language models learn a notion of grammaticality that goes beyond simply assigning higher probabilities to well-formed strings. To test this, the authors train a linear probe on the hidden states of an LM, using a corpus of naturally occurring sentences paired with synthetically ungrammatical versions created by targeted perturbations. The probe is then evaluated on held-out human-curated grammaticality benchmarks, where it outperforms raw model probabilities, and on semantic-plausibility minimal pairs, where it underperforms them. The same English probe also transfers to grammaticality judgments in other languages better than probability does. These patterns indicate that grammatical knowledge is represented separately from likelihood information inside the model.

Core claim

Training a linear classifier on LM hidden representations to separate grammatical sentences from synthetically perturbed ungrammatical ones yields a probe that generalizes to human-curated grammaticality benchmarks, beats string-probability baselines on those benchmarks, performs worse than probability on semantic-plausibility pairs, shows cross-lingual transfer, and correlates only weakly with raw log-probabilities.

What carries the argument

A linear probe trained on LM hidden states to classify grammatical versus synthetically ungrammatical sentences.

If this is right

The probe outperforms LM string probabilities on standard grammaticality judgment benchmarks.
The probe underperforms string probabilities when both sentences in a pair are grammatical and differ only in plausibility.
The English-trained probe transfers to grammaticality benchmarks in multiple other languages better than probability does.
Probe scores correlate only weakly with the LM's own string probabilities.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Grammatical knowledge may be linearly extractable from LM representations even when overall likelihood does not sharply encode it.
Different layers or subspaces may separately track syntactic well-formedness and semantic plausibility.
The finding suggests a route for testing whether LMs acquire other abstract linguistic properties in their hidden states.

Load-bearing premise

The synthetic perturbations used to generate ungrammatical examples produce a representative sample of grammatical violations rather than artifacts that the probe can exploit.

What would settle it

Evaluating the same probe on a fresh set of ungrammatical sentences constructed by entirely different perturbation rules or by human linguists, and checking whether its advantage over probability disappears.

Figures

Figures reproduced from arXiv: 2605.05197 by Linlu Qiu, Roger P. Levy, Yingshan Susan Wang, Yoon Kim, Zhaofeng Wu.

**Figure 1.** Figure 1: The grammaticality probe is a linear classifier over LM hidden states, trained on a synthetic dataset created from perturbing sentences from a generic text corpus. For evaluation, we measure how well probe scores and LM logprob discriminate between grammatical and ungrammatical sentences on acceptabilityjudgment corpora. As a baseline, we also evaluate the grammaticality probe on semantic plausibility da… view at source ↗

**Figure 2.** Figure 2: Comparing probe scores and string logprob view at source ↗

**Figure 3.** Figure 3: The evaluation results of LASSO probes on view at source ↗

**Figure 4.** Figure 4: Distributions of LM logprob and probe scores view at source ↗

**Figure 6.** Figure 6: By-layer ℓ2-probe AUC on the dev set. Vertical lines indicate the best layers selected for evaluation (Llama-3.2-1B: 10, Llama-3.1-8B: 11, Gemma-2-2B: 8, Gemma-2-9B: 18, OLMo-2-7B: 14, OLMo-3-7B: 15). OLMo-2-7B OLMo-3-7B Llama-3.2-1B Llama-3.1-8B Gemma-2-2B Gemma-2-9B 0.8 0.9 ACC OLMo-2-7B OLMo-3-7B Llama-3.2-1B Llama-3.1-8B Gemma-2-2B Gemma-2-9B 0.8 0.9 AUC Synthetic Dev Set Probe Score LM Logprob ACC AUC view at source ↗

**Figure 7.** Figure 7: Comparing the performance of the best ℓ2- probes and LM Logprob on heldout in-domain dev set. ℓ2 probe. For each layer, we train an ℓ2- regularized logistic regression probe on the lasttoken hidden-state vector, using an 80/20 train–dev split with feature normalization. We preprocess the hidden states by computing per-dimension mean and standard deviation on the training split and z-scoring both train and… view at source ↗

**Figure 8.** Figure 8: Comparisons of unsupervised (trained on syn view at source ↗

**Figure 9.** Figure 9: By-layer distributions of LASSO-selected neurons. The important neurons are distributed across many view at source ↗

**Figure 10.** Figure 10: Distributions of LM logprob and probe scores view at source ↗

**Figure 12.** Figure 12: Distributions of LM logprob and probe scores on Plausibility Sets. J LASSO Probe Performance on All Acceptability Datasets 0.5 0.6 0.7 0.8 CoLA AUC OLMo Llama Gemma 0.01 0.05 0.1 0.5 Sparsity (%) 0.5 0.6 0.7 0.8 SyntaxGym AUC 2-7B 3-7B 0.01 0.05 0.1 0.5 Sparsity (%) 3.2-1B 3.1-8B 0.01 0.05 0.1 0.5 Sparsity (%) 2-2B 2-9B LASSO-Selected Neurons Random Neurons view at source ↗

**Figure 14.** Figure 14: LASSO probe results on multilingual accept view at source ↗

**Figure 15.** Figure 15: The metalinguistic prompting template. The model correctly answers the prompt if it outputs the view at source ↗

**Figure 16.** Figure 16: Nonpairwise accuracies of metalinguistic fewshot prompting and view at source ↗

read the original abstract

Grammaticality and likelihood are distinct notions in human language. Pretrained language models (LMs), which are probabilistic models of language fitted to maximize corpus likelihood, generate grammatically well-formed text and discriminate well between grammatical and ungrammatical sentences in tightly controlled minimal pairs. However, their string probabilities do not sharply discriminate between grammatical and ungrammatical sentences overall. But do LMs implicitly acquire a grammaticality distinction distinct from string probability? We explore this question through studying internal representations of LMs, by training a linear probe on a dataset of grammatical and (synthetic) ungrammatical sentences obtained by applying perturbations to a naturalistic text corpus. We find that this simple grammaticality probe generalizes to human-curated grammaticality judgment benchmarks and outperforms LM probability-based grammaticality judgments. When applied to semantic plausibility benchmarks, in which both members of a minimal pair are grammatical and differ in only plausibility, the probe however performs worse than string probability. The English-trained probe also exhibits nontrivial cross-lingual generalization, outperforming string probabilities on grammaticality benchmarks in numerous other languages. Additionally, probe scores correlate only weakly with string probabilities. These results collectively suggest that LMs acquire to some extent an implicit grammaticality distinction within their hidden layers.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Linear probes on LM hidden states separate grammaticality from string probability more cleanly than baselines, with cross-lingual transfer, though synthetic perturbations could be the main vulnerability.

read the letter

The main point is that a linear probe trained on hidden states from grammatically perturbed sentences beats raw LM probabilities at grammaticality judgments on human-curated benchmarks, shows weak correlation with those probabilities, and transfers to other languages while underperforming on plausibility minimal pairs. This gives some empirical separation between grammaticality and likelihood inside the model layers. The cross-lingual generalization and the contrast with plausibility tasks are the clearest additions here; they go beyond the usual finding that LMs can do well on controlled minimal pairs. The setup itself is straightforward and uses independent test sets, which keeps the claims grounded in observable behavior rather than circular fitting. The soft spot is the training data. Synthetic perturbations on a naturalistic corpus can create detectable local patterns or length shifts that a linear probe might exploit without reflecting how humans actually judge grammaticality. The abstract does not spell out controls or comparisons to naturally occurring errors, so that assumption needs direct checking in the methods and results. If the perturbations hold up, the distinction looks real; if not, the probe success shrinks to an artifact. This paper is for people working on LM interpretability and linguistic probing. It is worth sending to peer review because the core question is well-posed, the experiments are replicable in principle, and the reported patterns are specific enough to be worth testing even if the data construction needs tightening.

Referee Report

3 major / 2 minor

Summary. The paper claims that pretrained language models implicitly encode a grammaticality distinction in their hidden-layer representations that is distinct from string probability. They train a linear probe on hidden states from grammatical sentences and synthetically perturbed ungrammatical sentences drawn from a naturalistic corpus, then show that the probe generalizes to human-curated grammaticality benchmarks (outperforming probability-based judgments), underperforms probability on semantic plausibility minimal pairs, exhibits cross-lingual generalization, and correlates only weakly with string probabilities.

Significance. If the central claim holds, the work provides useful empirical evidence that LMs acquire grammatical knowledge separable from likelihood modeling, with implications for interpretability and probing methods. Strengths include the linear-probe setup, explicit comparison to string probability, the plausibility control task, and cross-lingual tests; these elements make the distinction falsifiable and worth further investigation.

major comments (3)

[§3] §3 (Dataset and perturbation construction): The central claim that the probe captures an implicit grammaticality signal acquired by the LM rests on the assumption that synthetic perturbations produce violations whose statistical signatures match those of natural grammatical errors. The manuscript provides insufficient detail on the specific perturbation operations, their distribution, and any validation that they avoid consistent artifacts (e.g., local n-gram disruptions or length shifts) that a linear probe could exploit. This is load-bearing; without such validation or an ablation showing robustness to perturbation type, the generalization results cannot be confidently attributed to grammaticality rather than probe-detectable artifacts.
[§4.3] §4.3 and Table 4 (cross-lingual results): The reported nontrivial cross-lingual generalization is a key supporting result, yet the manuscript does not clarify whether the same English-trained probe weights are applied directly to non-English hidden states or whether any alignment or language-specific adaptation occurs. This detail is necessary to interpret whether the probe is truly capturing a language-general grammaticality representation or benefiting from English-centric artifacts that happen to transfer.
[§4.1] §4.1 (comparison to string probability): While the probe is shown to outperform raw string probability on grammaticality benchmarks, the paper should report the exact baseline implementation (e.g., whether probability is computed as normalized log-probability per token or sentence-level) and confirm that the probe's advantage is not an artifact of differing normalization or length sensitivity. This comparison is central to the claim of a distinct grammaticality signal.

minor comments (2)

The abstract states that the probe 'outperforms LM probability-based grammaticality judgments' on grammaticality benchmarks; a table or figure directly comparing probe accuracy/F1 to probability baselines across all test sets would improve clarity.
Minor notation inconsistency: 'string probabilities' and 'LM probability' are used interchangeably; a single defined term would reduce ambiguity in §2 and §4.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive comments, which have helped us improve the clarity and rigor of our manuscript. We address each major comment below and will incorporate the necessary revisions.

read point-by-point responses

Referee: [§3] §3 (Dataset and perturbation construction): The central claim that the probe captures an implicit grammaticality signal acquired by the LM rests on the assumption that synthetic perturbations produce violations whose statistical signatures match those of natural grammatical errors. The manuscript provides insufficient detail on the specific perturbation operations, their distribution, and any validation that they avoid consistent artifacts (e.g., local n-gram disruptions or length shifts) that a linear probe could exploit. This is load-bearing; without such validation or an ablation showing robustness to perturbation type, the generalization results cannot be confidently attributed to grammaticality rather than probe-detectable artifacts.

Authors: We agree that the description of the perturbation construction in §3 requires more detail to fully support the central claim. In the revised manuscript, we will expand this section to include a complete specification of the perturbation operations (such as the types of syntactic and morphological changes applied), their frequency distribution in the dataset, and quantitative validation that these perturbations do not introduce systematic artifacts like consistent n-gram disruptions or unintended length shifts. We will also include an ablation study demonstrating that the probe's performance remains consistent across variations in perturbation strategies, thereby strengthening the attribution to grammaticality. revision: yes
Referee: [§4.3] §4.3 and Table 4 (cross-lingual results): The reported nontrivial cross-lingual generalization is a key supporting result, yet the manuscript does not clarify whether the same English-trained probe weights are applied directly to non-English hidden states or whether any alignment or language-specific adaptation occurs. This detail is necessary to interpret whether the probe is truly capturing a language-general grammaticality representation or benefiting from English-centric artifacts that happen to transfer.

Authors: We confirm that the probe is applied directly using the English-trained weights to the hidden states extracted from non-English sentences, with no alignment, fine-tuning, or language-specific adaptation. This direct transfer is what enables the cross-lingual evaluation. We will revise §4.3 to explicitly state this procedure and discuss its implications for the language-general nature of the representation. revision: yes
Referee: [§4.1] §4.1 (comparison to string probability): While the probe is shown to outperform raw string probability on grammaticality benchmarks, the paper should report the exact baseline implementation (e.g., whether probability is computed as normalized log-probability per token or sentence-level) and confirm that the probe's advantage is not an artifact of differing normalization or length sensitivity. This comparison is central to the claim of a distinct grammaticality signal.

Authors: We appreciate this point regarding the baseline implementation. In the original manuscript, string probability is computed as the mean log-probability per token to account for sentence length. We will add explicit details in §4.1 on this normalization and include additional analysis confirming that the probe's superior performance on grammaticality benchmarks holds even when controlling for length sensitivity, ensuring the distinction from probability-based judgments is robust. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical probe training on synthetic data with independent benchmark evaluation

full rationale

The paper's central claim rests on training a linear probe to classify grammatical vs. synthetically perturbed ungrammatical sentences from hidden states, then measuring generalization to separate human-curated grammaticality benchmarks, cross-lingual data, and plausibility minimal pairs. These evaluation sets are constructed independently of the training perturbations, and the reported weak correlation with string probability is measured directly rather than assumed. No derivation, equation, or self-citation reduces the distinction to a fitted parameter by construction, nor does any uniqueness theorem or ansatz from prior work by the authors close the loop. The setup is a standard supervised probe evaluation whose success is falsifiable on the held-out data.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 0 invented entities

The central claim rests on the assumption that linear separability in hidden states reflects an implicit grammaticality concept and that the synthetic ungrammatical examples are valid proxies for real grammatical violations.

free parameters (1)

linear probe weights
The classifier is fitted to the grammaticality-labeled hidden activations from the training corpus.

axioms (2)

domain assumption Grammaticality is linearly separable in the LM's hidden representations
The choice of a linear probe presupposes this separability.
domain assumption Perturbations applied to naturalistic sentences produce representative ungrammatical examples
The training data construction depends on this.

pith-pipeline@v0.9.0 · 5529 in / 1382 out tokens · 45397 ms · 2026-05-08T16:27:31.320977+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

24 extracted references · 7 canonical work pages · 5 internal anchors

[1]

Association for Computing Machinery

On the dangers of stochastic parrots: Can language mod- els be too big? InProceedings of the 2021 ACM Conference on Fairness, Accountability, and Trans- parency, FAccT ’21, page 610–623, New York, NY , USA. Association for Computing Machinery. Kathryn Bock and Carol A Miller

2021
[2]

InProceed- ings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 4798–4808, Online

Unsu- pervised parsing via constituency tests. InProceed- ings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 4798–4808, Online. Association for Computational Linguistics. Boli Chen, Yao Fu, Guangwei Xu, Pengjun Xie, Chuanqi Tan, Mosha Chen, and Liping Jing

2020
[3]

Association for Computational Linguistics

Are all languages equally hard to language-model? InProceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), pages 536–541, New Orleans, Louisiana. Association for Computational Linguistics. Tyler Cowen

2018
[4]

InFind- ings of the Association for Computational Linguistics: EMNLP 2022, pages 2801–2813, Abu Dhabi, United Arab Emirates

Probing for incremental parse states in autoregressive language models. InFind- ings of the Association for Computational Linguistics: EMNLP 2022, pages 2801–2813, Abu Dhabi, United Arab Emirates. Association for Computational Lin- guistics. Evelina Fedorenko, Idan Asher Blank, Matthew Siegel- man, and Zachary Mineroff

2022
[5]

Neural language models as psycholinguistic subjects: Representations of syntactic state. InProceedings of the 2019 Conference of the North American Chap- ter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 32–42, Minneapolis, Minnesota. Association for Computational Linguistics. Jon Ga...

2019
[6]

Gemma 2: Improving open lan- guage models at a practical size.arXiv preprint arXiv:2408.00118. GenRM

work page internal anchor Pith review arXiv
[7]

Hugging Face Datasets

Gutenberg DPO (gutenberg-dpo- v0.1-jondurbin). Hugging Face Datasets. License: CC BY 4.0. Duplicated from jondurbin/gutenberg-dpo-v0.1. Accessed: 2026-01-04. Anirudh Goyal and Yoshua Bengio

2026
[8]

The Llama 3 Herd of Models

The llama 3 herd of models.Preprint, arXiv:2407.21783. John Hewitt and Christopher D. Manning

work page internal anchor Pith review arXiv
[9]

A structural probe for finding syntax in word represen- tations. InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Tech- nologies, Volume 1 (Long and Short Papers), pages 4129–4138, Minneapolis, Minnesota. Association for Computational Linguistics. Jennifer Hu, Jon Gauthier,...

2019
[10]

InThe 2023 Conference on Empirical Methods in Natural Language Processing

Prompting is not a substitute for probability measurements in large lan- guage models. InThe 2023 Conference on Empirical Methods in Natural Language Processing. Jennifer Hu, Ethan Gotlieb Wilcox, Siyuan Song, Kyle Mahowald, and Roger P. Levy

2023
[11]

What can string probability tell us about grammaticality?Preprint, arXiv:2510.16227. Anna A. Ivanova, Zachary Mineroff, Vitor Zimmerer, Nancy Kanwisher, Rosemary Varley, and Evelina Fe- dorenko

work page arXiv
[12]

MultiBLiMP 1.0: A Massively Multilingual Benchmark of Linguistic Minimal Pairs

Multiblimp 1.0: A massively multilingual benchmark of linguistic minimal pairs. Preprint, arXiv:2504.02768. Julie Kallini, Isabel Papadimitriou, Richard Futrell, Kyle Mahowald, and Christopher Potts

work page internal anchor Pith review Pith/arXiv arXiv
[13]

InProceed- ings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 1192–1202, Brussels, Belgium

Targeted syn- tactic evaluation of language models. InProceed- ings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 1192–1202, Brussels, Belgium. Association for Computational Linguistics. Vladislav Mikhailov, Tatiana Shamardina, Max Ryabinin, Alena Pestova, Ivan Smurov, and Ekaterina Artemova

2018
[14]

InProceedings of the 2022 Con- ference on Empirical Methods in Natural Language Processing, pages 5207–5227, Abu Dhabi, United Arab Emirates

RuCoLA: Russian corpus of lin- guistic acceptability. InProceedings of the 2022 Con- ference on Empirical Methods in Natural Language Processing, pages 5207–5227, Abu Dhabi, United Arab Emirates. Association for Computational Lin- guistics. Raphaël Millière

2022
[15]

Olmo 3.Preprint, arXiv:2512.13961. Team OLMo, Pete Walsh, Luca Soldaini, Dirk Groen- eveld, Kyle Lo, Shane Arora, Akshita Bhagia, Yuling Gu, Shengyi Huang, Matt Jordan, Nathan Lambert, Dustin Schwenk, Oyvind Tafjord, Taira Anderson, David Atkinson, Faeze Brahman, Christopher Clark, Pradeep Dasigi, Nouha Dziri, and 24 others

work page internal anchor Pith review Pith/arXiv arXiv
[16]

2 OLMo 2 Furious

2 olmo 2 furious.Preprint, arXiv:2501.00656. Kiho Park, Yo Joong Choe, Yibo Jiang, and Victor Veitch

work page internal anchor Pith review arXiv
[17]

InProceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), pages 9477–9488, Torino, Italia

JCoLA: Japanese corpus of linguistic acceptabil- ity. InProceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), pages 9477–9488, Torino, Italia. ELRA and ICCL. Yixiao Song, Kalpesh Krishna, Rajesh Bhatt, and Mo- hit Iyyer

2024
[18]

InProceedings of the 2022 Conference on Empirical Methods in Natu- ral Language Processing, pages 4606–4634, Abu Dhabi, United Arab Emirates

SLING: Sino linguistic evaluation of large language models. InProceedings of the 2022 Conference on Empirical Methods in Natu- ral Language Processing, pages 4606–4634, Abu Dhabi, United Arab Emirates. Association for Com- putational Linguistics. Michelle Suijkerbuijk, Zoë Prins, Marianne de Heer Kloots, Willem Zuidema, and Stefan L. Frank

2022
[19]

What goes into a LM acceptability judgment? rethinking the impact of frequency and length. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 2173– 2186, Albuquerque, New Mexico. Association for Computational Linguistics. D...

2025
[20]

InLREC 2018 Workshop on Linguistic and Neurocognitive Resources (LiNCR), Miyazaki, Japan

Event Knowledge in Sentence Processing: A New Dataset for the Evaluation of Argument Typicality. InLREC 2018 Workshop on Linguistic and Neurocognitive Resources (LiNCR), Miyazaki, Japan. Alex Warstadt, Alicia Parrish, Haokun Liu, Anhad Mo- hananey, Wei Peng, Sheng-Fu Wang, and Samuel R. Bowman

2018
[21]

InProceedings of the 2021 Conference of the North American Chapter of the Association for Computa- tional Linguistics: Human Language Technologies, pages 132–138, Online

A non-linear structural probe. InProceedings of the 2021 Conference of the North American Chapter of the Association for Computa- tional Linguistics: Human Language Technologies, pages 132–138, Online. Association for Computa- tional Linguistics. Ethan Wilcox, Roger Levy, Takashi Morita, and Richard Futrell

2021
[22]

Association for Com- putational Linguistics

What do RNN language models learn about filler–gap dependencies? InProceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyz- ing and Interpreting Neural Networks for NLP, pages 211–221, Brussels, Belgium. Association for Com- putational Linguistics. Tianyang Xu, Tatsuki Kuribayashi, Yohei Oseki, Ryan Cotterell, and Alex Warstadt

2018
[23]

good") sentences: 93.72% judged ac- ceptable. • Perturbed (

Can language models learn typologically implausible languages? Preprint, arXiv:2502.12317. A Contrastive Training Sets Generic text corpus.To build a generic English text corpus, we combine raw text from two sources: (i) Penn Treebank (PTB; Wall Street Journal text- only splits) (Marcus et al., 1994), and (ii) the GenRM/gutenberg-dpo-v0.1-jondurbin datase...

work page arXiv 1994
[24]

SyntaxGym is grouped by sentence conditions

All of them are publicly available on Hugging Face. SyntaxGym is grouped by sentence conditions. We label conditions as acceptable vs. unacceptable using: Acceptable:np_match, vp_match, that_nogap, what_subjgap, what_gap, neg_pos, neg_neg, match_sing, match_plural, no-sub_no-matrix, sub_matrix. Unacceptable:np_mismatch, vp_mismatch, what_nogap, that_subjg...

2023