pith. sign in

arxiv: 2404.18923 · v5 · submitted 2024-04-29 · 💻 cs.CL

Holmes: A Benchmark to Assess the Linguistic Competence of Language Models

Pith reviewed 2026-05-24 02:05 UTC · model grok-4.3

classification 💻 cs.CL
keywords linguistic competenceprobinglanguage modelsbenchmarksyntaxmorphologyinstruction tuningFlashHolmes
0
0 comments X

The pith

A new benchmark shows language models' linguistic competence grows with size but is shaped by architecture and instruction tuning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Holmes, a benchmark that applies classifier-based probing to internal model representations in order to measure unconscious linguistic knowledge of phenomena such as syntax and morphology. This method is meant to separate that knowledge from other abilities like instruction following that appear in standard evaluations. The authors compile more than two hundred datasets drawn from over two hundred seventy prior studies and test more than fifty language models. The results confirm a correlation between model size and linguistic competence while showing that architecture and instruction tuning produce additional, sizable differences especially in morphology and syntax. A lighter variant called FlashHolmes is offered to reduce the cost of running the evaluation.

Core claim

Holmes demonstrates that linguistic competence in language models correlates with model size, yet model architecture and instruction tuning exert significant additional influence, especially on morphology and syntax tasks.

What carries the argument

Classifier-based probing of internal representations to isolate linguistic competence across syntax, morphology, semantics, reasoning, and discourse phenomena.

Load-bearing premise

Classifier probes on internal states can measure unconscious linguistic competence without being contaminated by other model abilities or by the choice of probe itself.

What would settle it

Finding that instruction-tuned models show no consistent difference from base models on morphology or syntax probes would undermine the claim that tuning influences linguistic competence.

Figures

Figures reproduced from arXiv: 2404.18923 by Andreas Waldis, Iryna Gurevych, Leshem Choshen, Yotam Perlitz, Yufang Hou.

Figure 1
Figure 1. Figure 1: In Holmes, we encode examples of prob￾ing datasets using frozen LMs. Then, we train probes (linear models) with labels representing the specific lin￾guistic phenomenon under test. Finally, we use the re￾sults of testing the probes to approximate the LMs’ lin￾guistic competence regarding the tested phenomena. knowledge (Petroni et al., 2019b, 2020). How￾ever, such benchmarks rely on LMs’ use of lan￾guage (t… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of Holmes (left) with the five phenomena types (right) and an example of probing-based evaluations for part-of-speech: encoding the input tokens and predicting the POS tag for cucumber, here NN. els (probes) using the internal representations of text inputs from the last model layer to predict the specific phenomena aspects. We then approx￾imate the LMs’ grasp of these phenomena using the probes’ … view at source ↗
Figure 3
Figure 3. Figure 3: Citation analysis considering probing cita￾tions originating from the set of relevant work and ev￾ery other citation (general citations). The color scale indicates the ratio (α) between them [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Categorization of the selected studies by their [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Overview of how many tasks single LMs cover and vice versa - single examples are highlighted [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Cumulative coverage of LMs and tasks, con [PITH_FULL_IMAGE:figures/full_fig_p005_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: A subset of Holmes rankings (↓) for vari￾ous evaluated LMs. FLAN-UL2 outperforms the others overall, while different LMs prevail for the five distinct types of linguistic phenomena. F1(y, yˆ)−F1(y ′ , ˆy ′) as the difference between the probe trained with the original labels y and the control task where we train the probe with ran￾domly assigned labels y ′ . With a higher S, we assume the detected patterns… view at source ↗
Figure 8
Figure 8. Figure 8: Reliability evaluation Holmes results to en￾sure low deviation across random seeds, high informa￾tion compression (log), and high selectivity. Every dot represents the averaged results of one probing dataset across LMs. The x-axis represents the task metrics (ei￾ther person correlation or macro F1). prompting-based evaluations, where prompt para￾phrasing leads to deviations of σ = 0.07 reported in Mizrahi … view at source ↗
Figure 10
Figure 10. Figure 10: Kendall-tau correlation within Holmes (left) and compared to OpenLLM (right). Green stars indicate significant correlations (p < 0.05) [PITH_FULL_IMAGE:figures/full_fig_p009_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Comparison of the phenomenon types for encoder and decoder LMs (left) and on the right, the ac￾curacy of the top-20 most common tokens of the three part-of-speech probing datasets for BERT, RoBERTa, GPT2, Pythia, and Llama-2. saturated for morphology or syntax, encompassing a variety of token-level phenomena, like part-of￾speech. We assume that the missing bi-directional encoding of decoder LMs causes thi… view at source ↗
Figure 12
Figure 12. Figure 12: Effect of scaling LM parameters considering the T5 and Pythia model families providing eight and five [PITH_FULL_IMAGE:figures/full_fig_p010_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Analysis of the reliability vs. efficiency [PITH_FULL_IMAGE:figures/full_fig_p011_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Overview of the composition of the probing input [PITH_FULL_IMAGE:figures/full_fig_p027_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Detailed Holmes vs. HELM (Liang et al., 2023) comparison for 40 open decoder models and 22 Blimp datasets covering quantifier, island effects, irregular forms, and binding phenomena. We use the evaluation code of HELM and run the prompting-based adaption (multiple joice joined). The Holmes and Helm results for 40 open decoder models. These results show the advantage of disentangled evaluation (Holmes) ove… view at source ↗
read the original abstract

We introduce Holmes, a new benchmark designed to assess language models (LMs) linguistic competence - their unconscious understanding of linguistic phenomena. Specifically, we use classifier-based probing to examine LMs' internal representations regarding distinct linguistic phenomena (e.g., part-of-speech tagging). As a result, we meet recent calls to disentangle LMs' linguistic competence from other cognitive abilities, such as following instructions in prompting-based evaluations. Composing Holmes, we review over 270 probing studies and include more than 200 datasets to assess syntax, morphology, semantics, reasoning, and discourse phenomena. Analyzing over 50 LMs reveals that, aligned with known trends, their linguistic competence correlates with model size. However, surprisingly, model architecture and instruction tuning also significantly influence performance, particularly in morphology and syntax. Finally, we propose FlashHolmes, a streamlined version that reduces the computation load while maintaining high-ranking precision.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces Holmes, a benchmark that aggregates over 200 datasets from 270 probing studies to evaluate language models' linguistic competence via classifier-based probing of internal representations across syntax, morphology, semantics, reasoning, and discourse. Analysis of more than 50 LMs shows performance correlates with model size but is also affected by architecture and instruction tuning (especially morphology/syntax); FlashHolmes is proposed as a reduced-compute variant that preserves ranking precision.

Significance. If the probing methodology isolates unconscious linguistic competence without contamination, the work supplies a large-scale, prompting-independent benchmark that can reveal design-factor effects beyond scale and supports more targeted model development.

major comments (2)
  1. [Abstract and Methods] Abstract and Methods (probing setup): The central claim that classifier-based probing disentangles linguistic competence from instruction-following rests on the assumption that probe results are invariant to classifier choice, regularization, and training regime. No ablations (linear vs. non-linear probes, different seeds, or regularization strengths) are reported to verify that model rankings or the architecture/instruction-tuning effects on morphology and syntax remain stable under these variations.
  2. [Results] Results section (architecture and tuning effects): The reported significant influence of model architecture and instruction tuning on morphology and syntax performance lacks explicit controls for potential confounds such as training-data overlap between the probed LMs and the 200+ datasets or differences in pretraining objectives; without these, the attribution to architecture/tuning rather than data leakage cannot be isolated.
minor comments (1)
  1. [FlashHolmes proposal] The description of FlashHolmes should include the exact subset selection criteria and quantitative ranking correlation (e.g., Spearman rho) with the full Holmes benchmark to allow independent verification.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments, which help clarify the strengths and limitations of our work. We respond to each major comment below.

read point-by-point responses
  1. Referee: [Abstract and Methods] Abstract and Methods (probing setup): The central claim that classifier-based probing disentangles linguistic competence from instruction-following rests on the assumption that probe results are invariant to classifier choice, regularization, and training regime. No ablations (linear vs. non-linear probes, different seeds, or regularization strengths) are reported to verify that model rankings or the architecture/instruction-tuning effects on morphology and syntax remain stable under these variations.

    Authors: We agree that additional ablations would strengthen the robustness claim. In the revised manuscript we will report results with both linear and non-linear (MLP) probes, multiple random seeds, and a range of regularization strengths to confirm that model rankings and the architecture/instruction-tuning effects remain stable. revision: yes

  2. Referee: [Results] Results section (architecture and tuning effects): The reported significant influence of model architecture and instruction tuning on morphology and syntax performance lacks explicit controls for potential confounds such as training-data overlap between the probed LMs and the 200+ datasets or differences in pretraining objectives; without these, the attribution to architecture/tuning rather than data leakage cannot be isolated.

    Authors: We acknowledge the potential confound. Because training data for most of the 50+ models is not fully public, exhaustive overlap checks across 200+ datasets are not feasible. We will add an explicit limitations paragraph discussing this issue and will verify overlaps where training data is known; the consistency of effects across independently collected probing datasets from 270 studies provides supporting evidence that the architecture and tuning signals are not solely due to leakage. revision: partial

Circularity Check

0 steps flagged

Benchmark aggregates existing datasets; central claims are empirical observations with no reduction to fitted inputs or self-citation chains by construction

full rationale

The paper compiles Holmes from >200 existing datasets drawn from a review of 270 prior probing studies and applies classifier-based probing to >50 LMs to report size correlations plus architecture and instruction-tuning effects. No equations define a quantity in terms of itself, no parameters are fitted on a subset and then relabeled as predictions of closely related quantities, and no load-bearing premise rests on a self-citation whose validity is unverified outside the present work. The derivation chain consists of dataset aggregation followed by direct empirical measurement and is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that probing classifiers reliably extract linguistic knowledge and that the selected datasets validly represent the targeted phenomena; no free parameters or invented entities are described in the abstract.

axioms (1)
  • domain assumption Classifier-based probing on frozen representations measures unconscious linguistic competence independent of task performance.
    Invoked in the abstract when stating the benchmark examines internal representations regarding linguistic phenomena.

pith-pipeline@v0.9.0 · 5698 in / 1146 out tokens · 18103 ms · 2026-05-24T02:05:04.996401+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

13 extracted references · 13 canonical work pages · 3 internal anchors

  1. [1]

    In Proceedings of the 56th Annual Meeting of the Association for Compu- tational Linguistics (Volume 1: Long Papers) , pages 2126–2136, Melbourne, Australia

    What you can cram into a single $&!#* vector: Probing sentence embeddings for lin- guistic properties. In Proceedings of the 56th Annual Meeting of the Association for Compu- tational Linguistics (Volume 1: Long Papers) , pages 2126–2136, Melbourne, Australia. Asso- ciation for Computational Linguistics. Mike Conover, Matt Hayes, Ankit Mathur, Jian- wei X...

  2. [2]

    In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023

    Debertav3: Improving deberta us- ing electra-style pre-training with gradient- disentangled embedding sharing. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net. Pengcheng He, Xiaodong Liu, Jianfeng Gao, and Weizhu Chen. 2021. Deberta: decoding- enhanced bert with disentangled at...

  3. [3]

    In Proceedings of the 2021 Confer- ence of the North American Chapter of the As- sociation for Computational Linguistics: Hu- man Language Technologies, pages 3849–3864, Online

    Discourse probing of pretrained language models. In Proceedings of the 2021 Confer- ence of the North American Chapter of the As- sociation for Computational Linguistics: Hu- man Language Technologies, pages 3849–3864, Online. Association for Computational Linguis- tics. Katarzyna Krasnowska-Kiera ´s and Alina Wróblewska. 2019. Empirical linguistic study ...

  4. [4]

    In Proceed- ings of the 58th Annual Meeting of the As- sociation for Computational Linguistics , pages 7871–7880, Online

    BART: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In Proceed- ings of the 58th Annual Meeting of the As- sociation for Computational Linguistics , pages 7871–7880, Online. Association for Computa- tional Linguistics. Percy Liang, Rishi Bommasani, Tony Lee, Dim- itris Tsipras, Dilara Soylu, M...

  5. [5]

    RoBERTa: A Robustly Optimized BERT Pretraining Approach

    Holistic evaluation of language models. Transactions on Machine Learning Research . Featured Certification, Expert Certification. Tal Linzen, Emmanuel Dupoux, and Yoav Gold- berg. 2016. Assessing the ability of LSTMs to learn syntax-sensitive dependencies. Transac- tions of the Association for Computational Lin- guistics, 4:521–535. Yinhan Liu, Myle Ott, ...

  6. [6]

    Kyle Mahowald, Anna A

    Are emergent abilities in large lan- guage models just in-context learning? CoRR, abs/2309.01809. Kyle Mahowald, Anna A. Ivanova, Idan A. Blank, Nancy Kanwisher, Joshua B. Tenenbaum, and Evelina Fedorenko. 2024. Dissociating lan- guage and thought in large language models. Trends in Cognitive Sciences. Peter Hugoe Matthews. 2014. The concise Ox- ford dict...

  7. [7]

    CoRR, abs/2401.00595

    State of what art? A call for multi-prompt LLM evaluation. CoRR, abs/2401.00595. Michael Mohler, Mary Brunson, Bryan Rink, and Marc Tomlinson. 2016. Introducing the LCC metaphor datasets. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16) , pages 4221–4227, Portorož, Slovenia. European Lan- guage Resources ...

  8. [8]

    In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics , pages 4497–4510, Florence, Italy

    DisSent: Learning sentence represen- tations from explicit discourse relations. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics , pages 4497–4510, Florence, Italy. Association for Computational Linguistics. Yixin Nie, Adina Williams, Emily Dinan, Mohit Bansal, Jason Weston, and Douwe Kiela. 2020. Adversarial NLI:...

  9. [9]

    OpenAI blog, 1(8):9

    Language models are unsupervised mul- titask learners. OpenAI blog, 1(8):9. Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu

  10. [10]

    The truth is in there: Improving reasoning in language models with layer-selective rank reduction.arXiv preprint arXiv:2312.13558,

    Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 21(140):1–67. Robert Henry Robins. 2013. A short history of linguistics. Routledge. Pedro Rodriguez, Joe Barrow, Alexander Miserlis Hoyle, John P. Lalor, Robin Jia, and Jor- dan Boyd-Graber. 2021. Evaluation examples are not equally inf...

  11. [11]

    Llama 2: Open Foundation and Fine-Tuned Chat Models

    BEIR: A heterogeneous benchmark for zero-shot evaluation of information retrieval models. In Proceedings of the Neural Informa- tion Processing Systems Track on Datasets and Benchmarks 1, NeurIPS Datasets and Bench- marks 2021, December 2021, virtual. Lucas Torroba Hennigen, Adina Williams, and Ryan Cotterell. 2020. Intrinsic probing through dimension sel...

  12. [12]

    How far can camels go? exploring the state of instruction tuning on open resources. In Advances in Neural Information Processing Systems 36: Annual Conference on Neural In- formation Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023. Yizhong Wang, Swaroop Mishra, Pegah Alipoor- molabashi, Yeganeh Kordi, Amirreza Mirzaei, ...

  13. [13]

    CoRR, abs/2404.03818

    Probelm: Plausibility ranking evaluation for language models. CoRR, abs/2404.03818. Amir Zeldes. 2017. The GUM corpus: Creat- ing multilayer resources in the classroom. Lan- guage Resources and Evaluation , 51(3):581– 612. Xikun Zhang, Deepak Ramachandran, Ian Tenney, Yanai Elazar, and Dan Roth. 2020. Do language embeddings capture scales? In Proceedings ...