Linear probes on LM hidden states detect grammaticality better than string probabilities, generalize to human benchmarks and other languages, and correlate weakly with likelihood.
MultiBLiMP 1.0: A Massively Multilingual Benchmark of Linguistic Minimal Pairs
5 Pith papers cite this work. Polarity classification is still indexing.
abstract
We introduce MultiBLiMP 1.0, a massively multilingual benchmark of linguistic minimal pairs, covering 101 languages and 2 types of subject-verb agreement, containing more than 128,000 minimal pairs. Our minimal pairs are created using a fully automated pipeline, leveraging the large-scale linguistic resources of Universal Dependencies and UniMorph. MultiBLiMP 1.0 evaluates abilities of LLMs at an unprecedented multilingual scale, and highlights the shortcomings of the current state-of-the-art in modelling low-resource languages.
citation-role summary
citation-polarity summary
fields
cs.CL 5roles
background 1polarities
background 1representative citing papers
Presents a minimal-pair dataset and reports that probing experiments show language models differentiate light-verb from full-verb uses even in minimal contexts with separable patterns by object type.
Sparse crosscoders on LLM checkpoint triplets track emergence, maintenance, and discontinuation of linguistic features during pretraining via a new RelIE metric.
Different types of syntactic agreement recruit overlapping units within LLMs, indicating that agreement forms a meaningful functional category across English, Russian, Chinese, and structurally similar languages.
The survey identifies a key tension in multilingual vision-language models between language neutrality via contrastive learning and cultural awareness via diverse data, with most benchmarks relying on translation-based evaluation.
citing papers explorer
-
Implicit Representations of Grammaticality in Language Models
Linear probes on LM hidden states detect grammaticality better than string probabilities, generalize to human benchmarks and other languages, and correlate weakly with likelihood.
-
Light or Full Verb? A Minimal-Pair Dataset for Probing Phraseological Competence in Language Models
Presents a minimal-pair dataset and reports that probing experiments show language models differentiate light-verb from full-verb uses even in minimal contexts with separable patterns by object type.