Multilingual Embedding Probes Fail to Generalize Across Learner Corpora
Pith reviewed 2026-05-10 18:50 UTC · model grok-4.3
The pith
Multilingual embeddings do not encode a language-general representation of writing proficiency.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Probes on hidden-state activations from Qwen3-Embedding models predict CEFR proficiency levels with quadratic weighted kappa around 0.7 in-distribution and outperform surface baselines, but performance collapses in cross-corpus settings as probes converge to uniform predictions, showing that the embeddings encode corpus-specific distributional properties rather than an abstract, transferable proficiency dimension.
What carries the argument
Probing of hidden-state activations from Qwen3-Embedding models using linear and non-linear classifiers to predict CEFR levels, evaluated both in- and cross-corpus.
If this is right
- In-distribution probes achieve QWK around 0.7 and outperform surface baselines.
- Middle layers provide the best predictions within a corpus.
- Cross-corpus performance collapses for all probe architectures and model sizes.
- Out-of-distribution probes converge to uniform label predictions.
- The learned mappings reflect corpus-specific properties such as topic and rating methodology.
Where Pith is reading between the lines
- Proficiency-adaptive language technologies built on embedding representations may not transfer well without corpus-specific adjustments.
- Similar probing experiments could reveal whether other abstract linguistic properties are encoded in a generalizable manner in multilingual models.
- The results suggest caution in using embedding activations as proxies for proficiency in multilingual settings without validation on target corpora.
Load-bearing premise
Differences in task type, rating methodology, or topic distribution across the nine corpora are not responsible for the observed performance collapse and uniform predictions.
What would settle it
Demonstrating sustained high cross-corpus performance after matching or statistically controlling for topic, language, task type, and rating methodology across the corpora.
Figures
read the original abstract
Do multilingual embedding models encode a language-general representation of proficiency? We investigate this by training linear and non-linear probes on hidden-state activations from Qwen3-Embedding (0.6B, 4B, 8B) to predict CEFR proficiency levels from learner texts across nine corpora and seven languages. We compare five probing architectures against a baseline trained on surface-level text features. Under in-distribution evaluation, probes achieve strong performance ($QWK\approx0.7$), substantially outperforming the surface baseline, with middle layers consistently yielding the best predictions. However, in cross-corpus evaluation performance collapses across all probe types and model sizes. Residual analysis reveals that out-of-distribution probes converge towards predicting uniformly distributed labels, indicating that the learned mappings capture corpus-specific distributional properties (topic, language, task type, rating methodology) rather than an abstract, transferable proficiency dimension. These results suggest that current multilingual embeddings do not straightforwardly encode language-general proficiency, with implications for representation-based approaches to proficiency-adaptive language technology.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript investigates whether multilingual embedding models encode a language-general representation of proficiency by training linear and non-linear probes on hidden states from Qwen3-Embedding (0.6B/4B/8B) to predict CEFR levels across nine learner corpora in seven languages. In-distribution probes reach QWK≈0.7 and outperform surface-feature baselines, with middle layers performing best, but cross-corpus performance collapses for all architectures and sizes; residual analysis shows OOD probes converge to uniform label predictions, leading to the conclusion that the probes capture corpus-specific properties (topic, language, task type, rating methodology) rather than an abstract, transferable proficiency dimension.
Significance. If the central interpretation is supported after addressing potential confounds, the result would be significant for representation learning in NLP: it would indicate that current multilingual embeddings do not provide a straightforward language-general proficiency signal, with direct implications for proficiency-adaptive language technologies and the design of future probing or fine-tuning approaches.
major comments (3)
- [§4] §4 (Cross-corpus evaluation): The performance collapse and uniform OOD predictions are interpreted as evidence that embeddings lack a transferable proficiency dimension, yet the evaluation does not include stratification, matching, or covariate adjustment for the listed corpus differences (task type, rating methodology, topic distribution); without such controls the results are equally consistent with probes learning spurious corpus signatures that correlate with CEFR labels within each corpus.
- [§3.2] §3.2 (Data and splits): Exact train/test splits, sample sizes per corpus, and any balancing for language or CEFR distribution are not reported, nor are statistical tests (e.g., paired significance tests on QWK differences or confidence intervals) provided to support the claim of consistent failure across probe types and model sizes.
- [§5] §5 (Residual analysis): The claim that OOD probes converge to uniformly distributed labels requires quantitative details on the uniformity metric, comparison to in-distribution label distributions, and verification that this pattern is not an artifact of class imbalance or probe regularization choices.
minor comments (2)
- [Abstract] The abstract states that middle layers yield the best predictions but does not report layer-wise QWK values or identify the specific layers (e.g., layer 12 vs. 24) for the three model sizes.
- [§3.1] Notation for the five probing architectures and the surface baseline should be introduced with explicit equations or pseudocode in §3.1 to improve reproducibility.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed comments. We address each major comment point by point below. Where the suggestions identify gaps in reporting or analysis, we have revised the manuscript accordingly.
read point-by-point responses
-
Referee: [§4] §4 (Cross-corpus evaluation): The performance collapse and uniform OOD predictions are interpreted as evidence that embeddings lack a transferable proficiency dimension, yet the evaluation does not include stratification, matching, or covariate adjustment for the listed corpus differences (task type, rating methodology, topic distribution); without such controls the results are equally consistent with probes learning spurious corpus signatures that correlate with CEFR labels within each corpus.
Authors: We agree that corpus differences in task type, rating methodology, and topic distribution represent potential confounds. The manuscript already notes these differences in §2 and §4. However, the systematic convergence of OOD probes to uniform label predictions (rather than any shifted but still informative distribution) is difficult to explain solely by spurious within-corpus correlations; a probe capturing even a partially confounded proficiency signal should exhibit some positive transfer. To strengthen the analysis, we have added a new subsection in the revised §4 that discusses these confounds explicitly and reports results on a matched subset of corpora (balanced on task type and language where feasible), which reproduces the performance collapse. This constitutes a partial revision. revision: partial
-
Referee: [§3.2] §3.2 (Data and splits): Exact train/test splits, sample sizes per corpus, and any balancing for language or CEFR distribution are not reported, nor are statistical tests (e.g., paired significance tests on QWK differences or confidence intervals) provided to support the claim of consistent failure across probe types and model sizes.
Authors: We accept this point. The revised §3.2 now reports exact train/test splits (80/20 random split per corpus, with stratification by CEFR level), a new table listing sample sizes and CEFR distributions per corpus, and details on any balancing applied. We have also added statistical support: paired Wilcoxon signed-rank tests on QWK differences between in-distribution and cross-corpus settings, together with 95% bootstrap confidence intervals. These additions confirm the consistent and statistically significant performance drop across probe architectures and model sizes. revision: yes
-
Referee: [§5] §5 (Residual analysis): The claim that OOD probes converge to uniformly distributed labels requires quantitative details on the uniformity metric, comparison to in-distribution label distributions, and verification that this pattern is not an artifact of class imbalance or probe regularization choices.
Authors: We have expanded the residual analysis in the revised §5. We now report the entropy of the predicted label distribution (approaching log(C) for OOD probes, indicating uniformity) and the KL divergence from the uniform distribution (near zero OOD, substantially higher in-distribution). In-distribution predictions closely match the empirical label distribution of each corpus. To address potential artifacts, we re-ran all probes using class-balanced sampling and a range of L2 regularization strengths (10^{-4} to 10^{-1}); the uniform convergence pattern persists across these controls. These quantitative results and robustness checks are now included. revision: yes
Circularity Check
No significant circularity; purely empirical evaluation
full rationale
The paper is a self-contained empirical study that trains linear and non-linear probes on hidden states from Qwen3-Embedding models to predict CEFR levels, then measures in-distribution QWK performance and cross-corpus collapse. No derivations, equations, or fitted parameters are presented as predictions; all results are direct measurements on held-out data splits. The central claim follows from observed performance differences and residual analysis rather than any self-definition, self-citation chain, or ansatz. No load-bearing steps reduce to inputs by construction.
Axiom & Free-Parameter Ledger
free parameters (2)
- probe architecture variants
- model sizes
axioms (1)
- domain assumption CEFR proficiency levels constitute a consistent, language-general target variable across different learner corpora and rating methodologies.
Reference graph
Works this paper leans on
-
[1]
" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...
-
[2]
\@ifxundefined[1] #1\@undefined \@firstoftwo \@secondoftwo \@ifnum[1] #1 \@firstoftwo \@secondoftwo \@ifx[1] #1 \@firstoftwo \@secondoftwo [2] @ #1 \@temptokena #2 #1 @ \@temptokena \@ifclassloaded agu2001 natbib The agu2001 class already includes natbib coding, so you should not add it explicitly Type <Return> for now, but then later remove the command n...
-
[3]
\@lbibitem[] @bibitem@first@sw\@secondoftwo \@lbibitem[#1]#2 \@extra@b@citeb \@ifundefined br@#2\@extra@b@citeb \@namedef br@#2 \@nameuse br@#2\@extra@b@citeb \@ifundefined b@#2\@extra@b@citeb @num @parse #2 @tmp #1 NAT@b@open@#2 NAT@b@shut@#2 \@ifnum @merge>\@ne @bibitem@first@sw \@firstoftwo \@ifundefined NAT@b*@#2 \@firstoftwo @num @NAT@ctr \@secondoft...
-
[4]
@open @close @open @close and [1] URL: #1 \@ifundefined chapter * \@mkboth \@ifxundefined @sectionbib * \@mkboth * \@mkboth\@gobbletwo \@ifclassloaded amsart * \@ifclassloaded amsbook * \@ifxundefined @heading @heading NAT@ctr thebibliography [1] @ \@biblabel @NAT@ctr \@bibsetup #1 @NAT@ctr @ @openbib .11em \@plus.33em \@minus.07em 4000 4000 `\.\@m @bibit...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.