Multilingual Embedding Probes Fail to Generalize Across Learner Corpora

Laurits Lyngbaek; Ross Deans Kristensen-McLachlan

arxiv: 2604.07095 · v1 · submitted 2026-04-08 · 💻 cs.CL

Multilingual Embedding Probes Fail to Generalize Across Learner Corpora

Laurits Lyngbaek , Ross Deans Kristensen-McLachlan This is my paper

Pith reviewed 2026-05-10 18:50 UTC · model grok-4.3

classification 💻 cs.CL

keywords multilingual embeddingsCEFR proficiencylearner corporaprobingcross-corpus evaluationproficiency predictionQwen3-Embeddinggeneralization failure

0 comments

The pith

Multilingual embeddings do not encode a language-general representation of writing proficiency.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether multilingual embeddings contain a language-general proficiency signal by probing activations from Qwen3-Embedding models on CEFR level prediction tasks using nine learner corpora in seven languages. Probes perform well when trained and tested on the same corpus, exceeding a surface-feature baseline, but their accuracy drops dramatically when applied to new corpora. Analysis shows that out-of-distribution probes default to uniform label predictions, meaning they have fitted to corpus-specific traits such as topic, language, task, and rating methods. This finding indicates that representation-based methods for proficiency-aware language technology cannot rely on off-the-shelf embeddings for cross-corpus or cross-language transfer without additional adaptation.

Core claim

Probes on hidden-state activations from Qwen3-Embedding models predict CEFR proficiency levels with quadratic weighted kappa around 0.7 in-distribution and outperform surface baselines, but performance collapses in cross-corpus settings as probes converge to uniform predictions, showing that the embeddings encode corpus-specific distributional properties rather than an abstract, transferable proficiency dimension.

What carries the argument

Probing of hidden-state activations from Qwen3-Embedding models using linear and non-linear classifiers to predict CEFR levels, evaluated both in- and cross-corpus.

If this is right

In-distribution probes achieve QWK around 0.7 and outperform surface baselines.
Middle layers provide the best predictions within a corpus.
Cross-corpus performance collapses for all probe architectures and model sizes.
Out-of-distribution probes converge to uniform label predictions.
The learned mappings reflect corpus-specific properties such as topic and rating methodology.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Proficiency-adaptive language technologies built on embedding representations may not transfer well without corpus-specific adjustments.
Similar probing experiments could reveal whether other abstract linguistic properties are encoded in a generalizable manner in multilingual models.
The results suggest caution in using embedding activations as proxies for proficiency in multilingual settings without validation on target corpora.

Load-bearing premise

Differences in task type, rating methodology, or topic distribution across the nine corpora are not responsible for the observed performance collapse and uniform predictions.

What would settle it

Demonstrating sustained high cross-corpus performance after matching or statistically controlling for topic, language, task type, and rating methodology across the corpora.

Figures

Figures reproduced from arXiv: 2604.07095 by Laurits Lyngbaek, Ross Deans Kristensen-McLachlan.

**Figure 2.** Figure 2: Visualization of the potential sources of bias leading to an over-fitted proficiency [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: Layer-wise trend of Quadratic Weighted Kappa of probes across hidden layers. [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: Ridgeplot of predicted distribution (upper plot) and residuals of predictions [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Mean QWK for the Qwen3-4B MLP regression probe in the IID and OOD condition. [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

**Figure 6.** Figure 6: Ridgeplot of development of prediction error. The upper two plots show the [PITH_FULL_IMAGE:figures/full_fig_p017_6.png] view at source ↗

**Figure 7.** Figure 7: Mean probe QWK across datasets for Qwen3-8B. [PITH_FULL_IMAGE:figures/full_fig_p018_7.png] view at source ↗

read the original abstract

Do multilingual embedding models encode a language-general representation of proficiency? We investigate this by training linear and non-linear probes on hidden-state activations from Qwen3-Embedding (0.6B, 4B, 8B) to predict CEFR proficiency levels from learner texts across nine corpora and seven languages. We compare five probing architectures against a baseline trained on surface-level text features. Under in-distribution evaluation, probes achieve strong performance ($QWK\approx0.7$), substantially outperforming the surface baseline, with middle layers consistently yielding the best predictions. However, in cross-corpus evaluation performance collapses across all probe types and model sizes. Residual analysis reveals that out-of-distribution probes converge towards predicting uniformly distributed labels, indicating that the learned mappings capture corpus-specific distributional properties (topic, language, task type, rating methodology) rather than an abstract, transferable proficiency dimension. These results suggest that current multilingual embeddings do not straightforwardly encode language-general proficiency, with implications for representation-based approaches to proficiency-adaptive language technology.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows probes on multilingual embeddings predict CEFR levels inside a corpus but collapse across nine others and default to uniform guesses, pointing to corpus-specific capture rather than a general proficiency signal.

read the letter

The main thing here is that in-distribution probing works reasonably well but cross-corpus evaluation tanks for every probe type and model size tested. Out-of-distribution the models stop making varied predictions and head toward uniform label distributions, which the authors tie to corpus-specific properties like topic, task type, and rating methodology instead of any abstract proficiency dimension that transfers across learner data sets.

Referee Report

3 major / 2 minor

Summary. The manuscript investigates whether multilingual embedding models encode a language-general representation of proficiency by training linear and non-linear probes on hidden states from Qwen3-Embedding (0.6B/4B/8B) to predict CEFR levels across nine learner corpora in seven languages. In-distribution probes reach QWK≈0.7 and outperform surface-feature baselines, with middle layers performing best, but cross-corpus performance collapses for all architectures and sizes; residual analysis shows OOD probes converge to uniform label predictions, leading to the conclusion that the probes capture corpus-specific properties (topic, language, task type, rating methodology) rather than an abstract, transferable proficiency dimension.

Significance. If the central interpretation is supported after addressing potential confounds, the result would be significant for representation learning in NLP: it would indicate that current multilingual embeddings do not provide a straightforward language-general proficiency signal, with direct implications for proficiency-adaptive language technologies and the design of future probing or fine-tuning approaches.

major comments (3)

[§4] §4 (Cross-corpus evaluation): The performance collapse and uniform OOD predictions are interpreted as evidence that embeddings lack a transferable proficiency dimension, yet the evaluation does not include stratification, matching, or covariate adjustment for the listed corpus differences (task type, rating methodology, topic distribution); without such controls the results are equally consistent with probes learning spurious corpus signatures that correlate with CEFR labels within each corpus.
[§3.2] §3.2 (Data and splits): Exact train/test splits, sample sizes per corpus, and any balancing for language or CEFR distribution are not reported, nor are statistical tests (e.g., paired significance tests on QWK differences or confidence intervals) provided to support the claim of consistent failure across probe types and model sizes.
[§5] §5 (Residual analysis): The claim that OOD probes converge to uniformly distributed labels requires quantitative details on the uniformity metric, comparison to in-distribution label distributions, and verification that this pattern is not an artifact of class imbalance or probe regularization choices.

minor comments (2)

[Abstract] The abstract states that middle layers yield the best predictions but does not report layer-wise QWK values or identify the specific layers (e.g., layer 12 vs. 24) for the three model sizes.
[§3.1] Notation for the five probing architectures and the surface baseline should be introduced with explicit equations or pseudocode in §3.1 to improve reproducibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed comments. We address each major comment point by point below. Where the suggestions identify gaps in reporting or analysis, we have revised the manuscript accordingly.

read point-by-point responses

Referee: [§4] §4 (Cross-corpus evaluation): The performance collapse and uniform OOD predictions are interpreted as evidence that embeddings lack a transferable proficiency dimension, yet the evaluation does not include stratification, matching, or covariate adjustment for the listed corpus differences (task type, rating methodology, topic distribution); without such controls the results are equally consistent with probes learning spurious corpus signatures that correlate with CEFR labels within each corpus.

Authors: We agree that corpus differences in task type, rating methodology, and topic distribution represent potential confounds. The manuscript already notes these differences in §2 and §4. However, the systematic convergence of OOD probes to uniform label predictions (rather than any shifted but still informative distribution) is difficult to explain solely by spurious within-corpus correlations; a probe capturing even a partially confounded proficiency signal should exhibit some positive transfer. To strengthen the analysis, we have added a new subsection in the revised §4 that discusses these confounds explicitly and reports results on a matched subset of corpora (balanced on task type and language where feasible), which reproduces the performance collapse. This constitutes a partial revision. revision: partial
Referee: [§3.2] §3.2 (Data and splits): Exact train/test splits, sample sizes per corpus, and any balancing for language or CEFR distribution are not reported, nor are statistical tests (e.g., paired significance tests on QWK differences or confidence intervals) provided to support the claim of consistent failure across probe types and model sizes.

Authors: We accept this point. The revised §3.2 now reports exact train/test splits (80/20 random split per corpus, with stratification by CEFR level), a new table listing sample sizes and CEFR distributions per corpus, and details on any balancing applied. We have also added statistical support: paired Wilcoxon signed-rank tests on QWK differences between in-distribution and cross-corpus settings, together with 95% bootstrap confidence intervals. These additions confirm the consistent and statistically significant performance drop across probe architectures and model sizes. revision: yes
Referee: [§5] §5 (Residual analysis): The claim that OOD probes converge to uniformly distributed labels requires quantitative details on the uniformity metric, comparison to in-distribution label distributions, and verification that this pattern is not an artifact of class imbalance or probe regularization choices.

Authors: We have expanded the residual analysis in the revised §5. We now report the entropy of the predicted label distribution (approaching log(C) for OOD probes, indicating uniformity) and the KL divergence from the uniform distribution (near zero OOD, substantially higher in-distribution). In-distribution predictions closely match the empirical label distribution of each corpus. To address potential artifacts, we re-ran all probes using class-balanced sampling and a range of L2 regularization strengths (10^{-4} to 10^{-1}); the uniform convergence pattern persists across these controls. These quantitative results and robustness checks are now included. revision: yes

Circularity Check

0 steps flagged

No significant circularity; purely empirical evaluation

full rationale

The paper is a self-contained empirical study that trains linear and non-linear probes on hidden states from Qwen3-Embedding models to predict CEFR levels, then measures in-distribution QWK performance and cross-corpus collapse. No derivations, equations, or fitted parameters are presented as predictions; all results are direct measurements on held-out data splits. The central claim follows from observed performance differences and residual analysis rather than any self-definition, self-citation chain, or ansatz. No load-bearing steps reduce to inputs by construction.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The claim rests on standard probing methodology and the assumption that CEFR levels are comparable across corpora; no new entities are introduced and free parameters are limited to experimental design choices.

free parameters (2)

probe architecture variants
Five probing architectures (linear and non-linear) were selected and compared by the authors.
model sizes
Three sizes of Qwen3-Embedding (0.6B, 4B, 8B) were tested.

axioms (1)

domain assumption CEFR proficiency levels constitute a consistent, language-general target variable across different learner corpora and rating methodologies.
The paper uses CEFR levels as ground truth without additional validation of cross-corpus comparability.

pith-pipeline@v0.9.0 · 5476 in / 1377 out tokens · 53057 ms · 2026-05-10T18:50:03.773011+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

4 extracted references · 4 canonical work pages

[1]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...

work page
[2]

@esa (Ref

\@ifxundefined[1] #1\@undefined \@firstoftwo \@secondoftwo \@ifnum[1] #1 \@firstoftwo \@secondoftwo \@ifx[1] #1 \@firstoftwo \@secondoftwo [2] @ #1 \@temptokena #2 #1 @ \@temptokena \@ifclassloaded agu2001 natbib The agu2001 class already includes natbib coding, so you should not add it explicitly Type <Return> for now, but then later remove the command n...

work page
[3]

\@lbibitem[] @bibitem@first@sw\@secondoftwo \@lbibitem[#1]#2 \@extra@b@citeb \@ifundefined br@#2\@extra@b@citeb \@namedef br@#2 \@nameuse br@#2\@extra@b@citeb \@ifundefined b@#2\@extra@b@citeb @num @parse #2 @tmp #1 NAT@b@open@#2 NAT@b@shut@#2 \@ifnum @merge>\@ne @bibitem@first@sw \@firstoftwo \@ifundefined NAT@b*@#2 \@firstoftwo @num @NAT@ctr \@secondoft...

work page
[4]

o k w v0 g

@open @close @open @close and [1] URL: #1 \@ifundefined chapter * \@mkboth \@ifxundefined @sectionbib * \@mkboth * \@mkboth\@gobbletwo \@ifclassloaded amsart * \@ifclassloaded amsbook * \@ifxundefined @heading @heading NAT@ctr thebibliography [1] @ \@biblabel @NAT@ctr \@bibsetup #1 @NAT@ctr @ @openbib .11em \@plus.33em \@minus.07em 4000 4000 `\.\@m @bibit...

work page arXiv 2087

[1] [1]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...

work page

[2] [2]

@esa (Ref

\@ifxundefined[1] #1\@undefined \@firstoftwo \@secondoftwo \@ifnum[1] #1 \@firstoftwo \@secondoftwo \@ifx[1] #1 \@firstoftwo \@secondoftwo [2] @ #1 \@temptokena #2 #1 @ \@temptokena \@ifclassloaded agu2001 natbib The agu2001 class already includes natbib coding, so you should not add it explicitly Type <Return> for now, but then later remove the command n...

work page

[3] [3]

\@lbibitem[] @bibitem@first@sw\@secondoftwo \@lbibitem[#1]#2 \@extra@b@citeb \@ifundefined br@#2\@extra@b@citeb \@namedef br@#2 \@nameuse br@#2\@extra@b@citeb \@ifundefined b@#2\@extra@b@citeb @num @parse #2 @tmp #1 NAT@b@open@#2 NAT@b@shut@#2 \@ifnum @merge>\@ne @bibitem@first@sw \@firstoftwo \@ifundefined NAT@b*@#2 \@firstoftwo @num @NAT@ctr \@secondoft...

work page

[4] [4]

o k w v0 g

@open @close @open @close and [1] URL: #1 \@ifundefined chapter * \@mkboth \@ifxundefined @sectionbib * \@mkboth * \@mkboth\@gobbletwo \@ifclassloaded amsart * \@ifclassloaded amsbook * \@ifxundefined @heading @heading NAT@ctr thebibliography [1] @ \@biblabel @NAT@ctr \@bibsetup #1 @NAT@ctr @ @openbib .11em \@plus.33em \@minus.07em 4000 4000 `\.\@m @bibit...

work page arXiv 2087