pith. sign in

arxiv: 2606.24501 · v1 · pith:XIDGC5FEnew · submitted 2026-06-23 · 💻 cs.CL

UOL@IDEM at BEA 2026 Shared Task 1: Neural Fusion and Feature-Rich Modeling for L1-Aware Vocabulary Difficulty Prediction

Pith reviewed 2026-06-25 23:59 UTC · model grok-4.3

classification 💻 cs.CL
keywords L1-aware vocabulary difficultyregressionmultilingual embeddingsfeature engineeringshared taskBEA 2026
0
0 comments X

The pith

Multilingual sentence embeddings fused with frequency and cognate features improve L1-aware vocabulary difficulty regression.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents a closed-track system for the BEA 2026 shared task that predicts how difficult individual words are for learners whose first language is Spanish, German, or Chinese. It treats the problem as regression and trains separate models that combine large multilingual sentence encoders with a small set of engineered features. The features capture word frequency, surface-form similarity to the learner language, retrieval evidence, semantic alignment, cognate overlap, and masked-language-model predictability. Development experiments show consistent gains over the official baselines, and the submitted runs reach RMSE values of 1.132, 1.037, and 0.891 on the three languages. Frequency emerges as the single most reliable predictor, while the remaining signals supply complementary L1-specific information.

Core claim

Our system combines multilingual contextual representations with engineered features capturing frequency, surface form, retrieval evidence, semantic alignment, cognate similarity, and masked-language-model predictability. Development results show consistent gains over the official closed-track baselines, with sentence-embedding encoders such as BGE-M3, multilingual E5, and LaBSE performing best. Official submissions achieve RMSE scores of 1.132, 1.037, and 0.891 for Spanish, German, and Chinese, respectively. Feature analysis identifies frequency as the most stable predictor, while contextual predictability, form similarity, retrieval, and semantic features provide complementary L1-sensitive

What carries the argument

Fusion of sentence-embedding encoders (BGE-M3, multilingual E5, LaBSE) with six L1-sensitive engineered features for regression.

If this is right

  • Frequency remains the single most stable predictor across the three target languages.
  • Contextual predictability and cognate similarity add measurable L1-specific value on top of raw embeddings.
  • The resulting models rank items reliably but tend to over-predict difficulty for the easiest words.
  • Separate per-language models outperform a single multilingual model under the closed-track constraints.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same feature set could be tested on additional learner languages without retraining the encoders from scratch.
  • Error patterns on easy items suggest that calibration adjustments might further reduce RMSE without changing the feature set.

Load-bearing premise

The selected engineered features supply complementary L1-sensitive signals that produce measurable gains over the provided baselines.

What would settle it

A replication in which the same sentence encoders plus the six engineered features yield equal or higher RMSE than the official baselines on the hidden test set.

Figures

Figures reproduced from arXiv: 2606.24501 by Nouran Khallaf, Serge Sharoff.

Figure 1
Figure 1. Figure 1: shows that the language-specific neu￾ral fusion models broadly follow the gold diffi￾culty scale, but with a clear compression toward the centre. This compression is visible in the solid mean-bias line, where bias is defined as predicted difficulty minus gold difficulty. In Band 1, which contains the easiest items, the bias is strongly pos￾itive: +1.21 for German, +1.28 for Spanish, and +1.03 for Chinese. … view at source ↗
Figure 2
Figure 2. Figure 2: Matched low-error versus high-error contrasts within language and difficulty band. Positive values indicate features higher in low-error items; negative values indicate features higher in high-error items. textual representations with engineered features capturing frequency, lexical surface form, surprisal, retrieval evidence, cognate-like similarity, and se￾mantic alignment. Across the development set, ne… view at source ↗
read the original abstract

This paper describes UOL@IDEM's closed-track submission to the BEA 2026 shared task on L1-aware vocabulary difficulty prediction. We model the task as regression and train separate systems for Spanish, German, and Mandarin Chinese\footnote{Below we use \emph{Chinese} for brevity.}. Our system combines multilingual contextual representations with engineered features capturing frequency, surface form, retrieval evidence, semantic alignment, cognate similarity, and masked-language-model predictability. Development results show consistent gains over the official closed-track baselines, with sentence-embedding encoders such as BGE-M3, multilingual E5, and LaBSE performing best. Official submissions achieve RMSE scores of 1.132, 1.037, and 0.891 for Spanish, German, and Chinese, respectively. Feature analysis identifies frequency as the most stable predictor, while contextual predictability, form similarity, retrieval, and semantic features provide complementary L1-sensitive signals. Error analysis shows strong ranking performance but weaker calibration for the easiest items, which are often overpredicted. See https://github.com/Nouran-Khallaf/UoL-IDEM-BEA2026-Vocabulary-Difficulty-Prediction

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 3 minor

Summary. The paper describes UOL@IDEM's closed-track submission to the BEA 2026 Shared Task 1 on L1-aware vocabulary difficulty prediction. The authors frame the task as regression and train separate models for Spanish, German, and Chinese. Their approach fuses multilingual sentence embeddings (BGE-M3, multilingual E5, LaBSE) with engineered features capturing frequency, surface form, retrieval evidence, semantic alignment, cognate similarity, and masked-language-model predictability. Development results show gains over official closed-track baselines; official submissions report RMSE values of 1.132 (Spanish), 1.037 (German), and 0.891 (Chinese). Feature analysis identifies frequency as the most stable predictor while the remaining features supply complementary L1-sensitive signals. Error analysis notes strong ranking performance but weaker calibration on the easiest items.

Significance. If the reported RMSE gains and feature rankings hold under proper controls, the work supplies a competitive system description and concrete evidence that frequency remains dominant while contextual and L1-specific features add value. The explicit identification of complementary signals and the error patterns on easy items are useful observations for the lexical difficulty modeling community.

minor comments (3)
  1. [Abstract / Feature analysis] The abstract states that 'contextual predictability, form similarity, retrieval, and semantic features provide complementary L1-sensitive signals' but does not report the quantitative ablation or permutation importance values that would substantiate the complementarity claim; add these numbers (with confidence intervals) in the feature-analysis section.
  2. [Development results] The manuscript reports official RMSE scores but does not indicate whether the development-set gains were obtained with a single fixed train/dev split or with cross-validation; clarify the evaluation protocol and any statistical significance tests performed against the baselines.
  3. [Abstract] The GitHub link appears inline in the abstract; move it to a footnote or proper reference entry for conventional formatting.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive summary, significance assessment, and recommendation of minor revision. No major comments were provided in the report.

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper is a standard shared-task system description reporting regression performance (RMSE) on development and official test sets for L1-aware vocabulary difficulty prediction. It combines sentence embeddings with engineered features and notes frequency as the strongest predictor with others providing complementary signals. No equations, self-definitional loops, fitted inputs renamed as predictions, or load-bearing self-citations appear in the provided text or abstract. Results are presented as empirical outcomes on held-out shared-task data rather than any derivation that reduces to its own inputs by construction.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central performance claim rests on the validity of the shared-task data splits, the appropriateness of RMSE as the metric, and the assumption that the listed feature categories add non-redundant L1-sensitive information; standard supervised-learning fitting is also required.

free parameters (1)
  • regression fusion weights and embedding fine-tuning parameters
    The neural fusion and feature combination are learned from data and therefore constitute fitted parameters whose values are not reported in the abstract.
axioms (1)
  • domain assumption The official closed-track baselines and the shared-task test data constitute a fair and stable evaluation of L1-aware difficulty prediction.
    The paper treats the competition setup as the reference point for claiming gains.

pith-pipeline@v0.9.1-grok · 5752 in / 1328 out tokens · 33313 ms · 2026-06-25T23:59:15.642457+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

39 extracted references · 5 canonical work pages

  1. [1]

    2026 , howpublished =

    BEA 2026 Shared Task:. 2026 , howpublished =

  2. [2]

    Proceedings of the 21st Workshop on Innovative Use of NLP for Building Educational Applications (

    Felice, Mariano and Skidmore, Lucy , title =. Proceedings of the 21st Workshop on Innovative Use of NLP for Building Educational Applications (. 2026 , address =

  3. [3]

    Proceedings of the 20th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2025) , year =

    Transformer Architectures for Vocabulary Test Item Difficulty Prediction , author =. Proceedings of the 20th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2025) , year =

  4. [4]

    Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics , year =

    Unsupervised Cross-lingual Representation Learning at Scale , author =. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics , year =

  5. [5]

    2019 , pages =

    Devlin, Jacob and Chang, Ming-Wei and Lee, Kenton and Toutanova, Kristina , booktitle =. 2019 , pages =

  6. [6]

    Language-agnostic

    Feng, Fangxiaoyu and Yang, Yinfei and Cer, Daniel and Arivazhagan, Naveen and Wang, Wei , booktitle =. Language-agnostic. 2022 , pages =

  7. [7]

    Multilingual

    Wang, Liang and Yang, Nan and Huang, Xiaolong and Yang, Linjun and Majumder, Rangan and Wei, Furu , journal =. Multilingual. 2024 , url =

  8. [8]

    Sentence-

    Reimers, Nils and Gurevych, Iryna , booktitle =. Sentence-. 2019 , pages =

  9. [9]

    2024 , pages =

    Chen, Jianlv and Xiao, Shitao and Zhang, Peitian and Luo, Kun and Lian, Defu and Liu, Zheng , booktitle =. 2024 , pages =

  10. [10]

    Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics , year =

    Masked Language Model Scoring , author =. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics , year =

  11. [11]

    2021 , pages =

    Shardlow, Matthew and Evans, Richard and Zampieri, Marcos , booktitle =. 2021 , pages =

  12. [12]

    Predicting lexical complexity in

    Shardlow, Matthew and Evans, Richard and Zampieri, Marcos , journal =. Predicting lexical complexity in. 2022 , volume =. doi:10.1007/s10579-022-09588-2 , url =

  13. [13]

    arXiv preprint arXiv:2303.04851 , year =

    Lexical Complexity Prediction: An Overview , author =. arXiv preprint arXiv:2303.04851 , year =

  14. [14]

    The UCREL Semantic Analysis System , author =. Proceedings of the Workshop on Beyond Named Entity Recognition: Semantic Labelling for NLP Tasks in Association with the 4th International Conference on Language Resources and Evaluation (LREC 2004) , year =

  15. [15]

    Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics , pages =

    CogNet: A Large-Scale Cognate Database , author =. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics , pages =. 2019 , address =

  16. [16]

    Language Resources and Evaluation , year =

    A Large and Evolving Cognate Database , author =. Language Resources and Evaluation , year =. doi:10.1007/s10579-021-09544-6 , url =

  17. [17]

    Advances in Neural Information Processing Systems , volume=

    A Unified Approach to Interpreting Model Predictions , author=. Advances in Neural Information Processing Systems , volume=

  18. [18]

    Nature Machine Intelligence , volume=

    From local explanations to global understanding with explainable AI for trees , author=. Nature Machine Intelligence , volume=. 2020 , doi=

  19. [19]

    Statistical Applications in Genetics and Molecular Biology , volume =

    Super Learner , author =. Statistical Applications in Genetics and Molecular Biology , volume =. 2007 , doi =

  20. [20]

    The Prague Bulletin of Mathematical Linguistics , volume =

    Language Adaptation for Extending Post-editing Estimates for Closely Related Languages , author =. The Prague Bulletin of Mathematical Linguistics , volume =. 2016 , url =

  21. [21]

    Language Resources and Evaluation , volume =

    Kilgarriff, Adam and Charalabopoulou, Frieda and Gavrilidou, Maria and Johannessen, Janne Bondi and Khalil, Saussan and Johansson Kokkinakis, Sofie and Lew, Robert and Sharoff, Serge and Vadlapudi, Ravikiran and Volodina, Elena , title =. Language Resources and Evaluation , volume =. 2014 , doi =

  22. [22]

    2001 , url =

    Common European Framework of Reference for Languages: Learning, Teaching, Assessment , publisher =. 2001 , url =

  23. [23]

    doi:10.5281/zenodo.1443582 , url =

    Robyn Speer and Joshua Chin and Andrew Lin and Sara Jewett and Lance Nathan , title =. doi:10.5281/zenodo.1443582 , url =

  24. [24]

    Behavior Research Methods , volume =

    Brysbaert, Marc and New, Boris , title =. Behavior Research Methods , volume =. 2009 , doi =

  25. [25]

    van Heuven, Walter J. B. and Mandera, Pawel and Keuleers, Emmanuel and Brysbaert, Marc , title =. The Quarterly Journal of Experimental Psychology , volume =. 2014 , doi =

  26. [26]

    Serge Sharoff and Dirk Goldhahn and Uwe Quasthoff , title =

  27. [27]

    Honnibal, I

    Honnibal, Matthew and Montani, Ines and Van Landeghem, Sofie and Boyd, Adriane , title =. 2020 , publisher =. doi:10.5281/zenodo.1212303 , url =

  28. [28]

    How Multilingual is Multilingual BERT ?

    How Multilingual is Multilingual BERT? , author =. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics , pages =. 2019 , publisher =. doi:10.18653/v1/P19-1493 , url =

  29. [29]

    The Annals of Statistics , volume =

    Efron, Bradley , title =. The Annals of Statistics , volume =. 1979 , doi =

  30. [30]

    Kendall, M. G. , title =. Biometrika , volume =. 1938 , doi =

  31. [31]

    and Whitney, Donald R

    Mann, Henry B. and Whitney, Donald R. , title =. The Annals of Mathematical Statistics , volume =. 1947 , doi =

  32. [32]

    Biometrics Bulletin , volume =

    Wilcoxon, Frank , title =. Biometrics Bulletin , volume =. 1945 , doi =

  33. [33]

    Journal of the Royal Statistical Society: Series B (Methodological) , volume =

    Benjamini, Yoav and Hochberg, Yosef , title =. Journal of the Royal Statistical Society: Series B (Methodological) , volume =. 1995 , doi =

  34. [34]

    Journal of the Royal Statistical Society: Series B (Statistical Methodology) , volume =

    Zou, Hui and Hastie, Trevor , title =. Journal of the Royal Statistical Society: Series B (Statistical Methodology) , volume =. 2005 , doi =

  35. [35]

    MacQueen, J. B. , title =. Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, Volume 1: Statistics , editor =. 1967 , publisher =

  36. [36]

    Quantifying the Contribution of MWE s and Polysemy in Translation Errors for E nglish -- I gbo MT

    Ohuoba, Adaeze and Sharoff, Serge and Walker, Callum. Quantifying the Contribution of MWE s and Polysemy in Translation Errors for E nglish -- I gbo MT. Proceedings of the 25th Annual Conference of the European Association for Machine Translation (Volume 1). 2024

  37. [37]

    1994 , publisher =

    An Introduction to the Bootstrap , author =. 1994 , publisher =

  38. [38]

    Journal of the Royal Statistical Society: Series B , volume =

    Regularization and Variable Selection via the Elastic Net , author =. Journal of the Royal Statistical Society: Series B , volume =. 2005 , doi =

  39. [39]

    Psychological Bulletin , year =

    Cliff, Norman , title =. Psychological Bulletin , year =