pith. sign in

arxiv: 2605.14257 · v2 · pith:JXNDUWUHnew · submitted 2026-05-14 · 💻 cs.CL

Sakura at BEA 2026 Shared Task 1: What Makes Vocabulary Difficult?

Pith reviewed 2026-05-22 10:28 UTC · model grok-4.3

classification 💻 cs.CL
keywords vocabulary difficultyLLM fine-tuningexplainable modellanguage assessmentshared taskspelling effectstest construction
0
0 comments X

The pith

Fine-tuned LLMs rate vocabulary difficulty at correlations above 0.91 while explainable versions show spelling and test design drive much of the observed difficulty.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops two models for predicting vocabulary item difficulty from learner data. A black-box approach fine-tunes a large language model with a soft-target loss to reach the highest correlation in the shared task open track. An explainable model keeps strong performance and surfaces the specific factors that make individual items hard. Analysis of the British Council vocabulary lists indicates that spelling difficulty and test-item construction frequently affect difficulty ratings in addition to the genuine challenge of producing the target word.

Core claim

We fine-tuned an LLM using a soft-target loss function for effective application to the rating task, achieving r > 0.91. The explainable model provides insights into what impacts the difficulty of each item while maintaining a strong correlation (r > 0.77). We further analyze the results, demonstrating that the difficulty of items in the British Council's Knowledge-based Vocabulary Lists (KVL) is often affected by spelling difficulty or the construction of the test items, in addition to the genuine production difficulty of the words.

What carries the argument

Soft-target loss applied during LLM fine-tuning for the rating task, paired with an explainable model that isolates spelling, test-construction, and production factors.

If this is right

  • Fine-tuned LLMs can produce vocabulary difficulty ratings that align closely with human judgments.
  • Explainable models can identify whether difficulty arises mainly from spelling, test format, or actual word production.
  • Vocabulary list construction can be improved by separating genuine production difficulty from confounding factors.
  • Shared-task data of this kind supports training systems that flag non-production sources of difficulty.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same approach could help redesign language tests so that scores better reflect true productive knowledge rather than orthographic or formatting artifacts.
  • Controlled experiments that hold spelling and format constant would allow direct measurement of how much each factor contributes to current difficulty ratings.
  • Insights from the explainable model might transfer to other educational domains where perceived difficulty mixes content knowledge with surface features.

Load-bearing premise

The shared task data from the British Council's Knowledge-based Vocabulary Lists supplies a clean signal of genuine production difficulty separate from spelling or test-construction effects.

What would settle it

A new set of vocabulary items presented without spelling cues or with uniform test formats where the model's high correlation with human ratings disappears.

Figures

Figures reproduced from arXiv: 2605.14257 by Adam Nohejl, Hitomi Yanaka, Maria Angelica Riera Machin, Xuanxin Wu, Yi-Ning Chang, Yusuke Ide.

Figure 1
Figure 1. Figure 1: Global SHAP summaries by L1. feature for Spanish. Perhaps counter-intuitively, it is much less important for L1 Chinese. We hypothesize this is caused by two factors. First, the production frequency uses learner-written texts, and therefore it partially discounts the frequency of words with frequent mistakes. As a result, the im￾portance of the separate spelling difficulty feature is lower proportionally t… view at source ↗
Figure 2
Figure 2. Figure 2: Example of local SHAP explanations by L1 [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗
read the original abstract

We describe two types of models for vocabulary difficulty prediction: a high-accuracy black-box model, which achieved the top shared task result in the open track, and an explainable model, which outperforms a fine-tuned encoder baseline. As the black-box model, we fine-tuned an LLM using a soft-target loss function for effective application to the rating task, achieving r > 0.91. The explainable model provides insights into what impacts the difficulty of each item while maintaining a strong correlation (r > 0.77). We further analyze the results, demonstrating that the difficulty of items in the British Council's Knowledge-based Vocabulary Lists (KVL) is often affected by spelling difficulty or the construction of the test items, in addition to the genuine production difficulty of the words. We make our code available online at https://github.com/ynklab/vocabulary-difficulty .

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The manuscript reports on participation in the BEA 2026 Shared Task 1 for vocabulary difficulty prediction using the British Council's Knowledge-based Vocabulary Lists (KVL). It presents a black-box model consisting of an LLM fine-tuned with a soft-target loss that achieves Pearson correlation r > 0.91 and ranks first in the open track, together with an explainable model that outperforms a fine-tuned encoder baseline at r > 0.77 while supplying feature attributions. The authors further analyze the KVL ratings and conclude that item difficulty is influenced by spelling difficulty and test-item construction in addition to genuine production difficulty. Code is released at https://github.com/ynklab/vocabulary-difficulty.

Significance. If the empirical results and analysis hold, the work supplies competitive shared-task performance together with an interpretable model and code release that supports reproducibility. The explicit discussion of potential confounds (spelling and test construction) in the KVL data could help future vocabulary-assessment research, provided the separation of these factors from genuine production difficulty is demonstrated quantitatively.

major comments (1)
  1. [Analysis section] Analysis section (abstract and results discussion): The central claim that KVL difficulty 'is often affected by spelling difficulty or the construction of the test items, in addition to the genuine production difficulty' is load-bearing for the 'insights' contribution. No quantitative decomposition, control set, or inter-rater protocol is described that isolates genuine production difficulty from orthographic or item-format artifacts. If the human ratings already embed these confounds, both the explainable model's attributions and the stated insights become difficult to interpret as cleanly separating the intended factors.
minor comments (2)
  1. [Abstract] Abstract and evaluation sections: Exact evaluation metric (Pearson r), data-split details, error bars, and the precise baseline architectures are not reported, making it hard to assess the strength of the r > 0.91 and r > 0.77 figures.
  2. [Methods] Methods: The architecture and training details of the explainable model (feature set, attribution method) are only sketched; fuller specification would aid replication.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their careful reading and for identifying a key point about the strength of the claims in our analysis section. We respond to the major comment below and indicate the revisions we will make.

read point-by-point responses
  1. Referee: [Analysis section] Analysis section (abstract and results discussion): The central claim that KVL difficulty 'is often affected by spelling difficulty or the construction of the test items, in addition to the genuine production difficulty' is load-bearing for the 'insights' contribution. No quantitative decomposition, control set, or inter-rater protocol is described that isolates genuine production difficulty from orthographic or item-format artifacts. If the human ratings already embed these confounds, both the explainable model's attributions and the stated insights become difficult to interpret as cleanly separating the intended factors.

    Authors: We acknowledge that our analysis relies on qualitative inspection of cases and feature attributions from the explainable model rather than a formal quantitative decomposition, control set, or inter-rater protocol. The attributions consistently surface orthographic and item-construction signals as contributors to predicted difficulty for a substantial portion of items, which we interpret as evidence that these factors influence the KVL ratings beyond pure production difficulty. We agree that this does not constitute a controlled isolation of factors and that the human ratings may embed confounds. In the revised manuscript we will (a) explicitly describe the analysis as observational and attribution-based, (b) report aggregate statistics on the relative contribution of spelling-related features across the test set, and (c) add a limitations paragraph discussing the difficulty of cleanly separating the intended factors given the rating protocol. These changes will make the scope and evidential basis of the insights transparent without overstating the separation achieved. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical fine-tuning and external shared-task evaluation

full rationale

The paper reports standard supervised learning results: fine-tuning an LLM with soft-target loss to obtain Pearson r > 0.91 on the BEA shared-task KVL labels, plus an explainable model reaching r > 0.77 whose attributions are used for post-hoc inspection of spelling and item-construction effects. All reported correlations are measured against the external test labels rather than any quantity defined by the model's own fitted parameters. No equations, self-definitional loops, fitted-input-as-prediction steps, or load-bearing self-citations appear in the provided text. The derivation is therefore self-contained against the shared-task benchmark.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The work relies on standard machine learning assumptions for fine-tuning and evaluation without introducing new free parameters, axioms beyond domain norms, or invented entities.

axioms (1)
  • domain assumption The shared task dataset and evaluation metrics are appropriate for measuring vocabulary difficulty prediction performance.
    Implicit in reporting r values and claiming top result in the open track.

pith-pipeline@v0.9.0 · 5697 in / 1417 out tokens · 37540 ms · 2026-05-22T10:28:20.652963+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    We fine-tuned an LLM using a soft-target loss function for effective application to the rating task, achieving r > 0.91. The explainable model provides insights into what impacts the difficulty of each item while maintaining a strong correlation (r > 0.77).

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

35 extracted references · 35 canonical work pages · 7 internal anchors

  1. [1]

    Dennis Aumiller and Michael Gertz. 2022. https://doi.org/10.18653/v1/2022.tsar-1.28 U ni HD at TSAR -2022 shared task: Is compute all we need for lexical simplification? In Proceedings of the Workshop on Text Simplification, Accessibility, and Readability (TSAR-2022), pages 251--258, Abu Dhabi, United Arab Emirates (Virtual). Association for Computational...

  2. [2]

    BNC Consortium . 2007. https://llds.ling-phil.ox.ac.uk/llds/xmlui/handle/20.500.14106/2554 British National Corpus , XML edition . https://llds.ling-phil.ox.ac.uk/llds/xmlui/handle/20.500.14106/2554

  3. [3]

    Annette Capel. 2012. https://doi.org/10.1017/S2041536212000013 Completing the English Vocabulary Profile : C1 and C2 vocabulary . English Profile Journal, 3:e1

  4. [4]

    Tianqi Chen and Carlos Guestrin. 2016. https://doi.org/10.1145/2939672.2939785 XGBoost : A Scalable Tree Boosting System . In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining , KDD '16, pages 785--794, New York, NY, USA. Association for Computing Machinery

  5. [5]

    Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzm \'a n, Edouard Grave, Myle Ott, Luke Zettlemoyer, and Veselin Stoyanov. 2020. https://doi.org/10.18653/v1/2020.acl-main.747 Unsupervised cross-lingual representation learning at scale . In Proceedings of the 58th Annual Meeting of the Association for Comp...

  6. [6]

    DeepSeek-AI . 2025. http://arxiv.org/abs/2412.19437v2 DeepSeek-V3 Technical Report . ArXiv preprint, arXiv:2412.19437v2 [cs.CL]

  7. [7]

    Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. 2023. https://proceedings.neurips.cc/paper_files/paper/2023/hash/1feb87871436031bdc0f2beaa62a049b-Abstract-Conference.html QLoRA : Efficient Finetuning of Quantized LLMs . Advances in Neural Information Processing Systems, 36:10088--10115

  8. [8]

    Taisei Enomoto, Hwichan Kim, Tosho Hirasawa, Yoshinari Nagai, Ayako Sato, Kyotaro Nakajima, and Mamoru Komachi. 2024. https://aclanthology.org/2024.bea-1.52/ TMU - HIT at MLSP 2024: How well can GPT -4 tackle multilingual lexical simplification? In Proceedings of the 19th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2024), ...

  9. [9]

    Mariano Felice and Lucy Skidmore. 2026. Findings of the BEA 2026 shared task on vocabulary difficulty prediction for English learners. In Proceedings of the 21st Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2026), San Diego, California. Association for Computational Linguistics

  10. [10]

    Geoffrey Hinton, Oriol Vinyals, and Jeffrey Dean. 2015. http://arxiv.org/abs/1503.02531 Distilling the knowledge in a neural network . In NIPS Deep Learning and Representation Learning Workshop

  11. [11]

    Yusuke Ide, Masato Mita, Adam Nohejl, Hiroki Ouchi, and Taro Watanabe. 2023. https://doi.org/10.18653/v1/2023.bea-1.40 J apanese lexical complexity for non-native readers: A new dataset . In Proceedings of the 18th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2023), pages 477--487, Toronto, Canada. Association for Computati...

  12. [12]

    Jong, Mike Mayor, and Catherine Hayes

    John H.A.L. Jong, Mike Mayor, and Catherine Hayes. 2016. https://www.pearson.com/content/dam/one-dot-com/one-dot-com/english/TeacherResources/GSE/GSE-WhitePaper-Developing-LOs.pdf Developing global scale of English learning objectives aligned to the common European framework . Technical report

  13. [13]

    Pierre Lison, J \"o rg Tiedemann, and Milen Kouylekov. 2018. https://aclanthology.org/L18-1275/ O pen S ubtitles2018: Statistical rescoring of sentence alignments in large, noisy parallel corpora . In Proceedings of the Eleventh International Conference on Language Resources and Evaluation ( LREC 2018) , Miyazaki, Japan. European Language Resources Associ...

  14. [14]

    Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, and Chenguang Zhu. 2023. https://doi.org/10.18653/v1/2023.emnlp-main.153 G -eval: NLG evaluation using gpt-4 with better human alignment . In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 2511--2522, Singapore. Association for Computational Linguistics

  15. [15]

    Lundberg and Su-In Lee

    Scott M. Lundberg and Su-In Lee. 2017. https://proceedings.neurips.cc/paper/2017/hash/8a20a8621978632d76c43dfd28b67767-Abstract.html A unified approach to interpreting model predictions . In Proceedings of the 31st International Conference on Neural Information Processing Systems , NIPS '17, pages 4768--4777, Red Hook, NY, USA. Curran Associates Inc

  16. [16]

    Marc Marone, Orion Weller, William Fleshman, Eugene Yang, Dawn Lawrie, and Benjamin Van Durme. 2025. http://arxiv.org/abs/2509.06888v1 mmBERT : A Modern Multilingual Encoder with Annealed Language Learning . ArXiv preprint, arXiv:2509.06888v1 [cs]

  17. [17]

    Mistral AI . 2026. http://arxiv.org/abs/2601.08584v1 Ministral 3 . ArXiv preprint, arXiv:2601.08584v1 [cs.CL]

  18. [18]

    Tomoya Mizumoto, Mamoru Komachi, Masaaki Nagata, and Yuji Matsumoto. 2011. https://aclanthology.org/I11-1017/ Mining revision log of language learning SNS for automated J apanese error correction of second language learners . In Proceedings of 5th International Joint Conference on Natural Language Processing, pages 147--155, Chiang Mai, Thailand. Asian Fe...

  19. [19]

    Adam Nohejl, Akio Hayakawa, Yusuke Ide, and Taro Watanabe. 2024. https://doi.org/10.18653/v1/2024.tsar-1.8 Difficult for whom? a study of J apanese lexical complexity . In Proceedings of the Third Workshop on Text Simplification, Accessibility and Readability (TSAR 2024), pages 69--81, Miami, Florida, USA. Association for Computational Linguistics

  20. [20]

    Adam Nohejl, Akio Hayakawa, Yusuke Ide, and Taro Watanabe. 2025 a . https://doi.org/10.5715/jnlp.32.1129 A Japanese Dataset and Efficient Multilingual LLM-Based Methods for Lexical Simplification and Lexical Complexity Prediction . Journal of Natural Language Processing, 32(4):1129--1188

  21. [21]

    Adam Nohejl, Frederikus Hudi, Eunike Andriani Kardinata, Shintaro Ozaki, Maria Angelica Riera Machin, Hongyu Sun, Justin Vasselli, and Taro Watanabe. 2025 b . https://aclanthology.org/2025.coling-main.641/ Beyond film subtitles: Is Y ou T ube the best approximation of spoken vocabulary? In Proceedings of the 31st International Conference on Computational ...

  22. [22]

    OpenAI. 2024. http://arxiv.org/abs/2303.08774v6 GPT-4 Technical Report . ArXiv preprint, arXiv:2303.08774v6 [cs.CL]

  23. [23]

    OpenAI . 2025. https://cdn.openai.com/pdf/3a4153c8-c748-4b71-8e31-aecbde944f8d/oai_5_2_system-card.pdf Update to GPT-5 System Card : GPT-5 .2 . Technical report

  24. [24]

    Gustavo Paetzold and Lucia Specia. 2016. https://doi.org/10.18653/v1/S16-1085 S em E val 2016 task 11: Complex word identification . In Proceedings of the 10th International Workshop on Semantic Evaluation ( S em E val-2016) , pages 560--569, San Diego, California. Association for Computational Linguistics

  25. [25]

    Qwen Team . 2025. http://arxiv.org/abs/2412.15115v2 Qwen2.5 Technical Report . ArXiv preprint, arXiv:2412.15115v2 [cs.CL]

  26. [26]

    Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. 2019. http://arxiv.org/abs/1910.01108 DistilBERT , a distilled version of BERT : Smaller, faster, cheaper and lighter . In 5th Workshop on Energy Efficient Machine Learning and Cognitive Computing @ NeurIPS 2019 , volume arXiv:1910.01108

  27. [27]

    Norbert Schmitt, Karen Dunn, Barry O'Sullivan, Laurence Anthony, and Benjamin Kremmel. 2021. https://doi.org/10.1002/tesj.622 Introducing Knowledge-based Vocabulary Lists ( KVL ) . TESOL Journal, 12(4):e622

  28. [28]

    Norbert Schmitt, Karen Dunn, Barry O'Sullivan, Laurence Anthony, and Benjamin Kremmel. 2024. https://doi.org/10.3138/9781800504141 Knowledge-Based Vocabulary Lists . British Council Monographs on Modern Language Testing . University of Toronto Press

  29. [29]

    Scott, Anne Keitel, Marc Becirspahic, Bo Yao, and Sara C

    Graham G. Scott, Anne Keitel, Marc Becirspahic, Bo Yao, and Sara C. Sereno. 2019. https://doi.org/10.3758/s13428-018-1099-3 The Glasgow Norms : Ratings of 5,500 words on nine scales . Behavior Research Methods, 51(3):1258--1270

  30. [30]

    Matthew Shardlow, Fernando Alva-Manchego, Riza Batista-Navarro, Stefan Bott, Saul Calderon Ramirez, R \'e mi Cardon, Thomas Fran c ois, Akio Hayakawa, Andrea Horbach, Anna H \"u lsing, Yusuke Ide, Joseph Marvin Imperial, Adam Nohejl, Kai North, Laura Occhipinti, Nelson Per \'e z Rojas, Nishat Raihan, Tharindu Ranasinghe, Martin Solis Salazar, and 3 others...

  31. [31]

    Matthew Shardlow, Richard Evans, Gustavo Henrique Paetzold, and Marcos Zampieri. 2021. https://doi.org/10.18653/v1/2021.semeval-1.1 S em E val-2021 task 1: Lexical complexity prediction . In Proceedings of the 15th International Workshop on Semantic Evaluation (SemEval-2021), pages 1--16, Online. Association for Computational Linguistics

  32. [32]

    Lucy Skidmore, Mariano Felice, and Karen Dunn. 2025. https://doi.org/10.18653/v1/2025.bea-1.12 Transformer architectures for vocabulary test item difficulty prediction . In Proceedings of the 20th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2025), pages 160--174, Vienna, Austria. Association for Computational Linguistics

  33. [33]

    R a zvan-Alexandru Sm a du, David-Gabriel Ion, Dumitru-Clementin Cercel, Florin Pop, and Mihaela-Claudia Cercel. 2024. https://doi.org/10.18653/v1/2024.emnlp-main.933 Investigating large language models for complex word identification in multilingual and multidomain setups . In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Pr...

  34. [34]

    Team GLM . 2024. http://arxiv.org/abs/2406.12793v2 ChatGLM: A Family of Large Language Models from GLM-130B to GLM-4 All Tools . ArXiv preprint, arXiv:2406.12793v2 [cs.CL]

  35. [35]

    Seid Muhie Yimam, Chris Biemann, Shervin Malmasi, Gustavo Paetzold, Lucia Specia, Sanja S tajner, Ana \"i s Tack, and Marcos Zampieri. 2018. https://doi.org/10.18653/v1/W18-0507 A report on the complex word identification shared task 2018 . In Proceedings of the Thirteenth Workshop on Innovative Use of NLP for Building Educational Applications , pages 66-...