Sakura at BEA 2026 Shared Task 1: What Makes Vocabulary Difficult?
Pith reviewed 2026-05-22 10:28 UTC · model grok-4.3
The pith
Fine-tuned LLMs rate vocabulary difficulty at correlations above 0.91 while explainable versions show spelling and test design drive much of the observed difficulty.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We fine-tuned an LLM using a soft-target loss function for effective application to the rating task, achieving r > 0.91. The explainable model provides insights into what impacts the difficulty of each item while maintaining a strong correlation (r > 0.77). We further analyze the results, demonstrating that the difficulty of items in the British Council's Knowledge-based Vocabulary Lists (KVL) is often affected by spelling difficulty or the construction of the test items, in addition to the genuine production difficulty of the words.
What carries the argument
Soft-target loss applied during LLM fine-tuning for the rating task, paired with an explainable model that isolates spelling, test-construction, and production factors.
If this is right
- Fine-tuned LLMs can produce vocabulary difficulty ratings that align closely with human judgments.
- Explainable models can identify whether difficulty arises mainly from spelling, test format, or actual word production.
- Vocabulary list construction can be improved by separating genuine production difficulty from confounding factors.
- Shared-task data of this kind supports training systems that flag non-production sources of difficulty.
Where Pith is reading between the lines
- The same approach could help redesign language tests so that scores better reflect true productive knowledge rather than orthographic or formatting artifacts.
- Controlled experiments that hold spelling and format constant would allow direct measurement of how much each factor contributes to current difficulty ratings.
- Insights from the explainable model might transfer to other educational domains where perceived difficulty mixes content knowledge with surface features.
Load-bearing premise
The shared task data from the British Council's Knowledge-based Vocabulary Lists supplies a clean signal of genuine production difficulty separate from spelling or test-construction effects.
What would settle it
A new set of vocabulary items presented without spelling cues or with uniform test formats where the model's high correlation with human ratings disappears.
Figures
read the original abstract
We describe two types of models for vocabulary difficulty prediction: a high-accuracy black-box model, which achieved the top shared task result in the open track, and an explainable model, which outperforms a fine-tuned encoder baseline. As the black-box model, we fine-tuned an LLM using a soft-target loss function for effective application to the rating task, achieving r > 0.91. The explainable model provides insights into what impacts the difficulty of each item while maintaining a strong correlation (r > 0.77). We further analyze the results, demonstrating that the difficulty of items in the British Council's Knowledge-based Vocabulary Lists (KVL) is often affected by spelling difficulty or the construction of the test items, in addition to the genuine production difficulty of the words. We make our code available online at https://github.com/ynklab/vocabulary-difficulty .
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript reports on participation in the BEA 2026 Shared Task 1 for vocabulary difficulty prediction using the British Council's Knowledge-based Vocabulary Lists (KVL). It presents a black-box model consisting of an LLM fine-tuned with a soft-target loss that achieves Pearson correlation r > 0.91 and ranks first in the open track, together with an explainable model that outperforms a fine-tuned encoder baseline at r > 0.77 while supplying feature attributions. The authors further analyze the KVL ratings and conclude that item difficulty is influenced by spelling difficulty and test-item construction in addition to genuine production difficulty. Code is released at https://github.com/ynklab/vocabulary-difficulty.
Significance. If the empirical results and analysis hold, the work supplies competitive shared-task performance together with an interpretable model and code release that supports reproducibility. The explicit discussion of potential confounds (spelling and test construction) in the KVL data could help future vocabulary-assessment research, provided the separation of these factors from genuine production difficulty is demonstrated quantitatively.
major comments (1)
- [Analysis section] Analysis section (abstract and results discussion): The central claim that KVL difficulty 'is often affected by spelling difficulty or the construction of the test items, in addition to the genuine production difficulty' is load-bearing for the 'insights' contribution. No quantitative decomposition, control set, or inter-rater protocol is described that isolates genuine production difficulty from orthographic or item-format artifacts. If the human ratings already embed these confounds, both the explainable model's attributions and the stated insights become difficult to interpret as cleanly separating the intended factors.
minor comments (2)
- [Abstract] Abstract and evaluation sections: Exact evaluation metric (Pearson r), data-split details, error bars, and the precise baseline architectures are not reported, making it hard to assess the strength of the r > 0.91 and r > 0.77 figures.
- [Methods] Methods: The architecture and training details of the explainable model (feature set, attribution method) are only sketched; fuller specification would aid replication.
Simulated Author's Rebuttal
We thank the referee for their careful reading and for identifying a key point about the strength of the claims in our analysis section. We respond to the major comment below and indicate the revisions we will make.
read point-by-point responses
-
Referee: [Analysis section] Analysis section (abstract and results discussion): The central claim that KVL difficulty 'is often affected by spelling difficulty or the construction of the test items, in addition to the genuine production difficulty' is load-bearing for the 'insights' contribution. No quantitative decomposition, control set, or inter-rater protocol is described that isolates genuine production difficulty from orthographic or item-format artifacts. If the human ratings already embed these confounds, both the explainable model's attributions and the stated insights become difficult to interpret as cleanly separating the intended factors.
Authors: We acknowledge that our analysis relies on qualitative inspection of cases and feature attributions from the explainable model rather than a formal quantitative decomposition, control set, or inter-rater protocol. The attributions consistently surface orthographic and item-construction signals as contributors to predicted difficulty for a substantial portion of items, which we interpret as evidence that these factors influence the KVL ratings beyond pure production difficulty. We agree that this does not constitute a controlled isolation of factors and that the human ratings may embed confounds. In the revised manuscript we will (a) explicitly describe the analysis as observational and attribution-based, (b) report aggregate statistics on the relative contribution of spelling-related features across the test set, and (c) add a limitations paragraph discussing the difficulty of cleanly separating the intended factors given the rating protocol. These changes will make the scope and evidential basis of the insights transparent without overstating the separation achieved. revision: yes
Circularity Check
No circularity: empirical fine-tuning and external shared-task evaluation
full rationale
The paper reports standard supervised learning results: fine-tuning an LLM with soft-target loss to obtain Pearson r > 0.91 on the BEA shared-task KVL labels, plus an explainable model reaching r > 0.77 whose attributions are used for post-hoc inspection of spelling and item-construction effects. All reported correlations are measured against the external test labels rather than any quantity defined by the model's own fitted parameters. No equations, self-definitional loops, fitted-input-as-prediction steps, or load-bearing self-citations appear in the provided text. The derivation is therefore self-contained against the shared-task benchmark.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The shared task dataset and evaluation metrics are appropriate for measuring vocabulary difficulty prediction performance.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We fine-tuned an LLM using a soft-target loss function for effective application to the rating task, achieving r > 0.91. The explainable model provides insights into what impacts the difficulty of each item while maintaining a strong correlation (r > 0.77).
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Dennis Aumiller and Michael Gertz. 2022. https://doi.org/10.18653/v1/2022.tsar-1.28 U ni HD at TSAR -2022 shared task: Is compute all we need for lexical simplification? In Proceedings of the Workshop on Text Simplification, Accessibility, and Readability (TSAR-2022), pages 251--258, Abu Dhabi, United Arab Emirates (Virtual). Association for Computational...
-
[2]
BNC Consortium . 2007. https://llds.ling-phil.ox.ac.uk/llds/xmlui/handle/20.500.14106/2554 British National Corpus , XML edition . https://llds.ling-phil.ox.ac.uk/llds/xmlui/handle/20.500.14106/2554
work page 2007
-
[3]
Annette Capel. 2012. https://doi.org/10.1017/S2041536212000013 Completing the English Vocabulary Profile : C1 and C2 vocabulary . English Profile Journal, 3:e1
-
[4]
Tianqi Chen and Carlos Guestrin. 2016. https://doi.org/10.1145/2939672.2939785 XGBoost : A Scalable Tree Boosting System . In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining , KDD '16, pages 785--794, New York, NY, USA. Association for Computing Machinery
-
[5]
Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzm \'a n, Edouard Grave, Myle Ott, Luke Zettlemoyer, and Veselin Stoyanov. 2020. https://doi.org/10.18653/v1/2020.acl-main.747 Unsupervised cross-lingual representation learning at scale . In Proceedings of the 58th Annual Meeting of the Association for Comp...
-
[6]
DeepSeek-AI . 2025. http://arxiv.org/abs/2412.19437v2 DeepSeek-V3 Technical Report . ArXiv preprint, arXiv:2412.19437v2 [cs.CL]
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[7]
Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. 2023. https://proceedings.neurips.cc/paper_files/paper/2023/hash/1feb87871436031bdc0f2beaa62a049b-Abstract-Conference.html QLoRA : Efficient Finetuning of Quantized LLMs . Advances in Neural Information Processing Systems, 36:10088--10115
work page 2023
-
[8]
Taisei Enomoto, Hwichan Kim, Tosho Hirasawa, Yoshinari Nagai, Ayako Sato, Kyotaro Nakajima, and Mamoru Komachi. 2024. https://aclanthology.org/2024.bea-1.52/ TMU - HIT at MLSP 2024: How well can GPT -4 tackle multilingual lexical simplification? In Proceedings of the 19th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2024), ...
work page 2024
-
[9]
Mariano Felice and Lucy Skidmore. 2026. Findings of the BEA 2026 shared task on vocabulary difficulty prediction for English learners. In Proceedings of the 21st Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2026), San Diego, California. Association for Computational Linguistics
work page 2026
-
[10]
Geoffrey Hinton, Oriol Vinyals, and Jeffrey Dean. 2015. http://arxiv.org/abs/1503.02531 Distilling the knowledge in a neural network . In NIPS Deep Learning and Representation Learning Workshop
work page internal anchor Pith review Pith/arXiv arXiv 2015
-
[11]
Yusuke Ide, Masato Mita, Adam Nohejl, Hiroki Ouchi, and Taro Watanabe. 2023. https://doi.org/10.18653/v1/2023.bea-1.40 J apanese lexical complexity for non-native readers: A new dataset . In Proceedings of the 18th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2023), pages 477--487, Toronto, Canada. Association for Computati...
-
[12]
Jong, Mike Mayor, and Catherine Hayes
John H.A.L. Jong, Mike Mayor, and Catherine Hayes. 2016. https://www.pearson.com/content/dam/one-dot-com/one-dot-com/english/TeacherResources/GSE/GSE-WhitePaper-Developing-LOs.pdf Developing global scale of English learning objectives aligned to the common European framework . Technical report
work page 2016
-
[13]
Pierre Lison, J \"o rg Tiedemann, and Milen Kouylekov. 2018. https://aclanthology.org/L18-1275/ O pen S ubtitles2018: Statistical rescoring of sentence alignments in large, noisy parallel corpora . In Proceedings of the Eleventh International Conference on Language Resources and Evaluation ( LREC 2018) , Miyazaki, Japan. European Language Resources Associ...
work page 2018
-
[14]
Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, and Chenguang Zhu. 2023. https://doi.org/10.18653/v1/2023.emnlp-main.153 G -eval: NLG evaluation using gpt-4 with better human alignment . In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 2511--2522, Singapore. Association for Computational Linguistics
-
[15]
Scott M. Lundberg and Su-In Lee. 2017. https://proceedings.neurips.cc/paper/2017/hash/8a20a8621978632d76c43dfd28b67767-Abstract.html A unified approach to interpreting model predictions . In Proceedings of the 31st International Conference on Neural Information Processing Systems , NIPS '17, pages 4768--4777, Red Hook, NY, USA. Curran Associates Inc
work page 2017
- [16]
-
[17]
Mistral AI . 2026. http://arxiv.org/abs/2601.08584v1 Ministral 3 . ArXiv preprint, arXiv:2601.08584v1 [cs.CL]
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[18]
Tomoya Mizumoto, Mamoru Komachi, Masaaki Nagata, and Yuji Matsumoto. 2011. https://aclanthology.org/I11-1017/ Mining revision log of language learning SNS for automated J apanese error correction of second language learners . In Proceedings of 5th International Joint Conference on Natural Language Processing, pages 147--155, Chiang Mai, Thailand. Asian Fe...
work page 2011
-
[19]
Adam Nohejl, Akio Hayakawa, Yusuke Ide, and Taro Watanabe. 2024. https://doi.org/10.18653/v1/2024.tsar-1.8 Difficult for whom? a study of J apanese lexical complexity . In Proceedings of the Third Workshop on Text Simplification, Accessibility and Readability (TSAR 2024), pages 69--81, Miami, Florida, USA. Association for Computational Linguistics
-
[20]
Adam Nohejl, Akio Hayakawa, Yusuke Ide, and Taro Watanabe. 2025 a . https://doi.org/10.5715/jnlp.32.1129 A Japanese Dataset and Efficient Multilingual LLM-Based Methods for Lexical Simplification and Lexical Complexity Prediction . Journal of Natural Language Processing, 32(4):1129--1188
-
[21]
Adam Nohejl, Frederikus Hudi, Eunike Andriani Kardinata, Shintaro Ozaki, Maria Angelica Riera Machin, Hongyu Sun, Justin Vasselli, and Taro Watanabe. 2025 b . https://aclanthology.org/2025.coling-main.641/ Beyond film subtitles: Is Y ou T ube the best approximation of spoken vocabulary? In Proceedings of the 31st International Conference on Computational ...
work page 2025
-
[22]
OpenAI. 2024. http://arxiv.org/abs/2303.08774v6 GPT-4 Technical Report . ArXiv preprint, arXiv:2303.08774v6 [cs.CL]
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[23]
OpenAI . 2025. https://cdn.openai.com/pdf/3a4153c8-c748-4b71-8e31-aecbde944f8d/oai_5_2_system-card.pdf Update to GPT-5 System Card : GPT-5 .2 . Technical report
work page 2025
-
[24]
Gustavo Paetzold and Lucia Specia. 2016. https://doi.org/10.18653/v1/S16-1085 S em E val 2016 task 11: Complex word identification . In Proceedings of the 10th International Workshop on Semantic Evaluation ( S em E val-2016) , pages 560--569, San Diego, California. Association for Computational Linguistics
-
[25]
Qwen Team . 2025. http://arxiv.org/abs/2412.15115v2 Qwen2.5 Technical Report . ArXiv preprint, arXiv:2412.15115v2 [cs.CL]
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[26]
Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. 2019. http://arxiv.org/abs/1910.01108 DistilBERT , a distilled version of BERT : Smaller, faster, cheaper and lighter . In 5th Workshop on Energy Efficient Machine Learning and Cognitive Computing @ NeurIPS 2019 , volume arXiv:1910.01108
work page internal anchor Pith review Pith/arXiv arXiv 2019
-
[27]
Norbert Schmitt, Karen Dunn, Barry O'Sullivan, Laurence Anthony, and Benjamin Kremmel. 2021. https://doi.org/10.1002/tesj.622 Introducing Knowledge-based Vocabulary Lists ( KVL ) . TESOL Journal, 12(4):e622
-
[28]
Norbert Schmitt, Karen Dunn, Barry O'Sullivan, Laurence Anthony, and Benjamin Kremmel. 2024. https://doi.org/10.3138/9781800504141 Knowledge-Based Vocabulary Lists . British Council Monographs on Modern Language Testing . University of Toronto Press
-
[29]
Scott, Anne Keitel, Marc Becirspahic, Bo Yao, and Sara C
Graham G. Scott, Anne Keitel, Marc Becirspahic, Bo Yao, and Sara C. Sereno. 2019. https://doi.org/10.3758/s13428-018-1099-3 The Glasgow Norms : Ratings of 5,500 words on nine scales . Behavior Research Methods, 51(3):1258--1270
-
[30]
Matthew Shardlow, Fernando Alva-Manchego, Riza Batista-Navarro, Stefan Bott, Saul Calderon Ramirez, R \'e mi Cardon, Thomas Fran c ois, Akio Hayakawa, Andrea Horbach, Anna H \"u lsing, Yusuke Ide, Joseph Marvin Imperial, Adam Nohejl, Kai North, Laura Occhipinti, Nelson Per \'e z Rojas, Nishat Raihan, Tharindu Ranasinghe, Martin Solis Salazar, and 3 others...
work page 2024
-
[31]
Matthew Shardlow, Richard Evans, Gustavo Henrique Paetzold, and Marcos Zampieri. 2021. https://doi.org/10.18653/v1/2021.semeval-1.1 S em E val-2021 task 1: Lexical complexity prediction . In Proceedings of the 15th International Workshop on Semantic Evaluation (SemEval-2021), pages 1--16, Online. Association for Computational Linguistics
-
[32]
Lucy Skidmore, Mariano Felice, and Karen Dunn. 2025. https://doi.org/10.18653/v1/2025.bea-1.12 Transformer architectures for vocabulary test item difficulty prediction . In Proceedings of the 20th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2025), pages 160--174, Vienna, Austria. Association for Computational Linguistics
-
[33]
R a zvan-Alexandru Sm a du, David-Gabriel Ion, Dumitru-Clementin Cercel, Florin Pop, and Mihaela-Claudia Cercel. 2024. https://doi.org/10.18653/v1/2024.emnlp-main.933 Investigating large language models for complex word identification in multilingual and multidomain setups . In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Pr...
-
[34]
Team GLM . 2024. http://arxiv.org/abs/2406.12793v2 ChatGLM: A Family of Large Language Models from GLM-130B to GLM-4 All Tools . ArXiv preprint, arXiv:2406.12793v2 [cs.CL]
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[35]
Seid Muhie Yimam, Chris Biemann, Shervin Malmasi, Gustavo Paetzold, Lucia Specia, Sanja S tajner, Ana \"i s Tack, and Marcos Zampieri. 2018. https://doi.org/10.18653/v1/W18-0507 A report on the complex word identification shared task 2018 . In Proceedings of the Thirteenth Workshop on Innovative Use of NLP for Building Educational Applications , pages 66-...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.