Recognition: no theorem link
What Makes Words Hard? Sakura at BEA 2026 Shared Task on Vocabulary Difficulty Prediction
Pith reviewed 2026-05-15 02:56 UTC · model grok-4.3
The pith
Spelling difficulty and test item construction often drive ratings in standard vocabulary difficulty lists beyond genuine word production demands.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The difficulty of items in the British Council's Knowledge-based Vocabulary Lists is often affected by spelling difficulty or the construction of the test items, in addition to the genuine production difficulty of the words.
What carries the argument
Fine-tuned LLM with soft-target loss for black-box prediction, paired with an explainable model that decomposes per-item influences on difficulty ratings.
If this is right
- Vocabulary tests and lists should separate spelling and format effects from production difficulty to give cleaner signals for learners and teachers.
- Explainable models can flag which specific items in existing lists are likely inflated by non-production factors.
- Training data for future difficulty predictors should include explicit spelling and item-format annotations to reduce post-hoc capture of confounders.
Where Pith is reading between the lines
- Difficulty prediction systems may improve if they are trained to output separate scores for spelling load, item format, and core production effort rather than a single combined rating.
- This approach could transfer to other language assessment domains where surface features of test items distort underlying skill measures.
Load-bearing premise
The shared task dataset and KVL lists measure genuine word production difficulty without major confounding from spelling or test-item design that the models later detect.
What would settle it
Re-rating a subset of KVL items after standardizing spelling presentation and removing item-construction cues, then checking whether the original difficulty scores remain unchanged.
Figures
read the original abstract
We describe two types of models for vocabulary difficulty prediction: a high-accuracy black-box model, which achieved the top shared task result in the open track, and an explainable model, which outperforms a fine-tuned encoder baseline. As the black-box model, we fine-tuned an LLM using a soft-target loss function for effective application to the rating task, achieving r > 0.91. The explainable model provides insights into what impacts the difficulty of each item while maintaining a strong correlation (r > 0.77). We further analyze the results, demonstrating that the difficulty of items in the British Council's Knowledge-based Vocabulary Lists (KVL) is often affected by spelling difficulty or the construction of the test items, in addition to the genuine production difficulty of the words. We make our code available online at https://github.com/adno/vocabulary-difficulty .
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents two models for the BEA 2026 Shared Task on Vocabulary Difficulty Prediction: a black-box LLM fine-tuned with a soft-target loss achieving r > 0.91 and topping the open track, and an explainable model reaching r > 0.77 that outperforms a fine-tuned encoder baseline. The explainable model is used to analyze KVL items, concluding that difficulty is affected by spelling difficulty and test-item construction in addition to genuine production difficulty. Code is released at https://github.com/adno/vocabulary-difficulty.
Significance. If the results and analysis hold, the work contributes a strong shared-task entry with reproducible code and an interpretable component that flags potential surface confounds in test-derived vocabulary difficulty labels. This has practical value for educational assessment design and for distinguishing production difficulty from orthographic or format effects in NLP applications.
major comments (1)
- [Abstract / Analysis] Abstract and analysis section: The central interpretive claim that KVL difficulty reflects spelling difficulty and test-item construction 'in addition to' genuine production difficulty lacks an independent anchor. Both models are trained directly on KVL-derived ratings, so the explainable model necessarily learns statistical associations present in those same labels; without a separate production-only gold standard (e.g., free-recall or cloze accuracy collected independently of the KVL test format), the analysis cannot distinguish additive effects from saturation of the labels by spelling and construction confounds.
minor comments (2)
- [Methods] Methods: Provide explicit details on train/dev/test splits, the precise encoder baseline architecture, hyperparameter choices for the soft-target loss, and any error analysis or ablation results to allow full verification of the reported correlations.
- [Model description] Explainable model: Clarify the feature set and how interpretability is achieved (e.g., which surface cues are explicitly modeled) so readers can assess whether the r > 0.77 performance genuinely isolates the claimed factors.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We agree that the interpretive framing in the abstract and analysis requires careful qualification given the shared-task data constraints, and we will revise accordingly.
read point-by-point responses
-
Referee: [Abstract / Analysis] Abstract and analysis section: The central interpretive claim that KVL difficulty reflects spelling difficulty and test-item construction 'in addition to' genuine production difficulty lacks an independent anchor. Both models are trained directly on KVL-derived ratings, so the explainable model necessarily learns statistical associations present in those same labels; without a separate production-only gold standard (e.g., free-recall or cloze accuracy collected independently of the KVL test format), the analysis cannot distinguish additive effects from saturation of the labels by spelling and construction confounds.
Authors: We accept this criticism. The analysis is correlational and draws exclusively from KVL-derived labels; no independent production-only gold standard (such as free-recall or cloze data) is available within the shared-task setting. The explainable model surfaces features (spelling complexity, item format) that are theoretically distinct and measurable independently of the labels themselves, but we cannot rule out that these features simply saturate the observed ratings. We will revise the abstract and analysis section to present the findings as evidence of potential surface confounds in KVL difficulty labels rather than claiming additive effects beyond genuine production difficulty. A new limitations paragraph will explicitly note the absence of an independent anchor and recommend future validation with production-only measures. revision: yes
Circularity Check
No significant circularity in model training or analysis
full rationale
The paper trains supervised models (black-box LLM with soft-target loss and explainable model) directly on KVL-derived difficulty ratings and evaluates correlation against the shared-task held-out test set. No equations, derivations, or self-citations are presented that reduce any reported prediction or interpretive claim to its own inputs by construction. The post-hoc analysis of spelling and test-construction factors is an empirical observation from model features correlated with the provided labels, not a definitional or fitted-input circularity. Code release enables external reproduction, confirming the work is self-contained against the task benchmarks.
Axiom & Free-Parameter Ledger
free parameters (1)
- soft-target loss hyperparameters
axioms (1)
- domain assumption Shared task data splits and labels are representative of vocabulary difficulty
Reference graph
Works this paper leans on
-
[1]
Dennis Aumiller and Michael Gertz. 2022. https://doi.org/10.18653/v1/2022.tsar-1.28 U ni HD at TSAR -2022 shared task: Is compute all we need for lexical simplification? In Proceedings of the Workshop on Text Simplification, Accessibility, and Readability (TSAR-2022), pages 251--258, Abu Dhabi, United Arab Emirates (Virtual). Association for Computational...
-
[2]
BNC Consortium . 2007. https://llds.ling-phil.ox.ac.uk/llds/xmlui/handle/20.500.14106/2554 British National Corpus , XML edition . https://llds.ling-phil.ox.ac.uk/llds/xmlui/handle/20.500.14106/2554
work page 2007
-
[3]
Annette Capel. 2012. https://doi.org/10.1017/S2041536212000013 Completing the English Vocabulary Profile : C1 and C2 vocabulary . English Profile Journal, 3:e1
-
[4]
Tianqi Chen and Carlos Guestrin. 2016. https://doi.org/10.1145/2939672.2939785 XGBoost : A Scalable Tree Boosting System . In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining , KDD '16, pages 785--794, New York, NY, USA. Association for Computing Machinery
-
[5]
Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzm \'a n, Edouard Grave, Myle Ott, Luke Zettlemoyer, and Veselin Stoyanov. 2020. https://doi.org/10.18653/v1/2020.acl-main.747 Unsupervised cross-lingual representation learning at scale . In Proceedings of the 58th Annual Meeting of the Association for Comp...
-
[6]
DeepSeek-AI . 2025. http://arxiv.org/abs/2412.19437v2 DeepSeek-V3 Technical Report . ArXiv preprint, arXiv:2412.19437v2 [cs.CL]
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[7]
Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. 2023. https://proceedings.neurips.cc/paper_files/paper/2023/hash/1feb87871436031bdc0f2beaa62a049b-Abstract-Conference.html QLoRA : Efficient Finetuning of Quantized LLMs . Advances in Neural Information Processing Systems, 36:10088--10115
work page 2023
-
[8]
Taisei Enomoto, Hwichan Kim, Tosho Hirasawa, Yoshinari Nagai, Ayako Sato, Kyotaro Nakajima, and Mamoru Komachi. 2024. https://aclanthology.org/2024.bea-1.52/ TMU - HIT at MLSP 2024: How well can GPT -4 tackle multilingual lexical simplification? In Proceedings of the 19th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2024), ...
work page 2024
-
[9]
Mariano Felice and Lucy Skidmore. 2026. Findings of the BEA 2026 shared task on vocabulary difficulty prediction for English learners. In Proceedings of the 21st Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2026), San Diego, California. Association for Computational Linguistics
work page 2026
-
[10]
Geoffrey Hinton, Oriol Vinyals, and Jeffrey Dean. 2015. http://arxiv.org/abs/1503.02531 Distilling the knowledge in a neural network . In NIPS Deep Learning and Representation Learning Workshop
work page internal anchor Pith review Pith/arXiv arXiv 2015
-
[11]
Yusuke Ide, Masato Mita, Adam Nohejl, Hiroki Ouchi, and Taro Watanabe. 2023. https://doi.org/10.18653/v1/2023.bea-1.40 J apanese lexical complexity for non-native readers: A new dataset . In Proceedings of the 18th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2023), pages 477--487, Toronto, Canada. Association for Computati...
-
[12]
Jong, Mike Mayor, and Catherine Hayes
John H.A.L. Jong, Mike Mayor, and Catherine Hayes. 2016. https://www.pearson.com/content/dam/one-dot-com/one-dot-com/english/TeacherResources/GSE/GSE-WhitePaper-Developing-LOs.pdf Developing global scale of English learning objectives aligned to the common European framework . Technical report
work page 2016
-
[13]
Pierre Lison, J \"o rg Tiedemann, and Milen Kouylekov. 2018. https://aclanthology.org/L18-1275/ O pen S ubtitles2018: Statistical rescoring of sentence alignments in large, noisy parallel corpora . In Proceedings of the Eleventh International Conference on Language Resources and Evaluation ( LREC 2018) , Miyazaki, Japan. European Language Resources Associ...
work page 2018
-
[14]
Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, and Chenguang Zhu. 2023. https://doi.org/10.18653/v1/2023.emnlp-main.153 G -eval: NLG evaluation using gpt-4 with better human alignment . In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 2511--2522, Singapore. Association for Computational Linguistics
-
[15]
Scott M. Lundberg and Su-In Lee. 2017. https://proceedings.neurips.cc/paper/2017/hash/8a20a8621978632d76c43dfd28b67767-Abstract.html A unified approach to interpreting model predictions . In Proceedings of the 31st International Conference on Neural Information Processing Systems , NIPS '17, pages 4768--4777, Red Hook, NY, USA. Curran Associates Inc
work page 2017
- [16]
-
[17]
Mistral AI . 2026. http://arxiv.org/abs/2601.08584v1 Ministral 3 . ArXiv preprint, arXiv:2601.08584v1 [cs.CL]
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[18]
Tomoya Mizumoto, Mamoru Komachi, Masaaki Nagata, and Yuji Matsumoto. 2011. https://aclanthology.org/I11-1017/ Mining revision log of language learning SNS for automated J apanese error correction of second language learners . In Proceedings of 5th International Joint Conference on Natural Language Processing, pages 147--155, Chiang Mai, Thailand. Asian Fe...
work page 2011
-
[19]
Adam Nohejl, Akio Hayakawa, Yusuke Ide, and Taro Watanabe. 2024. https://doi.org/10.18653/v1/2024.tsar-1.8 Difficult for whom? a study of J apanese lexical complexity . In Proceedings of the Third Workshop on Text Simplification, Accessibility and Readability (TSAR 2024), pages 69--81, Miami, Florida, USA. Association for Computational Linguistics
-
[20]
Adam Nohejl, Akio Hayakawa, Yusuke Ide, and Taro Watanabe. 2025 a . https://doi.org/10.5715/jnlp.32.1129 A Japanese Dataset and Efficient Multilingual LLM-Based Methods for Lexical Simplification and Lexical Complexity Prediction . Journal of Natural Language Processing, 32(4):1129--1188
-
[21]
Adam Nohejl, Frederikus Hudi, Eunike Andriani Kardinata, Shintaro Ozaki, Maria Angelica Riera Machin, Hongyu Sun, Justin Vasselli, and Taro Watanabe. 2025 b . https://aclanthology.org/2025.coling-main.641/ Beyond film subtitles: Is Y ou T ube the best approximation of spoken vocabulary? In Proceedings of the 31st International Conference on Computational ...
work page 2025
-
[22]
OpenAI. 2024. http://arxiv.org/abs/2303.08774v6 GPT-4 Technical Report . ArXiv preprint, arXiv:2303.08774v6 [cs.CL]
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[23]
OpenAI . 2025. https://cdn.openai.com/pdf/3a4153c8-c748-4b71-8e31-aecbde944f8d/oai_5_2_system-card.pdf Update to GPT-5 System Card : GPT-5 .2 . Technical report
work page 2025
-
[24]
Gustavo Paetzold and Lucia Specia. 2016. https://doi.org/10.18653/v1/S16-1085 S em E val 2016 task 11: Complex word identification . In Proceedings of the 10th International Workshop on Semantic Evaluation ( S em E val-2016) , pages 560--569, San Diego, California. Association for Computational Linguistics
-
[25]
Qwen Team . 2025. http://arxiv.org/abs/2412.15115v2 Qwen2.5 Technical Report . ArXiv preprint, arXiv:2412.15115v2 [cs.CL]
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[26]
Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. 2019. http://arxiv.org/abs/1910.01108 DistilBERT , a distilled version of BERT : Smaller, faster, cheaper and lighter . In 5th Workshop on Energy Efficient Machine Learning and Cognitive Computing @ NeurIPS 2019 , volume arXiv:1910.01108
work page internal anchor Pith review Pith/arXiv arXiv 2019
-
[27]
Norbert Schmitt, Karen Dunn, Barry O'Sullivan, Laurence Anthony, and Benjamin Kremmel. 2021. https://doi.org/10.1002/tesj.622 Introducing Knowledge-based Vocabulary Lists ( KVL ) . TESOL Journal, 12(4):e622
-
[28]
Norbert Schmitt, Karen Dunn, Barry O'Sullivan, Laurence Anthony, and Benjamin Kremmel. 2024. https://doi.org/10.3138/9781800504141 Knowledge-Based Vocabulary Lists . British Council Monographs on Modern Language Testing . University of Toronto Press
-
[29]
Scott, Anne Keitel, Marc Becirspahic, Bo Yao, and Sara C
Graham G. Scott, Anne Keitel, Marc Becirspahic, Bo Yao, and Sara C. Sereno. 2019. https://doi.org/10.3758/s13428-018-1099-3 The Glasgow Norms : Ratings of 5,500 words on nine scales . Behavior Research Methods, 51(3):1258--1270
-
[30]
Matthew Shardlow, Fernando Alva-Manchego, Riza Batista-Navarro, Stefan Bott, Saul Calderon Ramirez, R \'e mi Cardon, Thomas Fran c ois, Akio Hayakawa, Andrea Horbach, Anna H \"u lsing, Yusuke Ide, Joseph Marvin Imperial, Adam Nohejl, Kai North, Laura Occhipinti, Nelson Per \'e z Rojas, Nishat Raihan, Tharindu Ranasinghe, Martin Solis Salazar, and 3 others...
work page 2024
-
[31]
Matthew Shardlow, Richard Evans, Gustavo Henrique Paetzold, and Marcos Zampieri. 2021. https://doi.org/10.18653/v1/2021.semeval-1.1 S em E val-2021 task 1: Lexical complexity prediction . In Proceedings of the 15th International Workshop on Semantic Evaluation (SemEval-2021), pages 1--16, Online. Association for Computational Linguistics
-
[32]
Lucy Skidmore, Mariano Felice, and Karen Dunn. 2025. https://doi.org/10.18653/v1/2025.bea-1.12 Transformer architectures for vocabulary test item difficulty prediction . In Proceedings of the 20th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2025), pages 160--174, Vienna, Austria. Association for Computational Linguistics
-
[33]
R a zvan-Alexandru Sm a du, David-Gabriel Ion, Dumitru-Clementin Cercel, Florin Pop, and Mihaela-Claudia Cercel. 2024. https://doi.org/10.18653/v1/2024.emnlp-main.933 Investigating large language models for complex word identification in multilingual and multidomain setups . In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Pr...
-
[34]
Team GLM . 2024. http://arxiv.org/abs/2406.12793v2 ChatGLM: A Family of Large Language Models from GLM-130B to GLM-4 All Tools . ArXiv preprint, arXiv:2406.12793v2 [cs.CL]
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[35]
Seid Muhie Yimam, Chris Biemann, Shervin Malmasi, Gustavo Paetzold, Lucia Specia, Sanja S tajner, Ana \"i s Tack, and Marcos Zampieri. 2018. https://doi.org/10.18653/v1/W18-0507 A report on the complex word identification shared task 2018 . In Proceedings of the Thirteenth Workshop on Innovative Use of NLP for Building Educational Applications , pages 66-...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.