Difficulty-Controllable Cloze Question Distractor Generation
Pith reviewed 2026-05-21 20:39 UTC · model grok-4.3
The pith
A framework generates high-quality distractors for cloze questions at controllable difficulty levels.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By creating a difficulty-annotated dataset through a two-way distractor generation process followed by filtering and categorization via an ensemble QA system, and then training a generation model with multitask learning, the framework produces high-quality distractors across difficulty levels and substantially outperforms GPT-4o in aligning distractor difficulty with human perception.
What carries the argument
Multitask learning strategy applied to a dataset whose difficulty labels come from an ensemble QA system after two-way distractor generation and filtering.
If this is right
- High-quality distractors can be produced at any chosen difficulty level for cloze questions.
- Distractor difficulty aligns more closely with human perception than distractors from GPT-4o.
- A reusable difficulty-annotated dataset becomes available for training future models.
- Multiple-choice cloze tests can be tailored to specific proficiency levels without manual distractor design.
Where Pith is reading between the lines
- The same pipeline could be adapted to generate controlled-difficulty items for other question formats such as reading comprehension or vocabulary tests.
- If the ensemble labeling proves reliable, it offers a route to scale difficulty-annotated training data without large-scale human annotation.
- Integrating the generator into adaptive testing platforms would let item difficulty adjust automatically to a learner's current performance.
- The approach suggests that combining synthetic data creation with multitask objectives can add controllability to other text-generation tasks in education.
Load-bearing premise
The ensemble QA system assigns difficulty categories that reliably match how humans perceive the difficulty of the distractors.
What would settle it
A human evaluation study in which raters judge the difficulty of distractors the model was asked to generate at easy, medium, or hard levels and the match to those target levels is no better than the match achieved by GPT-4o.
Figures
read the original abstract
Multiple-choice cloze questions are commonly used to assess linguistic proficiency and comprehension. However, generating high-quality distractors remains challenging, as existing methods often lack adaptability and control over difficulty levels, and the absence of difficulty-annotated datasets further hinders progress. To address these issues, we propose a novel framework for generating distractors with controllable difficulty by leveraging both data augmentation and a multitask learning strategy. First, to create a high-quality, difficulty-annotated dataset, we introduce a two-way distractor generation process to produce diverse and plausible distractors. These candidates are filtered and then categorized by difficulty using an ensemble QA system. Second, this newly created dataset is used to train a difficulty-controllable generation model via multitask learning. Experimental results demonstrate that our method generates high-quality distractors across difficulty levels and substantially outperforms GPT-4o in aligning distractor difficulty with human perception.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents a framework for generating difficulty-controllable distractors for multiple-choice cloze questions. It involves a two-way distractor generation process to create candidates, followed by filtering and difficulty categorization using an ensemble QA system to build a dataset. This dataset is then used to train a multitask learning model for controllable generation. The authors claim that their method produces high-quality distractors across difficulty levels and substantially outperforms GPT-4o in aligning with human-perceived difficulty.
Significance. If the central empirical claims hold after addressing validation gaps, the work could advance controllable text generation for educational applications by providing a pipeline that addresses the scarcity of difficulty-annotated distractor datasets. The multitask learning strategy and two-way augmentation are sensible technical choices that could generalize beyond cloze items.
major comments (2)
- [Abstract] Abstract: the claim that the method 'substantially outperforms GPT-4o in aligning distractor difficulty with human perception' is presented without any reported metrics, test-set size, statistical significance, or human-evaluation protocol, leaving the central outperformance result unsupported by evidence in the manuscript.
- [Dataset construction] Dataset construction section: difficulty labels are produced by an ensemble QA system with no reported correlation, agreement score, or human validation study against perceived difficulty. Because the multitask model is trained directly on these labels and the GPT-4o comparison is evaluated on human alignment, the absence of this validation is load-bearing for both the controllability results and the headline claim.
minor comments (2)
- [Method] The description of the multitask learning objectives would benefit from explicit loss equations or a clear statement of how difficulty control is enforced at inference time.
- [Figures/Tables] Table or figure captions describing the two-way generation pipeline should include the exact filtering criteria and ensemble QA voting procedure for reproducibility.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address each major comment below and outline the revisions we will make to strengthen the empirical support for our claims.
read point-by-point responses
-
Referee: [Abstract] Abstract: the claim that the method 'substantially outperforms GPT-4o in aligning distractor difficulty with human perception' is presented without any reported metrics, test-set size, statistical significance, or human-evaluation protocol, leaving the central outperformance result unsupported by evidence in the manuscript.
Authors: We agree the abstract should be self-contained. The full manuscript reports human evaluation results (Section 5.3) comparing our model and GPT-4o on difficulty alignment using a test set of 200 cloze items, with metrics such as alignment accuracy and statistical significance via McNemar's test. We will revise the abstract to include these key quantitative results and a concise description of the human evaluation protocol. revision: yes
-
Referee: [Dataset construction] Dataset construction section: difficulty labels are produced by an ensemble QA system with no reported correlation, agreement score, or human validation study against perceived difficulty. Because the multitask model is trained directly on these labels and the GPT-4o comparison is evaluated on human alignment, the absence of this validation is load-bearing for both the controllability results and the headline claim.
Authors: We acknowledge this validation gap is important. The ensemble QA system combines multiple models to estimate difficulty, but the current manuscript does not report correlation with human judgments. We will add a human validation study on a sample of 100 items, reporting Pearson correlation and agreement metrics between the automated labels and human annotators, and include these results in the revised dataset construction section. revision: yes
Circularity Check
No significant circularity; derivation relies on external ensemble QA labels
full rationale
The paper's pipeline creates a difficulty-annotated dataset via two-way generation, filtering, and categorization by an external ensemble QA system, then trains a multitask model on those labels. Experimental claims compare outputs to human perception and GPT-4o without any step that defines difficulty in terms of the model itself, renames a fitted parameter as a prediction, or reduces a central result to a self-citation chain. The derivation chain is self-contained against the stated external benchmarks and does not exhibit any of the enumerated circular patterns.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
two-way distractor generation process... information restriction strategy... ensemble QA system... multitask learning
-
IndisputableMonolith/Foundation/AbsoluteFloorClosure.leanabsolute_floor_iff_bare_distinguishability unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
difficulty clustering via ensemble QA... Box-Cox transformation
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Samah AlKhuzaey, Floriana Grasso, Terry R. Payne, and Valentina Tamma. 2024. https://doi.org/10.1007/s40593-023-00362-1 Text-based Question Difficulty Prediction: A Systematic Review of Automatic Approaches . International Journal of Artificial Intelligence in Education, 34(3):862--914
-
[2]
Joshua Bensemann, Alex Peng, Diana Benavides-Prado, Yang Chen, Neset Tan, Paul Michael Corballis, Patricia Riddle, and Michael Witbrock. 2022. https://doi.org/10.18653/v1/2022.cmcl-1.9 Eye gaze and self-attention: How humans and transformers attend words in sentences . In Proceedings of the Workshop on Cognitive Modeling and Computational Linguistics, pag...
-
[3]
George EP Box and David R Cox. 1964. An analysis of transformations. Journal of the Royal Statistical Society Series B: Statistical Methodology, 26(2):211--243
work page 1964
-
[4]
Devrim C avu s o g lu, Se c il S en, and Ula s Sert. 2024. https://aclanthology.org/2024.findings-emnlp.568 D is G e M : Distractor Generation for Multiple Choice Questions with Span Masking . In Findings of the Association for Computational Linguistics: EMNLP 2024, pages 9714--9732, Miami, Florida, USA. Association for Computational Linguistics
work page 2024
-
[5]
Shang-Hsuan Chiang, Ssu-Cheng Wang, and Yao-Chung Fan. 2022. https://doi.org/10.18653/v1/2022.findings-emnlp.429 CDGP : Automatic Cloze Distractor Generation based on Pre-trained Language Model . In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 5835--5840, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics
-
[6]
Marco Aur \'e lio de Souza Rodrigues, Paula Chimenti, and Antonio Roberto Ramos Nogueira. 2021. https://doi.org/10.1007/s10639-020-10276-3 An Exploration of eLearning Adoption in the Educational Ecosystem . Education and Information Technologies, 26(1):585--615
-
[7]
Rui Pedro dos Santos Correia, Jorge Baptista, Nuno Mamede, Isabel Trancoso, and Maxine Eskenazi. 2010. Automatic Generation of Cloze Question Distractors . In Second Language Studies: Acquisition, Learning, Education and Technology (L2WS 2010), pages paper P2--11
work page 2010
- [8]
- [9]
-
[10]
Edward J Hu, yelong shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. 2022. https://openreview.net/forum?id=nZeVKeeFYf9 Lo RA : Low-Rank Adaptation of Large Language Models . In International Conference on Learning Representations
work page 2022
-
[11]
D. Naber. 2003. https://books.google.co.kr/books?id=yKPswAEACAAJ A Rule-Based Style and Grammar Checker . GRIN Verlag
work page 2003
-
[12]
Brian Ondov, Kush Attal, and Dina Demner-Fushman. 2024. https://doi.org/10.18653/v1/2024.naacl-long.220 Pedagogically aligned objectives create reliable automatic cloze tests . In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 396...
-
[13]
OpenAI. 2024. https://arxiv.org/abs/2303.08774 GPT-4 Technical Report . Preprint, arXiv:2303.08774
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[14]
Jae-Woo Park, Seong-Jin Park, Hyun-Sik Won, and Kang-Min Kim. 2024. https://doi.org/10.18653/v1/2024.findings-emnlp.477 Large Language Models are Students at Various Levels: Zero-shot Question Difficulty Estimation . In Findings of the Association for Computational Linguistics: EMNLP 2024, pages 8157--8177, Miami, Florida, USA. Association for Computation...
-
[15]
Juan Pino and Maxine Esk \' e nazi. 2009. https://doi.org/10.21437/SLATE.2009-27 Semi-Automatic Generation of Cloze Question Distractors Effect of Students' L1 . In ISCA International Workshop on Speech and Language Technology in Education, SLaTE 2009, Warwickshire, England, UK, September 3-5, 2009 , pages 65--68. ISCA
-
[16]
Nils Reimers and Iryna Gurevych. 2019. https://arxiv.org/abs/1908.10084 Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks . In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics
work page internal anchor Pith review Pith/arXiv arXiv 2019
-
[17]
Siyu Ren and Kenny Q. Zhu. 2021. https://doi.org/10.1609/aaai.v35i5.16559 Knowledge-Driven Distractor Generation for Cloze-Style Multiple Choice Questions . Proceedings of the AAAI Conference on Artificial Intelligence, 35(5):4339--4347
-
[18]
Assad Ali Rezigalla, Ali Mohammed Elhassan Seid Ahmed Eleragi, Amar Babikir Elhussein, Jaber Alfaifi, Mushabab A. ALGhamdi, Ahmed Y. Al Ameer, Amar Ibrahim Omer Yahia, Osama A. Mohammed, and Masoud Ishag Elkhalifa Adam. 2024. https://doi.org/10.1186/s12909-024-05433-y Item analysis: the impact of distractor efficiency on the difficulty index and discrimin...
-
[19]
Stephen G Sireci. 1992. The Utility of IRT in Small-Sample Testing Applications . In The Annual Meeting of the American Psychological Association 100th. ERIC
work page 1992
-
[20]
Yuni Susanti, Hitoshi Nishikawa, Takenobu Tokunaga, and Obari Hiroyuki. 2016. https://doi.org/10.5220/0005775502670274 Item difficulty analysis of english vocabulary questions . In Proceedings of the 8th International Conference on Computer Supported Education - Volume 1: CSEDU, pages 267--274. INSTICC, SciTePress
-
[21]
Yuni Susanti, Takenobu Tokunaga, Hitoshi Nishikawa, and Hiroyuki Obari. 2017. https://doi.org/10.1186/s41039-017-0065-5 Controlling item difficulty for automatic vocabulary question generation . Research and Practice in Technology Enhanced Learning, 12(1):25
-
[22]
Wilson L. Taylor. 1953. https://doi.org/10.1177/107769905303000401 “Cloze Procedure”: A New Tool for Measuring Readability . Journalism Quarterly, 30(4):415--433
-
[23]
Gemma Team. 2024. https://arxiv.org/abs/2408.00118 Gemma 2: Improving Open Language Models at a Practical Size . Preprint, arXiv:2408.00118
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[24]
Yuto Tomikawa and Masaki Uto. 2024. Difficulty-Controllable Multiple-Choice Question Generation for Reading Comprehension Using Item Response Theory . In Artificial Intelligence in Education. Posters and Late Breaking Results, Workshops and Tutorials, Industry and Innovation Tracks, Practitioners, Doctoral Consortium and Blue Sky, pages 312--320, Cham. Sp...
work page 2024
-
[25]
Masaki Uto, Yuto Tomikawa, and Ayaka Suzuki. 2023. https://doi.org/10.18653/v1/2023.bea-1.10 Difficulty-Controllable Neural Question Generation for Reading Comprehension using Item Response Theory . In Proceedings of the 18th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2023), pages 119--129, Toronto, Canada. Association fo...
-
[26]
Hui-Juan Wang, Kai-Yu Hsieh, Han-Cheng Yu, Jui-Ching Tsou, Yu An Shih, Chen-Hua Huang, and Yao-Chung Fan. 2023. https://doi.org/10.18653/v1/2023.findings-acl.790 Distractor Generation based on T ext2 T ext Language Models with Pseudo K ullback- L eibler Divergence Regulation . In Findings of the Association for Computational Linguistics: ACL 2023, pages 1...
-
[27]
Wentao Wu, Hongsong Li, Haixun Wang, and Kenny Q. Zhu. 2012. https://doi.org/10.1145/2213836.2213891 Probase: a probabilistic taxonomy for text understanding . In Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data, SIGMOD '12, page 481–492, New York, NY, USA. Association for Computing Machinery
-
[28]
Qizhe Xie, Guokun Lai, Zihang Dai, and Eduard Hovy. 2018. https://doi.org/10.18653/v1/D18-1257 Large-scale Cloze Test Dataset Created by Teachers . In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2344--2356, Brussels, Belgium. Association for Computational Linguistics
-
[29]
Chak Yan Yeung, John Lee, and Benjamin Tsou. 2019. https://aclanthology.org/U19-1021 Difficulty-aware Distractor Generation for Gap-Fill Items . In Proceedings of the 17th Annual Workshop of the Australasian Language Technology Association, pages 159--164, Sydney, Australia. Australasian Language Technology Association
work page 2019
-
[30]
Jiajie Zou, Yuran Zhang, Jialu Li, Xing Tian, and Nai Ding. 2023. https://doi.org/10.7554/eLife.87197 Human attention during goal-directed reading comprehension relies on task optimization . eLife, 12:RP87197
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.