pith. sign in

arxiv: 2511.01526 · v2 · pith:47J4GMSPnew · submitted 2025-11-03 · 💻 cs.CL

Difficulty-Controllable Cloze Question Distractor Generation

Pith reviewed 2026-05-21 20:39 UTC · model grok-4.3

classification 💻 cs.CL
keywords cloze questionsdistractor generationdifficulty controlmultitask learningdata augmentationensemble QAmultiple-choice questionslanguage assessment
0
0 comments X

The pith

A framework generates high-quality distractors for cloze questions at controllable difficulty levels.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tries to establish a way to produce wrong-answer options for multiple-choice cloze questions while controlling how difficult those options are. Existing generators lack this control and no suitable labeled data existed, so the authors first build a dataset by generating candidate distractors in two directions, filtering them, and assigning difficulty labels with an ensemble of question-answering models. They then train a single model with multitask learning on that dataset so it can output distractors at a chosen difficulty. A sympathetic reader would care because accurate difficulty control would let tests better match a learner's actual proficiency level instead of offering only generic wrong answers.

Core claim

By creating a difficulty-annotated dataset through a two-way distractor generation process followed by filtering and categorization via an ensemble QA system, and then training a generation model with multitask learning, the framework produces high-quality distractors across difficulty levels and substantially outperforms GPT-4o in aligning distractor difficulty with human perception.

What carries the argument

Multitask learning strategy applied to a dataset whose difficulty labels come from an ensemble QA system after two-way distractor generation and filtering.

If this is right

  • High-quality distractors can be produced at any chosen difficulty level for cloze questions.
  • Distractor difficulty aligns more closely with human perception than distractors from GPT-4o.
  • A reusable difficulty-annotated dataset becomes available for training future models.
  • Multiple-choice cloze tests can be tailored to specific proficiency levels without manual distractor design.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same pipeline could be adapted to generate controlled-difficulty items for other question formats such as reading comprehension or vocabulary tests.
  • If the ensemble labeling proves reliable, it offers a route to scale difficulty-annotated training data without large-scale human annotation.
  • Integrating the generator into adaptive testing platforms would let item difficulty adjust automatically to a learner's current performance.
  • The approach suggests that combining synthetic data creation with multitask objectives can add controllability to other text-generation tasks in education.

Load-bearing premise

The ensemble QA system assigns difficulty categories that reliably match how humans perceive the difficulty of the distractors.

What would settle it

A human evaluation study in which raters judge the difficulty of distractors the model was asked to generate at easy, medium, or hard levels and the match to those target levels is no better than the match achieved by GPT-4o.

Figures

Figures reproduced from arXiv: 2511.01526 by Gary Geunbae Lee, Seokhoon Kang, Seonjeong Hwang, Yejin Jeon.

Figure 1
Figure 1. Figure 1: Overview of the dataset augmentation pipeline. [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of training methods of difficulty [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: QA system annotation scores across two diffi [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Duplication rate of generated distractors with [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Histogram difference of QA ensemble sys [PITH_FULL_IMAGE:figures/full_fig_p011_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Prompt template used for filtering distractor candidates with GPT-4o mini. [PITH_FULL_IMAGE:figures/full_fig_p012_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Raw confidence score distribution for each [PITH_FULL_IMAGE:figures/full_fig_p012_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Evaluation prompt template for relative difficulty. [PITH_FULL_IMAGE:figures/full_fig_p014_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Evaluation prompt template for assessing invalid distractors. [PITH_FULL_IMAGE:figures/full_fig_p014_9.png] view at source ↗
read the original abstract

Multiple-choice cloze questions are commonly used to assess linguistic proficiency and comprehension. However, generating high-quality distractors remains challenging, as existing methods often lack adaptability and control over difficulty levels, and the absence of difficulty-annotated datasets further hinders progress. To address these issues, we propose a novel framework for generating distractors with controllable difficulty by leveraging both data augmentation and a multitask learning strategy. First, to create a high-quality, difficulty-annotated dataset, we introduce a two-way distractor generation process to produce diverse and plausible distractors. These candidates are filtered and then categorized by difficulty using an ensemble QA system. Second, this newly created dataset is used to train a difficulty-controllable generation model via multitask learning. Experimental results demonstrate that our method generates high-quality distractors across difficulty levels and substantially outperforms GPT-4o in aligning distractor difficulty with human perception.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript presents a framework for generating difficulty-controllable distractors for multiple-choice cloze questions. It involves a two-way distractor generation process to create candidates, followed by filtering and difficulty categorization using an ensemble QA system to build a dataset. This dataset is then used to train a multitask learning model for controllable generation. The authors claim that their method produces high-quality distractors across difficulty levels and substantially outperforms GPT-4o in aligning with human-perceived difficulty.

Significance. If the central empirical claims hold after addressing validation gaps, the work could advance controllable text generation for educational applications by providing a pipeline that addresses the scarcity of difficulty-annotated distractor datasets. The multitask learning strategy and two-way augmentation are sensible technical choices that could generalize beyond cloze items.

major comments (2)
  1. [Abstract] Abstract: the claim that the method 'substantially outperforms GPT-4o in aligning distractor difficulty with human perception' is presented without any reported metrics, test-set size, statistical significance, or human-evaluation protocol, leaving the central outperformance result unsupported by evidence in the manuscript.
  2. [Dataset construction] Dataset construction section: difficulty labels are produced by an ensemble QA system with no reported correlation, agreement score, or human validation study against perceived difficulty. Because the multitask model is trained directly on these labels and the GPT-4o comparison is evaluated on human alignment, the absence of this validation is load-bearing for both the controllability results and the headline claim.
minor comments (2)
  1. [Method] The description of the multitask learning objectives would benefit from explicit loss equations or a clear statement of how difficulty control is enforced at inference time.
  2. [Figures/Tables] Table or figure captions describing the two-way generation pipeline should include the exact filtering criteria and ensemble QA voting procedure for reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and outline the revisions we will make to strengthen the empirical support for our claims.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the claim that the method 'substantially outperforms GPT-4o in aligning distractor difficulty with human perception' is presented without any reported metrics, test-set size, statistical significance, or human-evaluation protocol, leaving the central outperformance result unsupported by evidence in the manuscript.

    Authors: We agree the abstract should be self-contained. The full manuscript reports human evaluation results (Section 5.3) comparing our model and GPT-4o on difficulty alignment using a test set of 200 cloze items, with metrics such as alignment accuracy and statistical significance via McNemar's test. We will revise the abstract to include these key quantitative results and a concise description of the human evaluation protocol. revision: yes

  2. Referee: [Dataset construction] Dataset construction section: difficulty labels are produced by an ensemble QA system with no reported correlation, agreement score, or human validation study against perceived difficulty. Because the multitask model is trained directly on these labels and the GPT-4o comparison is evaluated on human alignment, the absence of this validation is load-bearing for both the controllability results and the headline claim.

    Authors: We acknowledge this validation gap is important. The ensemble QA system combines multiple models to estimate difficulty, but the current manuscript does not report correlation with human judgments. We will add a human validation study on a sample of 100 items, reporting Pearson correlation and agreement metrics between the automated labels and human annotators, and include these results in the revised dataset construction section. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation relies on external ensemble QA labels

full rationale

The paper's pipeline creates a difficulty-annotated dataset via two-way generation, filtering, and categorization by an external ensemble QA system, then trains a multitask model on those labels. Experimental claims compare outputs to human perception and GPT-4o without any step that defines difficulty in terms of the model itself, renames a fitted parameter as a prediction, or reduces a central result to a self-citation chain. The derivation chain is self-contained against the stated external benchmarks and does not exhibit any of the enumerated circular patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Review based on abstract only; no explicit free parameters, axioms, or invented entities are detailed. The approach implicitly assumes the ensemble QA system yields human-aligned difficulty labels.

pith-pipeline@v0.9.0 · 5687 in / 989 out tokens · 33189 ms · 2026-05-21T20:39:25.348361+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

30 extracted references · 30 canonical work pages · 3 internal anchors

  1. [1]

    Payne, and Valentina Tamma

    Samah AlKhuzaey, Floriana Grasso, Terry R. Payne, and Valentina Tamma. 2024. https://doi.org/10.1007/s40593-023-00362-1 Text-based Question Difficulty Prediction: A Systematic Review of Automatic Approaches . International Journal of Artificial Intelligence in Education, 34(3):862--914

  2. [2]

    Joshua Bensemann, Alex Peng, Diana Benavides-Prado, Yang Chen, Neset Tan, Paul Michael Corballis, Patricia Riddle, and Michael Witbrock. 2022. https://doi.org/10.18653/v1/2022.cmcl-1.9 Eye gaze and self-attention: How humans and transformers attend words in sentences . In Proceedings of the Workshop on Cognitive Modeling and Computational Linguistics, pag...

  3. [3]

    George EP Box and David R Cox. 1964. An analysis of transformations. Journal of the Royal Statistical Society Series B: Statistical Methodology, 26(2):211--243

  4. [4]

    Devrim C avu s o g lu, Se c il S en, and Ula s Sert. 2024. https://aclanthology.org/2024.findings-emnlp.568 D is G e M : Distractor Generation for Multiple Choice Questions with Span Masking . In Findings of the Association for Computational Linguistics: EMNLP 2024, pages 9714--9732, Miami, Florida, USA. Association for Computational Linguistics

  5. [5]

    Shang-Hsuan Chiang, Ssu-Cheng Wang, and Yao-Chung Fan. 2022. https://doi.org/10.18653/v1/2022.findings-emnlp.429 CDGP : Automatic Cloze Distractor Generation based on Pre-trained Language Model . In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 5835--5840, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics

  6. [6]

    Marco Aur \'e lio de Souza Rodrigues, Paula Chimenti, and Antonio Roberto Ramos Nogueira. 2021. https://doi.org/10.1007/s10639-020-10276-3 An Exploration of eLearning Adoption in the Educational Ecosystem . Education and Information Technologies, 26(1):585--615

  7. [7]

    Rui Pedro dos Santos Correia, Jorge Baptista, Nuno Mamede, Isabel Trancoso, and Maxine Eskenazi. 2010. Automatic Generation of Cloze Question Distractors . In Second Language Studies: Acquisition, Learning, Education and Technology (L2WS 2010), pages paper P2--11

  8. [8]

    Fellbaum

    C. Fellbaum. 1998. http://books.google.at/books?id=Rehu8OOzMIMC WordNet: An Electronic Lexical Database . Language, Speech and Communication. Mit Press

  9. [9]

    Haladyna

    T.M. Haladyna. 2004. https://books.google.co.kr/books?id=4fJ2YXMLTrsC Developing and Validating Multiple-choice Test Items . Lawrence Erlbaum Associates

  10. [10]

    Edward J Hu, yelong shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. 2022. https://openreview.net/forum?id=nZeVKeeFYf9 Lo RA : Low-Rank Adaptation of Large Language Models . In International Conference on Learning Representations

  11. [11]

    D. Naber. 2003. https://books.google.co.kr/books?id=yKPswAEACAAJ A Rule-Based Style and Grammar Checker . GRIN Verlag

  12. [12]

    Brian Ondov, Kush Attal, and Dina Demner-Fushman. 2024. https://doi.org/10.18653/v1/2024.naacl-long.220 Pedagogically aligned objectives create reliable automatic cloze tests . In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 396...

  13. [13]

    OpenAI. 2024. https://arxiv.org/abs/2303.08774 GPT-4 Technical Report . Preprint, arXiv:2303.08774

  14. [14]

    Jae-Woo Park, Seong-Jin Park, Hyun-Sik Won, and Kang-Min Kim. 2024. https://doi.org/10.18653/v1/2024.findings-emnlp.477 Large Language Models are Students at Various Levels: Zero-shot Question Difficulty Estimation . In Findings of the Association for Computational Linguistics: EMNLP 2024, pages 8157--8177, Miami, Florida, USA. Association for Computation...

  15. [15]

    Juan Pino and Maxine Esk \' e nazi. 2009. https://doi.org/10.21437/SLATE.2009-27 Semi-Automatic Generation of Cloze Question Distractors Effect of Students' L1 . In ISCA International Workshop on Speech and Language Technology in Education, SLaTE 2009, Warwickshire, England, UK, September 3-5, 2009 , pages 65--68. ISCA

  16. [16]

    Nils Reimers and Iryna Gurevych. 2019. https://arxiv.org/abs/1908.10084 Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks . In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics

  17. [17]

    Siyu Ren and Kenny Q. Zhu. 2021. https://doi.org/10.1609/aaai.v35i5.16559 Knowledge-Driven Distractor Generation for Cloze-Style Multiple Choice Questions . Proceedings of the AAAI Conference on Artificial Intelligence, 35(5):4339--4347

  18. [18]

    ALGhamdi, Ahmed Y

    Assad Ali Rezigalla, Ali Mohammed Elhassan Seid Ahmed Eleragi, Amar Babikir Elhussein, Jaber Alfaifi, Mushabab A. ALGhamdi, Ahmed Y. Al Ameer, Amar Ibrahim Omer Yahia, Osama A. Mohammed, and Masoud Ishag Elkhalifa Adam. 2024. https://doi.org/10.1186/s12909-024-05433-y Item analysis: the impact of distractor efficiency on the difficulty index and discrimin...

  19. [19]

    Stephen G Sireci. 1992. The Utility of IRT in Small-Sample Testing Applications . In The Annual Meeting of the American Psychological Association 100th. ERIC

  20. [20]

    Yuni Susanti, Hitoshi Nishikawa, Takenobu Tokunaga, and Obari Hiroyuki. 2016. https://doi.org/10.5220/0005775502670274 Item difficulty analysis of english vocabulary questions . In Proceedings of the 8th International Conference on Computer Supported Education - Volume 1: CSEDU, pages 267--274. INSTICC, SciTePress

  21. [21]

    Yuni Susanti, Takenobu Tokunaga, Hitoshi Nishikawa, and Hiroyuki Obari. 2017. https://doi.org/10.1186/s41039-017-0065-5 Controlling item difficulty for automatic vocabulary question generation . Research and Practice in Technology Enhanced Learning, 12(1):25

  22. [22]

    Cloze Procedure

    Wilson L. Taylor. 1953. https://doi.org/10.1177/107769905303000401 “Cloze Procedure”: A New Tool for Measuring Readability . Journalism Quarterly, 30(4):415--433

  23. [23]

    Gemma Team. 2024. https://arxiv.org/abs/2408.00118 Gemma 2: Improving Open Language Models at a Practical Size . Preprint, arXiv:2408.00118

  24. [24]

    Yuto Tomikawa and Masaki Uto. 2024. Difficulty-Controllable Multiple-Choice Question Generation for Reading Comprehension Using Item Response Theory . In Artificial Intelligence in Education. Posters and Late Breaking Results, Workshops and Tutorials, Industry and Innovation Tracks, Practitioners, Doctoral Consortium and Blue Sky, pages 312--320, Cham. Sp...

  25. [25]

    Masaki Uto, Yuto Tomikawa, and Ayaka Suzuki. 2023. https://doi.org/10.18653/v1/2023.bea-1.10 Difficulty-Controllable Neural Question Generation for Reading Comprehension using Item Response Theory . In Proceedings of the 18th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2023), pages 119--129, Toronto, Canada. Association fo...

  26. [26]

    Hui-Juan Wang, Kai-Yu Hsieh, Han-Cheng Yu, Jui-Ching Tsou, Yu An Shih, Chen-Hua Huang, and Yao-Chung Fan. 2023. https://doi.org/10.18653/v1/2023.findings-acl.790 Distractor Generation based on T ext2 T ext Language Models with Pseudo K ullback- L eibler Divergence Regulation . In Findings of the Association for Computational Linguistics: ACL 2023, pages 1...

  27. [27]

    Wentao Wu, Hongsong Li, Haixun Wang, and Kenny Q. Zhu. 2012. https://doi.org/10.1145/2213836.2213891 Probase: a probabilistic taxonomy for text understanding . In Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data, SIGMOD '12, page 481–492, New York, NY, USA. Association for Computing Machinery

  28. [28]

    Qizhe Xie, Guokun Lai, Zihang Dai, and Eduard Hovy. 2018. https://doi.org/10.18653/v1/D18-1257 Large-scale Cloze Test Dataset Created by Teachers . In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2344--2356, Brussels, Belgium. Association for Computational Linguistics

  29. [29]

    Chak Yan Yeung, John Lee, and Benjamin Tsou. 2019. https://aclanthology.org/U19-1021 Difficulty-aware Distractor Generation for Gap-Fill Items . In Proceedings of the 17th Annual Workshop of the Australasian Language Technology Association, pages 159--164, Sydney, Australia. Australasian Language Technology Association

  30. [30]

    Jiajie Zou, Yuran Zhang, Jialu Li, Xing Tian, and Nai Ding. 2023. https://doi.org/10.7554/eLife.87197 Human attention during goal-directed reading comprehension relies on task optimization . eLife, 12:RP87197