pith. sign in

arxiv: 2605.19316 · v1 · pith:IANXT52Mnew · submitted 2026-05-19 · 💻 cs.CL

A Multi-Agent Framework for Feature-Constrained Difficulty Control in Reading Comprehension Item Generation

Pith reviewed 2026-05-20 05:53 UTC · model grok-4.3

classification 💻 cs.CL
keywords reading comprehensionitem generationdifficulty controlmulti-agent frameworklarge language modelsfeature constraintsiterative revisiontest generation
0
0 comments X

The pith

Multi-agent LLM setup with feature evaluators generates reading comprehension items that match target difficulty constraints far more reliably than single-agent prompts.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces MAFIG, a multi-agent framework in which several large language model agents collaborate with dedicated evaluators to create and revise reading comprehension test items until they satisfy chosen feature constraints that control difficulty. Single-agent prompting often produces items that drift away from the intended features, so the new method uses iterative feedback loops where evaluators check specific aspects and guide corrections. The authors also present a way to order the constraint sets so difficulty rises steadily across a sequence of generated items. This matters for building fair assessments and adaptive learning tools that need predictable difficulty levels rather than random variation. Experiments show the multi-agent approach meets constraints at higher rates than prior methods.

Core claim

MAFIG coordinates multiple LLM agents and feature-specific evaluators that work together to generate items and then iteratively revise them based on whether they satisfy the intended constraints. A separate construction method produces a difficulty-calibrated sequence of feature constraint sets that results in items with monotonically increasing difficulty. The framework is shown to achieve significantly higher rates of adherence to target constraints than single-agent baselines.

What carries the argument

MAFIG, the multi-agent framework in which LLM agents generate candidate items while feature-specific evaluators assess compliance and direct targeted revisions to enforce the constraints.

If this is right

  • Items adhere to target feature constraints at significantly higher rates than single-agent baselines.
  • A difficulty-calibrated sequence of constraint sets produces items with monotonically increasing difficulty.
  • Iterative revision guided by evaluator feedback improves overall constraint satisfaction in generated reading items.
  • The approach enables more consistent control over difficulty levels for reading comprehension tests.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same multi-agent pattern with domain-specific evaluators could be applied to item generation in other subjects such as mathematics or science.
  • Large testing organizations could integrate the framework into pipelines to reduce manual quality checks for new items.
  • If constraint adherence improves, the generated items might yield more accurate difficulty estimates when used in adaptive testing systems.

Load-bearing premise

Feature-specific evaluators can accurately and consistently judge whether generated items satisfy the intended constraints without systematic errors or the need for external human checks.

What would settle it

A human review of several hundred items generated under MAFIG that finds the actual rate of constraint satisfaction is no better than the rate achieved by single-agent baselines.

Figures

Figures reproduced from arXiv: 2605.19316 by Gary Geunbae Lee, Hyounghun Kim, Jun Seo, Seonjeong Hwang.

Figure 1
Figure 1. Figure 1: Example of feature-constrained difficulty [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the MAFIG generation pipeline. [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Distribution of Difficulty Alignment Scores assigned by human evaluators. [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Constraint satisfaction performance under [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Ablation results showing the effect of Plan [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
Figure 7
Figure 7. Figure 7: Screenshots of instructions provided to human [PITH_FULL_IMAGE:figures/full_fig_p015_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Feature-wise ARs for GPT-5 (Direct Prompting) and Qwen3-32B-based [PITH_FULL_IMAGE:figures/full_fig_p016_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Visualization of feature-wise satisfaction across rounds for examples that (a) succeed or (b) fail to achieve [PITH_FULL_IMAGE:figures/full_fig_p016_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Expert-perceived difficulty factors for each [PITH_FULL_IMAGE:figures/full_fig_p018_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Prompt template for the Drafter agent used in the passage generation stage. {placeholder} indicates a slot to be filled with the corresponding value [PITH_FULL_IMAGE:figures/full_fig_p019_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Prompt template for the Planner agent used in the passage generation stage. {placeholder} indicates a slot to be filled with the corresponding value [PITH_FULL_IMAGE:figures/full_fig_p020_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Prompt template for the Editor agent used in the passage generation stage. {placeholder} indicates a slot to be filled with the corresponding value [PITH_FULL_IMAGE:figures/full_fig_p021_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Prompt template for the Reworder agent used in the passage generation stage. {placeholder} indicates a slot to be filled with the corresponding value [PITH_FULL_IMAGE:figures/full_fig_p022_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Prompt template for the Refiner agent used in the passage generation stage. {placeholder} indicates a slot to be filled with the corresponding value [PITH_FULL_IMAGE:figures/full_fig_p023_15.png] view at source ↗
read the original abstract

Recent studies in difficulty-controlled reading comprehension item generation have leveraged large language models (LLMs) to produce items by adjusting difficulty-related features. However, existing methods typically rely on a single-agent prompting approach, which often fails to consistently satisfy specified feature constraints, resulting in items that deviate from the target difficulty level. To address this limitation, we introduce MAFIG, a Multi-agent Framework for Feature-constrained Item Generation, where multiple LLM agents and feature-specific evaluators collaborate to generate and iteratively revise items based on intended constraints. Furthermore, to verify the efficacy of MAFIG in difficulty control, we propose a method for constructing a sequence of feature constraint sets that yield items with monotonically increasing difficulty. Experimental results demonstrate that MAFIG generates items that adhere to target constraints at a significantly higher rate than baselines, achieving robust difficulty control through the difficulty-calibrated constraint sequence.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The paper introduces MAFIG, a multi-agent LLM framework in which specialized agents and feature-specific evaluators collaborate to generate and iteratively revise reading comprehension items so that they satisfy explicit lexical, syntactic, and semantic constraints tied to target difficulty. It further proposes a procedure for constructing a monotonic sequence of constraint sets that is claimed to produce items of steadily increasing difficulty. Experiments are reported to show substantially higher constraint-adherence rates for MAFIG than for single-agent baselines.

Significance. If the empirical claims hold after proper validation of the evaluators, the work would supply a practical engineering advance for automated item generation in language testing: a multi-agent loop that demonstrably improves adherence to multi-dimensional feature constraints without requiring direct difficulty-parameter tuning. The difficulty-calibrated constraint-sequence construction is a reusable methodological contribution that could be adopted by other generation pipelines.

major comments (1)
  1. [Abstract / Experimental Results] Abstract and Experimental Results section: the central claim that MAFIG achieves 'significantly higher' constraint adherence and 'robust difficulty control' rests entirely on the judgments of the feature-specific evaluators, yet the manuscript supplies no information on how these evaluators were implemented, whether they are LLM-based, how they were calibrated, or any human-validation or inter-annotator-agreement statistics. Because the reported performance gap and the monotonic-difficulty assertion both depend on these unvalidated judgments, the empirical support for the primary contribution is unverifiable from the given text.
minor comments (1)
  1. [Abstract] The abstract states that items 'adhere to target constraints at a significantly higher rate' but does not name the statistical test, effect size, or number of items per condition; these details should be added for reproducibility.

Simulated Author's Rebuttal

1 responses · 1 unresolved

We appreciate the referee's thorough review and constructive criticism. We address the major comment regarding the description and validation of the feature-specific evaluators in detail below.

read point-by-point responses
  1. Referee: [Abstract / Experimental Results] Abstract and Experimental Results section: the central claim that MAFIG achieves 'significantly higher' constraint adherence and 'robust difficulty control' rests entirely on the judgments of the feature-specific evaluators, yet the manuscript supplies no information on how these evaluators were implemented, whether they are LLM-based, how they were calibrated, or any human-validation or inter-annotator-agreement statistics. Because the reported performance gap and the monotonic-difficulty assertion both depend on these unvalidated judgments, the empirical support for the primary contribution is unverifiable from the given text.

    Authors: We agree that the current version of the manuscript does not provide adequate details on the feature-specific evaluators, which is a shortcoming. In the revised manuscript, we will include a new subsection titled 'Evaluator Implementation' in the Experimental Setup. This will specify that the evaluators are LLM-based, using GPT-4 with carefully designed prompts for each constraint type (lexical, syntactic, semantic). The calibration involved iterative prompt engineering on a held-out development set of 200 items to achieve high agreement with manual annotations on that set. We will also report the prompt templates used. However, a full-scale human validation study with multiple annotators and IAA statistics was not conducted for the main experiments, as the evaluators were intended as automated proxies. We will explicitly state this limitation and its implications for the claims in the revised paper. revision: partial

standing simulated objections not resolved
  • Full human validation and inter-annotator agreement statistics for the evaluators were not performed in the study.

Circularity Check

0 steps flagged

No circularity: empirical engineering contribution with independent experimental validation

full rationale

The paper introduces MAFIG as a multi-agent LLM framework for generating reading comprehension items under feature constraints and proposes a method to build monotonically increasing difficulty sequences. Experimental results compare adherence rates against baselines. No equations, fitted parameters, or derivations appear in the provided abstract or description. The central claim rests on empirical comparison rather than any self-definitional reduction, fitted-input prediction, or load-bearing self-citation chain. The framework and constraint-sequence construction are presented as engineering proposals whose validity is tested externally via generated items and evaluator judgments, without reducing to tautological inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach rests on standard assumptions about LLM text generation capabilities and the measurability of difficulty features; no free parameters or invented entities are introduced in the abstract.

axioms (1)
  • domain assumption LLMs can generate coherent reading comprehension items when appropriately prompted and revised.
    Core premise enabling the use of LLMs for item generation and iterative refinement.

pith-pipeline@v0.9.0 · 5682 in / 1075 out tokens · 31772 ms · 2026-05-20T05:53:50.091514+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

103 extracted references · 103 canonical work pages · 10 internal anchors

  1. [1]

    G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment

    G-eval: NLG evaluation using gpt-4 with better human alignment , author=. arXiv preprint arXiv:2303.16634 , year=

  2. [2]

    arXiv preprint arXiv:2210.07197 , year=

    Towards a unified multi-dimensional evaluator for text generation , author=. arXiv preprint arXiv:2210.07197 , year=

  3. [3]

    Multi-Agent Collaboration: Harnessing the Power of Intelligent LLM Agents

    Multi-agent collaboration: Harnessing the power of intelligent llm agents , author=. arXiv preprint arXiv:2306.03314 , year=

  4. [4]

    Advances in Neural Information Processing Systems , volume=

    Self-refine: Iterative refinement with self-feedback , author=. Advances in Neural Information Processing Systems , volume=

  5. [5]

    Advances in neural information processing systems , volume=

    Retrieval-augmented generation for knowledge-intensive nlp tasks , author=. Advances in neural information processing systems , volume=

  6. [6]

    arXiv preprint arXiv:2409.14371 , year=

    The ability of large language models to evaluate constraint-satisfaction in agent responses to open-ended requests , author=. arXiv preprint arXiv:2409.14371 , year=

  7. [7]

    arXiv preprint arXiv:2503.08669 , year=

    Agentorca: A dual-system framework to evaluate language agents on operational routine and constraint adherence , author=. arXiv preprint arXiv:2503.08669 , year=

  8. [8]

    Metal: A multi- agent framework for chart generation with test-time scaling,

    Metal: A multi-agent framework for chart generation with test-time scaling , author=. arXiv preprint arXiv:2502.17651 , year=

  9. [9]

    Findings of the Association for Computational Linguistics: NAACL 2025 , pages=

    Zero-Shot Strategies for Length-Controllable Summarization , author=. Findings of the Association for Computational Linguistics: NAACL 2025 , pages=

  10. [10]

    arXiv preprint arXiv:2411.12460 , year=

    Exploring Iterative Controllable Summarization with Large Language Models , author=. arXiv preprint arXiv:2411.12460 , year=

  11. [11]

    The eleventh international conference on learning representations , year=

    React: Synergizing reasoning and acting in language models , author=. The eleventh international conference on learning representations , year=

  12. [12]

    Advances in Neural Information Processing Systems , volume=

    Reflexion: Language agents with verbal reinforcement learning , author=. Advances in Neural Information Processing Systems , volume=

  13. [13]

    Proceedings of the 18th workshop on innovative use of NLP for building educational applications (BEA 2023) , pages=

    Evaluating reading comprehension exercises generated by LLMs: A showcase of ChatGPT in education applications , author=. Proceedings of the 18th workshop on innovative use of NLP for building educational applications (BEA 2023) , pages=

  14. [14]

    Education and Information Technologies , volume=

    Few-shot is enough: exploring ChatGPT prompt engineering method for automatic question generation in english education , author=. Education and Information Technologies , volume=. 2024 , publisher=

  15. [15]

    Proceedings of the 31st International Conference on Computational Linguistics , pages=

    Automatic multiple-choice question generation and evaluation systems based on LLM: A study case with university resolutions , author=. Proceedings of the 31st International Conference on Computational Linguistics , pages=

  16. [16]

    Educational Measurement: Issues and Practice , volume=

    Using OpenAI GPT to generate reading comprehension items , author=. Educational Measurement: Issues and Practice , volume=. 2024 , publisher=

  17. [17]

    Proceedings of the 3rd Workshop on Tools and Resources for People with REAding DIfficulties (READI)@ LREC-COLING 2024 , pages=

    Automatic Generation and Evaluation of Reading Comprehension Test Items with Large Language Models , author=. Proceedings of the 3rd Workshop on Tools and Resources for People with REAding DIfficulties (READI)@ LREC-COLING 2024 , pages=

  18. [18]

    Advances in Neural Information Processing Systems , volume=

    Large language models are semi-parametric reinforcement learning agents , author=. Advances in Neural Information Processing Systems , volume=

  19. [19]

    International Conference on Artificial Intelligence in Education , pages=

    How useful are educational questions generated by large language models? , author=. International Conference on Artificial Intelligence in Education , pages=. 2023 , organization=

  20. [20]

    Difficulty Controllable Generation of Reading Comprehension Questions

    Difficulty controllable generation of reading comprehension questions , author=. arXiv preprint arXiv:1807.03586 , year=

  21. [21]

    Findings of the Association for Computational Linguistics ACL 2024 , pages=

    Planning first, question second: An llm-guided method for controllable question generation , author=. Findings of the Association for Computational Linguistics ACL 2024 , pages=

  22. [22]

    Proceedings of the 18th workshop on innovative use of NLP for building educational applications (BEA 2023) , pages=

    Difficulty-controllable neural question generation for reading comprehension using item response theory , author=. Proceedings of the 18th workshop on innovative use of NLP for building educational applications (BEA 2023) , pages=

  23. [23]

    IEEE Transactions on Learning Technologies , year=

    Adaptive question--answer generation with difficulty control using Item Response Theory and pre-trained transformer models , author=. IEEE Transactions on Learning Technologies , year=

  24. [24]

    arXiv preprint arXiv:2510.19265 , year=

    Difficulty-Controllable Multiple-Choice Question Generation Using Large Language Models and Direct Preference Optimization , author=. arXiv preprint arXiv:2510.19265 , year=

  25. [25]

    Computers and Education: Artificial Intelligence , volume=

    Automated reading passage generation with OpenAI's large language model , author=. Computers and Education: Artificial Intelligence , volume=. 2023 , publisher=

  26. [26]

    International Conference on Artificial Intelligence in Education , pages=

    Difficulty-controllable multiple-choice question generation for reading comprehension using item response theory , author=. International Conference on Artificial Intelligence in Education , pages=. 2024 , organization=

  27. [27]

    International Conference on Computers in Education , year=

    Difficulty-controllable reading comprehension question generation considering the difficulty of reading passages , author=. International Conference on Computers in Education , year=

  28. [28]

    arXiv preprint arXiv:2505.07618 , year=

    KAQG: A Knowledge-Graph-Enhanced RAG for Difficulty-Controlled Question Generation , author=. arXiv preprint arXiv:2505.07618 , year=

  29. [29]

    arXiv preprint arXiv:2504.14232 , year=

    Assessing AI-Generated Questions' Alignment with Cognitive Frameworks in Educational Assessment , author=. arXiv preprint arXiv:2504.14232 , year=

  30. [30]

    International Conference on Artificial Intelligence in Education , pages=

    Towards automated multiple choice question generation and evaluation: aligning with Bloom’s taxonomy , author=. International Conference on Artificial Intelligence in Education , pages=. 2024 , organization=

  31. [31]

    IEEE Access , volume=

    MCQGen: A large language model-driven MCQ generator for personalized learning , author=. IEEE Access , volume=. 2024 , publisher=

  32. [32]

    Information Processing & Management , volume=

    Difficulty-controllable question generation over knowledge graphs: A counterfactual reasoning approach , author=. Information Processing & Management , volume=. 2024 , publisher=

  33. [33]

    International Conference on Artificial Intelligence in Education , pages=

    Systematic Control of Multiple-Choice Item Difficulty Through LLM-Based Distractor Generation , author=. International Conference on Artificial Intelligence in Education , pages=. 2025 , organization=

  34. [34]

    International Journal of Artificial Intelligence in Education , volume=

    Text-based question difficulty prediction: A systematic review of automatic approaches , author=. International Journal of Artificial Intelligence in Education , volume=. 2024 , publisher=

  35. [35]

    International Conference on Human-Computer Interaction , pages=

    Exploring the Potential of Large Language Models for Estimating the Reading Comprehension Question Difficulty , author=. International Conference on Human-Computer Interaction , pages=. 2025 , organization=

  36. [36]

    Proceedings of the 31st International Conference on Computational Linguistics , pages=

    TEEMIL: Towards Educational MCQ Difficulty Estimation in Indic Languages , author=. Proceedings of the 31st International Conference on Computational Linguistics , pages=

  37. [37]

    arXiv preprint arXiv:2404.13343 , year=

    UnibucLLM: Harnessing LLMs for automated prediction of item difficulty and response time for multiple-choice questions , author=. arXiv preprint arXiv:2404.13343 , year=

  38. [38]

    arXiv preprint arXiv:2404.10704 , year=

    Question difficulty ranking for multiple-choice reading comprehension , author=. arXiv preprint arXiv:2404.10704 , year=

  39. [39]

    Proceedings of the Eleventh ACM Conference on Learning@ Scale , pages=

    Generative students: Using llm-simulated student profiles to support question item evaluation , author=. Proceedings of the Eleventh ACM Conference on Learning@ Scale , pages=

  40. [40]

    Findings of the Association for Computational Linguistics: EMNLP 2024 , pages=

    Large language models are students at various levels: Zero-shot question difficulty estimation , author=. Findings of the Association for Computational Linguistics: EMNLP 2024 , pages=

  41. [41]

    Proceedings of the 17th International Conference on Educational Data Mining , pages=

    How hard can this question be? An exploratory analysis of features assessing question difficulty using LLMs , author=. Proceedings of the 17th International Conference on Educational Data Mining , pages=

  42. [42]

    Applied Measurement in Education , volume=

    Response demands of reading comprehension test items: A review of item difficulty modeling studies , author=. Applied Measurement in Education , volume=. 2022 , publisher=

  43. [43]

    2014 ieee international conference on software science, technology and engineering , pages=

    Learning methods for rating the difficulty of reading comprehension questions , author=. 2014 ieee international conference on software science, technology and engineering , pages=. 2014 , organization=

  44. [44]

    Cogent Education , volume=

    Analysis of IELTS and TOEFL reading and listening tests in terms of Revised Bloom’s Taxonomy , author=. Cogent Education , volume=. 2020 , publisher=

  45. [45]

    Can LLMs Estimate Cognitive Complexity of Reading Comprehension Items?

    Can LLMs Estimate Cognitive Complexity of Reading Comprehension Items? , author=. arXiv preprint arXiv:2510.25064 , year=

  46. [46]

    European conference on information retrieval , pages=

    A novel multi-stage prompting approach for language agnostic mcq generation using gpt , author=. European conference on information retrieval , pages=. 2024 , organization=

  47. [47]

    Bert: Pre-training of deep bidirectional transformers for language understanding , author=. Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers) , pages=

  48. [48]

    and Jones, Ronald W

    Hambleton, Ronald K. and Jones, Ronald W. , title =. Educational Measurement: Issues and Practice , volume =. 1993 , doi =

  49. [49]

    Lord , title =

    Frederic M. Lord , title =. 1980 , publisher =

  50. [50]

    Studies in English Education , volume=

    Prediction of item difficulty on a reading comprehension test , author=. Studies in English Education , volume=. 2012 , publisher=

  51. [51]

    2014 , publisher=

    Automated evaluation of text and discourse with Coh-Metrix , author=. 2014 , publisher=

  52. [52]

    , author=

    Children's comprehension of between-and within-sentence syntactic structures. , author=. Journal of educational psychology , volume=. 1970 , publisher=

  53. [53]

    Review of educational research , volume=

    How to construct achievement tests to assess comprehension , author=. Review of educational research , volume=. 1972 , publisher=

  54. [54]

    Applied psychological measurement , volume=

    Component latent trait models for paragraph comprehension tests , author=. Applied psychological measurement , volume=. 1987 , publisher=

  55. [55]

    ETS Research Report Series , volume=

    The prediction of GRE reading comprehension item difficulty for expository prose passages for each of three item types: Main ideas, inferences and explicit statements , author=. ETS Research Report Series , volume=. 1991 , publisher=

  56. [56]

    Foreign Language Annals , volume=

    Comparison of L2 listening and reading comprehension by university students learning English in Korea , author=. Foreign Language Annals , volume=. 2004 , publisher=

  57. [57]

    Asian-Pacific Journal of Second and Foreign Language Education , volume=

    Predicting the difficulty of EFL reading comprehension tests based on linguistic indices , author=. Asian-Pacific Journal of Second and Foreign Language Education , volume=. 2023 , publisher=

  58. [58]

    Handbook 1: Cognitive domain , author=

    Taxonomy of educational objectives: The classification of educational goals. Handbook 1: Cognitive domain , author=. 1956 , publisher=

  59. [59]

    Proceedings of the 31st International Conference on Computational Linguistics , pages=

    LLMs meet Bloom’s Taxonomy: A Cognitive View on Large Language Model Evaluations , author=. Proceedings of the 31st International Conference on Computational Linguistics , pages=

  60. [60]

    Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing , pages=

    RACE: Large-scale ReAding Comprehension Dataset From Examinations , author=. Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing , pages=

  61. [61]

    Asian Conference on Machine Learning , pages=

    A new multi-choice reading comprehension dataset for curriculum learning , author=. Asian Conference on Machine Learning , pages=. 2019 , organization=

  62. [62]

    Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , pages=

    Hierarchical Deconstruction of LLM Reasoning: A Graph-Based Framework for Analyzing Knowledge Utilization , author=. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , pages=

  63. [63]

    Language Assessment Quarterly , volume=

    Predicting the difficulty of EFL tests based on corpus linguistic features and expert judgment , author=. Language Assessment Quarterly , volume=. 2020 , publisher=

  64. [64]

    Psychiatry research , volume=

    Detecting formal thought disorder by deep contextualized word representations , author=. Psychiatry research , volume=. 2021 , publisher=

  65. [65]

    Efficient Estimation of Word Representations in Vector Space

    Efficient estimation of word representations in vector space , author=. arXiv preprint arXiv:1301.3781 , year=

  66. [66]

    International conference on artificial intelligence in education , pages=

    Introducing a framework to assess newly created questions with natural language processing , author=. International conference on artificial intelligence in education , pages=. 2020 , organization=

  67. [67]

    International Journal of Artificial Intelligence in Education , volume=

    Predicting the difficulty of exercise items for dynamic difficulty adaptation in adaptive language tutoring , author=. International Journal of Artificial Intelligence in Education , volume=. 2019 , publisher=

  68. [68]

    Information Processing & Management , volume=

    Automated estimation of item difficulty for multiple-choice tests: An application of word embedding techniques , author=. Information Processing & Management , volume=. 2018 , publisher=

  69. [69]

    2020 international conference on communications, information system and computer engineering (cisce) , pages=

    Multi-task BERT for problem difficulty prediction , author=. 2020 international conference on communications, information system and computer engineering (cisce) , pages=. 2020 , organization=

  70. [70]

    Proceedings of the 16th workshop on innovative use of NLP for building educational applications , pages=

    On the application of transformers for estimating the difficulty of multiple-choice questions from text , author=. Proceedings of the 16th workshop on innovative use of NLP for building educational applications , pages=

  71. [71]

    arXiv preprint arXiv:2502.20663 , year=

    Prediction of Item Difficulty for Reading Comprehension Items by Creation of Annotated Item Repository , author=. arXiv preprint arXiv:2502.20663 , year=

  72. [72]

    Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

    Fantastic Questions and Where to Find Them: FairytaleQA--An Authentic Dataset for Narrative Comprehension , author=. Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

  73. [73]

    arXiv preprint arXiv:2306.13047 , year=

    Analysis of the cambridge multiple-choice questions reading dataset with a focus on candidate response distribution , author=. arXiv preprint arXiv:2306.13047 , year=

  74. [74]

    Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , pages=

    Adaption-of-thought: Learning question difficulty improves large language models for reasoning , author=. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , pages=

  75. [75]

    SQuAD: 100,000+ Questions for Machine Comprehension of Text

    Squad: 100,000+ questions for machine comprehension of text , author=. arXiv preprint arXiv:1606.05250 , year=

  76. [76]

    MS MARCO: A Human Generated MAchine Reading COmprehension Dataset

    Ms marco: A human generated machine reading comprehension dataset , author=. arXiv preprint arXiv:1611.09268 , year=

  77. [77]

    BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions , author=. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers) , pages=

  78. [78]

    Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing , pages=

    HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering , author=. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing , pages=

  79. [79]

    Proceedings of the AAAI conference on artificial intelligence , volume=

    Question Difficulty Prediction for READING Problems in Standard Tests , author=. Proceedings of the AAAI conference on artificial intelligence , volume=

  80. [80]

    2023 , publisher=

    The cambridge multiple-choice questions reading dataset , author=. 2023 , publisher=

Showing first 80 references.