A Multi-Agent Framework for Feature-Constrained Difficulty Control in Reading Comprehension Item Generation

Gary Geunbae Lee; Hyounghun Kim; Jun Seo; Seonjeong Hwang

arxiv: 2605.19316 · v1 · pith:IANXT52Mnew · submitted 2026-05-19 · 💻 cs.CL

A Multi-Agent Framework for Feature-Constrained Difficulty Control in Reading Comprehension Item Generation

Seonjeong Hwang , Jun Seo , Hyounghun Kim , Gary Geunbae Lee This is my paper

Pith reviewed 2026-05-20 05:53 UTC · model grok-4.3

classification 💻 cs.CL

keywords reading comprehensionitem generationdifficulty controlmulti-agent frameworklarge language modelsfeature constraintsiterative revisiontest generation

0 comments

The pith

Multi-agent LLM setup with feature evaluators generates reading comprehension items that match target difficulty constraints far more reliably than single-agent prompts.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces MAFIG, a multi-agent framework in which several large language model agents collaborate with dedicated evaluators to create and revise reading comprehension test items until they satisfy chosen feature constraints that control difficulty. Single-agent prompting often produces items that drift away from the intended features, so the new method uses iterative feedback loops where evaluators check specific aspects and guide corrections. The authors also present a way to order the constraint sets so difficulty rises steadily across a sequence of generated items. This matters for building fair assessments and adaptive learning tools that need predictable difficulty levels rather than random variation. Experiments show the multi-agent approach meets constraints at higher rates than prior methods.

Core claim

MAFIG coordinates multiple LLM agents and feature-specific evaluators that work together to generate items and then iteratively revise them based on whether they satisfy the intended constraints. A separate construction method produces a difficulty-calibrated sequence of feature constraint sets that results in items with monotonically increasing difficulty. The framework is shown to achieve significantly higher rates of adherence to target constraints than single-agent baselines.

What carries the argument

MAFIG, the multi-agent framework in which LLM agents generate candidate items while feature-specific evaluators assess compliance and direct targeted revisions to enforce the constraints.

If this is right

Items adhere to target feature constraints at significantly higher rates than single-agent baselines.
A difficulty-calibrated sequence of constraint sets produces items with monotonically increasing difficulty.
Iterative revision guided by evaluator feedback improves overall constraint satisfaction in generated reading items.
The approach enables more consistent control over difficulty levels for reading comprehension tests.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same multi-agent pattern with domain-specific evaluators could be applied to item generation in other subjects such as mathematics or science.
Large testing organizations could integrate the framework into pipelines to reduce manual quality checks for new items.
If constraint adherence improves, the generated items might yield more accurate difficulty estimates when used in adaptive testing systems.

Load-bearing premise

Feature-specific evaluators can accurately and consistently judge whether generated items satisfy the intended constraints without systematic errors or the need for external human checks.

What would settle it

A human review of several hundred items generated under MAFIG that finds the actual rate of constraint satisfaction is no better than the rate achieved by single-agent baselines.

Figures

Figures reproduced from arXiv: 2605.19316 by Gary Geunbae Lee, Hyounghun Kim, Jun Seo, Seonjeong Hwang.

**Figure 2.** Figure 2: Overview of the MAFIG generation pipeline. [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Distribution of Difficulty Alignment Scores assigned by human evaluators. [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Constraint satisfaction performance under [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Ablation results showing the effect of Plan [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

**Figure 7.** Figure 7: Screenshots of instructions provided to human [PITH_FULL_IMAGE:figures/full_fig_p015_7.png] view at source ↗

**Figure 8.** Figure 8: Feature-wise ARs for GPT-5 (Direct Prompting) and Qwen3-32B-based [PITH_FULL_IMAGE:figures/full_fig_p016_8.png] view at source ↗

**Figure 9.** Figure 9: Visualization of feature-wise satisfaction across rounds for examples that (a) succeed or (b) fail to achieve [PITH_FULL_IMAGE:figures/full_fig_p016_9.png] view at source ↗

**Figure 10.** Figure 10: Expert-perceived difficulty factors for each [PITH_FULL_IMAGE:figures/full_fig_p018_10.png] view at source ↗

**Figure 11.** Figure 11: Prompt template for the Drafter agent used in the passage generation stage. {placeholder} indicates a slot to be filled with the corresponding value [PITH_FULL_IMAGE:figures/full_fig_p019_11.png] view at source ↗

**Figure 12.** Figure 12: Prompt template for the Planner agent used in the passage generation stage. {placeholder} indicates a slot to be filled with the corresponding value [PITH_FULL_IMAGE:figures/full_fig_p020_12.png] view at source ↗

**Figure 13.** Figure 13: Prompt template for the Editor agent used in the passage generation stage. {placeholder} indicates a slot to be filled with the corresponding value [PITH_FULL_IMAGE:figures/full_fig_p021_13.png] view at source ↗

**Figure 14.** Figure 14: Prompt template for the Reworder agent used in the passage generation stage. {placeholder} indicates a slot to be filled with the corresponding value [PITH_FULL_IMAGE:figures/full_fig_p022_14.png] view at source ↗

**Figure 15.** Figure 15: Prompt template for the Refiner agent used in the passage generation stage. {placeholder} indicates a slot to be filled with the corresponding value [PITH_FULL_IMAGE:figures/full_fig_p023_15.png] view at source ↗

read the original abstract

Recent studies in difficulty-controlled reading comprehension item generation have leveraged large language models (LLMs) to produce items by adjusting difficulty-related features. However, existing methods typically rely on a single-agent prompting approach, which often fails to consistently satisfy specified feature constraints, resulting in items that deviate from the target difficulty level. To address this limitation, we introduce MAFIG, a Multi-agent Framework for Feature-constrained Item Generation, where multiple LLM agents and feature-specific evaluators collaborate to generate and iteratively revise items based on intended constraints. Furthermore, to verify the efficacy of MAFIG in difficulty control, we propose a method for constructing a sequence of feature constraint sets that yield items with monotonically increasing difficulty. Experimental results demonstrate that MAFIG generates items that adhere to target constraints at a significantly higher rate than baselines, achieving robust difficulty control through the difficulty-calibrated constraint sequence.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MAFIG adds a multi-agent loop with feature evaluators and a constraint sequencing trick to get better adherence in difficulty-controlled reading item generation, but the gains rest on uncalibrated LLM judges.

read the letter

The core move here is replacing single-agent prompting with a team of LLM agents plus separate feature-specific evaluators that iterate until constraints are met, plus a method for ordering constraint sets so difficulty rises monotonically. That architecture is the actual addition over the cited prior work on single-agent approaches. It targets a real pain point in edtech item writing where generated questions drift from the intended lexical, syntactic, or semantic features that control difficulty. The sequencing idea is a clean engineering step that could be reused even outside this exact setup. On the positive side, the framework is described clearly enough that someone could implement the agent roles and evaluator prompts without too much guesswork. The abstract claims significantly higher adherence than baselines, which would be useful if it holds. The soft spot is exactly where the stress-test points: the adherence numbers come from the same class of LLM evaluators that are supposed to be judging the items. Without reported human calibration, inter-annotator agreement, or even basic error analysis on the evaluators themselves, the performance gap could be inflated by systematic bias in the judges. The abstract gives no metrics, dataset details, or statistical tests, so the central empirical claim stays hard to assess from the available text. If the full paper only shows LLM-vs-LLM comparisons without external validation, the difficulty-control results become circular. This is the kind of paper that belongs in an applied NLP or educational technology venue rather than a core AI conference. Readers working on constrained generation or automated assessment tools could pick up the multi-agent pattern and the sequencing method and test them on their own data. It is worth sending to peer review because the problem is concrete, the proposed fix is straightforward to replicate, and a referee can push for the missing human validation and clearer reporting. The work does not need to be groundbreaking to be worth referee time; it just needs the evidence to match the claim.

Referee Report

1 major / 1 minor

Summary. The paper introduces MAFIG, a multi-agent LLM framework in which specialized agents and feature-specific evaluators collaborate to generate and iteratively revise reading comprehension items so that they satisfy explicit lexical, syntactic, and semantic constraints tied to target difficulty. It further proposes a procedure for constructing a monotonic sequence of constraint sets that is claimed to produce items of steadily increasing difficulty. Experiments are reported to show substantially higher constraint-adherence rates for MAFIG than for single-agent baselines.

Significance. If the empirical claims hold after proper validation of the evaluators, the work would supply a practical engineering advance for automated item generation in language testing: a multi-agent loop that demonstrably improves adherence to multi-dimensional feature constraints without requiring direct difficulty-parameter tuning. The difficulty-calibrated constraint-sequence construction is a reusable methodological contribution that could be adopted by other generation pipelines.

major comments (1)

[Abstract / Experimental Results] Abstract and Experimental Results section: the central claim that MAFIG achieves 'significantly higher' constraint adherence and 'robust difficulty control' rests entirely on the judgments of the feature-specific evaluators, yet the manuscript supplies no information on how these evaluators were implemented, whether they are LLM-based, how they were calibrated, or any human-validation or inter-annotator-agreement statistics. Because the reported performance gap and the monotonic-difficulty assertion both depend on these unvalidated judgments, the empirical support for the primary contribution is unverifiable from the given text.

minor comments (1)

[Abstract] The abstract states that items 'adhere to target constraints at a significantly higher rate' but does not name the statistical test, effect size, or number of items per condition; these details should be added for reproducibility.

Simulated Author's Rebuttal

1 responses · 1 unresolved

We appreciate the referee's thorough review and constructive criticism. We address the major comment regarding the description and validation of the feature-specific evaluators in detail below.

read point-by-point responses

Referee: [Abstract / Experimental Results] Abstract and Experimental Results section: the central claim that MAFIG achieves 'significantly higher' constraint adherence and 'robust difficulty control' rests entirely on the judgments of the feature-specific evaluators, yet the manuscript supplies no information on how these evaluators were implemented, whether they are LLM-based, how they were calibrated, or any human-validation or inter-annotator-agreement statistics. Because the reported performance gap and the monotonic-difficulty assertion both depend on these unvalidated judgments, the empirical support for the primary contribution is unverifiable from the given text.

Authors: We agree that the current version of the manuscript does not provide adequate details on the feature-specific evaluators, which is a shortcoming. In the revised manuscript, we will include a new subsection titled 'Evaluator Implementation' in the Experimental Setup. This will specify that the evaluators are LLM-based, using GPT-4 with carefully designed prompts for each constraint type (lexical, syntactic, semantic). The calibration involved iterative prompt engineering on a held-out development set of 200 items to achieve high agreement with manual annotations on that set. We will also report the prompt templates used. However, a full-scale human validation study with multiple annotators and IAA statistics was not conducted for the main experiments, as the evaluators were intended as automated proxies. We will explicitly state this limitation and its implications for the claims in the revised paper. revision: partial

standing simulated objections not resolved

Full human validation and inter-annotator agreement statistics for the evaluators were not performed in the study.

Circularity Check

0 steps flagged

No circularity: empirical engineering contribution with independent experimental validation

full rationale

The paper introduces MAFIG as a multi-agent LLM framework for generating reading comprehension items under feature constraints and proposes a method to build monotonically increasing difficulty sequences. Experimental results compare adherence rates against baselines. No equations, fitted parameters, or derivations appear in the provided abstract or description. The central claim rests on empirical comparison rather than any self-definitional reduction, fitted-input prediction, or load-bearing self-citation chain. The framework and constraint-sequence construction are presented as engineering proposals whose validity is tested externally via generated items and evaluator judgments, without reducing to tautological inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach rests on standard assumptions about LLM text generation capabilities and the measurability of difficulty features; no free parameters or invented entities are introduced in the abstract.

axioms (1)

domain assumption LLMs can generate coherent reading comprehension items when appropriately prompted and revised.
Core premise enabling the use of LLMs for item generation and iterative refinement.

pith-pipeline@v0.9.0 · 5682 in / 1075 out tokens · 31772 ms · 2026-05-20T05:53:50.091514+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

MAFIG ... multiple LLM agents and feature-specific evaluators collaborate to generate and iteratively revise items based on intended constraints ... difficulty-calibrated constraint sequence
IndisputableMonolith/Foundation/ArithmeticFromLogic.lean LogicNat induction and embed_strictMono unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

DAS(Qi, Qj) ... monotonic increase in difficulty

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

103 extracted references · 103 canonical work pages · 10 internal anchors

[1]

G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment

G-eval: NLG evaluation using gpt-4 with better human alignment , author=. arXiv preprint arXiv:2303.16634 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[2]

arXiv preprint arXiv:2210.07197 , year=

Towards a unified multi-dimensional evaluator for text generation , author=. arXiv preprint arXiv:2210.07197 , year=

work page arXiv
[3]

Multi-Agent Collaboration: Harnessing the Power of Intelligent LLM Agents

Multi-agent collaboration: Harnessing the power of intelligent llm agents , author=. arXiv preprint arXiv:2306.03314 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[4]

Advances in Neural Information Processing Systems , volume=

Self-refine: Iterative refinement with self-feedback , author=. Advances in Neural Information Processing Systems , volume=

work page
[5]

Advances in neural information processing systems , volume=

Retrieval-augmented generation for knowledge-intensive nlp tasks , author=. Advances in neural information processing systems , volume=

work page
[6]

arXiv preprint arXiv:2409.14371 , year=

The ability of large language models to evaluate constraint-satisfaction in agent responses to open-ended requests , author=. arXiv preprint arXiv:2409.14371 , year=

work page arXiv
[7]

arXiv preprint arXiv:2503.08669 , year=

Agentorca: A dual-system framework to evaluate language agents on operational routine and constraint adherence , author=. arXiv preprint arXiv:2503.08669 , year=

work page arXiv
[8]

Metal: A multi- agent framework for chart generation with test-time scaling,

Metal: A multi-agent framework for chart generation with test-time scaling , author=. arXiv preprint arXiv:2502.17651 , year=

work page arXiv
[9]

Findings of the Association for Computational Linguistics: NAACL 2025 , pages=

Zero-Shot Strategies for Length-Controllable Summarization , author=. Findings of the Association for Computational Linguistics: NAACL 2025 , pages=

work page 2025
[10]

arXiv preprint arXiv:2411.12460 , year=

Exploring Iterative Controllable Summarization with Large Language Models , author=. arXiv preprint arXiv:2411.12460 , year=

work page arXiv
[11]

The eleventh international conference on learning representations , year=

React: Synergizing reasoning and acting in language models , author=. The eleventh international conference on learning representations , year=

work page
[12]

Advances in Neural Information Processing Systems , volume=

Reflexion: Language agents with verbal reinforcement learning , author=. Advances in Neural Information Processing Systems , volume=

work page
[13]

Proceedings of the 18th workshop on innovative use of NLP for building educational applications (BEA 2023) , pages=

Evaluating reading comprehension exercises generated by LLMs: A showcase of ChatGPT in education applications , author=. Proceedings of the 18th workshop on innovative use of NLP for building educational applications (BEA 2023) , pages=

work page 2023
[14]

Education and Information Technologies , volume=

Few-shot is enough: exploring ChatGPT prompt engineering method for automatic question generation in english education , author=. Education and Information Technologies , volume=. 2024 , publisher=

work page 2024
[15]

Proceedings of the 31st International Conference on Computational Linguistics , pages=

Automatic multiple-choice question generation and evaluation systems based on LLM: A study case with university resolutions , author=. Proceedings of the 31st International Conference on Computational Linguistics , pages=

work page
[16]

Educational Measurement: Issues and Practice , volume=

Using OpenAI GPT to generate reading comprehension items , author=. Educational Measurement: Issues and Practice , volume=. 2024 , publisher=

work page 2024
[17]

Proceedings of the 3rd Workshop on Tools and Resources for People with REAding DIfficulties (READI)@ LREC-COLING 2024 , pages=

Automatic Generation and Evaluation of Reading Comprehension Test Items with Large Language Models , author=. Proceedings of the 3rd Workshop on Tools and Resources for People with REAding DIfficulties (READI)@ LREC-COLING 2024 , pages=

work page 2024
[18]

Advances in Neural Information Processing Systems , volume=

Large language models are semi-parametric reinforcement learning agents , author=. Advances in Neural Information Processing Systems , volume=

work page
[19]

International Conference on Artificial Intelligence in Education , pages=

How useful are educational questions generated by large language models? , author=. International Conference on Artificial Intelligence in Education , pages=. 2023 , organization=

work page 2023
[20]

Difficulty Controllable Generation of Reading Comprehension Questions

Difficulty controllable generation of reading comprehension questions , author=. arXiv preprint arXiv:1807.03586 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[21]

Findings of the Association for Computational Linguistics ACL 2024 , pages=

Planning first, question second: An llm-guided method for controllable question generation , author=. Findings of the Association for Computational Linguistics ACL 2024 , pages=

work page 2024
[22]

Proceedings of the 18th workshop on innovative use of NLP for building educational applications (BEA 2023) , pages=

Difficulty-controllable neural question generation for reading comprehension using item response theory , author=. Proceedings of the 18th workshop on innovative use of NLP for building educational applications (BEA 2023) , pages=

work page 2023
[23]

IEEE Transactions on Learning Technologies , year=

Adaptive question--answer generation with difficulty control using Item Response Theory and pre-trained transformer models , author=. IEEE Transactions on Learning Technologies , year=

work page
[24]

arXiv preprint arXiv:2510.19265 , year=

Difficulty-Controllable Multiple-Choice Question Generation Using Large Language Models and Direct Preference Optimization , author=. arXiv preprint arXiv:2510.19265 , year=

work page arXiv
[25]

Computers and Education: Artificial Intelligence , volume=

Automated reading passage generation with OpenAI's large language model , author=. Computers and Education: Artificial Intelligence , volume=. 2023 , publisher=

work page 2023
[26]

International Conference on Artificial Intelligence in Education , pages=

Difficulty-controllable multiple-choice question generation for reading comprehension using item response theory , author=. International Conference on Artificial Intelligence in Education , pages=. 2024 , organization=

work page 2024
[27]

International Conference on Computers in Education , year=

Difficulty-controllable reading comprehension question generation considering the difficulty of reading passages , author=. International Conference on Computers in Education , year=

work page
[28]

arXiv preprint arXiv:2505.07618 , year=

KAQG: A Knowledge-Graph-Enhanced RAG for Difficulty-Controlled Question Generation , author=. arXiv preprint arXiv:2505.07618 , year=

work page arXiv
[29]

arXiv preprint arXiv:2504.14232 , year=

Assessing AI-Generated Questions' Alignment with Cognitive Frameworks in Educational Assessment , author=. arXiv preprint arXiv:2504.14232 , year=

work page arXiv
[30]

International Conference on Artificial Intelligence in Education , pages=

Towards automated multiple choice question generation and evaluation: aligning with Bloom’s taxonomy , author=. International Conference on Artificial Intelligence in Education , pages=. 2024 , organization=

work page 2024
[31]

IEEE Access , volume=

MCQGen: A large language model-driven MCQ generator for personalized learning , author=. IEEE Access , volume=. 2024 , publisher=

work page 2024
[32]

Information Processing & Management , volume=

Difficulty-controllable question generation over knowledge graphs: A counterfactual reasoning approach , author=. Information Processing & Management , volume=. 2024 , publisher=

work page 2024
[33]

International Conference on Artificial Intelligence in Education , pages=

Systematic Control of Multiple-Choice Item Difficulty Through LLM-Based Distractor Generation , author=. International Conference on Artificial Intelligence in Education , pages=. 2025 , organization=

work page 2025
[34]

International Journal of Artificial Intelligence in Education , volume=

Text-based question difficulty prediction: A systematic review of automatic approaches , author=. International Journal of Artificial Intelligence in Education , volume=. 2024 , publisher=

work page 2024
[35]

International Conference on Human-Computer Interaction , pages=

Exploring the Potential of Large Language Models for Estimating the Reading Comprehension Question Difficulty , author=. International Conference on Human-Computer Interaction , pages=. 2025 , organization=

work page 2025
[36]

Proceedings of the 31st International Conference on Computational Linguistics , pages=

TEEMIL: Towards Educational MCQ Difficulty Estimation in Indic Languages , author=. Proceedings of the 31st International Conference on Computational Linguistics , pages=

work page
[37]

arXiv preprint arXiv:2404.13343 , year=

UnibucLLM: Harnessing LLMs for automated prediction of item difficulty and response time for multiple-choice questions , author=. arXiv preprint arXiv:2404.13343 , year=

work page arXiv
[38]

arXiv preprint arXiv:2404.10704 , year=

Question difficulty ranking for multiple-choice reading comprehension , author=. arXiv preprint arXiv:2404.10704 , year=

work page arXiv
[39]

Proceedings of the Eleventh ACM Conference on Learning@ Scale , pages=

Generative students: Using llm-simulated student profiles to support question item evaluation , author=. Proceedings of the Eleventh ACM Conference on Learning@ Scale , pages=

work page
[40]

Findings of the Association for Computational Linguistics: EMNLP 2024 , pages=

Large language models are students at various levels: Zero-shot question difficulty estimation , author=. Findings of the Association for Computational Linguistics: EMNLP 2024 , pages=

work page 2024
[41]

Proceedings of the 17th International Conference on Educational Data Mining , pages=

How hard can this question be? An exploratory analysis of features assessing question difficulty using LLMs , author=. Proceedings of the 17th International Conference on Educational Data Mining , pages=

work page
[42]

Applied Measurement in Education , volume=

Response demands of reading comprehension test items: A review of item difficulty modeling studies , author=. Applied Measurement in Education , volume=. 2022 , publisher=

work page 2022
[43]

2014 ieee international conference on software science, technology and engineering , pages=

Learning methods for rating the difficulty of reading comprehension questions , author=. 2014 ieee international conference on software science, technology and engineering , pages=. 2014 , organization=

work page 2014
[44]

Cogent Education , volume=

Analysis of IELTS and TOEFL reading and listening tests in terms of Revised Bloom’s Taxonomy , author=. Cogent Education , volume=. 2020 , publisher=

work page 2020
[45]

Can LLMs Estimate Cognitive Complexity of Reading Comprehension Items?

Can LLMs Estimate Cognitive Complexity of Reading Comprehension Items? , author=. arXiv preprint arXiv:2510.25064 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[46]

European conference on information retrieval , pages=

A novel multi-stage prompting approach for language agnostic mcq generation using gpt , author=. European conference on information retrieval , pages=. 2024 , organization=

work page 2024
[47]

Bert: Pre-training of deep bidirectional transformers for language understanding , author=. Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers) , pages=

work page 2019
[48]

and Jones, Ronald W

Hambleton, Ronald K. and Jones, Ronald W. , title =. Educational Measurement: Issues and Practice , volume =. 1993 , doi =

work page 1993
[49]

Lord , title =

Frederic M. Lord , title =. 1980 , publisher =

work page 1980
[50]

Studies in English Education , volume=

Prediction of item difficulty on a reading comprehension test , author=. Studies in English Education , volume=. 2012 , publisher=

work page 2012
[51]

2014 , publisher=

Automated evaluation of text and discourse with Coh-Metrix , author=. 2014 , publisher=

work page 2014
[52]

, author=

Children's comprehension of between-and within-sentence syntactic structures. , author=. Journal of educational psychology , volume=. 1970 , publisher=

work page 1970
[53]

Review of educational research , volume=

How to construct achievement tests to assess comprehension , author=. Review of educational research , volume=. 1972 , publisher=

work page 1972
[54]

Applied psychological measurement , volume=

Component latent trait models for paragraph comprehension tests , author=. Applied psychological measurement , volume=. 1987 , publisher=

work page 1987
[55]

ETS Research Report Series , volume=

The prediction of GRE reading comprehension item difficulty for expository prose passages for each of three item types: Main ideas, inferences and explicit statements , author=. ETS Research Report Series , volume=. 1991 , publisher=

work page 1991
[56]

Foreign Language Annals , volume=

Comparison of L2 listening and reading comprehension by university students learning English in Korea , author=. Foreign Language Annals , volume=. 2004 , publisher=

work page 2004
[57]

Asian-Pacific Journal of Second and Foreign Language Education , volume=

Predicting the difficulty of EFL reading comprehension tests based on linguistic indices , author=. Asian-Pacific Journal of Second and Foreign Language Education , volume=. 2023 , publisher=

work page 2023
[58]

Handbook 1: Cognitive domain , author=

Taxonomy of educational objectives: The classification of educational goals. Handbook 1: Cognitive domain , author=. 1956 , publisher=

work page 1956
[59]

Proceedings of the 31st International Conference on Computational Linguistics , pages=

LLMs meet Bloom’s Taxonomy: A Cognitive View on Large Language Model Evaluations , author=. Proceedings of the 31st International Conference on Computational Linguistics , pages=

work page
[60]

Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing , pages=

RACE: Large-scale ReAding Comprehension Dataset From Examinations , author=. Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing , pages=

work page 2017
[61]

Asian Conference on Machine Learning , pages=

A new multi-choice reading comprehension dataset for curriculum learning , author=. Asian Conference on Machine Learning , pages=. 2019 , organization=

work page 2019
[62]

Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , pages=

Hierarchical Deconstruction of LLM Reasoning: A Graph-Based Framework for Analyzing Knowledge Utilization , author=. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , pages=

work page 2024
[63]

Language Assessment Quarterly , volume=

Predicting the difficulty of EFL tests based on corpus linguistic features and expert judgment , author=. Language Assessment Quarterly , volume=. 2020 , publisher=

work page 2020
[64]

Psychiatry research , volume=

Detecting formal thought disorder by deep contextualized word representations , author=. Psychiatry research , volume=. 2021 , publisher=

work page 2021
[65]

Efficient Estimation of Word Representations in Vector Space

Efficient estimation of word representations in vector space , author=. arXiv preprint arXiv:1301.3781 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[66]

International conference on artificial intelligence in education , pages=

Introducing a framework to assess newly created questions with natural language processing , author=. International conference on artificial intelligence in education , pages=. 2020 , organization=

work page 2020
[67]

International Journal of Artificial Intelligence in Education , volume=

Predicting the difficulty of exercise items for dynamic difficulty adaptation in adaptive language tutoring , author=. International Journal of Artificial Intelligence in Education , volume=. 2019 , publisher=

work page 2019
[68]

Information Processing & Management , volume=

Automated estimation of item difficulty for multiple-choice tests: An application of word embedding techniques , author=. Information Processing & Management , volume=. 2018 , publisher=

work page 2018
[69]

2020 international conference on communications, information system and computer engineering (cisce) , pages=

Multi-task BERT for problem difficulty prediction , author=. 2020 international conference on communications, information system and computer engineering (cisce) , pages=. 2020 , organization=

work page 2020
[70]

Proceedings of the 16th workshop on innovative use of NLP for building educational applications , pages=

On the application of transformers for estimating the difficulty of multiple-choice questions from text , author=. Proceedings of the 16th workshop on innovative use of NLP for building educational applications , pages=

work page
[71]

arXiv preprint arXiv:2502.20663 , year=

Prediction of Item Difficulty for Reading Comprehension Items by Creation of Annotated Item Repository , author=. arXiv preprint arXiv:2502.20663 , year=

work page arXiv
[72]

Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

Fantastic Questions and Where to Find Them: FairytaleQA--An Authentic Dataset for Narrative Comprehension , author=. Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

work page
[73]

arXiv preprint arXiv:2306.13047 , year=

Analysis of the cambridge multiple-choice questions reading dataset with a focus on candidate response distribution , author=. arXiv preprint arXiv:2306.13047 , year=

work page arXiv
[74]

Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , pages=

Adaption-of-thought: Learning question difficulty improves large language models for reasoning , author=. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , pages=

work page 2024
[75]

SQuAD: 100,000+ Questions for Machine Comprehension of Text

Squad: 100,000+ questions for machine comprehension of text , author=. arXiv preprint arXiv:1606.05250 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[76]

MS MARCO: A Human Generated MAchine Reading COmprehension Dataset

Ms marco: A human generated machine reading comprehension dataset , author=. arXiv preprint arXiv:1611.09268 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[77]

BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions , author=. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers) , pages=

work page 2019
[78]

Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing , pages=

HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering , author=. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing , pages=

work page 2018
[79]

Proceedings of the AAAI conference on artificial intelligence , volume=

Question Difficulty Prediction for READING Problems in Standard Tests , author=. Proceedings of the AAAI conference on artificial intelligence , volume=

work page
[80]

2023 , publisher=

The cambridge multiple-choice questions reading dataset , author=. 2023 , publisher=

work page 2023

Showing first 80 references.

[1] [1]

G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment

G-eval: NLG evaluation using gpt-4 with better human alignment , author=. arXiv preprint arXiv:2303.16634 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[2] [2]

arXiv preprint arXiv:2210.07197 , year=

Towards a unified multi-dimensional evaluator for text generation , author=. arXiv preprint arXiv:2210.07197 , year=

work page arXiv

[3] [3]

Multi-Agent Collaboration: Harnessing the Power of Intelligent LLM Agents

Multi-agent collaboration: Harnessing the power of intelligent llm agents , author=. arXiv preprint arXiv:2306.03314 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[4] [4]

Advances in Neural Information Processing Systems , volume=

Self-refine: Iterative refinement with self-feedback , author=. Advances in Neural Information Processing Systems , volume=

work page

[5] [5]

Advances in neural information processing systems , volume=

Retrieval-augmented generation for knowledge-intensive nlp tasks , author=. Advances in neural information processing systems , volume=

work page

[6] [6]

arXiv preprint arXiv:2409.14371 , year=

The ability of large language models to evaluate constraint-satisfaction in agent responses to open-ended requests , author=. arXiv preprint arXiv:2409.14371 , year=

work page arXiv

[7] [7]

arXiv preprint arXiv:2503.08669 , year=

Agentorca: A dual-system framework to evaluate language agents on operational routine and constraint adherence , author=. arXiv preprint arXiv:2503.08669 , year=

work page arXiv

[8] [8]

Metal: A multi- agent framework for chart generation with test-time scaling,

Metal: A multi-agent framework for chart generation with test-time scaling , author=. arXiv preprint arXiv:2502.17651 , year=

work page arXiv

[9] [9]

Findings of the Association for Computational Linguistics: NAACL 2025 , pages=

Zero-Shot Strategies for Length-Controllable Summarization , author=. Findings of the Association for Computational Linguistics: NAACL 2025 , pages=

work page 2025

[10] [10]

arXiv preprint arXiv:2411.12460 , year=

Exploring Iterative Controllable Summarization with Large Language Models , author=. arXiv preprint arXiv:2411.12460 , year=

work page arXiv

[11] [11]

The eleventh international conference on learning representations , year=

React: Synergizing reasoning and acting in language models , author=. The eleventh international conference on learning representations , year=

work page

[12] [12]

Advances in Neural Information Processing Systems , volume=

Reflexion: Language agents with verbal reinforcement learning , author=. Advances in Neural Information Processing Systems , volume=

work page

[13] [13]

Proceedings of the 18th workshop on innovative use of NLP for building educational applications (BEA 2023) , pages=

Evaluating reading comprehension exercises generated by LLMs: A showcase of ChatGPT in education applications , author=. Proceedings of the 18th workshop on innovative use of NLP for building educational applications (BEA 2023) , pages=

work page 2023

[14] [14]

Education and Information Technologies , volume=

Few-shot is enough: exploring ChatGPT prompt engineering method for automatic question generation in english education , author=. Education and Information Technologies , volume=. 2024 , publisher=

work page 2024

[15] [15]

Proceedings of the 31st International Conference on Computational Linguistics , pages=

Automatic multiple-choice question generation and evaluation systems based on LLM: A study case with university resolutions , author=. Proceedings of the 31st International Conference on Computational Linguistics , pages=

work page

[16] [16]

Educational Measurement: Issues and Practice , volume=

Using OpenAI GPT to generate reading comprehension items , author=. Educational Measurement: Issues and Practice , volume=. 2024 , publisher=

work page 2024

[17] [17]

Proceedings of the 3rd Workshop on Tools and Resources for People with REAding DIfficulties (READI)@ LREC-COLING 2024 , pages=

Automatic Generation and Evaluation of Reading Comprehension Test Items with Large Language Models , author=. Proceedings of the 3rd Workshop on Tools and Resources for People with REAding DIfficulties (READI)@ LREC-COLING 2024 , pages=

work page 2024

[18] [18]

Advances in Neural Information Processing Systems , volume=

Large language models are semi-parametric reinforcement learning agents , author=. Advances in Neural Information Processing Systems , volume=

work page

[19] [19]

International Conference on Artificial Intelligence in Education , pages=

How useful are educational questions generated by large language models? , author=. International Conference on Artificial Intelligence in Education , pages=. 2023 , organization=

work page 2023

[20] [20]

Difficulty Controllable Generation of Reading Comprehension Questions

Difficulty controllable generation of reading comprehension questions , author=. arXiv preprint arXiv:1807.03586 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[21] [21]

Findings of the Association for Computational Linguistics ACL 2024 , pages=

Planning first, question second: An llm-guided method for controllable question generation , author=. Findings of the Association for Computational Linguistics ACL 2024 , pages=

work page 2024

[22] [22]

Proceedings of the 18th workshop on innovative use of NLP for building educational applications (BEA 2023) , pages=

Difficulty-controllable neural question generation for reading comprehension using item response theory , author=. Proceedings of the 18th workshop on innovative use of NLP for building educational applications (BEA 2023) , pages=

work page 2023

[23] [23]

IEEE Transactions on Learning Technologies , year=

Adaptive question--answer generation with difficulty control using Item Response Theory and pre-trained transformer models , author=. IEEE Transactions on Learning Technologies , year=

work page

[24] [24]

arXiv preprint arXiv:2510.19265 , year=

Difficulty-Controllable Multiple-Choice Question Generation Using Large Language Models and Direct Preference Optimization , author=. arXiv preprint arXiv:2510.19265 , year=

work page arXiv

[25] [25]

Computers and Education: Artificial Intelligence , volume=

Automated reading passage generation with OpenAI's large language model , author=. Computers and Education: Artificial Intelligence , volume=. 2023 , publisher=

work page 2023

[26] [26]

International Conference on Artificial Intelligence in Education , pages=

Difficulty-controllable multiple-choice question generation for reading comprehension using item response theory , author=. International Conference on Artificial Intelligence in Education , pages=. 2024 , organization=

work page 2024

[27] [27]

International Conference on Computers in Education , year=

Difficulty-controllable reading comprehension question generation considering the difficulty of reading passages , author=. International Conference on Computers in Education , year=

work page

[28] [28]

arXiv preprint arXiv:2505.07618 , year=

KAQG: A Knowledge-Graph-Enhanced RAG for Difficulty-Controlled Question Generation , author=. arXiv preprint arXiv:2505.07618 , year=

work page arXiv

[29] [29]

arXiv preprint arXiv:2504.14232 , year=

Assessing AI-Generated Questions' Alignment with Cognitive Frameworks in Educational Assessment , author=. arXiv preprint arXiv:2504.14232 , year=

work page arXiv

[30] [30]

International Conference on Artificial Intelligence in Education , pages=

Towards automated multiple choice question generation and evaluation: aligning with Bloom’s taxonomy , author=. International Conference on Artificial Intelligence in Education , pages=. 2024 , organization=

work page 2024

[31] [31]

IEEE Access , volume=

MCQGen: A large language model-driven MCQ generator for personalized learning , author=. IEEE Access , volume=. 2024 , publisher=

work page 2024

[32] [32]

Information Processing & Management , volume=

Difficulty-controllable question generation over knowledge graphs: A counterfactual reasoning approach , author=. Information Processing & Management , volume=. 2024 , publisher=

work page 2024

[33] [33]

International Conference on Artificial Intelligence in Education , pages=

Systematic Control of Multiple-Choice Item Difficulty Through LLM-Based Distractor Generation , author=. International Conference on Artificial Intelligence in Education , pages=. 2025 , organization=

work page 2025

[34] [34]

International Journal of Artificial Intelligence in Education , volume=

Text-based question difficulty prediction: A systematic review of automatic approaches , author=. International Journal of Artificial Intelligence in Education , volume=. 2024 , publisher=

work page 2024

[35] [35]

International Conference on Human-Computer Interaction , pages=

Exploring the Potential of Large Language Models for Estimating the Reading Comprehension Question Difficulty , author=. International Conference on Human-Computer Interaction , pages=. 2025 , organization=

work page 2025

[36] [36]

Proceedings of the 31st International Conference on Computational Linguistics , pages=

TEEMIL: Towards Educational MCQ Difficulty Estimation in Indic Languages , author=. Proceedings of the 31st International Conference on Computational Linguistics , pages=

work page

[37] [37]

arXiv preprint arXiv:2404.13343 , year=

UnibucLLM: Harnessing LLMs for automated prediction of item difficulty and response time for multiple-choice questions , author=. arXiv preprint arXiv:2404.13343 , year=

work page arXiv

[38] [38]

arXiv preprint arXiv:2404.10704 , year=

Question difficulty ranking for multiple-choice reading comprehension , author=. arXiv preprint arXiv:2404.10704 , year=

work page arXiv

[39] [39]

Proceedings of the Eleventh ACM Conference on Learning@ Scale , pages=

Generative students: Using llm-simulated student profiles to support question item evaluation , author=. Proceedings of the Eleventh ACM Conference on Learning@ Scale , pages=

work page

[40] [40]

Findings of the Association for Computational Linguistics: EMNLP 2024 , pages=

Large language models are students at various levels: Zero-shot question difficulty estimation , author=. Findings of the Association for Computational Linguistics: EMNLP 2024 , pages=

work page 2024

[41] [41]

Proceedings of the 17th International Conference on Educational Data Mining , pages=

How hard can this question be? An exploratory analysis of features assessing question difficulty using LLMs , author=. Proceedings of the 17th International Conference on Educational Data Mining , pages=

work page

[42] [42]

Applied Measurement in Education , volume=

Response demands of reading comprehension test items: A review of item difficulty modeling studies , author=. Applied Measurement in Education , volume=. 2022 , publisher=

work page 2022

[43] [43]

2014 ieee international conference on software science, technology and engineering , pages=

Learning methods for rating the difficulty of reading comprehension questions , author=. 2014 ieee international conference on software science, technology and engineering , pages=. 2014 , organization=

work page 2014

[44] [44]

Cogent Education , volume=

Analysis of IELTS and TOEFL reading and listening tests in terms of Revised Bloom’s Taxonomy , author=. Cogent Education , volume=. 2020 , publisher=

work page 2020

[45] [45]

Can LLMs Estimate Cognitive Complexity of Reading Comprehension Items?

Can LLMs Estimate Cognitive Complexity of Reading Comprehension Items? , author=. arXiv preprint arXiv:2510.25064 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[46] [46]

European conference on information retrieval , pages=

A novel multi-stage prompting approach for language agnostic mcq generation using gpt , author=. European conference on information retrieval , pages=. 2024 , organization=

work page 2024

[47] [47]

Bert: Pre-training of deep bidirectional transformers for language understanding , author=. Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers) , pages=

work page 2019

[48] [48]

and Jones, Ronald W

Hambleton, Ronald K. and Jones, Ronald W. , title =. Educational Measurement: Issues and Practice , volume =. 1993 , doi =

work page 1993

[49] [49]

Lord , title =

Frederic M. Lord , title =. 1980 , publisher =

work page 1980

[50] [50]

Studies in English Education , volume=

Prediction of item difficulty on a reading comprehension test , author=. Studies in English Education , volume=. 2012 , publisher=

work page 2012

[51] [51]

2014 , publisher=

Automated evaluation of text and discourse with Coh-Metrix , author=. 2014 , publisher=

work page 2014

[52] [52]

, author=

Children's comprehension of between-and within-sentence syntactic structures. , author=. Journal of educational psychology , volume=. 1970 , publisher=

work page 1970

[53] [53]

Review of educational research , volume=

How to construct achievement tests to assess comprehension , author=. Review of educational research , volume=. 1972 , publisher=

work page 1972

[54] [54]

Applied psychological measurement , volume=

Component latent trait models for paragraph comprehension tests , author=. Applied psychological measurement , volume=. 1987 , publisher=

work page 1987

[55] [55]

ETS Research Report Series , volume=

The prediction of GRE reading comprehension item difficulty for expository prose passages for each of three item types: Main ideas, inferences and explicit statements , author=. ETS Research Report Series , volume=. 1991 , publisher=

work page 1991

[56] [56]

Foreign Language Annals , volume=

Comparison of L2 listening and reading comprehension by university students learning English in Korea , author=. Foreign Language Annals , volume=. 2004 , publisher=

work page 2004

[57] [57]

Asian-Pacific Journal of Second and Foreign Language Education , volume=

Predicting the difficulty of EFL reading comprehension tests based on linguistic indices , author=. Asian-Pacific Journal of Second and Foreign Language Education , volume=. 2023 , publisher=

work page 2023

[58] [58]

Handbook 1: Cognitive domain , author=

Taxonomy of educational objectives: The classification of educational goals. Handbook 1: Cognitive domain , author=. 1956 , publisher=

work page 1956

[59] [59]

Proceedings of the 31st International Conference on Computational Linguistics , pages=

LLMs meet Bloom’s Taxonomy: A Cognitive View on Large Language Model Evaluations , author=. Proceedings of the 31st International Conference on Computational Linguistics , pages=

work page

[60] [60]

Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing , pages=

RACE: Large-scale ReAding Comprehension Dataset From Examinations , author=. Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing , pages=

work page 2017

[61] [61]

Asian Conference on Machine Learning , pages=

A new multi-choice reading comprehension dataset for curriculum learning , author=. Asian Conference on Machine Learning , pages=. 2019 , organization=

work page 2019

[62] [62]

Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , pages=

Hierarchical Deconstruction of LLM Reasoning: A Graph-Based Framework for Analyzing Knowledge Utilization , author=. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , pages=

work page 2024

[63] [63]

Language Assessment Quarterly , volume=

Predicting the difficulty of EFL tests based on corpus linguistic features and expert judgment , author=. Language Assessment Quarterly , volume=. 2020 , publisher=

work page 2020

[64] [64]

Psychiatry research , volume=

Detecting formal thought disorder by deep contextualized word representations , author=. Psychiatry research , volume=. 2021 , publisher=

work page 2021

[65] [65]

Efficient Estimation of Word Representations in Vector Space

Efficient estimation of word representations in vector space , author=. arXiv preprint arXiv:1301.3781 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[66] [66]

International conference on artificial intelligence in education , pages=

Introducing a framework to assess newly created questions with natural language processing , author=. International conference on artificial intelligence in education , pages=. 2020 , organization=

work page 2020

[67] [67]

International Journal of Artificial Intelligence in Education , volume=

Predicting the difficulty of exercise items for dynamic difficulty adaptation in adaptive language tutoring , author=. International Journal of Artificial Intelligence in Education , volume=. 2019 , publisher=

work page 2019

[68] [68]

Information Processing & Management , volume=

Automated estimation of item difficulty for multiple-choice tests: An application of word embedding techniques , author=. Information Processing & Management , volume=. 2018 , publisher=

work page 2018

[69] [69]

2020 international conference on communications, information system and computer engineering (cisce) , pages=

Multi-task BERT for problem difficulty prediction , author=. 2020 international conference on communications, information system and computer engineering (cisce) , pages=. 2020 , organization=

work page 2020

[70] [70]

Proceedings of the 16th workshop on innovative use of NLP for building educational applications , pages=

On the application of transformers for estimating the difficulty of multiple-choice questions from text , author=. Proceedings of the 16th workshop on innovative use of NLP for building educational applications , pages=

work page

[71] [71]

arXiv preprint arXiv:2502.20663 , year=

Prediction of Item Difficulty for Reading Comprehension Items by Creation of Annotated Item Repository , author=. arXiv preprint arXiv:2502.20663 , year=

work page arXiv

[72] [72]

Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

Fantastic Questions and Where to Find Them: FairytaleQA--An Authentic Dataset for Narrative Comprehension , author=. Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

work page

[73] [73]

arXiv preprint arXiv:2306.13047 , year=

Analysis of the cambridge multiple-choice questions reading dataset with a focus on candidate response distribution , author=. arXiv preprint arXiv:2306.13047 , year=

work page arXiv

[74] [74]

Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , pages=

Adaption-of-thought: Learning question difficulty improves large language models for reasoning , author=. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , pages=

work page 2024

[75] [75]

SQuAD: 100,000+ Questions for Machine Comprehension of Text

Squad: 100,000+ questions for machine comprehension of text , author=. arXiv preprint arXiv:1606.05250 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[76] [76]

MS MARCO: A Human Generated MAchine Reading COmprehension Dataset

Ms marco: A human generated machine reading comprehension dataset , author=. arXiv preprint arXiv:1611.09268 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[77] [77]

BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions , author=. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers) , pages=

work page 2019

[78] [78]

Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing , pages=

HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering , author=. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing , pages=

work page 2018

[79] [79]

Proceedings of the AAAI conference on artificial intelligence , volume=

Question Difficulty Prediction for READING Problems in Standard Tests , author=. Proceedings of the AAAI conference on artificial intelligence , volume=

work page

[80] [80]

2023 , publisher=

The cambridge multiple-choice questions reading dataset , author=. 2023 , publisher=

work page 2023