Can LLMs Estimate Cognitive Complexity of Reading Comprehension Items?
Pith reviewed 2026-05-21 20:30 UTC · model grok-4.3
The pith
Large language models can approximate the cognitive complexity of reading comprehension items by estimating evidence needs and reasoning transformations.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Large language models can approximate the cognitive complexity of reading comprehension items along the dimensions of Evidence Scope and Transformation Level. These dimensions reflect the cognitive burden in the reasoning process to find the answer. The results show LLMs provide estimates that align with human annotations, supporting their use for pre-administration difficulty analysis. The study further reveals a discrepancy where LLMs produce correct answers but fail to accurately identify the reasoning features they employed.
What carries the argument
Evidence Scope and Transformation Level as measures of cognitive burden in answer reasoning.
If this is right
- Questions can be screened for likely difficulty levels without first giving them to students.
- Development of reading comprehension tests can incorporate automated cognitive analysis.
- Human experts can focus on other aspects while LLMs handle initial complexity estimates.
Where Pith is reading between the lines
- Similar methods could apply to estimating complexity in other educational domains like math problems or science questions.
- LLMs might be prompted to generate reading items with specific complexity profiles based on these dimensions.
- The metacognition gap points to a need for training or techniques that improve LLMs' ability to reflect on their reasoning processes.
Load-bearing premise
That the two dimensions adequately capture cognitive complexity and that LLM judgments correspond meaningfully to those of human annotators.
What would settle it
Running the LLM estimates on a fresh collection of reading comprehension items and checking whether those estimates predict actual student error rates or match new rounds of human ratings on difficulty.
read the original abstract
Estimating the cognitive complexity of reading comprehension (RC) items is crucial for assessing item difficulty before it is administered to learners. Unlike syntactic and semantic features, such as passage length or semantic similarity between options, cognitive features that arise during answer reasoning are not readily extractable using existing NLP tools and have traditionally relied on human annotation. In this study, we examine whether large language models (LLMs) can estimate the cognitive complexity of RC items by focusing on two dimensions-Evidence Scope and Transformation Level-that indicate the degree of cognitive burden involved in reasoning about the answer. Our experimental results demonstrate that LLMs can approximate the cognitive complexity of items, indicating their potential as tools for prior difficulty analysis. Further analysis reveals a gap between LLMs' reasoning ability and their metacognitive awareness: even when they produce correct answers, they sometimes fail to correctly identify the features underlying their own reasoning process.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper investigates whether LLMs can estimate the cognitive complexity of reading comprehension items by annotating two author-defined dimensions—Evidence Scope and Transformation Level—that purportedly reflect reasoning burden. It claims that experimental results show LLMs can approximate human judgments on these dimensions, enabling prior difficulty analysis, while also identifying a gap between LLMs' correct answers and their ability to metacognitively identify the underlying reasoning features.
Significance. If the central claim holds after proper validation, the work could reduce dependence on costly human annotation for item difficulty estimation in educational assessment, with downstream uses in automated test construction and adaptive learning systems. The metacognitive discrepancy observation may also inform research on LLM reasoning transparency.
major comments (2)
- The manuscript provides no inter-annotator agreement statistics (e.g., Cohen's kappa or Fleiss' kappa) for the human annotations of Evidence Scope and Transformation Level. This is load-bearing for the central claim, as low reliability in the human ground truth would render any LLM-human agreement uninterpretable as evidence of cognitive complexity estimation.
- No correlation or validation is reported between the annotated scores on the two dimensions and external empirical difficulty indicators such as learner error rates, response times, or item response theory parameters from actual test data. Without this, the dimensions may capture surface cues rather than genuine cognitive burden experienced by test-takers.
minor comments (2)
- The abstract states that 'experimental results demonstrate' the claim but omits any mention of dataset size, sampling procedure, specific LLMs tested, evaluation metrics (e.g., accuracy, correlation coefficients), or statistical tests; these details should be summarized even at the abstract level for transparency.
- Clarify the rationale and potential limitations for selecting only Evidence Scope and Transformation Level; discuss whether these two dimensions are exhaustive or if additional cognitive features (e.g., inference type or working memory load) were considered.
Simulated Author's Rebuttal
We thank the referee for their constructive and insightful comments. We address each major comment point by point below, indicating planned revisions to the manuscript where appropriate.
read point-by-point responses
-
Referee: The manuscript provides no inter-annotator agreement statistics (e.g., Cohen's kappa or Fleiss' kappa) for the human annotations of Evidence Scope and Transformation Level. This is load-bearing for the central claim, as low reliability in the human ground truth would render any LLM-human agreement uninterpretable as evidence of cognitive complexity estimation.
Authors: We agree that inter-annotator agreement statistics are essential to establish the reliability of the human ground truth. The annotations were developed through collaborative guideline creation by the authors followed by independent annotation and consensus discussion. We will compute and report Cohen's kappa for both Evidence Scope and Transformation Level in the revised manuscript, along with details on how disagreements were resolved. This addition will directly address the concern and allow proper interpretation of LLM-human agreement results. revision: yes
-
Referee: No correlation or validation is reported between the annotated scores on the two dimensions and external empirical difficulty indicators such as learner error rates, response times, or item response theory parameters from actual test data. Without this, the dimensions may capture surface cues rather than genuine cognitive burden experienced by test-takers.
Authors: We appreciate this observation on external validation. Our study centers on whether LLMs can approximate human judgments on the two theoretically motivated dimensions (Evidence Scope and Transformation Level), which are defined to capture reasoning burden in reading comprehension. We did not have access to accompanying learner performance data for the items, precluding direct correlation with error rates or IRT parameters. In the revision we will expand the Limitations and Future Work sections to explicitly discuss this gap, clarify that the current focus is on matching defined human annotations rather than proving ecological validity of the dimensions, and propose validation against real test-taker data as an important direction for follow-up research. revision: partial
Circularity Check
No significant circularity; empirical comparison of LLM estimates to independent human annotations
full rationale
The paper selects two author-defined dimensions (Evidence Scope and Transformation Level) to represent cognitive complexity, obtains human annotations on RC items, prompts LLMs to produce estimates on the same dimensions, and reports agreement as evidence that LLMs can approximate cognitive complexity. This is a standard empirical evaluation chain with no equations, fitted parameters, or derivations that reduce outputs to inputs by construction. No self-citation is invoked to establish uniqueness or forbid alternatives, and no ansatz or renaming of known results is presented as a derivation. The central claim rests on external comparison to human labels rather than self-referential fitting or definitional equivalence.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We examine whether large language models (LLMs) can estimate the cognitive complexity of RC items by focusing on two dimensions—Evidence Scope and Transformation Level
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Our experimental results demonstrate that LLMs can approximate the cognitive complexity of items
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 1 Pith paper
-
A Multi-Agent Framework for Feature-Constrained Difficulty Control in Reading Comprehension Item Generation
MAFIG is a multi-agent framework that uses LLM agents and evaluators to generate reading comprehension items with significantly higher adherence to specified feature constraints than single-agent baselines.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.