Can LLMs Estimate Cognitive Complexity of Reading Comprehension Items?

Gary Geunbae Lee; Hyounghun Kim; Seonjeong Hwang

arxiv: 2510.25064 · v2 · pith:W5C5QI32new · submitted 2025-10-29 · 💻 cs.CL

Can LLMs Estimate Cognitive Complexity of Reading Comprehension Items?

Seonjeong Hwang , Hyounghun Kim , Gary Geunbae Lee This is my paper

Pith reviewed 2026-05-21 20:30 UTC · model grok-4.3

classification 💻 cs.CL

keywords LLMcognitive complexityreading comprehensionitem difficulty estimationevidence scopetransformation level

0 comments

The pith

Large language models can approximate the cognitive complexity of reading comprehension items by estimating evidence needs and reasoning transformations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether large language models can judge the cognitive complexity of reading comprehension questions. It uses two specific dimensions to capture this complexity: the scope of evidence required from the text and the level of transformation needed to reach the answer. Showing that LLMs can match human estimates on these dimensions would mean automated tools could analyze question difficulty in advance. The experiments indicate that LLMs succeed at this approximation. The work also uncovers that LLMs often cannot explain the cognitive features behind their own correct answers.

Core claim

Large language models can approximate the cognitive complexity of reading comprehension items along the dimensions of Evidence Scope and Transformation Level. These dimensions reflect the cognitive burden in the reasoning process to find the answer. The results show LLMs provide estimates that align with human annotations, supporting their use for pre-administration difficulty analysis. The study further reveals a discrepancy where LLMs produce correct answers but fail to accurately identify the reasoning features they employed.

What carries the argument

Evidence Scope and Transformation Level as measures of cognitive burden in answer reasoning.

If this is right

Questions can be screened for likely difficulty levels without first giving them to students.
Development of reading comprehension tests can incorporate automated cognitive analysis.
Human experts can focus on other aspects while LLMs handle initial complexity estimates.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar methods could apply to estimating complexity in other educational domains like math problems or science questions.
LLMs might be prompted to generate reading items with specific complexity profiles based on these dimensions.
The metacognition gap points to a need for training or techniques that improve LLMs' ability to reflect on their reasoning processes.

Load-bearing premise

That the two dimensions adequately capture cognitive complexity and that LLM judgments correspond meaningfully to those of human annotators.

What would settle it

Running the LLM estimates on a fresh collection of reading comprehension items and checking whether those estimates predict actual student error rates or match new rounds of human ratings on difficulty.

read the original abstract

Estimating the cognitive complexity of reading comprehension (RC) items is crucial for assessing item difficulty before it is administered to learners. Unlike syntactic and semantic features, such as passage length or semantic similarity between options, cognitive features that arise during answer reasoning are not readily extractable using existing NLP tools and have traditionally relied on human annotation. In this study, we examine whether large language models (LLMs) can estimate the cognitive complexity of RC items by focusing on two dimensions-Evidence Scope and Transformation Level-that indicate the degree of cognitive burden involved in reasoning about the answer. Our experimental results demonstrate that LLMs can approximate the cognitive complexity of items, indicating their potential as tools for prior difficulty analysis. Further analysis reveals a gap between LLMs' reasoning ability and their metacognitive awareness: even when they produce correct answers, they sometimes fail to correctly identify the features underlying their own reasoning process.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

LLMs match human ratings on Evidence Scope and Transformation Level for RC items but the work provides no visible checks on annotation reliability or links to actual learner performance.

read the letter

The paper checks whether LLMs can rate reading comprehension items on two author-defined dimensions that are meant to capture cognitive burden during answer reasoning. The main positive result is that the models produce scores close enough to human labels to suggest they could serve as a quick pre-screen for item difficulty. The secondary observation is that LLMs sometimes answer correctly yet still misidentify the underlying reasoning features, which is a clean point about limits in model self-reporting.

Referee Report

2 major / 2 minor

Summary. The paper investigates whether LLMs can estimate the cognitive complexity of reading comprehension items by annotating two author-defined dimensions—Evidence Scope and Transformation Level—that purportedly reflect reasoning burden. It claims that experimental results show LLMs can approximate human judgments on these dimensions, enabling prior difficulty analysis, while also identifying a gap between LLMs' correct answers and their ability to metacognitively identify the underlying reasoning features.

Significance. If the central claim holds after proper validation, the work could reduce dependence on costly human annotation for item difficulty estimation in educational assessment, with downstream uses in automated test construction and adaptive learning systems. The metacognitive discrepancy observation may also inform research on LLM reasoning transparency.

major comments (2)

The manuscript provides no inter-annotator agreement statistics (e.g., Cohen's kappa or Fleiss' kappa) for the human annotations of Evidence Scope and Transformation Level. This is load-bearing for the central claim, as low reliability in the human ground truth would render any LLM-human agreement uninterpretable as evidence of cognitive complexity estimation.
No correlation or validation is reported between the annotated scores on the two dimensions and external empirical difficulty indicators such as learner error rates, response times, or item response theory parameters from actual test data. Without this, the dimensions may capture surface cues rather than genuine cognitive burden experienced by test-takers.

minor comments (2)

The abstract states that 'experimental results demonstrate' the claim but omits any mention of dataset size, sampling procedure, specific LLMs tested, evaluation metrics (e.g., accuracy, correlation coefficients), or statistical tests; these details should be summarized even at the abstract level for transparency.
Clarify the rationale and potential limitations for selecting only Evidence Scope and Transformation Level; discuss whether these two dimensions are exhaustive or if additional cognitive features (e.g., inference type or working memory load) were considered.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and insightful comments. We address each major comment point by point below, indicating planned revisions to the manuscript where appropriate.

read point-by-point responses

Referee: The manuscript provides no inter-annotator agreement statistics (e.g., Cohen's kappa or Fleiss' kappa) for the human annotations of Evidence Scope and Transformation Level. This is load-bearing for the central claim, as low reliability in the human ground truth would render any LLM-human agreement uninterpretable as evidence of cognitive complexity estimation.

Authors: We agree that inter-annotator agreement statistics are essential to establish the reliability of the human ground truth. The annotations were developed through collaborative guideline creation by the authors followed by independent annotation and consensus discussion. We will compute and report Cohen's kappa for both Evidence Scope and Transformation Level in the revised manuscript, along with details on how disagreements were resolved. This addition will directly address the concern and allow proper interpretation of LLM-human agreement results. revision: yes
Referee: No correlation or validation is reported between the annotated scores on the two dimensions and external empirical difficulty indicators such as learner error rates, response times, or item response theory parameters from actual test data. Without this, the dimensions may capture surface cues rather than genuine cognitive burden experienced by test-takers.

Authors: We appreciate this observation on external validation. Our study centers on whether LLMs can approximate human judgments on the two theoretically motivated dimensions (Evidence Scope and Transformation Level), which are defined to capture reasoning burden in reading comprehension. We did not have access to accompanying learner performance data for the items, precluding direct correlation with error rates or IRT parameters. In the revision we will expand the Limitations and Future Work sections to explicitly discuss this gap, clarify that the current focus is on matching defined human annotations rather than proving ecological validity of the dimensions, and propose validation against real test-taker data as an important direction for follow-up research. revision: partial

Circularity Check

0 steps flagged

No significant circularity; empirical comparison of LLM estimates to independent human annotations

full rationale

The paper selects two author-defined dimensions (Evidence Scope and Transformation Level) to represent cognitive complexity, obtains human annotations on RC items, prompts LLMs to produce estimates on the same dimensions, and reports agreement as evidence that LLMs can approximate cognitive complexity. This is a standard empirical evaluation chain with no equations, fitted parameters, or derivations that reduce outputs to inputs by construction. No self-citation is invoked to establish uniqueness or forbid alternatives, and no ansatz or renaming of known results is presented as a derivation. The central claim rests on external comparison to human labels rather than self-referential fitting or definitional equivalence.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The abstract provides no explicit free parameters, axioms, or invented entities; the work is presented as an empirical feasibility study relying on standard LLM capabilities and human-annotated cognitive dimensions.

pith-pipeline@v0.9.0 · 5681 in / 1139 out tokens · 65814 ms · 2026-05-21T20:30:43.674381+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We examine whether large language models (LLMs) can estimate the cognitive complexity of RC items by focusing on two dimensions—Evidence Scope and Transformation Level
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Our experimental results demonstrate that LLMs can approximate the cognitive complexity of items

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

A Multi-Agent Framework for Feature-Constrained Difficulty Control in Reading Comprehension Item Generation
cs.CL 2026-05 unverdicted novelty 6.0

MAFIG is a multi-agent framework that uses LLM agents and evaluators to generate reading comprehension items with significantly higher adherence to specified feature constraints than single-agent baselines.