Take Out Your Calculators: Estimating the Real Difficulty of Question Items with LLM Student Simulations

Christabel Acquaye; Marine Carpuat; Rachel Rudinger; Yi Ting Huang

arxiv: 2601.09953 · v2 · submitted 2026-01-15 · 💻 cs.CL

Take Out Your Calculators: Estimating the Real Difficulty of Question Items with LLM Student Simulations

Christabel Acquaye , Yi Ting Huang , Marine Carpuat , Rachel Rudinger This is my paper

Pith reviewed 2026-05-16 14:42 UTC · model grok-4.3

classification 💻 cs.CL

keywords LLM simulationsitem difficultyitem response theoryNAEP assessmentsmath questionsstudent role-playdifficulty estimationmultiple-choice items

0 comments

The pith

LLM role-play simulations of students recover real math test item difficulties with correlations up to 0.82

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether open-source large language models can stand in for expensive human pilot studies when ranking the difficulty of multiple-choice math questions. Instead of asking an LLM to judge difficulty outright, the authors prompt it to role-play entire classrooms of fourth-, eighth-, or twelfth-grade students who vary in proficiency and background. The simulated answers are then used to fit standard Item Response Theory models whose difficulty parameters are compared against official item statistics from the National Assessment of Educational Progress. Correlations reach 0.75, 0.76, and 0.82 across the three grades. The same simulations show that diverse student names stratified by gender and race improve the match and that weaker math models actually produce more realistic error patterns than stronger ones.

Core claim

Prompting LLMs to role-play students of varying proficiency levels and diverse demographics generates response patterns that, once fit with IRT models, yield item difficulty estimates correlating as high as 0.82 with real NAEP correctness rates. Direct LLM difficulty judgments perform poorly, while simulation-based estimates improve with larger simulated classrooms, stratified names, and the use of comparatively weaker base models such as Gemma over Llama or Qwen.

What carries the argument

LLM student role-play simulations that generate correctness data for fitting Item Response Theory (IRT) models to recover item difficulty parameters

Load-bearing premise

The response patterns produced by prompted LLM role-plays must closely track the actual distribution of abilities and common mistakes among real students at each grade level.

What would settle it

Repeating the exact simulation protocol on a fresh sample of NAEP or equivalent items and obtaining item-level correlations below 0.5 would falsify the claim that these LLM simulations reliably recover real-world difficulties.

read the original abstract

Standardized math assessments require expensive human pilot studies to establish the difficulty of test items. We investigate the predictive value of open-source large language models (LLMs) for evaluating the difficulty of multiple-choice math questions for real-world students. We show that, while LLMs are poor direct judges of problem difficulty, simulation-based approaches with LLMs yield promising results under the right conditions. Under the proposed approach, we simulate a ``classroom'' of 4th, 8th, or 12th-grade students by prompting the LLM to role-play students of varying proficiency levels. We use the outcomes of these simulations to fit Item Response Theory (IRT) models, comparing learned difficulty parameters for items to their real-world difficulties, as determined by item-level statistics furnished by the National Assessment of Educational Progress (NAEP). We observe correlations as high as 0.75, 0.76, and 0.82 for grades 4, 8, and 12, respectively, on the item-level correctness rates. In our simulations, we experiment on math MCQs with different ``classroom sizes,'' showing tradeoffs between computation size and accuracy. We find that role-plays with diverse-named students improve predictions (compared to student IDs), and stratifying names across gender and race further improves predictions. Our results show that LLMs with relatively weaker mathematical abilities (Gemma) actually yield better real-world difficulty predictions than mathematically stronger models (Llama and Qwen), further underscoring the suitability of these models for the task.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper proposes simulating classrooms of 4th-, 8th-, and 12th-grade students via LLM role-play prompts that vary proficiency levels and student names, generating response data to fit IRT models, and then comparing the resulting item difficulty estimates against real NAEP item statistics for math multiple-choice questions. It reports correlations of up to 0.75/0.76/0.82 on item-level correctness rates, examines trade-offs with classroom size, and finds benefits from name diversity and that weaker LLMs (Gemma) outperform stronger ones (Llama, Qwen).

Significance. If the IRT-based difficulty estimates prove robust and replicable, the method could substantially lower the cost of human pilot testing for item calibration in large-scale assessments, offering a scalable LLM-driven alternative for educational measurement.

major comments (2)

[Abstract] Abstract: the central claim is that IRT models fitted to LLM-simulated responses recover difficulty parameters whose ordering matches NAEP values, yet the headline numbers (0.75/0.76/0.82) are correlations on raw item-level correctness rates rather than on the IRT b-parameters. The abstract states the comparison is on 'learned difficulty parameters' but supplies only correctness-rate correlations; the b-parameter correlation must be reported as the primary metric if the IRT step is load-bearing.
[Abstract] Abstract / Methods: no information is supplied on the number of items, the exact prompt templates for proficiency stratification and name assignment, the IRT fitting procedure (model type, estimation method, software), or controls for simulation variance. These omissions make it impossible to determine whether the reported correlations are stable or sensitive to post-hoc choices.

minor comments (1)

The observation that mathematically weaker models yield better difficulty predictions is noteworthy but would benefit from explicit discussion of possible mechanisms (e.g., over- or under-confidence patterns) rather than being left as an empirical finding.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. We agree that the abstract requires clarification on the primary metrics and that additional methodological details are essential for reproducibility. We will revise the manuscript accordingly.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim is that IRT models fitted to LLM-simulated responses recover difficulty parameters whose ordering matches NAEP values, yet the headline numbers (0.75/0.76/0.82) are correlations on raw item-level correctness rates rather than on the IRT b-parameters. The abstract states the comparison is on 'learned difficulty parameters' but supplies only correctness-rate correlations; the b-parameter correlation must be reported as the primary metric if the IRT step is load-bearing.

Authors: We acknowledge the mismatch between the abstract wording and the reported numbers. The IRT b-parameters are the central output of our simulations, but the headline correlations were computed on item-level correctness rates (which are the direct simulation outputs used to estimate those parameters). We will revise the abstract to report the correlations on the estimated IRT b-parameters as the primary metric, while retaining the correctness-rate results as a supplementary analysis. This change will be made in the next version. revision: yes
Referee: [Abstract] Abstract / Methods: no information is supplied on the number of items, the exact prompt templates for proficiency stratification and name assignment, the IRT fitting procedure (model type, estimation method, software), or controls for simulation variance. These omissions make it impossible to determine whether the reported correlations are stable or sensitive to post-hoc choices.

Authors: We agree these details are necessary. The full manuscript will be updated to specify: the number of items (approximately 40–60 NAEP math MCQs per grade level), the complete prompt templates (including proficiency stratification and name assignment) in a new appendix, the IRT procedure (2PL model estimated via marginal maximum likelihood in the mirt R package), and variance controls (results averaged across 10 independent simulation runs with standard errors reported). These additions will appear in the Methods section and appendix. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical validation against external NAEP data

full rationale

The paper generates LLM-simulated student responses independently, fits standard IRT models to those simulated outcomes, and computes correlations between the resulting item statistics and real-world NAEP item-level correctness rates. No derivation step reduces the reported correlations or difficulty parameters to the simulation inputs by construction, self-definition, or self-citation chain. The comparison is to an external benchmark (NAEP statistics) that is not part of the model's fitted values.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that prompted LLMs can produce response distributions representative of real students; no free parameters are explicitly fitted to NAEP data, and no new entities are postulated.

free parameters (2)

classroom size
Number of simulated students per prompt; authors experiment with different sizes but do not report a single fitted value.
proficiency stratification
Choice of how many ability levels and how names are assigned across gender and race; these are design choices rather than data-fitted constants.

axioms (1)

domain assumption LLM role-play with ability-level prompts produces correctness rates whose ordering reflects real student difficulty
Invoked throughout the simulation design and IRT fitting step.

pith-pipeline@v0.9.0 · 5587 in / 1275 out tokens · 28895 ms · 2026-05-16T14:42:53.713012+00:00 · methodology

Take Out Your Calculators: Estimating the Real Difficulty of Question Items with LLM Student Simulations

Core claim

What carries the argument

Load-bearing premise

What would settle it

discussion (0)