Take Out Your Calculators: Estimating the Real Difficulty of Question Items with LLM Student Simulations
Pith reviewed 2026-05-16 14:42 UTC · model grok-4.3
The pith
LLM role-play simulations of students recover real math test item difficulties with correlations up to 0.82
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Prompting LLMs to role-play students of varying proficiency levels and diverse demographics generates response patterns that, once fit with IRT models, yield item difficulty estimates correlating as high as 0.82 with real NAEP correctness rates. Direct LLM difficulty judgments perform poorly, while simulation-based estimates improve with larger simulated classrooms, stratified names, and the use of comparatively weaker base models such as Gemma over Llama or Qwen.
What carries the argument
LLM student role-play simulations that generate correctness data for fitting Item Response Theory (IRT) models to recover item difficulty parameters
Load-bearing premise
The response patterns produced by prompted LLM role-plays must closely track the actual distribution of abilities and common mistakes among real students at each grade level.
What would settle it
Repeating the exact simulation protocol on a fresh sample of NAEP or equivalent items and obtaining item-level correlations below 0.5 would falsify the claim that these LLM simulations reliably recover real-world difficulties.
read the original abstract
Standardized math assessments require expensive human pilot studies to establish the difficulty of test items. We investigate the predictive value of open-source large language models (LLMs) for evaluating the difficulty of multiple-choice math questions for real-world students. We show that, while LLMs are poor direct judges of problem difficulty, simulation-based approaches with LLMs yield promising results under the right conditions. Under the proposed approach, we simulate a ``classroom'' of 4th, 8th, or 12th-grade students by prompting the LLM to role-play students of varying proficiency levels. We use the outcomes of these simulations to fit Item Response Theory (IRT) models, comparing learned difficulty parameters for items to their real-world difficulties, as determined by item-level statistics furnished by the National Assessment of Educational Progress (NAEP). We observe correlations as high as 0.75, 0.76, and 0.82 for grades 4, 8, and 12, respectively, on the item-level correctness rates. In our simulations, we experiment on math MCQs with different ``classroom sizes,'' showing tradeoffs between computation size and accuracy. We find that role-plays with diverse-named students improve predictions (compared to student IDs), and stratifying names across gender and race further improves predictions. Our results show that LLMs with relatively weaker mathematical abilities (Gemma) actually yield better real-world difficulty predictions than mathematically stronger models (Llama and Qwen), further underscoring the suitability of these models for the task.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes simulating classrooms of 4th-, 8th-, and 12th-grade students via LLM role-play prompts that vary proficiency levels and student names, generating response data to fit IRT models, and then comparing the resulting item difficulty estimates against real NAEP item statistics for math multiple-choice questions. It reports correlations of up to 0.75/0.76/0.82 on item-level correctness rates, examines trade-offs with classroom size, and finds benefits from name diversity and that weaker LLMs (Gemma) outperform stronger ones (Llama, Qwen).
Significance. If the IRT-based difficulty estimates prove robust and replicable, the method could substantially lower the cost of human pilot testing for item calibration in large-scale assessments, offering a scalable LLM-driven alternative for educational measurement.
major comments (2)
- [Abstract] Abstract: the central claim is that IRT models fitted to LLM-simulated responses recover difficulty parameters whose ordering matches NAEP values, yet the headline numbers (0.75/0.76/0.82) are correlations on raw item-level correctness rates rather than on the IRT b-parameters. The abstract states the comparison is on 'learned difficulty parameters' but supplies only correctness-rate correlations; the b-parameter correlation must be reported as the primary metric if the IRT step is load-bearing.
- [Abstract] Abstract / Methods: no information is supplied on the number of items, the exact prompt templates for proficiency stratification and name assignment, the IRT fitting procedure (model type, estimation method, software), or controls for simulation variance. These omissions make it impossible to determine whether the reported correlations are stable or sensitive to post-hoc choices.
minor comments (1)
- The observation that mathematically weaker models yield better difficulty predictions is noteworthy but would benefit from explicit discussion of possible mechanisms (e.g., over- or under-confidence patterns) rather than being left as an empirical finding.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive feedback. We agree that the abstract requires clarification on the primary metrics and that additional methodological details are essential for reproducibility. We will revise the manuscript accordingly.
read point-by-point responses
-
Referee: [Abstract] Abstract: the central claim is that IRT models fitted to LLM-simulated responses recover difficulty parameters whose ordering matches NAEP values, yet the headline numbers (0.75/0.76/0.82) are correlations on raw item-level correctness rates rather than on the IRT b-parameters. The abstract states the comparison is on 'learned difficulty parameters' but supplies only correctness-rate correlations; the b-parameter correlation must be reported as the primary metric if the IRT step is load-bearing.
Authors: We acknowledge the mismatch between the abstract wording and the reported numbers. The IRT b-parameters are the central output of our simulations, but the headline correlations were computed on item-level correctness rates (which are the direct simulation outputs used to estimate those parameters). We will revise the abstract to report the correlations on the estimated IRT b-parameters as the primary metric, while retaining the correctness-rate results as a supplementary analysis. This change will be made in the next version. revision: yes
-
Referee: [Abstract] Abstract / Methods: no information is supplied on the number of items, the exact prompt templates for proficiency stratification and name assignment, the IRT fitting procedure (model type, estimation method, software), or controls for simulation variance. These omissions make it impossible to determine whether the reported correlations are stable or sensitive to post-hoc choices.
Authors: We agree these details are necessary. The full manuscript will be updated to specify: the number of items (approximately 40–60 NAEP math MCQs per grade level), the complete prompt templates (including proficiency stratification and name assignment) in a new appendix, the IRT procedure (2PL model estimated via marginal maximum likelihood in the mirt R package), and variance controls (results averaged across 10 independent simulation runs with standard errors reported). These additions will appear in the Methods section and appendix. revision: yes
Circularity Check
No significant circularity; empirical validation against external NAEP data
full rationale
The paper generates LLM-simulated student responses independently, fits standard IRT models to those simulated outcomes, and computes correlations between the resulting item statistics and real-world NAEP item-level correctness rates. No derivation step reduces the reported correlations or difficulty parameters to the simulation inputs by construction, self-definition, or self-citation chain. The comparison is to an external benchmark (NAEP statistics) that is not part of the model's fitted values.
Axiom & Free-Parameter Ledger
free parameters (2)
- classroom size
- proficiency stratification
axioms (1)
- domain assumption LLM role-play with ability-level prompts produces correctness rates whose ordering reflects real student difficulty
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.