Scalable Generation and Validation of Isomorphic Physics Problems with GenAI
Pith reviewed 2026-05-21 13:18 UTC · model grok-4.3
The pith
Generative AI can produce large banks of isomorphic physics problems that maintain consistent difficulty and allow language models to validate them by matching student performance patterns.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We present a framework for generating and evaluating large-scale isomorphic physics problem banks using Generative AI to enable asynchronous, multi-attempt assessments. Isomorphic problems test identical concepts through varied surface features and contexts, providing richer variation than conventional parameterized questions while maintaining consistent difficulty. Our generation framework employs prompt chaining and tool use to achieve precise control over structural variations alongside diverse contextual variations. For pre-deployment validation, we evaluate generated items using 17 open-source language models and compare against actual student performance across three midterm exams. 73%
What carries the argument
Prompt chaining and tool use that controls numeric values, spatial relations, and contextual changes while preserving identical underlying physics concepts, validated by comparing language-model accuracy patterns against student performance data.
If this is right
- Seventy-three percent of generated banks reach statistically homogeneous difficulty suitable for deployment.
- Language models can flag ambiguous or outlier variants before students see them.
- Mid-sized models avoid the floor and ceiling effects seen in very small or very large models during validation.
- The method supports richer surface variation than simple parameter changes while keeping difficulty stable.
- Asynchronous multi-attempt assessments become feasible without compromising security or comparability.
Where Pith is reading between the lines
- The same generation and validation steps could be tested on other STEM subjects to check transferability.
- If the proxy relationship holds, institutions could automate much of the effort required to produce varied, secure problem sets.
- Repeated use of such banks might allow researchers to measure whether actual cheating rates decline in practice.
- Extending the approach to include automatic difficulty calibration could further reduce the need for human review.
Load-bearing premise
Language model performance patterns on the generated problems serve as a reliable proxy for the conceptual difficulty experienced by human students.
What would settle it
A direct comparison in which student success rates on the isomorphic variants show no statistically significant correlation with the accuracy rates of the validating language models.
Figures
read the original abstract
Traditional synchronous STEM assessments face growing challenges including accessibility barriers, security concerns from resource-sharing platforms, and limited comparability across institutions. We present a framework for generating and evaluating large-scale isomorphic physics problem banks using Generative AI to enable asynchronous, multi-attempt assessments. Isomorphic problems test identical concepts through varied surface features and contexts, providing richer variation than conventional parameterized questions while maintaining consistent difficulty. Our generation framework employs prompt chaining and tool use to achieve precise control over structural variations (numeric values, spatial relations) alongside diverse contextual variations. For pre-deployment validation, we evaluate generated items using 17 open-source language models (LMs) (0.6B-32B) and compare against actual student performance (N>200) across three midterm exams. Results show that 73% of deployed banks achieve statistically homogeneous difficulty, and LMs pattern correlate strongly with student performance (Pearson's $\rho$ up to 0.594). Additionally, LMs successfully identify problematic variants, such as ambiguous problem texts. Model scale also proves critical for effective validation, where extremely small (<4B) and large (>14B) models exhibit floor and ceiling effects respectively, making mid-sized models optimal for detecting difficulty outliers.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents a GenAI framework using prompt chaining and tool use to generate large banks of isomorphic physics problems that vary surface features while preserving conceptual difficulty. It validates the approach empirically by evaluating items with 17 open-source LMs (0.6B–32B parameters) and comparing performance patterns against student responses from N>200 participants across three midterm exams, reporting that 73% of deployed banks show statistically homogeneous difficulty and that LM patterns correlate with student performance (Pearson’s ρ up to 0.594). Model scale is identified as critical, with mid-sized models optimal for outlier detection due to floor/ceiling effects in smaller and larger models.
Significance. If the central claims hold, the work offers a practical route to scalable, secure, and comparable asynchronous STEM assessments that address accessibility and academic-integrity concerns. The use of multiple independent open-source models together with real student performance data constitutes a concrete strength, moving beyond purely synthetic validation. The moderate correlation and lack of qualitative error analysis, however, leave the reliability of LM patterns as a proxy for human conceptual difficulty open to further scrutiny.
major comments (2)
- [Abstract / §4 (Empirical Validation)] Abstract and validation results: The headline claim that 73% of banks achieve statistically homogeneous difficulty and that LMs correlate with student performance (ρ up to 0.594) rests on treating LM accuracy/error patterns as a proxy for conceptual difficulty experienced by students. No qualitative error analysis, per-concept breakdown, or controls are described to establish that LMs and students fail the same items for the same reasons rather than shared surface cues; this assumption is load-bearing for the pre-deployment validation argument.
- [Abstract / §4.2] Abstract: The statement that mid-sized models are optimal for detecting difficulty outliers is presented without reporting the precise size thresholds, the statistical criteria used to identify outliers, or the number of banks on which the floor/ceiling effects were observed, making it difficult to evaluate the robustness of the scale recommendation.
minor comments (3)
- [Abstract] The abstract states N>200 students but provides no exclusion criteria, demographic details, or exact statistical tests (e.g., which homogeneity test and significance threshold) used to classify banks as homogeneous.
- [§3 (Generation Framework)] Prompt templates and tool-use specifications for controlling numeric values and spatial relations are referenced but not illustrated with concrete examples, limiting reproducibility.
- [Results] Figure or table captions should explicitly state the number of problem banks and items underlying the 73% homogeneity figure and the reported ρ values.
Simulated Author's Rebuttal
We thank the referee for their constructive feedback on our manuscript. We address each of the major comments below and have made revisions to strengthen the paper where indicated.
read point-by-point responses
-
Referee: [Abstract / §4 (Empirical Validation)] Abstract and validation results: The headline claim that 73% of banks achieve statistically homogeneous difficulty and that LMs correlate with student performance (ρ up to 0.594) rests on treating LM accuracy/error patterns as a proxy for conceptual difficulty experienced by students. No qualitative error analysis, per-concept breakdown, or controls are described to establish that LMs and students fail the same items for the same reasons rather than shared surface cues; this assumption is load-bearing for the pre-deployment validation argument.
Authors: We acknowledge the importance of demonstrating that the alignment between LM and student performance patterns reflects shared conceptual difficulties rather than coincidental surface features. Our validation is primarily quantitative, relying on Pearson correlations and tests for statistical homogeneity of difficulty across banks. To address this concern, we have added a qualitative error analysis in the revised manuscript. This includes examining specific error types (e.g., conceptual misconceptions vs. calculation errors) for a representative sample of 50 items across the three exams, with per-concept breakdowns for topics like kinematics and circuits. We also incorporated controls by comparing performance on variants with controlled surface features. These additions support that the proxy is reasonable for pre-deployment screening. revision: yes
-
Referee: [Abstract / §4.2] Abstract: The statement that mid-sized models are optimal for detecting difficulty outliers is presented without reporting the precise size thresholds, the statistical criteria used to identify outliers, or the number of banks on which the floor/ceiling effects were observed, making it difficult to evaluate the robustness of the scale recommendation.
Authors: We appreciate this request for greater precision. The manuscript identifies extremely small models (<4B parameters) as showing floor effects and large models (>14B) as showing ceiling effects. In the revision, we now specify the exact thresholds used: models with average accuracy below 25% across banks for floor effects and above 85% for ceiling effects, determined via one-sample t-tests against chance performance (p < 0.01). These patterns were observed consistently across all 22 deployed problem banks. We have added a table listing the 17 models by size category and a figure plotting accuracy by model scale for each bank to illustrate the outlier detection capability of mid-sized models (4B to 14B parameters). revision: yes
Circularity Check
No significant circularity; validation uses external student data and independent models
full rationale
The paper reports empirical results from comparing LM performance patterns on generated isomorphic problems against actual student performance data (N>200 responses across three midterms) and evaluations from 17 independent open-source models (0.6B-32B parameters). The 73% homogeneous difficulty rate and Pearson's ρ up to 0.594 are direct statistical comparisons to these external benchmarks rather than any self-referential derivation, fitted parameter renamed as prediction, or load-bearing self-citation. No equations, ansatzes, or uniqueness theorems are invoked that reduce claims to prior author work or internal inputs by construction. The framework remains self-contained against external falsifiable data.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Language model performance patterns can serve as a proxy for human student difficulty in physics problems
Reference graph
Works this paper leans on
-
[1]
R. Muldoon. Is it time to ditch the traditional university exam?Higher Education Research and Development, 31(2):263–265, 2012
work page 2012
- [2]
-
[3]
Valerie Flugge. Cheaters never prosper: The legal liability and ethical responsibility of “homework help” sites. Notre Dame Journal of Law, Ethics & Public Policy, 38(1):177–212, 2024
work page 2024
-
[4]
Francis, Sue Jones, and David P
Nigel J. Francis, Sue Jones, and David P. Smith. Generative AI in higher education: Balancing innovation and integrity.British Journal of Biomedical Science, 81, 2025
work page 2025
-
[5]
Philip M. Newton and Michael J. Draper. Widespread use of summative online unsupervised remote (SOUR) examinations in UK higher education: ethical and quality assurance implications.Quality in Higher Education, 31(1):127–141, 2025
work page 2025
-
[6]
Emily Frederick and Zhongzhou Chen. Comparing student performance on a multi-attempt asynchronous assessment to a single-attempt synchronous assessment in introductory level physics. In2024 Physics Education Research Conference Proceedings, pages 148–153, 2024. Also available as arXiv:2407.15257
-
[7]
Measuring the score advantage on asynchronous exams in an undergraduate cs course
Mariana Silva, Matthew West, and Craig Zilles. Measuring the score advantage on asynchronous exams in an undergraduate cs course. InProceedings of the 51st ACM Technical Symposium on Computer Science Education, pages 873–879, 2020
work page 2020
-
[8]
Reducing difficulty variance in randomized assessments
Paras Sud, Matthew West, and Craig Zilles. Reducing difficulty variance in randomized assessments. In2019 ASEE Annual Conference & Exposition, 2019
work page 2019
- [9]
-
[10]
Chang Liu, Rui Xie, and Zhongzhou Chen. Towards actionable recommendations for exam preparation using isomorphic problem banks and explainable machine learning.Frontiers in Education, 10:1632132, 2025
work page 2025
-
[11]
Russell Millar and Sathiamoorthy Manoharan. Repeat individualized assessment using isomorphic questions: a novel approach to increase peer discussion and learning.International Journal of Educational Technology in Higher Education, 18(1):22, 2021
work page 2021
-
[12]
Ghader Kurdi, Jared Leo, Bijan Parsia, Uli Sattler, and Salam Al-Emari. A systematic review of automatic question generation for educational purposes.International journal of artificial intelligence in education, 30(1):121–204, 2020
work page 2020
-
[13]
Audra E Kosh, Mary Ann Simpson, Lisa Bickel, Mark Kellogg, and Ellie Sanford-Moore. A cost–benefit analysis of automatic item generation.Educational Measurement: Issues and Practice, 38(1):48–53, 2019
work page 2019
-
[14]
Ramon Dijkstra, Zülküf Genç, Subhradeep Kayal, and J. Kamps. Reading comprehension quiz generation using generative pre-trained transformers. IniTextbooks@AIED, 2022
work page 2022
-
[15]
Automatic educational question generation with difficulty level controls
Ying Jiao, Kumar Shridhar, Peng Cui, Wangchunshu Zhou, and Mrinmaya Sachan. Automatic educational question generation with difficulty level controls. InInternational Conference on Artificial Intelligence in Education, pages 476–488. Springer, 2023
work page 2023
-
[16]
Kuang Wen Chan, Farhan Ali, Joonhyeong Park, Kah Shen Brandon Sham, Erdalyn Yeh Thong Tan, Francis Woon Chien Chong, Kun Qian, and Guan Kheng Sze. Automatic item generation in various stem subjects using large language model prompting.Computers and Education: Artificial Intelligence, 8:100344, 2025
work page 2025
-
[17]
Subhankar Maity, Aniket Deroy, and Sudeshna Sarkar. Can large language models meet the challenge of generating school-level questions?Computers and Education: Artificial Intelligence, 8:100370, 2025
work page 2025
-
[18]
Martin Arendasy and Markus Sommer. Using psychometric technology in educational assessment: The case of a schema-based isomorphic approach to the automatic generation of quantitative reasoning items.Learning and Individual Differences, 17(4):366–383, 2007. 10 Isomorphic Problem Generation and Validation
work page 2007
-
[19]
Rewriting math word problems with large language models.Grantee Submission, 2023
Kole Norberg, Husni Almoubayyed, Stephen E Fancsali, Logan De Ley, Kyle Weldon, April Murphy, and Steve Ritter. Rewriting math word problems with large language models.Grantee Submission, 2023
work page 2023
-
[20]
Generative students: Using llm-simulated student profiles to support question item evaluation
Xinyi Lu and Xu Wang. Generative students: Using llm-simulated student profiles to support question item evaluation. InProceedings of the Eleventh ACM Conference on Learning@ Scale, pages 16–27, 2024
work page 2024
-
[21]
Smart: Simulated students aligned with item response theory for question difficulty prediction
Alexander Scarlatos, Nigel Fernandez, Christopher Ormerod, Susan Lottridge, and Andrew Lan. Smart: Simulated students aligned with item response theory for question difficulty prediction. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 25082–25105, 2025
work page 2025
-
[22]
Naiming Liu, Shashank Sonkar, and Richard Baraniuk. Do llms make mistakes like students? exploring natural alignments between language models and human error patterns. InInternational Conference on Artificial Intelligence in Education, pages 364–377. Springer, 2025
work page 2025
-
[23]
Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[24]
An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[25]
The llama 3 herd of models.arXiv e-prints, pages arXiv–2407, 2024
Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3 herd of models.arXiv e-prints, pages arXiv–2407, 2024
work page 2024
-
[26]
Marah Abdin, Jyoti Aneja, Harkirat Behl, Sébastien Bubeck, Ronen Eldan, Suriya Gunasekar, Michael Har- rison, Russell J Hewett, Mojan Javaheripi, Piero Kauffmann, et al. Phi-4 technical report.arXiv preprint arXiv:2412.08905, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[27]
gpt-oss-120b & gpt-oss-20b Model Card
Sandhini Agarwal, Lama Ahmad, Jason Ai, Sam Altman, Andy Applebaum, Edwin Arbus, Rahul K Arora, Yu Bai, Bowen Baker, Haiming Bao, et al. gpt-oss-120b & gpt-oss-20b model card.arXiv preprint arXiv:2508.10925, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[28]
Graham JG Upton. Fisher’s exact test.Journal of the Royal Statistical Society: Series A (Statistics in Society), 155(3):395–402, 1992
work page 1992
-
[29]
Pearson’s correlation coefficient.Bmj, 345, 2012
Philip Sedgwick. Pearson’s correlation coefficient.Bmj, 345, 2012
work page 2012
-
[30]
Cognition and the question of test item format.Educational Psychologist, 34(4):207–218, 1999
Michael E Martinez. Cognition and the question of test item format.Educational Psychologist, 34(4):207–218, 1999. 11
work page 1999
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.