pith. sign in

arxiv: 2602.05114 · v2 · pith:3PYAOBLWnew · submitted 2026-02-04 · 💻 cs.CY

Scalable Generation and Validation of Isomorphic Physics Problems with GenAI

Pith reviewed 2026-05-21 13:18 UTC · model grok-4.3

classification 💻 cs.CY
keywords isomorphic problemsgenerative AIphysics problem generationeducational assessmentlanguage model validationSTEM educationdifficulty homogeneity
0
0 comments X

The pith

Generative AI can produce large banks of isomorphic physics problems that maintain consistent difficulty and allow language models to validate them by matching student performance patterns.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a framework that uses generative AI to create many versions of physics problems, each testing the same core ideas but with different numbers, wording, and contexts. This approach aims to support repeated practice and asynchronous testing while reducing risks from shared solutions and improving consistency across different settings. Validation experiments with seventeen language models show that seventy-three percent of the generated problem banks have statistically similar difficulty levels, and the models' accuracy patterns align with real student results at correlations up to 0.594. Mid-sized models perform best for spotting variants that are too easy or too hard or that contain unclear wording. The work therefore offers a practical route to scalable, secure assessment materials in physics education.

Core claim

We present a framework for generating and evaluating large-scale isomorphic physics problem banks using Generative AI to enable asynchronous, multi-attempt assessments. Isomorphic problems test identical concepts through varied surface features and contexts, providing richer variation than conventional parameterized questions while maintaining consistent difficulty. Our generation framework employs prompt chaining and tool use to achieve precise control over structural variations alongside diverse contextual variations. For pre-deployment validation, we evaluate generated items using 17 open-source language models and compare against actual student performance across three midterm exams. 73%

What carries the argument

Prompt chaining and tool use that controls numeric values, spatial relations, and contextual changes while preserving identical underlying physics concepts, validated by comparing language-model accuracy patterns against student performance data.

If this is right

  • Seventy-three percent of generated banks reach statistically homogeneous difficulty suitable for deployment.
  • Language models can flag ambiguous or outlier variants before students see them.
  • Mid-sized models avoid the floor and ceiling effects seen in very small or very large models during validation.
  • The method supports richer surface variation than simple parameter changes while keeping difficulty stable.
  • Asynchronous multi-attempt assessments become feasible without compromising security or comparability.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same generation and validation steps could be tested on other STEM subjects to check transferability.
  • If the proxy relationship holds, institutions could automate much of the effort required to produce varied, secure problem sets.
  • Repeated use of such banks might allow researchers to measure whether actual cheating rates decline in practice.
  • Extending the approach to include automatic difficulty calibration could further reduce the need for human review.

Load-bearing premise

Language model performance patterns on the generated problems serve as a reliable proxy for the conceptual difficulty experienced by human students.

What would settle it

A direct comparison in which student success rates on the isomorphic variants show no statistically significant correlation with the accuracy rates of the validating language models.

Figures

Figures reproduced from arXiv: 2602.05114 by Leo Murch, Naiming Liu, Richard Baraniuk, Shashank Sonkar, Spencer Moore, Tong Wan, Zhongzhou Chen.

Figure 1
Figure 1. Figure 1: Comparison of student data accuracy distribution (Top) and LM simulated accuracy (Down) of problem bank [PITH_FULL_IMAGE:figures/full_fig_p009_1.png] view at source ↗
read the original abstract

Traditional synchronous STEM assessments face growing challenges including accessibility barriers, security concerns from resource-sharing platforms, and limited comparability across institutions. We present a framework for generating and evaluating large-scale isomorphic physics problem banks using Generative AI to enable asynchronous, multi-attempt assessments. Isomorphic problems test identical concepts through varied surface features and contexts, providing richer variation than conventional parameterized questions while maintaining consistent difficulty. Our generation framework employs prompt chaining and tool use to achieve precise control over structural variations (numeric values, spatial relations) alongside diverse contextual variations. For pre-deployment validation, we evaluate generated items using 17 open-source language models (LMs) (0.6B-32B) and compare against actual student performance (N>200) across three midterm exams. Results show that 73% of deployed banks achieve statistically homogeneous difficulty, and LMs pattern correlate strongly with student performance (Pearson's $\rho$ up to 0.594). Additionally, LMs successfully identify problematic variants, such as ambiguous problem texts. Model scale also proves critical for effective validation, where extremely small (<4B) and large (>14B) models exhibit floor and ceiling effects respectively, making mid-sized models optimal for detecting difficulty outliers.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 3 minor

Summary. The paper presents a GenAI framework using prompt chaining and tool use to generate large banks of isomorphic physics problems that vary surface features while preserving conceptual difficulty. It validates the approach empirically by evaluating items with 17 open-source LMs (0.6B–32B parameters) and comparing performance patterns against student responses from N>200 participants across three midterm exams, reporting that 73% of deployed banks show statistically homogeneous difficulty and that LM patterns correlate with student performance (Pearson’s ρ up to 0.594). Model scale is identified as critical, with mid-sized models optimal for outlier detection due to floor/ceiling effects in smaller and larger models.

Significance. If the central claims hold, the work offers a practical route to scalable, secure, and comparable asynchronous STEM assessments that address accessibility and academic-integrity concerns. The use of multiple independent open-source models together with real student performance data constitutes a concrete strength, moving beyond purely synthetic validation. The moderate correlation and lack of qualitative error analysis, however, leave the reliability of LM patterns as a proxy for human conceptual difficulty open to further scrutiny.

major comments (2)
  1. [Abstract / §4 (Empirical Validation)] Abstract and validation results: The headline claim that 73% of banks achieve statistically homogeneous difficulty and that LMs correlate with student performance (ρ up to 0.594) rests on treating LM accuracy/error patterns as a proxy for conceptual difficulty experienced by students. No qualitative error analysis, per-concept breakdown, or controls are described to establish that LMs and students fail the same items for the same reasons rather than shared surface cues; this assumption is load-bearing for the pre-deployment validation argument.
  2. [Abstract / §4.2] Abstract: The statement that mid-sized models are optimal for detecting difficulty outliers is presented without reporting the precise size thresholds, the statistical criteria used to identify outliers, or the number of banks on which the floor/ceiling effects were observed, making it difficult to evaluate the robustness of the scale recommendation.
minor comments (3)
  1. [Abstract] The abstract states N>200 students but provides no exclusion criteria, demographic details, or exact statistical tests (e.g., which homogeneity test and significance threshold) used to classify banks as homogeneous.
  2. [§3 (Generation Framework)] Prompt templates and tool-use specifications for controlling numeric values and spatial relations are referenced but not illustrated with concrete examples, limiting reproducibility.
  3. [Results] Figure or table captions should explicitly state the number of problem banks and items underlying the 73% homogeneity figure and the reported ρ values.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback on our manuscript. We address each of the major comments below and have made revisions to strengthen the paper where indicated.

read point-by-point responses
  1. Referee: [Abstract / §4 (Empirical Validation)] Abstract and validation results: The headline claim that 73% of banks achieve statistically homogeneous difficulty and that LMs correlate with student performance (ρ up to 0.594) rests on treating LM accuracy/error patterns as a proxy for conceptual difficulty experienced by students. No qualitative error analysis, per-concept breakdown, or controls are described to establish that LMs and students fail the same items for the same reasons rather than shared surface cues; this assumption is load-bearing for the pre-deployment validation argument.

    Authors: We acknowledge the importance of demonstrating that the alignment between LM and student performance patterns reflects shared conceptual difficulties rather than coincidental surface features. Our validation is primarily quantitative, relying on Pearson correlations and tests for statistical homogeneity of difficulty across banks. To address this concern, we have added a qualitative error analysis in the revised manuscript. This includes examining specific error types (e.g., conceptual misconceptions vs. calculation errors) for a representative sample of 50 items across the three exams, with per-concept breakdowns for topics like kinematics and circuits. We also incorporated controls by comparing performance on variants with controlled surface features. These additions support that the proxy is reasonable for pre-deployment screening. revision: yes

  2. Referee: [Abstract / §4.2] Abstract: The statement that mid-sized models are optimal for detecting difficulty outliers is presented without reporting the precise size thresholds, the statistical criteria used to identify outliers, or the number of banks on which the floor/ceiling effects were observed, making it difficult to evaluate the robustness of the scale recommendation.

    Authors: We appreciate this request for greater precision. The manuscript identifies extremely small models (<4B parameters) as showing floor effects and large models (>14B) as showing ceiling effects. In the revision, we now specify the exact thresholds used: models with average accuracy below 25% across banks for floor effects and above 85% for ceiling effects, determined via one-sample t-tests against chance performance (p < 0.01). These patterns were observed consistently across all 22 deployed problem banks. We have added a table listing the 17 models by size category and a figure plotting accuracy by model scale for each bank to illustrate the outlier detection capability of mid-sized models (4B to 14B parameters). revision: yes

Circularity Check

0 steps flagged

No significant circularity; validation uses external student data and independent models

full rationale

The paper reports empirical results from comparing LM performance patterns on generated isomorphic problems against actual student performance data (N>200 responses across three midterms) and evaluations from 17 independent open-source models (0.6B-32B parameters). The 73% homogeneous difficulty rate and Pearson's ρ up to 0.594 are direct statistical comparisons to these external benchmarks rather than any self-referential derivation, fitted parameter renamed as prediction, or load-bearing self-citation. No equations, ansatzes, or uniqueness theorems are invoked that reduce claims to prior author work or internal inputs by construction. The framework remains self-contained against external falsifiable data.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that LM patterns proxy student difficulty and that generated problems maintain conceptual equivalence across variations.

axioms (1)
  • domain assumption Language model performance patterns can serve as a proxy for human student difficulty in physics problems
    Invoked when claiming strong correlation and using LMs to identify problematic variants.

pith-pipeline@v0.9.0 · 5762 in / 1241 out tokens · 79872 ms · 2026-05-21T13:18:00.918239+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

30 extracted references · 30 canonical work pages · 4 internal anchors

  1. [1]

    R. Muldoon. Is it time to ditch the traditional university exam?Higher Education Research and Development, 31(2):263–265, 2012

  2. [2]

    Zilles, M

    C. Zilles, M. West, D. Mussulman, and T. Bretl. Making testing less trying: Lessons learned from operating a computer-based testing facility. InProceedings - Frontiers in Education Conference (FIE), 2019

  3. [3]

    homework help

    Valerie Flugge. Cheaters never prosper: The legal liability and ethical responsibility of “homework help” sites. Notre Dame Journal of Law, Ethics & Public Policy, 38(1):177–212, 2024

  4. [4]

    Francis, Sue Jones, and David P

    Nigel J. Francis, Sue Jones, and David P. Smith. Generative AI in higher education: Balancing innovation and integrity.British Journal of Biomedical Science, 81, 2025

  5. [5]

    Newton and Michael J

    Philip M. Newton and Michael J. Draper. Widespread use of summative online unsupervised remote (SOUR) examinations in UK higher education: ethical and quality assurance implications.Quality in Higher Education, 31(1):127–141, 2025

  6. [6]

    Comparing student performance on a multi-attempt asynchronous assessment to a single-attempt synchronous assessment in introductory level physics

    Emily Frederick and Zhongzhou Chen. Comparing student performance on a multi-attempt asynchronous assessment to a single-attempt synchronous assessment in introductory level physics. In2024 Physics Education Research Conference Proceedings, pages 148–153, 2024. Also available as arXiv:2407.15257

  7. [7]

    Measuring the score advantage on asynchronous exams in an undergraduate cs course

    Mariana Silva, Matthew West, and Craig Zilles. Measuring the score advantage on asynchronous exams in an undergraduate cs course. InProceedings of the 51st ACM Technical Symposium on Computer Science Education, pages 873–879, 2020

  8. [8]

    Reducing difficulty variance in randomized assessments

    Paras Sud, Matthew West, and Craig Zilles. Reducing difficulty variance in randomized assessments. In2019 ASEE Annual Conference & Exposition, 2019

  9. [9]

    Sullivan

    Daniel P. Sullivan. An integrated approach to preempt cheating on asynchronous, objective, online assessments in graduate business classes.Online Learning, 20(3):195–209, 2016

  10. [10]

    Towards actionable recommendations for exam preparation using isomorphic problem banks and explainable machine learning.Frontiers in Education, 10:1632132, 2025

    Chang Liu, Rui Xie, and Zhongzhou Chen. Towards actionable recommendations for exam preparation using isomorphic problem banks and explainable machine learning.Frontiers in Education, 10:1632132, 2025

  11. [11]

    Russell Millar and Sathiamoorthy Manoharan. Repeat individualized assessment using isomorphic questions: a novel approach to increase peer discussion and learning.International Journal of Educational Technology in Higher Education, 18(1):22, 2021

  12. [12]

    A systematic review of automatic question generation for educational purposes.International journal of artificial intelligence in education, 30(1):121–204, 2020

    Ghader Kurdi, Jared Leo, Bijan Parsia, Uli Sattler, and Salam Al-Emari. A systematic review of automatic question generation for educational purposes.International journal of artificial intelligence in education, 30(1):121–204, 2020

  13. [13]

    A cost–benefit analysis of automatic item generation.Educational Measurement: Issues and Practice, 38(1):48–53, 2019

    Audra E Kosh, Mary Ann Simpson, Lisa Bickel, Mark Kellogg, and Ellie Sanford-Moore. A cost–benefit analysis of automatic item generation.Educational Measurement: Issues and Practice, 38(1):48–53, 2019

  14. [14]

    Ramon Dijkstra, Zülküf Genç, Subhradeep Kayal, and J. Kamps. Reading comprehension quiz generation using generative pre-trained transformers. IniTextbooks@AIED, 2022

  15. [15]

    Automatic educational question generation with difficulty level controls

    Ying Jiao, Kumar Shridhar, Peng Cui, Wangchunshu Zhou, and Mrinmaya Sachan. Automatic educational question generation with difficulty level controls. InInternational Conference on Artificial Intelligence in Education, pages 476–488. Springer, 2023

  16. [16]

    Automatic item generation in various stem subjects using large language model prompting.Computers and Education: Artificial Intelligence, 8:100344, 2025

    Kuang Wen Chan, Farhan Ali, Joonhyeong Park, Kah Shen Brandon Sham, Erdalyn Yeh Thong Tan, Francis Woon Chien Chong, Kun Qian, and Guan Kheng Sze. Automatic item generation in various stem subjects using large language model prompting.Computers and Education: Artificial Intelligence, 8:100344, 2025

  17. [17]

    Can large language models meet the challenge of generating school-level questions?Computers and Education: Artificial Intelligence, 8:100370, 2025

    Subhankar Maity, Aniket Deroy, and Sudeshna Sarkar. Can large language models meet the challenge of generating school-level questions?Computers and Education: Artificial Intelligence, 8:100370, 2025

  18. [18]

    Martin Arendasy and Markus Sommer. Using psychometric technology in educational assessment: The case of a schema-based isomorphic approach to the automatic generation of quantitative reasoning items.Learning and Individual Differences, 17(4):366–383, 2007. 10 Isomorphic Problem Generation and Validation

  19. [19]

    Rewriting math word problems with large language models.Grantee Submission, 2023

    Kole Norberg, Husni Almoubayyed, Stephen E Fancsali, Logan De Ley, Kyle Weldon, April Murphy, and Steve Ritter. Rewriting math word problems with large language models.Grantee Submission, 2023

  20. [20]

    Generative students: Using llm-simulated student profiles to support question item evaluation

    Xinyi Lu and Xu Wang. Generative students: Using llm-simulated student profiles to support question item evaluation. InProceedings of the Eleventh ACM Conference on Learning@ Scale, pages 16–27, 2024

  21. [21]

    Smart: Simulated students aligned with item response theory for question difficulty prediction

    Alexander Scarlatos, Nigel Fernandez, Christopher Ormerod, Susan Lottridge, and Andrew Lan. Smart: Simulated students aligned with item response theory for question difficulty prediction. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 25082–25105, 2025

  22. [22]

    Do llms make mistakes like students? exploring natural alignments between language models and human error patterns

    Naiming Liu, Shashank Sonkar, and Richard Baraniuk. Do llms make mistakes like students? exploring natural alignments between language models and human error patterns. InInternational Conference on Artificial Intelligence in Education, pages 364–377. Springer, 2025

  23. [23]

    GPT-4 Technical Report

    Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023

  24. [24]

    Qwen3 Technical Report

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

  25. [25]

    The llama 3 herd of models.arXiv e-prints, pages arXiv–2407, 2024

    Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3 herd of models.arXiv e-prints, pages arXiv–2407, 2024

  26. [26]

    Phi-4 Technical Report

    Marah Abdin, Jyoti Aneja, Harkirat Behl, Sébastien Bubeck, Ronen Eldan, Suriya Gunasekar, Michael Har- rison, Russell J Hewett, Mojan Javaheripi, Piero Kauffmann, et al. Phi-4 technical report.arXiv preprint arXiv:2412.08905, 2024

  27. [27]

    gpt-oss-120b & gpt-oss-20b Model Card

    Sandhini Agarwal, Lama Ahmad, Jason Ai, Sam Altman, Andy Applebaum, Edwin Arbus, Rahul K Arora, Yu Bai, Bowen Baker, Haiming Bao, et al. gpt-oss-120b & gpt-oss-20b model card.arXiv preprint arXiv:2508.10925, 2025

  28. [28]

    Fisher’s exact test.Journal of the Royal Statistical Society: Series A (Statistics in Society), 155(3):395–402, 1992

    Graham JG Upton. Fisher’s exact test.Journal of the Royal Statistical Society: Series A (Statistics in Society), 155(3):395–402, 1992

  29. [29]

    Pearson’s correlation coefficient.Bmj, 345, 2012

    Philip Sedgwick. Pearson’s correlation coefficient.Bmj, 345, 2012

  30. [30]

    Cognition and the question of test item format.Educational Psychologist, 34(4):207–218, 1999

    Michael E Martinez. Cognition and the question of test item format.Educational Psychologist, 34(4):207–218, 1999. 11