When AI Evaluates Its Own Work: Validating Learner-Initiated, AI-Generated Physics Practice Problems
Pith reviewed 2026-05-19 01:25 UTC · model grok-4.3
The pith
Only a curated subset of quality checks is needed to validate AI-generated physics practice problems for student appeal and soundness.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central discovery is that student preferences for AI-generated physics problems are reliably addressed by a small subset of structural and learner-visible quality attributes, as determined through expert labeling, LLM benchmarking, random-forest prediction of choices, and exit surveys. This shows that exhaustive scoring is unnecessary for scalable formative assessment.
What carries the argument
Random-forest models that identify the quality attributes predicting student preferences, benchmarked against expert labels via LLM judges.
If this is right
- Scalable formative assessment in physics becomes feasible without exhaustive expert scoring.
- A practical blueprint supports deploying real-time AI-generated practice problems.
- Both technical soundness and user appeal are maintained by structural and learner-visible checks.
- The approach extends directly to other quantitative disciplines.
Where Pith is reading between the lines
- These minimal checks could be embedded in the chatbot to filter problems before students see them.
- The same reduced set of attributes might apply to AI-generated problems in mathematics or chemistry.
- Repeating the preference trials with students at different levels could test broader validity.
Load-bearing premise
Expert labels on the quality attributes are treated as ground truth without systematic biases or missing key pedagogical dimensions.
What would settle it
A new cohort of students whose preferences contradict the random-forest selected minimal attributes would show the core checks are not sufficient.
Figures
read the original abstract
Large language models (LLMs) can now generate physics practice problems in real time, yet the educational value of these items hinges on rapid, reliable post-generation vetting. In this exploratory study, we investigated which automated checks are both technically feasible and pedagogically meaningful when exercises are produced on demand within a chatbot interface. A cohort of 34 introductory-physics students generated and attempted 543 practice problems during exam preparation. Each item was labeled by an expert on a wide range of quality attributes and presented to the learners in pairs to record their preference. We then (i) benchmarked three commodity LLMs as ``judges'' against the expert labels, (ii) quantified which attributes predict student choice via random-forest models, and (iii) triangulated these results with free-form exit surveys. Only a small subset of the original metric items proved necessary to reliably address student preferences either directly or by proxy. The study demonstrates that scalable formative assessment does not require exhaustive scoring: a carefully curated core of structural and learner-visible checks is sufficient to ensure both technical soundness and user appeal. The findings provide a practical blueprint for deploying real-time, AI-generated practice in physics and other quantitative disciplines.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript reports an exploratory study in which 34 introductory-physics students generated and attempted 543 AI-produced practice problems. Each item received expert labels on a broad set of quality attributes, was presented to learners in pairs to elicit preference data, and was used to benchmark three commodity LLMs as judges, to train random-forest models identifying which attributes predict student choice, and to triangulate with free-form exit surveys. The central claim is that only a small curated subset of the original attributes is required to ensure both technical soundness and learner appeal, thereby showing that scalable formative assessment need not rely on exhaustive scoring.
Significance. If the central claim holds, the work supplies a practical blueprint for real-time vetting of learner-initiated, AI-generated physics exercises. Its strengths include the use of held-out preference data for random-forest feature selection, triangulation across expert labels, LLM judgments, and student surveys on a sizable corpus of 543 items, and an explicit demonstration that a minimal set of structural and learner-visible checks can substitute for exhaustive metrics. These elements could inform the design of adaptive tutoring systems in quantitative disciplines.
major comments (1)
- [Abstract and expert-labeling description] Abstract and expert-labeling description: the study treats labels assigned by a single expert as ground truth for both LLM benchmarking and random-forest predictor selection, yet reports no inter-rater reliability statistics, intra-rater consistency checks, or multi-expert comparison. Because any systematic bias in these labels directly determines which attributes are retained as the 'necessary' minimal set, the absence of reliability metrics renders the sufficiency conclusion non-generalizable and load-bearing for the central claim.
minor comments (1)
- [Abstract] The abstract omits error bars on reported metrics, cross-validation statistics for the random-forest models, and any quantitative summary of inter-rater agreement.
Simulated Author's Rebuttal
We thank the referee for their constructive feedback on our exploratory study. Below we respond point-by-point to the major comment, acknowledging limitations while highlighting the supporting triangulation in our design.
read point-by-point responses
-
Referee: Abstract and expert-labeling description: the study treats labels assigned by a single expert as ground truth for both LLM benchmarking and random-forest predictor selection, yet reports no inter-rater reliability statistics, intra-rater consistency checks, or multi-expert comparison. Because any systematic bias in these labels directly determines which attributes are retained as the 'necessary' minimal set, the absence of reliability metrics renders the sufficiency conclusion non-generalizable and load-bearing for the central claim.
Authors: We agree that single-expert labeling constitutes a limitation for generalizability. The 543 items were labeled by one experienced physics educator to ensure internal consistency within this pilot-scale study; resource constraints precluded multi-rater data collection or formal reliability statistics. Nevertheless, the retained minimal attribute set was validated through independent held-out student preference data in the random-forest models and cross-checked against free-form exit surveys, providing convergent evidence beyond the expert labels alone. We will add an explicit limitations subsection in the revised manuscript discussing single-rater bias and recommending multi-expert replication in future work. revision: partial
Circularity Check
Empirical study relies on independent expert labels, student preference data, and external LLM benchmarking with no self-referential derivation
full rationale
The paper describes an empirical workflow: students generate problems, an expert assigns quality labels, students record pairwise preferences, LLMs are benchmarked directly against the expert labels, random-forest models are trained on the preference data to identify predictive attributes, and results are triangulated with exit surveys. None of these steps constitutes a derivation that reduces to its own inputs by construction, a fitted parameter renamed as a prediction of the same quantity, or a load-bearing claim justified solely by self-citation. The central conclusion—that a curated subset of attributes suffices—emerges from cross-validation across distinct data sources (expert labels, student choices, survey responses) rather than from any internal redefinition or circular fitting. This is a standard self-contained empirical analysis against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Expert labels on quality attributes accurately capture pedagogically relevant dimensions of problem quality.
- domain assumption Student preferences recorded in paired comparisons reflect genuine differences in perceived usefulness for exam preparation.
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Each item was labeled by an expert on a wide range of quality attributes... bloom-level-of-exercise, task-is-solvable
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Skewedness Although we collected a total ofN = 543 generated ex- ercises, the class distribution for some metrics was highly skewed — indeed, certain classes Ci were underrepre- sented or even absent. To ensure that per-class esti- mates (e.g., precision or recall for each Ci) have suffi- ciently low sampling variability before treating them as definitive...
-
[2]
Interrelatedness As Fig. 5 shows, several metrics are correlated with each other [61]. This may allow us to later use a metric that is more reliable as a proxy for relevant, but less reliable, metrics. Added here is the variable chosen if the student chose to work on this problem in the forced- choice shown in Fig. 1. Three highly significant ( p < 0.001)...
work page 2025
-
[3]
includes-solution-strategy in its own right and as a proxy for depth measures,
-
[4]
llm-solution-is-correct,
-
[5]
task-is-specific-and-complete,
-
[6]
measurement-unit-is-clearly-stated. Based on our findings in Table V, the first two metrics are best handled with a reasoning model, while the remaining two tasks can be handled by a cheaper, lower-latency non-reasoning model. V. LIMIT A TIONS The evidence assembled here, while encouraging, must be interpreted with caution because several structural const...
-
[7]
F. Reif, J. H. Larkin, and G. C. Brackett, Teaching gen- eral learning and problem-solving skills, American Jour- nal of Physics 44, 212 (1976)
work page 1976
-
[8]
A. B. Arons, Teaching introductory physics (John Wiley & Sons, Hoboken, NJ, 1996)
work page 1996
-
[9]
N. R. Council et al. , How people learn: Brain, mind, experience, and school: Expanded edition (National Academies Press, Washington, DC, 2000)
work page 2000
-
[10]
R. J. Dufresne and W. J. Gerace, Assessing-to-learn: For- mative assessment in physics instruction, The Physics Teacher 42, 428 (2004)
work page 2004
-
[11]
C. Wagner and A. Vaterlaus, Promoting formative as- sessment in high school teaching of physics, Lat. Am. J. Phys. Educ 1, 6 (2012)
work page 2012
-
[12]
D. A. Kashy, G. Albertelli, W. Bauer, E. Kashy, and M. Thoennessen, Influence of non-moderated and moder- ated discussion sites on student success, Journal of Asyn- chronous Learning Networks 7, 31 (2003)
work page 2003
-
[13]
G. Kortemeyer, E. Kashy, W. Benenson, and W. Bauer, Experiences using the open-source learning content man- agement and assessment system LON-CAPA in introduc- tory physics courses, Am. J. Phys 76, 438 (2008)
work page 2008
-
[14]
J. Risley, Motivating students to learn physics using an online homework system, Newsletter of the APS Forum on Education F all, 3 (2001)
work page 2001
-
[15]
a. A. T. Guoqing Tang, Increasing student’s time on task in calculus and general physics courses through webassign, in ASEE Annual Conference and Exposition (2002) pp. 7.660.1 – 7.660.20
work page 2002
-
[16]
S. W. Bonham, D. L. Deardorff, and R. J. Beichner, Com- parison of student performance using web and paper- based homework in college-level physics, Journal of re- search in science teaching 40, 1050 (2003)
work page 2003
-
[17]
B. Gutmann, G. Gladding, M. Lundsgaard, and T. Stelzer, Mastery-style homework exercises in intro- ductory physics courses: Implementation matters, Phys. Rev. Phys. Educ. Res. 14, 010128 (2018)
work page 2018
-
[18]
A. Sperling and J. Lincoln, Artificial intelligence and high school physics, The Physics Teacher 62, 314 (2024)
work page 2024
-
[19]
P. Wattanakasiwich, K. Kaewkhong, and D. Katwibun, Physics instructors’ acceptance and implementation of generative AI, Physical Review Physics Education Re- search 21, 010155 (2025)
work page 2025
-
[20]
S. K¨ uchemann, S. Steinert, N. Revenga, M. Schwein- berger, Y. Dinc, K. E. Avila, and J. Kuhn, Can Chat- GPT support prospective teachers in physics task devel- opment?, Phys. Rev. Phys. Educ. Res.19, 020128 (2023)
work page 2023
-
[21]
J. Lademann, J. Henze, and S. Becker-Genschow, Aug- menting learning environments using AI custom chat- bots: Effects on learning performance, cognitive load, and affective variables, Physical Review Physics Educa- tion Research 21, 010147 (2025)
work page 2025
-
[22]
P. Bitzenbauer, ChatGPT in physics education: A pi- lot study on easy-to-implement activities, Contemporary Educational Technology 15, ep430 (2023)
work page 2023
-
[23]
Kortemeyer, Ethel: A virtual teaching assistant, Phys
G. Kortemeyer, Ethel: A virtual teaching assistant, Phys. Teach. 62, 698 (2024)
work page 2024
-
[24]
G. Kortemeyer and J. N¨ ohl, Assessing confidence in ai- assisted grading of physics exams through psychometrics: An exploratory study, Physical Review Physics Educa- tion Research 21, 010136 (2025)
work page 2025
-
[25]
Z. Chen and T. Wan, Grading explanations of problem- solving process and generating feedback using large lan- guage models at human-level accuracy, Physical Review Physics Education Research 21, 010126 (2025)
work page 2025
-
[26]
B. Gregorcic, G. Polverini, and A. Sarlah, ChatGPT as a tool for honing teachers’s Socratic dialogue skills, Physics Education 59, 045005 (2024)
work page 2024
-
[27]
M. A. R. Vasconcelos and R. P. Dos Santos, Enhancing STEM learning with ChatGPT and Bing chat as objects to think with: a case study, Eurasia Journal of Mathe- matics, Science and Technology Education 19, em2296 (2023)
work page 2023
-
[28]
L. Ding, T. Li, S. Jiang, and A. Gapud, Students’ per- ceptions of using ChatGPT in a physics class as a virtual tutor, International Journal of Educational Technology in Higher Education 20, 63 (2023)
work page 2023
-
[29]
F. Balabdaoui, N. Dittmann-Domenichini, H. Grosse, C. Schlienger, and G. Kortemeyer, A survey on students’ use of AI at a technical university, Discover Education 3, 51 (2024)
work page 2024
-
[30]
Kortemeyer, Could an artificial-intelligence agent pass an introductory physics course?, Phys
G. Kortemeyer, Could an artificial-intelligence agent pass an introductory physics course?, Phys. Rev. Phys. Educ. Res. 19, 010132 (2023). 16
work page 2023
-
[31]
K. A. Pimbblet and L. J. Morrell, Can ChatGPT pass a physics degree? making a case for reformation of as- sessment of undergraduate degrees, European Journal of Physics 46, 015702 (2024)
work page 2024
-
[32]
G. Polverini and B. Gregorcic, Performance of ChatGPT on the test of understanding graphs in kinematics, Phys. Rev. Phys. Educ. Res. 20, 010109 (2024)
work page 2024
-
[33]
G. Kortemeyer, M. Babayeva, G. Polverini, R. Widen- horn, and B. Gregorcic, Multilingual performance of a multimodal artificial intelligence system on multisubject physics concept inventories, Physical Review Physics Ed- ucation Research 21, 020101 (2025)
work page 2025
- [34]
- [35]
-
[36]
P. A. Kirschner, J. Sweller, and R. E. Clark, Why mini- mal guidance during instruction does not work: An anal- ysis of the failure of constructivist, discovery, problem- based, experiential, and inquiry-based teaching, Educa- tional Psychologist 41, 75 (2006)
work page 2006
-
[37]
S. El-Adawy, I. Liao, V. Lad, M. Abdelhafez, and P. Dourmashkin, Streamlining physics problem genera- tion to support physics teachers in using generative arti- ficial intelligence, The Physics Teacher 62, 595 (2024)
work page 2024
- [38]
- [39]
-
[40]
W. Fakcharoenphol, E. Potter, and T. Stelzer, What stu- dents learn when studying physics practice exam prob- lems, Physical Review Special Topics – Physics Educa- tion Research 7, 010107 (2011)
work page 2011
-
[41]
R. Gautreau and L. Novemsky, Concepts first – a small group approach to physics learning, American journal of Physics 65, 418 (1997)
work page 1997
-
[42]
W. Fakcharoenphol and T. Stelzer, Physics exam prepa- ration: A comparison of three methods, Physical Review Special Topics – Physics Education Research 10, 010108 (2014)
work page 2014
-
[43]
M. Rodriguez and G. Potvin, Frequent small group in- teractions improve student learning gains in physics: Re- sults from a nationally representative pre-post study of four-year colleges, Physical Review Physics Education Research 17, 020131 (2021)
work page 2021
-
[44]
M. J. Gierl, O. Bulut, Q. Guo, and X. Zhang, Develop- ing, analyzing, and using distractors for multiple-choice tests in education: A comprehensive review, Review of educational research 87, 1082 (2017)
work page 2017
-
[45]
T. F. Scott and D. Schumayer, Central distractors in force concept inventory data, Physical review physics ed- ucation research 14, 010106 (2018)
work page 2018
- [46]
-
[47]
A. R. Mota, N. Didi¸ s K¨ orhasan, K. Miller, and E. Mazur, Homework as a metacognitive tool in an undergraduate physics course, Physical Review Physics Education Re- search 15, 010136 (2019)
work page 2019
-
[48]
C. Walkington and M. L. Bernacki, Personalizing algebra to students’ individual interests in an intelligent tutoring system: Moderators of impact, International Journal of Artificial Intelligence in Education 29, 58 (2019)
work page 2019
-
[49]
G. Taasoobshirazi and M. Carr, A review and critique of context-based physics instruction and assessment, Edu- cational Research Review 3, 155 (2008)
work page 2008
-
[50]
Z. Dulger and F. Ogan-Bekiroglu, Students’ metacogni- tion knowledge and skills during physics problem-solving process, Physical Review Physics Education Research21, 020106 (2025)
work page 2025
-
[51]
C. Harrison, C. P. Constantinou, C. F. Correia, M. Grangeat, M. H¨ ahki¨ oniemi, M. Livitzis, P. Nieminen, N. Papadouris, E. Rached, N. Serret, et al., Assessment on-the-fly: Promoting and collecting evidence of learn- ing through dialogue, Transforming assessment: Through an interplay between practice, research and policy , 83 (2018)
work page 2018
-
[52]
D. Sadigh, S. A. Seshia, and M. Gupta, Automating ex- ercise generation: A step towards meeting the MOOC challenge for embedded systems, in Proceedings of the workshop on embedded and cyber-physical systems educa- tion (Association for Computing Machinery, New York, NY, 2012) pp. 1–8
work page 2012
-
[53]
M. N. Demaidi, M. M. Gaber, and N. Filer, Evaluating the quality of the ontology-based auto-generated ques- tions, Smart Learning Environments 4, 10.1186/s40561- 017-0046-6 (2017)
-
[54]
V. Nentwich, N. Fischer, A. C. Sonnenbichler, and A. Geyer-Schulz, Computer aided exercise generation — a framework for human interaction in the automated ex- ercise generation process, inProceedings of the 13th Inter- national Joint Conference on e-Business and Telecommu- nications (SCITEPRESS, Set´ ubal, Portugal, 2016) pp. 57–63
work page 2016
-
[55]
I. Aldabe, M. L. De Lacalle, M. Maritxalar, E. Martinez, and L. Uria, Arikiturri: an automatic question gener- ator based on corpora and NLP techniques, in Intel- ligent Tutoring Systems: 8th International Conference, ITS 2006, Jhongli, Taiwan, June 26-30, 2006. Proceed- ings 8 (Springer, New York, NY, 2006) pp. 584–594
work page 2006
-
[56]
T. Freitas, ´A. Neto, M. J. Pereira, and P. Henriques, NLP/AI based techniques for programming exercises generation, in 4th International Computer Programming Education Conference (ICPEC 2023) , Vol. 112 (2023)
work page 2023
-
[57]
G. Chen, J. Yang, C. Hauff, and G.-J. Houben, Learn- ingQ: a large-scale dataset for educational question gen- eration, in Proceedings of the international AAAI con- ference on web and social media , Vol. 12 (Association for the Advancement of Artificial Intelligence, Washington, DC, 2018)
work page 2018
-
[58]
Training Verifiers to Solve Math Word Problems
K. Cobbe, V. Kosaraju, M. Bavarian, M. Chen, H. Jun, L. Kaiser, M. Plappert, J. Tworek, J. Hilton, R. Nakano, et al. , Training verifiers to solve math word problems, arXiv preprint arXiv:2110.14168 (2021)
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[59]
Measuring Mathematical Problem Solving With the MATH Dataset
D. Hendrycks, C. Burns, S. Kadavath, A. Arora, S. Basart, E. Tang, D. Song, and J. Steinhardt, Mea- suring mathematical problem solving with the MATH dataset, arXiv preprint arXiv:2103.03874 (2021). 17
work page internal anchor Pith review Pith/arXiv arXiv 2021
- [60]
-
[61]
J. Doughty, Z. Wan, A. Bompelli, J. Qayum, T. Wang, J. Zhang, Y. Zheng, A. Doyle, P. Sridhar, A. Agarwal, et al., A comparative study of AI-generated (gpt-4) and human-crafted MCQs in programming education, in Pro- ceedings of the 26th Australasian Computing Education Conference (Association for Computing Machinery, New York, NY, 2024) pp. 114–123
work page 2024
-
[62]
G. Kortemeyer, J. N¨ ohl, and D. Onishchuk, Grading as- sistance for a handwritten thermodynamics exam using artificial intelligence: An exploratory study, Physical Re- view Physics Education Research 20, 020144 (2024)
work page 2024
-
[63]
M. X. Liu, F. Liu, A. J. Fiannaca, T. Koo, L. Dixon, M. Terry, and C. J. Cai, “We need structured out- put”: Towards user-centered constraints on large lan- guage model output, in Extended Abstracts of the CHI Conference on Human Factors in Computing Systems (Association for Computing Machinery, New York, NY,
-
[64]
R. V. Hogg, E. A. Tanis, and D. L. Zimmerman, Proba- bility and Statistical Inference, 9th ed. (Pearson, Boston, MA, 2019) section 5.6, p. 202
work page 2019
-
[65]
L. A. Orawo, Confidence intervals for the binomial pro- portion: A comparison of four methods, Open Journal of Statistics 11, 806 (2021)
work page 2021
-
[66]
Z. Ji, N. Lee, R. Frieske, T. Yu, D. Su, Y. Xu, E. Ishii, Y. J. Bang, A. Madotto, and P. Fung, Survey of halluci- nation in natural language generation, ACM computing surveys 55, 1 (2023)
work page 2023
-
[67]
C. Spearman, The proof and measurement of association between two things, The American Journal of Psychology 15, 72 (1904)
work page 1904
-
[68]
T. M. Fruchterman and E. M. Reingold, Graph draw- ing by force-directed placement, Software: Practice and experience 21, 1129 (1991)
work page 1991
-
[69]
OpenAI, OpenAI, https://openai.com/ (accessed July 2025)
work page 2025
-
[70]
microsoft.com/en-us/products/ai-services (ac- cessed June 2024)
Microsoft, Azure AI Services, https://azure. microsoft.com/en-us/products/ai-services (ac- cessed June 2024)
work page 2024
-
[71]
OpenAI, Hello GPT-4o, https://openai.com/index/ hello-gpt-4o/ (accessed June 2024)
work page 2024
-
[72]
OpenAI, GPT-4o mini: advancing cost- efficient intelligence, https://openai.com/index/ gpt-4o-mini-advancing-cost-efficient-intelligence/ (accessed Februrary 2025)
work page 2025
-
[73]
OpenAI, OpenAI o3-mini, https://openai.com/index/ openai-o3-mini/ (accessed February 2025)
work page 2025
-
[74]
Breiman, Random forests, Machine learning 45, 5 (2001)
L. Breiman, Random forests, Machine learning 45, 5 (2001)
work page 2001
-
[75]
J. Wei, X. Wang, D. Schuurmans, M. Bosma, F. Xia, E. Chi, Q. V. Le, D. Zhou, et al. , Chain-of-thought prompting elicits reasoning in large language models, Advances in neural information processing systems 35, 24824 (2022)
work page 2022
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.