When AI Evaluates Its Own Work: Validating Learner-Initiated, AI-Generated Physics Practice Problems

Gerd Kortemeyer; Tobias Geisler

arxiv: 2508.03085 · v2 · submitted 2025-08-05 · ⚛️ physics.ed-ph

When AI Evaluates Its Own Work: Validating Learner-Initiated, AI-Generated Physics Practice Problems

Tobias Geisler , Gerd Kortemeyer This is my paper

Pith reviewed 2026-05-19 01:25 UTC · model grok-4.3

classification ⚛️ physics.ed-ph

keywords AI-generated problemsphysics educationformative assessmentLLM evaluationstudent preferencesquality metricsrandom forestscalable assessment

0 comments

The pith

Only a curated subset of quality checks is needed to validate AI-generated physics practice problems for student appeal and soundness.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large language models can generate physics practice problems instantly, but ensuring they are useful requires vetting. This study had students create and try hundreds of such problems, with experts rating them on many attributes and students indicating preferences between pairs. By comparing how well different AI models match expert ratings and using machine learning to find which ratings predict student likes, the authors found that a small number of checks cover what matters. This means real-time generated exercises can be made reliable without checking everything, opening the door to more personalized practice in physics classes.

Core claim

The central discovery is that student preferences for AI-generated physics problems are reliably addressed by a small subset of structural and learner-visible quality attributes, as determined through expert labeling, LLM benchmarking, random-forest prediction of choices, and exit surveys. This shows that exhaustive scoring is unnecessary for scalable formative assessment.

What carries the argument

Random-forest models that identify the quality attributes predicting student preferences, benchmarked against expert labels via LLM judges.

If this is right

Scalable formative assessment in physics becomes feasible without exhaustive expert scoring.
A practical blueprint supports deploying real-time AI-generated practice problems.
Both technical soundness and user appeal are maintained by structural and learner-visible checks.
The approach extends directly to other quantitative disciplines.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

These minimal checks could be embedded in the chatbot to filter problems before students see them.
The same reduced set of attributes might apply to AI-generated problems in mathematics or chemistry.
Repeating the preference trials with students at different levels could test broader validity.

Load-bearing premise

Expert labels on the quality attributes are treated as ground truth without systematic biases or missing key pedagogical dimensions.

What would settle it

A new cohort of students whose preferences contradict the random-forest selected minimal attributes would show the core checks are not sufficient.

Figures

Figures reproduced from arXiv: 2508.03085 by Gerd Kortemeyer, Tobias Geisler.

**Figure 1.** Figure 1: FIG. 1. Two interactive problems generated on-demand, showcasing how course-specific contextual information, in this case the [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗

**Figure 2.** Figure 2: FIG. 2. Attempting to solve the selected problem from Fig. 1. [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗

**Figure 3.** Figure 3: FIG. 3. Information flow in an enhanced chatbot that can generate verified problems on-the-fly and on-demand. The “exercise [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: FIG. 4. Example of the definition for [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: FIG. 5. Correlations between the metrics in a force-directed [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

read the original abstract

Large language models (LLMs) can now generate physics practice problems in real time, yet the educational value of these items hinges on rapid, reliable post-generation vetting. In this exploratory study, we investigated which automated checks are both technically feasible and pedagogically meaningful when exercises are produced on demand within a chatbot interface. A cohort of 34 introductory-physics students generated and attempted 543 practice problems during exam preparation. Each item was labeled by an expert on a wide range of quality attributes and presented to the learners in pairs to record their preference. We then (i) benchmarked three commodity LLMs as ``judges'' against the expert labels, (ii) quantified which attributes predict student choice via random-forest models, and (iii) triangulated these results with free-form exit surveys. Only a small subset of the original metric items proved necessary to reliably address student preferences either directly or by proxy. The study demonstrates that scalable formative assessment does not require exhaustive scoring: a carefully curated core of structural and learner-visible checks is sufficient to ensure both technical soundness and user appeal. The findings provide a practical blueprint for deploying real-time, AI-generated practice in physics and other quantitative disciplines.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

A small curated set of checks can validate AI-generated physics problems for students, but single-expert labels without reliability checks weaken the claims.

read the letter

Hi, The punchline from this paper is that a carefully chosen small set of checks can handle validation for AI-generated physics problems without needing everything. It points to a practical way to scale formative assessment in physics. The work applies LLM generation and judging to student-initiated problems in a real chatbot setting. They collected 543 items, had an expert label attributes, recorded student pair preferences, benchmarked LLMs against the labels, and ran random forests to find predictors. Triangulating with surveys is a plus. The new bit is identifying that minimal viable check set for this context, which isn't just repeating prior AI-ed stuff. It does well by focusing on what actually matters for students and feasibility. The random forest on held-out data helps avoid overfitting in picking the subset. Where it could be stronger is the ground truth. Treating one expert's labels as fixed without reporting reliability or agreement stats is a limitation. Any bias there would affect the LLM comparisons and the attribute selection. Also missing are details like error bars on the models. This paper suits people in physics education research or those building AI tutoring systems. If you're after actionable insights for deployment, it's worth a look. I think it should go to peer review. The exploratory nature and the empirical results justify referee time, particularly to address the labeling and stats questions.

Referee Report

1 major / 1 minor

Summary. The manuscript reports an exploratory study in which 34 introductory-physics students generated and attempted 543 AI-produced practice problems. Each item received expert labels on a broad set of quality attributes, was presented to learners in pairs to elicit preference data, and was used to benchmark three commodity LLMs as judges, to train random-forest models identifying which attributes predict student choice, and to triangulate with free-form exit surveys. The central claim is that only a small curated subset of the original attributes is required to ensure both technical soundness and learner appeal, thereby showing that scalable formative assessment need not rely on exhaustive scoring.

Significance. If the central claim holds, the work supplies a practical blueprint for real-time vetting of learner-initiated, AI-generated physics exercises. Its strengths include the use of held-out preference data for random-forest feature selection, triangulation across expert labels, LLM judgments, and student surveys on a sizable corpus of 543 items, and an explicit demonstration that a minimal set of structural and learner-visible checks can substitute for exhaustive metrics. These elements could inform the design of adaptive tutoring systems in quantitative disciplines.

major comments (1)

[Abstract and expert-labeling description] Abstract and expert-labeling description: the study treats labels assigned by a single expert as ground truth for both LLM benchmarking and random-forest predictor selection, yet reports no inter-rater reliability statistics, intra-rater consistency checks, or multi-expert comparison. Because any systematic bias in these labels directly determines which attributes are retained as the 'necessary' minimal set, the absence of reliability metrics renders the sufficiency conclusion non-generalizable and load-bearing for the central claim.

minor comments (1)

[Abstract] The abstract omits error bars on reported metrics, cross-validation statistics for the random-forest models, and any quantitative summary of inter-rater agreement.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive feedback on our exploratory study. Below we respond point-by-point to the major comment, acknowledging limitations while highlighting the supporting triangulation in our design.

read point-by-point responses

Referee: Abstract and expert-labeling description: the study treats labels assigned by a single expert as ground truth for both LLM benchmarking and random-forest predictor selection, yet reports no inter-rater reliability statistics, intra-rater consistency checks, or multi-expert comparison. Because any systematic bias in these labels directly determines which attributes are retained as the 'necessary' minimal set, the absence of reliability metrics renders the sufficiency conclusion non-generalizable and load-bearing for the central claim.

Authors: We agree that single-expert labeling constitutes a limitation for generalizability. The 543 items were labeled by one experienced physics educator to ensure internal consistency within this pilot-scale study; resource constraints precluded multi-rater data collection or formal reliability statistics. Nevertheless, the retained minimal attribute set was validated through independent held-out student preference data in the random-forest models and cross-checked against free-form exit surveys, providing convergent evidence beyond the expert labels alone. We will add an explicit limitations subsection in the revised manuscript discussing single-rater bias and recommending multi-expert replication in future work. revision: partial

Circularity Check

0 steps flagged

Empirical study relies on independent expert labels, student preference data, and external LLM benchmarking with no self-referential derivation

full rationale

The paper describes an empirical workflow: students generate problems, an expert assigns quality labels, students record pairwise preferences, LLMs are benchmarked directly against the expert labels, random-forest models are trained on the preference data to identify predictive attributes, and results are triangulated with exit surveys. None of these steps constitutes a derivation that reduces to its own inputs by construction, a fitted parameter renamed as a prediction of the same quantity, or a load-bearing claim justified solely by self-citation. The central conclusion—that a curated subset of attributes suffices—emerges from cross-validation across distinct data sources (expert labels, student choices, survey responses) rather than from any internal redefinition or circular fitting. This is a standard self-contained empirical analysis against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The paper is an empirical user study whose central claim depends on the validity of expert annotations as ground truth and on the assumption that paired student preferences serve as a meaningful proxy for educational value.

axioms (2)

domain assumption Expert labels on quality attributes accurately capture pedagogically relevant dimensions of problem quality.
These labels are used both to benchmark the three commodity LLMs and to train the random-forest models that predict student choice.
domain assumption Student preferences recorded in paired comparisons reflect genuine differences in perceived usefulness for exam preparation.
The random-forest analysis and the claim that a small subset of checks addresses student preferences rest on this interpretation of the preference data.

pith-pipeline@v0.9.0 · 5746 in / 1493 out tokens · 57066 ms · 2026-05-19T01:25:19.444145+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Each item was labeled by an expert on a wide range of quality attributes... bloom-level-of-exercise, task-is-solvable

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

75 extracted references · 75 canonical work pages · 2 internal anchors

[1]

30-sample rule

Skewedness Although we collected a total ofN = 543 generated ex- ercises, the class distribution for some metrics was highly skewed — indeed, certain classes Ci were underrepre- sented or even absent. To ensure that per-class esti- mates (e.g., precision or recall for each Ci) have suffi- ciently low sampling variability before treating them as definitive...

work page
[2]

problem description quality

Interrelatedness As Fig. 5 shows, several metrics are correlated with each other [61]. This may allow us to later use a metric that is more reliable as a proxy for relevant, but less reliable, metrics. Added here is the variable chosen if the student chose to work on this problem in the forced- choice shown in Fig. 1. Three highly significant ( p < 0.001)...

work page 2025
[3]

includes-solution-strategy in its own right and as a proxy for depth measures,

work page
[4]

llm-solution-is-correct,

work page
[5]

task-is-specific-and-complete,

work page
[6]

measurement-unit-is-clearly-stated. Based on our findings in Table V, the first two metrics are best handled with a reasoning model, while the remaining two tasks can be handled by a cheaper, lower-latency non-reasoning model. V. LIMIT A TIONS The evidence assembled here, while encouraging, must be interpreted with caution because several structural const...

work page
[7]

F. Reif, J. H. Larkin, and G. C. Brackett, Teaching gen- eral learning and problem-solving skills, American Jour- nal of Physics 44, 212 (1976)

work page 1976
[8]

A. B. Arons, Teaching introductory physics (John Wiley & Sons, Hoboken, NJ, 1996)

work page 1996
[9]

N. R. Council et al. , How people learn: Brain, mind, experience, and school: Expanded edition (National Academies Press, Washington, DC, 2000)

work page 2000
[10]

R. J. Dufresne and W. J. Gerace, Assessing-to-learn: For- mative assessment in physics instruction, The Physics Teacher 42, 428 (2004)

work page 2004
[11]

Wagner and A

C. Wagner and A. Vaterlaus, Promoting formative as- sessment in high school teaching of physics, Lat. Am. J. Phys. Educ 1, 6 (2012)

work page 2012
[12]

D. A. Kashy, G. Albertelli, W. Bauer, E. Kashy, and M. Thoennessen, Influence of non-moderated and moder- ated discussion sites on student success, Journal of Asyn- chronous Learning Networks 7, 31 (2003)

work page 2003
[13]

Kortemeyer, E

G. Kortemeyer, E. Kashy, W. Benenson, and W. Bauer, Experiences using the open-source learning content man- agement and assessment system LON-CAPA in introduc- tory physics courses, Am. J. Phys 76, 438 (2008)

work page 2008
[14]

Risley, Motivating students to learn physics using an online homework system, Newsletter of the APS Forum on Education F all, 3 (2001)

J. Risley, Motivating students to learn physics using an online homework system, Newsletter of the APS Forum on Education F all, 3 (2001)

work page 2001
[15]

a. A. T. Guoqing Tang, Increasing student’s time on task in calculus and general physics courses through webassign, in ASEE Annual Conference and Exposition (2002) pp. 7.660.1 – 7.660.20

work page 2002
[16]

S. W. Bonham, D. L. Deardorff, and R. J. Beichner, Com- parison of student performance using web and paper- based homework in college-level physics, Journal of re- search in science teaching 40, 1050 (2003)

work page 2003
[17]

Gutmann, G

B. Gutmann, G. Gladding, M. Lundsgaard, and T. Stelzer, Mastery-style homework exercises in intro- ductory physics courses: Implementation matters, Phys. Rev. Phys. Educ. Res. 14, 010128 (2018)

work page 2018
[18]

Sperling and J

A. Sperling and J. Lincoln, Artificial intelligence and high school physics, The Physics Teacher 62, 314 (2024)

work page 2024
[19]

Wattanakasiwich, K

P. Wattanakasiwich, K. Kaewkhong, and D. Katwibun, Physics instructors’ acceptance and implementation of generative AI, Physical Review Physics Education Re- search 21, 010155 (2025)

work page 2025
[20]

K¨ uchemann, S

S. K¨ uchemann, S. Steinert, N. Revenga, M. Schwein- berger, Y. Dinc, K. E. Avila, and J. Kuhn, Can Chat- GPT support prospective teachers in physics task devel- opment?, Phys. Rev. Phys. Educ. Res.19, 020128 (2023)

work page 2023
[21]

Lademann, J

J. Lademann, J. Henze, and S. Becker-Genschow, Aug- menting learning environments using AI custom chat- bots: Effects on learning performance, cognitive load, and affective variables, Physical Review Physics Educa- tion Research 21, 010147 (2025)

work page 2025
[22]

Bitzenbauer, ChatGPT in physics education: A pi- lot study on easy-to-implement activities, Contemporary Educational Technology 15, ep430 (2023)

P. Bitzenbauer, ChatGPT in physics education: A pi- lot study on easy-to-implement activities, Contemporary Educational Technology 15, ep430 (2023)

work page 2023
[23]

Kortemeyer, Ethel: A virtual teaching assistant, Phys

G. Kortemeyer, Ethel: A virtual teaching assistant, Phys. Teach. 62, 698 (2024)

work page 2024
[24]

Kortemeyer and J

G. Kortemeyer and J. N¨ ohl, Assessing confidence in ai- assisted grading of physics exams through psychometrics: An exploratory study, Physical Review Physics Educa- tion Research 21, 010136 (2025)

work page 2025
[25]

Chen and T

Z. Chen and T. Wan, Grading explanations of problem- solving process and generating feedback using large lan- guage models at human-level accuracy, Physical Review Physics Education Research 21, 010126 (2025)

work page 2025
[26]

Gregorcic, G

B. Gregorcic, G. Polverini, and A. Sarlah, ChatGPT as a tool for honing teachers’s Socratic dialogue skills, Physics Education 59, 045005 (2024)

work page 2024
[27]

M. A. R. Vasconcelos and R. P. Dos Santos, Enhancing STEM learning with ChatGPT and Bing chat as objects to think with: a case study, Eurasia Journal of Mathe- matics, Science and Technology Education 19, em2296 (2023)

work page 2023
[28]

L. Ding, T. Li, S. Jiang, and A. Gapud, Students’ per- ceptions of using ChatGPT in a physics class as a virtual tutor, International Journal of Educational Technology in Higher Education 20, 63 (2023)

work page 2023
[29]

Balabdaoui, N

F. Balabdaoui, N. Dittmann-Domenichini, H. Grosse, C. Schlienger, and G. Kortemeyer, A survey on students’ use of AI at a technical university, Discover Education 3, 51 (2024)

work page 2024
[30]

Kortemeyer, Could an artificial-intelligence agent pass an introductory physics course?, Phys

G. Kortemeyer, Could an artificial-intelligence agent pass an introductory physics course?, Phys. Rev. Phys. Educ. Res. 19, 010132 (2023). 16

work page 2023
[31]

K. A. Pimbblet and L. J. Morrell, Can ChatGPT pass a physics degree? making a case for reformation of as- sessment of undergraduate degrees, European Journal of Physics 46, 015702 (2024)

work page 2024
[32]

Polverini and B

G. Polverini and B. Gregorcic, Performance of ChatGPT on the test of understanding graphs in kinematics, Phys. Rev. Phys. Educ. Res. 20, 010109 (2024)

work page 2024
[33]

Kortemeyer, M

G. Kortemeyer, M. Babayeva, G. Polverini, R. Widen- horn, and B. Gregorcic, Multilingual performance of a multimodal artificial intelligence system on multisubject physics concept inventories, Physical Review Physics Ed- ucation Research 21, 020101 (2025)

work page 2025
[34]

Niu and H

Y. Niu and H. Xue, Exercise generation and student cognitive ability research based on ChatGPT and Rasch model, IEEE Access 11, 116695 (2023)

work page 2023
[35]

Maity, A

S. Maity, A. Deroy, and S. Sarkar, How effective is GPT-4 Turbo in generating school-level questions from textbooks based on Bloom’s revised taxonomy?, arXiv preprint arXiv:2406.15211 (2024)

work page arXiv 2024
[36]

P. A. Kirschner, J. Sweller, and R. E. Clark, Why mini- mal guidance during instruction does not work: An anal- ysis of the failure of constructivist, discovery, problem- based, experiential, and inquiry-based teaching, Educa- tional Psychologist 41, 75 (2006)

work page 2006
[37]

El-Adawy, I

S. El-Adawy, I. Liao, V. Lad, M. Abdelhafez, and P. Dourmashkin, Streamlining physics problem genera- tion to support physics teachers in using generative arti- ficial intelligence, The Physics Teacher 62, 595 (2024)

work page 2024
[38]

Lewis, E

P. Lewis, E. Perez, A. Piktus, F. Petroni, V. Karpukhin, N. Goyal, H. K¨ uttler, M. Lewis, W.-t. Yih, T. Rockt¨ aschel, et al. , Retrieval-augmented genera- tion for knowledge-intensive NLP tasks, Advances in neural information processing systems 33, 9459 (2020)

work page 2020
[39]

Scott, T

M. Scott, T. Stelzer, and G. Gladding, Evaluating multiple-choice exams in large introductory physics courses, Physical Review Special Topics Physics Educa- tion Research 2, 020102 (2006)

work page 2006
[40]

Fakcharoenphol, E

W. Fakcharoenphol, E. Potter, and T. Stelzer, What stu- dents learn when studying physics practice exam prob- lems, Physical Review Special Topics – Physics Educa- tion Research 7, 010107 (2011)

work page 2011
[41]

Gautreau and L

R. Gautreau and L. Novemsky, Concepts first – a small group approach to physics learning, American journal of Physics 65, 418 (1997)

work page 1997
[42]

Fakcharoenphol and T

W. Fakcharoenphol and T. Stelzer, Physics exam prepa- ration: A comparison of three methods, Physical Review Special Topics – Physics Education Research 10, 010108 (2014)

work page 2014
[43]

Rodriguez and G

M. Rodriguez and G. Potvin, Frequent small group in- teractions improve student learning gains in physics: Re- sults from a nationally representative pre-post study of four-year colleges, Physical Review Physics Education Research 17, 020131 (2021)

work page 2021
[44]

M. J. Gierl, O. Bulut, Q. Guo, and X. Zhang, Develop- ing, analyzing, and using distractors for multiple-choice tests in education: A comprehensive review, Review of educational research 87, 1082 (2017)

work page 2017
[45]

T. F. Scott and D. Schumayer, Central distractors in force concept inventory data, Physical review physics ed- ucation research 14, 010106 (2018)

work page 2018
[46]

Heller, R

P. Heller, R. Keith, and S. Anderson, Teaching problem solving through cooperative grouping, American journal of physics 60, 627 (1992)

work page 1992
[47]

A. R. Mota, N. Didi¸ s K¨ orhasan, K. Miller, and E. Mazur, Homework as a metacognitive tool in an undergraduate physics course, Physical Review Physics Education Re- search 15, 010136 (2019)

work page 2019
[48]

Walkington and M

C. Walkington and M. L. Bernacki, Personalizing algebra to students’ individual interests in an intelligent tutoring system: Moderators of impact, International Journal of Artificial Intelligence in Education 29, 58 (2019)

work page 2019
[49]

Taasoobshirazi and M

G. Taasoobshirazi and M. Carr, A review and critique of context-based physics instruction and assessment, Edu- cational Research Review 3, 155 (2008)

work page 2008
[50]

Dulger and F

Z. Dulger and F. Ogan-Bekiroglu, Students’ metacogni- tion knowledge and skills during physics problem-solving process, Physical Review Physics Education Research21, 020106 (2025)

work page 2025
[51]

Harrison, C

C. Harrison, C. P. Constantinou, C. F. Correia, M. Grangeat, M. H¨ ahki¨ oniemi, M. Livitzis, P. Nieminen, N. Papadouris, E. Rached, N. Serret, et al., Assessment on-the-fly: Promoting and collecting evidence of learn- ing through dialogue, Transforming assessment: Through an interplay between practice, research and policy , 83 (2018)

work page 2018
[52]

Sadigh, S

D. Sadigh, S. A. Seshia, and M. Gupta, Automating ex- ercise generation: A step towards meeting the MOOC challenge for embedded systems, in Proceedings of the workshop on embedded and cyber-physical systems educa- tion (Association for Computing Machinery, New York, NY, 2012) pp. 1–8

work page 2012
[53]

M. N. Demaidi, M. M. Gaber, and N. Filer, Evaluating the quality of the ontology-based auto-generated ques- tions, Smart Learning Environments 4, 10.1186/s40561- 017-0046-6 (2017)

work page doi:10.1186/s40561- 2017
[54]

Nentwich, N

V. Nentwich, N. Fischer, A. C. Sonnenbichler, and A. Geyer-Schulz, Computer aided exercise generation — a framework for human interaction in the automated ex- ercise generation process, inProceedings of the 13th Inter- national Joint Conference on e-Business and Telecommu- nications (SCITEPRESS, Set´ ubal, Portugal, 2016) pp. 57–63

work page 2016
[55]

Aldabe, M

I. Aldabe, M. L. De Lacalle, M. Maritxalar, E. Martinez, and L. Uria, Arikiturri: an automatic question gener- ator based on corpora and NLP techniques, in Intel- ligent Tutoring Systems: 8th International Conference, ITS 2006, Jhongli, Taiwan, June 26-30, 2006. Proceed- ings 8 (Springer, New York, NY, 2006) pp. 584–594

work page 2006
[56]

Freitas, ´A

T. Freitas, ´A. Neto, M. J. Pereira, and P. Henriques, NLP/AI based techniques for programming exercises generation, in 4th International Computer Programming Education Conference (ICPEC 2023) , Vol. 112 (2023)

work page 2023
[57]

G. Chen, J. Yang, C. Hauff, and G.-J. Houben, Learn- ingQ: a large-scale dataset for educational question gen- eration, in Proceedings of the international AAAI con- ference on web and social media , Vol. 12 (Association for the Advancement of Artificial Intelligence, Washington, DC, 2018)

work page 2018
[58]

Training Verifiers to Solve Math Word Problems

K. Cobbe, V. Kosaraju, M. Bavarian, M. Chen, H. Jun, L. Kaiser, M. Plappert, J. Tworek, J. Hilton, R. Nakano, et al. , Training verifiers to solve math word problems, arXiv preprint arXiv:2110.14168 (2021)

work page internal anchor Pith review Pith/arXiv arXiv 2021
[59]

Measuring Mathematical Problem Solving With the MATH Dataset

D. Hendrycks, C. Burns, S. Kadavath, A. Arora, S. Basart, E. Tang, D. Song, and J. Steinhardt, Mea- suring mathematical problem solving with the MATH dataset, arXiv preprint arXiv:2103.03874 (2021). 17

work page internal anchor Pith review Pith/arXiv arXiv 2021
[60]

Macina, N

J. Macina, N. Daheim, S. P. Chowdhury, T. Sinha, M. Kapur, I. Gurevych, and M. Sachan, Mathdial: A dialogue tutoring dataset with rich pedagogical proper- ties grounded in math reasoning problems, arXiv preprint arXiv:2305.14536 (2023)

work page arXiv 2023
[61]

Doughty, Z

J. Doughty, Z. Wan, A. Bompelli, J. Qayum, T. Wang, J. Zhang, Y. Zheng, A. Doyle, P. Sridhar, A. Agarwal, et al., A comparative study of AI-generated (gpt-4) and human-crafted MCQs in programming education, in Pro- ceedings of the 26th Australasian Computing Education Conference (Association for Computing Machinery, New York, NY, 2024) pp. 114–123

work page 2024
[62]

Kortemeyer, J

G. Kortemeyer, J. N¨ ohl, and D. Onishchuk, Grading as- sistance for a handwritten thermodynamics exam using artificial intelligence: An exploratory study, Physical Re- view Physics Education Research 20, 020144 (2024)

work page 2024
[63]

We need structured out- put

M. X. Liu, F. Liu, A. J. Fiannaca, T. Koo, L. Dixon, M. Terry, and C. J. Cai, “We need structured out- put”: Towards user-centered constraints on large lan- guage model output, in Extended Abstracts of the CHI Conference on Human Factors in Computing Systems (Association for Computing Machinery, New York, NY,

work page
[64]

R. V. Hogg, E. A. Tanis, and D. L. Zimmerman, Proba- bility and Statistical Inference, 9th ed. (Pearson, Boston, MA, 2019) section 5.6, p. 202

work page 2019
[65]

L. A. Orawo, Confidence intervals for the binomial pro- portion: A comparison of four methods, Open Journal of Statistics 11, 806 (2021)

work page 2021
[66]

Z. Ji, N. Lee, R. Frieske, T. Yu, D. Su, Y. Xu, E. Ishii, Y. J. Bang, A. Madotto, and P. Fung, Survey of halluci- nation in natural language generation, ACM computing surveys 55, 1 (2023)

work page 2023
[67]

Spearman, The proof and measurement of association between two things, The American Journal of Psychology 15, 72 (1904)

C. Spearman, The proof and measurement of association between two things, The American Journal of Psychology 15, 72 (1904)

work page 1904
[68]

T. M. Fruchterman and E. M. Reingold, Graph draw- ing by force-directed placement, Software: Practice and experience 21, 1129 (1991)

work page 1991
[69]

OpenAI, OpenAI, https://openai.com/ (accessed July 2025)

work page 2025
[70]

microsoft.com/en-us/products/ai-services (ac- cessed June 2024)

Microsoft, Azure AI Services, https://azure. microsoft.com/en-us/products/ai-services (ac- cessed June 2024)

work page 2024
[71]

OpenAI, Hello GPT-4o, https://openai.com/index/ hello-gpt-4o/ (accessed June 2024)

work page 2024
[72]

OpenAI, GPT-4o mini: advancing cost- efficient intelligence, https://openai.com/index/ gpt-4o-mini-advancing-cost-efficient-intelligence/ (accessed Februrary 2025)

work page 2025
[73]

OpenAI, OpenAI o3-mini, https://openai.com/index/ openai-o3-mini/ (accessed February 2025)

work page 2025
[74]

Breiman, Random forests, Machine learning 45, 5 (2001)

L. Breiman, Random forests, Machine learning 45, 5 (2001)

work page 2001
[75]

J. Wei, X. Wang, D. Schuurmans, M. Bosma, F. Xia, E. Chi, Q. V. Le, D. Zhou, et al. , Chain-of-thought prompting elicits reasoning in large language models, Advances in neural information processing systems 35, 24824 (2022)

work page 2022

[1] [1]

30-sample rule

Skewedness Although we collected a total ofN = 543 generated ex- ercises, the class distribution for some metrics was highly skewed — indeed, certain classes Ci were underrepre- sented or even absent. To ensure that per-class esti- mates (e.g., precision or recall for each Ci) have suffi- ciently low sampling variability before treating them as definitive...

work page

[2] [2]

problem description quality

Interrelatedness As Fig. 5 shows, several metrics are correlated with each other [61]. This may allow us to later use a metric that is more reliable as a proxy for relevant, but less reliable, metrics. Added here is the variable chosen if the student chose to work on this problem in the forced- choice shown in Fig. 1. Three highly significant ( p < 0.001)...

work page 2025

[3] [3]

includes-solution-strategy in its own right and as a proxy for depth measures,

work page

[4] [4]

llm-solution-is-correct,

work page

[5] [5]

task-is-specific-and-complete,

work page

[6] [6]

measurement-unit-is-clearly-stated. Based on our findings in Table V, the first two metrics are best handled with a reasoning model, while the remaining two tasks can be handled by a cheaper, lower-latency non-reasoning model. V. LIMIT A TIONS The evidence assembled here, while encouraging, must be interpreted with caution because several structural const...

work page

[7] [7]

F. Reif, J. H. Larkin, and G. C. Brackett, Teaching gen- eral learning and problem-solving skills, American Jour- nal of Physics 44, 212 (1976)

work page 1976

[8] [8]

A. B. Arons, Teaching introductory physics (John Wiley & Sons, Hoboken, NJ, 1996)

work page 1996

[9] [9]

N. R. Council et al. , How people learn: Brain, mind, experience, and school: Expanded edition (National Academies Press, Washington, DC, 2000)

work page 2000

[10] [10]

R. J. Dufresne and W. J. Gerace, Assessing-to-learn: For- mative assessment in physics instruction, The Physics Teacher 42, 428 (2004)

work page 2004

[11] [11]

Wagner and A

C. Wagner and A. Vaterlaus, Promoting formative as- sessment in high school teaching of physics, Lat. Am. J. Phys. Educ 1, 6 (2012)

work page 2012

[12] [12]

D. A. Kashy, G. Albertelli, W. Bauer, E. Kashy, and M. Thoennessen, Influence of non-moderated and moder- ated discussion sites on student success, Journal of Asyn- chronous Learning Networks 7, 31 (2003)

work page 2003

[13] [13]

Kortemeyer, E

G. Kortemeyer, E. Kashy, W. Benenson, and W. Bauer, Experiences using the open-source learning content man- agement and assessment system LON-CAPA in introduc- tory physics courses, Am. J. Phys 76, 438 (2008)

work page 2008

[14] [14]

Risley, Motivating students to learn physics using an online homework system, Newsletter of the APS Forum on Education F all, 3 (2001)

J. Risley, Motivating students to learn physics using an online homework system, Newsletter of the APS Forum on Education F all, 3 (2001)

work page 2001

[15] [15]

a. A. T. Guoqing Tang, Increasing student’s time on task in calculus and general physics courses through webassign, in ASEE Annual Conference and Exposition (2002) pp. 7.660.1 – 7.660.20

work page 2002

[16] [16]

S. W. Bonham, D. L. Deardorff, and R. J. Beichner, Com- parison of student performance using web and paper- based homework in college-level physics, Journal of re- search in science teaching 40, 1050 (2003)

work page 2003

[17] [17]

Gutmann, G

B. Gutmann, G. Gladding, M. Lundsgaard, and T. Stelzer, Mastery-style homework exercises in intro- ductory physics courses: Implementation matters, Phys. Rev. Phys. Educ. Res. 14, 010128 (2018)

work page 2018

[18] [18]

Sperling and J

A. Sperling and J. Lincoln, Artificial intelligence and high school physics, The Physics Teacher 62, 314 (2024)

work page 2024

[19] [19]

Wattanakasiwich, K

P. Wattanakasiwich, K. Kaewkhong, and D. Katwibun, Physics instructors’ acceptance and implementation of generative AI, Physical Review Physics Education Re- search 21, 010155 (2025)

work page 2025

[20] [20]

K¨ uchemann, S

S. K¨ uchemann, S. Steinert, N. Revenga, M. Schwein- berger, Y. Dinc, K. E. Avila, and J. Kuhn, Can Chat- GPT support prospective teachers in physics task devel- opment?, Phys. Rev. Phys. Educ. Res.19, 020128 (2023)

work page 2023

[21] [21]

Lademann, J

J. Lademann, J. Henze, and S. Becker-Genschow, Aug- menting learning environments using AI custom chat- bots: Effects on learning performance, cognitive load, and affective variables, Physical Review Physics Educa- tion Research 21, 010147 (2025)

work page 2025

[22] [22]

Bitzenbauer, ChatGPT in physics education: A pi- lot study on easy-to-implement activities, Contemporary Educational Technology 15, ep430 (2023)

P. Bitzenbauer, ChatGPT in physics education: A pi- lot study on easy-to-implement activities, Contemporary Educational Technology 15, ep430 (2023)

work page 2023

[23] [23]

Kortemeyer, Ethel: A virtual teaching assistant, Phys

G. Kortemeyer, Ethel: A virtual teaching assistant, Phys. Teach. 62, 698 (2024)

work page 2024

[24] [24]

Kortemeyer and J

G. Kortemeyer and J. N¨ ohl, Assessing confidence in ai- assisted grading of physics exams through psychometrics: An exploratory study, Physical Review Physics Educa- tion Research 21, 010136 (2025)

work page 2025

[25] [25]

Chen and T

Z. Chen and T. Wan, Grading explanations of problem- solving process and generating feedback using large lan- guage models at human-level accuracy, Physical Review Physics Education Research 21, 010126 (2025)

work page 2025

[26] [26]

Gregorcic, G

B. Gregorcic, G. Polverini, and A. Sarlah, ChatGPT as a tool for honing teachers’s Socratic dialogue skills, Physics Education 59, 045005 (2024)

work page 2024

[27] [27]

M. A. R. Vasconcelos and R. P. Dos Santos, Enhancing STEM learning with ChatGPT and Bing chat as objects to think with: a case study, Eurasia Journal of Mathe- matics, Science and Technology Education 19, em2296 (2023)

work page 2023

[28] [28]

L. Ding, T. Li, S. Jiang, and A. Gapud, Students’ per- ceptions of using ChatGPT in a physics class as a virtual tutor, International Journal of Educational Technology in Higher Education 20, 63 (2023)

work page 2023

[29] [29]

Balabdaoui, N

F. Balabdaoui, N. Dittmann-Domenichini, H. Grosse, C. Schlienger, and G. Kortemeyer, A survey on students’ use of AI at a technical university, Discover Education 3, 51 (2024)

work page 2024

[30] [30]

Kortemeyer, Could an artificial-intelligence agent pass an introductory physics course?, Phys

G. Kortemeyer, Could an artificial-intelligence agent pass an introductory physics course?, Phys. Rev. Phys. Educ. Res. 19, 010132 (2023). 16

work page 2023

[31] [31]

K. A. Pimbblet and L. J. Morrell, Can ChatGPT pass a physics degree? making a case for reformation of as- sessment of undergraduate degrees, European Journal of Physics 46, 015702 (2024)

work page 2024

[32] [32]

Polverini and B

G. Polverini and B. Gregorcic, Performance of ChatGPT on the test of understanding graphs in kinematics, Phys. Rev. Phys. Educ. Res. 20, 010109 (2024)

work page 2024

[33] [33]

Kortemeyer, M

G. Kortemeyer, M. Babayeva, G. Polverini, R. Widen- horn, and B. Gregorcic, Multilingual performance of a multimodal artificial intelligence system on multisubject physics concept inventories, Physical Review Physics Ed- ucation Research 21, 020101 (2025)

work page 2025

[34] [34]

Niu and H

Y. Niu and H. Xue, Exercise generation and student cognitive ability research based on ChatGPT and Rasch model, IEEE Access 11, 116695 (2023)

work page 2023

[35] [35]

Maity, A

S. Maity, A. Deroy, and S. Sarkar, How effective is GPT-4 Turbo in generating school-level questions from textbooks based on Bloom’s revised taxonomy?, arXiv preprint arXiv:2406.15211 (2024)

work page arXiv 2024

[36] [36]

P. A. Kirschner, J. Sweller, and R. E. Clark, Why mini- mal guidance during instruction does not work: An anal- ysis of the failure of constructivist, discovery, problem- based, experiential, and inquiry-based teaching, Educa- tional Psychologist 41, 75 (2006)

work page 2006

[37] [37]

El-Adawy, I

S. El-Adawy, I. Liao, V. Lad, M. Abdelhafez, and P. Dourmashkin, Streamlining physics problem genera- tion to support physics teachers in using generative arti- ficial intelligence, The Physics Teacher 62, 595 (2024)

work page 2024

[38] [38]

Lewis, E

P. Lewis, E. Perez, A. Piktus, F. Petroni, V. Karpukhin, N. Goyal, H. K¨ uttler, M. Lewis, W.-t. Yih, T. Rockt¨ aschel, et al. , Retrieval-augmented genera- tion for knowledge-intensive NLP tasks, Advances in neural information processing systems 33, 9459 (2020)

work page 2020

[39] [39]

Scott, T

M. Scott, T. Stelzer, and G. Gladding, Evaluating multiple-choice exams in large introductory physics courses, Physical Review Special Topics Physics Educa- tion Research 2, 020102 (2006)

work page 2006

[40] [40]

Fakcharoenphol, E

W. Fakcharoenphol, E. Potter, and T. Stelzer, What stu- dents learn when studying physics practice exam prob- lems, Physical Review Special Topics – Physics Educa- tion Research 7, 010107 (2011)

work page 2011

[41] [41]

Gautreau and L

R. Gautreau and L. Novemsky, Concepts first – a small group approach to physics learning, American journal of Physics 65, 418 (1997)

work page 1997

[42] [42]

Fakcharoenphol and T

W. Fakcharoenphol and T. Stelzer, Physics exam prepa- ration: A comparison of three methods, Physical Review Special Topics – Physics Education Research 10, 010108 (2014)

work page 2014

[43] [43]

Rodriguez and G

M. Rodriguez and G. Potvin, Frequent small group in- teractions improve student learning gains in physics: Re- sults from a nationally representative pre-post study of four-year colleges, Physical Review Physics Education Research 17, 020131 (2021)

work page 2021

[44] [44]

M. J. Gierl, O. Bulut, Q. Guo, and X. Zhang, Develop- ing, analyzing, and using distractors for multiple-choice tests in education: A comprehensive review, Review of educational research 87, 1082 (2017)

work page 2017

[45] [45]

T. F. Scott and D. Schumayer, Central distractors in force concept inventory data, Physical review physics ed- ucation research 14, 010106 (2018)

work page 2018

[46] [46]

Heller, R

P. Heller, R. Keith, and S. Anderson, Teaching problem solving through cooperative grouping, American journal of physics 60, 627 (1992)

work page 1992

[47] [47]

A. R. Mota, N. Didi¸ s K¨ orhasan, K. Miller, and E. Mazur, Homework as a metacognitive tool in an undergraduate physics course, Physical Review Physics Education Re- search 15, 010136 (2019)

work page 2019

[48] [48]

Walkington and M

C. Walkington and M. L. Bernacki, Personalizing algebra to students’ individual interests in an intelligent tutoring system: Moderators of impact, International Journal of Artificial Intelligence in Education 29, 58 (2019)

work page 2019

[49] [49]

Taasoobshirazi and M

G. Taasoobshirazi and M. Carr, A review and critique of context-based physics instruction and assessment, Edu- cational Research Review 3, 155 (2008)

work page 2008

[50] [50]

Dulger and F

Z. Dulger and F. Ogan-Bekiroglu, Students’ metacogni- tion knowledge and skills during physics problem-solving process, Physical Review Physics Education Research21, 020106 (2025)

work page 2025

[51] [51]

Harrison, C

C. Harrison, C. P. Constantinou, C. F. Correia, M. Grangeat, M. H¨ ahki¨ oniemi, M. Livitzis, P. Nieminen, N. Papadouris, E. Rached, N. Serret, et al., Assessment on-the-fly: Promoting and collecting evidence of learn- ing through dialogue, Transforming assessment: Through an interplay between practice, research and policy , 83 (2018)

work page 2018

[52] [52]

Sadigh, S

D. Sadigh, S. A. Seshia, and M. Gupta, Automating ex- ercise generation: A step towards meeting the MOOC challenge for embedded systems, in Proceedings of the workshop on embedded and cyber-physical systems educa- tion (Association for Computing Machinery, New York, NY, 2012) pp. 1–8

work page 2012

[53] [53]

M. N. Demaidi, M. M. Gaber, and N. Filer, Evaluating the quality of the ontology-based auto-generated ques- tions, Smart Learning Environments 4, 10.1186/s40561- 017-0046-6 (2017)

work page doi:10.1186/s40561- 2017

[54] [54]

Nentwich, N

V. Nentwich, N. Fischer, A. C. Sonnenbichler, and A. Geyer-Schulz, Computer aided exercise generation — a framework for human interaction in the automated ex- ercise generation process, inProceedings of the 13th Inter- national Joint Conference on e-Business and Telecommu- nications (SCITEPRESS, Set´ ubal, Portugal, 2016) pp. 57–63

work page 2016

[55] [55]

Aldabe, M

I. Aldabe, M. L. De Lacalle, M. Maritxalar, E. Martinez, and L. Uria, Arikiturri: an automatic question gener- ator based on corpora and NLP techniques, in Intel- ligent Tutoring Systems: 8th International Conference, ITS 2006, Jhongli, Taiwan, June 26-30, 2006. Proceed- ings 8 (Springer, New York, NY, 2006) pp. 584–594

work page 2006

[56] [56]

Freitas, ´A

T. Freitas, ´A. Neto, M. J. Pereira, and P. Henriques, NLP/AI based techniques for programming exercises generation, in 4th International Computer Programming Education Conference (ICPEC 2023) , Vol. 112 (2023)

work page 2023

[57] [57]

G. Chen, J. Yang, C. Hauff, and G.-J. Houben, Learn- ingQ: a large-scale dataset for educational question gen- eration, in Proceedings of the international AAAI con- ference on web and social media , Vol. 12 (Association for the Advancement of Artificial Intelligence, Washington, DC, 2018)

work page 2018

[58] [58]

Training Verifiers to Solve Math Word Problems

K. Cobbe, V. Kosaraju, M. Bavarian, M. Chen, H. Jun, L. Kaiser, M. Plappert, J. Tworek, J. Hilton, R. Nakano, et al. , Training verifiers to solve math word problems, arXiv preprint arXiv:2110.14168 (2021)

work page internal anchor Pith review Pith/arXiv arXiv 2021

[59] [59]

Measuring Mathematical Problem Solving With the MATH Dataset

D. Hendrycks, C. Burns, S. Kadavath, A. Arora, S. Basart, E. Tang, D. Song, and J. Steinhardt, Mea- suring mathematical problem solving with the MATH dataset, arXiv preprint arXiv:2103.03874 (2021). 17

work page internal anchor Pith review Pith/arXiv arXiv 2021

[60] [60]

Macina, N

J. Macina, N. Daheim, S. P. Chowdhury, T. Sinha, M. Kapur, I. Gurevych, and M. Sachan, Mathdial: A dialogue tutoring dataset with rich pedagogical proper- ties grounded in math reasoning problems, arXiv preprint arXiv:2305.14536 (2023)

work page arXiv 2023

[61] [61]

Doughty, Z

J. Doughty, Z. Wan, A. Bompelli, J. Qayum, T. Wang, J. Zhang, Y. Zheng, A. Doyle, P. Sridhar, A. Agarwal, et al., A comparative study of AI-generated (gpt-4) and human-crafted MCQs in programming education, in Pro- ceedings of the 26th Australasian Computing Education Conference (Association for Computing Machinery, New York, NY, 2024) pp. 114–123

work page 2024

[62] [62]

Kortemeyer, J

G. Kortemeyer, J. N¨ ohl, and D. Onishchuk, Grading as- sistance for a handwritten thermodynamics exam using artificial intelligence: An exploratory study, Physical Re- view Physics Education Research 20, 020144 (2024)

work page 2024

[63] [63]

We need structured out- put

M. X. Liu, F. Liu, A. J. Fiannaca, T. Koo, L. Dixon, M. Terry, and C. J. Cai, “We need structured out- put”: Towards user-centered constraints on large lan- guage model output, in Extended Abstracts of the CHI Conference on Human Factors in Computing Systems (Association for Computing Machinery, New York, NY,

work page

[64] [64]

R. V. Hogg, E. A. Tanis, and D. L. Zimmerman, Proba- bility and Statistical Inference, 9th ed. (Pearson, Boston, MA, 2019) section 5.6, p. 202

work page 2019

[65] [65]

L. A. Orawo, Confidence intervals for the binomial pro- portion: A comparison of four methods, Open Journal of Statistics 11, 806 (2021)

work page 2021

[66] [66]

Z. Ji, N. Lee, R. Frieske, T. Yu, D. Su, Y. Xu, E. Ishii, Y. J. Bang, A. Madotto, and P. Fung, Survey of halluci- nation in natural language generation, ACM computing surveys 55, 1 (2023)

work page 2023

[67] [67]

Spearman, The proof and measurement of association between two things, The American Journal of Psychology 15, 72 (1904)

C. Spearman, The proof and measurement of association between two things, The American Journal of Psychology 15, 72 (1904)

work page 1904

[68] [68]

T. M. Fruchterman and E. M. Reingold, Graph draw- ing by force-directed placement, Software: Practice and experience 21, 1129 (1991)

work page 1991

[69] [69]

OpenAI, OpenAI, https://openai.com/ (accessed July 2025)

work page 2025

[70] [70]

microsoft.com/en-us/products/ai-services (ac- cessed June 2024)

Microsoft, Azure AI Services, https://azure. microsoft.com/en-us/products/ai-services (ac- cessed June 2024)

work page 2024

[71] [71]

OpenAI, Hello GPT-4o, https://openai.com/index/ hello-gpt-4o/ (accessed June 2024)

work page 2024

[72] [72]

OpenAI, GPT-4o mini: advancing cost- efficient intelligence, https://openai.com/index/ gpt-4o-mini-advancing-cost-efficient-intelligence/ (accessed Februrary 2025)

work page 2025

[73] [73]

OpenAI, OpenAI o3-mini, https://openai.com/index/ openai-o3-mini/ (accessed February 2025)

work page 2025

[74] [74]

Breiman, Random forests, Machine learning 45, 5 (2001)

L. Breiman, Random forests, Machine learning 45, 5 (2001)

work page 2001

[75] [75]

J. Wei, X. Wang, D. Schuurmans, M. Bosma, F. Xia, E. Chi, Q. V. Le, D. Zhou, et al. , Chain-of-thought prompting elicits reasoning in large language models, Advances in neural information processing systems 35, 24824 (2022)

work page 2022