pith. sign in

arxiv: 2508.03085 · v2 · submitted 2025-08-05 · ⚛️ physics.ed-ph

When AI Evaluates Its Own Work: Validating Learner-Initiated, AI-Generated Physics Practice Problems

Pith reviewed 2026-05-19 01:25 UTC · model grok-4.3

classification ⚛️ physics.ed-ph
keywords AI-generated problemsphysics educationformative assessmentLLM evaluationstudent preferencesquality metricsrandom forestscalable assessment
0
0 comments X

The pith

Only a curated subset of quality checks is needed to validate AI-generated physics practice problems for student appeal and soundness.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large language models can generate physics practice problems instantly, but ensuring they are useful requires vetting. This study had students create and try hundreds of such problems, with experts rating them on many attributes and students indicating preferences between pairs. By comparing how well different AI models match expert ratings and using machine learning to find which ratings predict student likes, the authors found that a small number of checks cover what matters. This means real-time generated exercises can be made reliable without checking everything, opening the door to more personalized practice in physics classes.

Core claim

The central discovery is that student preferences for AI-generated physics problems are reliably addressed by a small subset of structural and learner-visible quality attributes, as determined through expert labeling, LLM benchmarking, random-forest prediction of choices, and exit surveys. This shows that exhaustive scoring is unnecessary for scalable formative assessment.

What carries the argument

Random-forest models that identify the quality attributes predicting student preferences, benchmarked against expert labels via LLM judges.

If this is right

  • Scalable formative assessment in physics becomes feasible without exhaustive expert scoring.
  • A practical blueprint supports deploying real-time AI-generated practice problems.
  • Both technical soundness and user appeal are maintained by structural and learner-visible checks.
  • The approach extends directly to other quantitative disciplines.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • These minimal checks could be embedded in the chatbot to filter problems before students see them.
  • The same reduced set of attributes might apply to AI-generated problems in mathematics or chemistry.
  • Repeating the preference trials with students at different levels could test broader validity.

Load-bearing premise

Expert labels on the quality attributes are treated as ground truth without systematic biases or missing key pedagogical dimensions.

What would settle it

A new cohort of students whose preferences contradict the random-forest selected minimal attributes would show the core checks are not sufficient.

Figures

Figures reproduced from arXiv: 2508.03085 by Gerd Kortemeyer, Tobias Geisler.

Figure 1
Figure 1. Figure 1: FIG. 1. Two interactive problems generated on-demand, showcasing how course-specific contextual information, in this case the [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: FIG. 2. Attempting to solve the selected problem from Fig. 1. [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: FIG. 3. Information flow in an enhanced chatbot that can generate verified problems on-the-fly and on-demand. The “exercise [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: FIG. 4. Example of the definition for [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: FIG. 5. Correlations between the metrics in a force-directed [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
read the original abstract

Large language models (LLMs) can now generate physics practice problems in real time, yet the educational value of these items hinges on rapid, reliable post-generation vetting. In this exploratory study, we investigated which automated checks are both technically feasible and pedagogically meaningful when exercises are produced on demand within a chatbot interface. A cohort of 34 introductory-physics students generated and attempted 543 practice problems during exam preparation. Each item was labeled by an expert on a wide range of quality attributes and presented to the learners in pairs to record their preference. We then (i) benchmarked three commodity LLMs as ``judges'' against the expert labels, (ii) quantified which attributes predict student choice via random-forest models, and (iii) triangulated these results with free-form exit surveys. Only a small subset of the original metric items proved necessary to reliably address student preferences either directly or by proxy. The study demonstrates that scalable formative assessment does not require exhaustive scoring: a carefully curated core of structural and learner-visible checks is sufficient to ensure both technical soundness and user appeal. The findings provide a practical blueprint for deploying real-time, AI-generated practice in physics and other quantitative disciplines.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The manuscript reports an exploratory study in which 34 introductory-physics students generated and attempted 543 AI-produced practice problems. Each item received expert labels on a broad set of quality attributes, was presented to learners in pairs to elicit preference data, and was used to benchmark three commodity LLMs as judges, to train random-forest models identifying which attributes predict student choice, and to triangulate with free-form exit surveys. The central claim is that only a small curated subset of the original attributes is required to ensure both technical soundness and learner appeal, thereby showing that scalable formative assessment need not rely on exhaustive scoring.

Significance. If the central claim holds, the work supplies a practical blueprint for real-time vetting of learner-initiated, AI-generated physics exercises. Its strengths include the use of held-out preference data for random-forest feature selection, triangulation across expert labels, LLM judgments, and student surveys on a sizable corpus of 543 items, and an explicit demonstration that a minimal set of structural and learner-visible checks can substitute for exhaustive metrics. These elements could inform the design of adaptive tutoring systems in quantitative disciplines.

major comments (1)
  1. [Abstract and expert-labeling description] Abstract and expert-labeling description: the study treats labels assigned by a single expert as ground truth for both LLM benchmarking and random-forest predictor selection, yet reports no inter-rater reliability statistics, intra-rater consistency checks, or multi-expert comparison. Because any systematic bias in these labels directly determines which attributes are retained as the 'necessary' minimal set, the absence of reliability metrics renders the sufficiency conclusion non-generalizable and load-bearing for the central claim.
minor comments (1)
  1. [Abstract] The abstract omits error bars on reported metrics, cross-validation statistics for the random-forest models, and any quantitative summary of inter-rater agreement.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive feedback on our exploratory study. Below we respond point-by-point to the major comment, acknowledging limitations while highlighting the supporting triangulation in our design.

read point-by-point responses
  1. Referee: Abstract and expert-labeling description: the study treats labels assigned by a single expert as ground truth for both LLM benchmarking and random-forest predictor selection, yet reports no inter-rater reliability statistics, intra-rater consistency checks, or multi-expert comparison. Because any systematic bias in these labels directly determines which attributes are retained as the 'necessary' minimal set, the absence of reliability metrics renders the sufficiency conclusion non-generalizable and load-bearing for the central claim.

    Authors: We agree that single-expert labeling constitutes a limitation for generalizability. The 543 items were labeled by one experienced physics educator to ensure internal consistency within this pilot-scale study; resource constraints precluded multi-rater data collection or formal reliability statistics. Nevertheless, the retained minimal attribute set was validated through independent held-out student preference data in the random-forest models and cross-checked against free-form exit surveys, providing convergent evidence beyond the expert labels alone. We will add an explicit limitations subsection in the revised manuscript discussing single-rater bias and recommending multi-expert replication in future work. revision: partial

Circularity Check

0 steps flagged

Empirical study relies on independent expert labels, student preference data, and external LLM benchmarking with no self-referential derivation

full rationale

The paper describes an empirical workflow: students generate problems, an expert assigns quality labels, students record pairwise preferences, LLMs are benchmarked directly against the expert labels, random-forest models are trained on the preference data to identify predictive attributes, and results are triangulated with exit surveys. None of these steps constitutes a derivation that reduces to its own inputs by construction, a fitted parameter renamed as a prediction of the same quantity, or a load-bearing claim justified solely by self-citation. The central conclusion—that a curated subset of attributes suffices—emerges from cross-validation across distinct data sources (expert labels, student choices, survey responses) rather than from any internal redefinition or circular fitting. This is a standard self-contained empirical analysis against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The paper is an empirical user study whose central claim depends on the validity of expert annotations as ground truth and on the assumption that paired student preferences serve as a meaningful proxy for educational value.

axioms (2)
  • domain assumption Expert labels on quality attributes accurately capture pedagogically relevant dimensions of problem quality.
    These labels are used both to benchmark the three commodity LLMs and to train the random-forest models that predict student choice.
  • domain assumption Student preferences recorded in paired comparisons reflect genuine differences in perceived usefulness for exam preparation.
    The random-forest analysis and the claim that a small subset of checks addresses student preferences rest on this interpretation of the preference data.

pith-pipeline@v0.9.0 · 5746 in / 1493 out tokens · 57066 ms · 2026-05-19T01:25:19.444145+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

75 extracted references · 75 canonical work pages · 2 internal anchors

  1. [1]

    30-sample rule

    Skewedness Although we collected a total ofN = 543 generated ex- ercises, the class distribution for some metrics was highly skewed — indeed, certain classes Ci were underrepre- sented or even absent. To ensure that per-class esti- mates (e.g., precision or recall for each Ci) have suffi- ciently low sampling variability before treating them as definitive...

  2. [2]

    problem description quality

    Interrelatedness As Fig. 5 shows, several metrics are correlated with each other [61]. This may allow us to later use a metric that is more reliable as a proxy for relevant, but less reliable, metrics. Added here is the variable chosen if the student chose to work on this problem in the forced- choice shown in Fig. 1. Three highly significant ( p < 0.001)...

  3. [3]

    includes-solution-strategy in its own right and as a proxy for depth measures,

  4. [4]

    llm-solution-is-correct,

  5. [5]

    task-is-specific-and-complete,

  6. [6]

    measurement-unit-is-clearly-stated. Based on our findings in Table V, the first two metrics are best handled with a reasoning model, while the remaining two tasks can be handled by a cheaper, lower-latency non-reasoning model. V. LIMIT A TIONS The evidence assembled here, while encouraging, must be interpreted with caution because several structural const...

  7. [7]

    F. Reif, J. H. Larkin, and G. C. Brackett, Teaching gen- eral learning and problem-solving skills, American Jour- nal of Physics 44, 212 (1976)

  8. [8]

    A. B. Arons, Teaching introductory physics (John Wiley & Sons, Hoboken, NJ, 1996)

  9. [9]

    N. R. Council et al. , How people learn: Brain, mind, experience, and school: Expanded edition (National Academies Press, Washington, DC, 2000)

  10. [10]

    R. J. Dufresne and W. J. Gerace, Assessing-to-learn: For- mative assessment in physics instruction, The Physics Teacher 42, 428 (2004)

  11. [11]

    Wagner and A

    C. Wagner and A. Vaterlaus, Promoting formative as- sessment in high school teaching of physics, Lat. Am. J. Phys. Educ 1, 6 (2012)

  12. [12]

    D. A. Kashy, G. Albertelli, W. Bauer, E. Kashy, and M. Thoennessen, Influence of non-moderated and moder- ated discussion sites on student success, Journal of Asyn- chronous Learning Networks 7, 31 (2003)

  13. [13]

    Kortemeyer, E

    G. Kortemeyer, E. Kashy, W. Benenson, and W. Bauer, Experiences using the open-source learning content man- agement and assessment system LON-CAPA in introduc- tory physics courses, Am. J. Phys 76, 438 (2008)

  14. [14]

    Risley, Motivating students to learn physics using an online homework system, Newsletter of the APS Forum on Education F all, 3 (2001)

    J. Risley, Motivating students to learn physics using an online homework system, Newsletter of the APS Forum on Education F all, 3 (2001)

  15. [15]

    a. A. T. Guoqing Tang, Increasing student’s time on task in calculus and general physics courses through webassign, in ASEE Annual Conference and Exposition (2002) pp. 7.660.1 – 7.660.20

  16. [16]

    S. W. Bonham, D. L. Deardorff, and R. J. Beichner, Com- parison of student performance using web and paper- based homework in college-level physics, Journal of re- search in science teaching 40, 1050 (2003)

  17. [17]

    Gutmann, G

    B. Gutmann, G. Gladding, M. Lundsgaard, and T. Stelzer, Mastery-style homework exercises in intro- ductory physics courses: Implementation matters, Phys. Rev. Phys. Educ. Res. 14, 010128 (2018)

  18. [18]

    Sperling and J

    A. Sperling and J. Lincoln, Artificial intelligence and high school physics, The Physics Teacher 62, 314 (2024)

  19. [19]

    Wattanakasiwich, K

    P. Wattanakasiwich, K. Kaewkhong, and D. Katwibun, Physics instructors’ acceptance and implementation of generative AI, Physical Review Physics Education Re- search 21, 010155 (2025)

  20. [20]

    K¨ uchemann, S

    S. K¨ uchemann, S. Steinert, N. Revenga, M. Schwein- berger, Y. Dinc, K. E. Avila, and J. Kuhn, Can Chat- GPT support prospective teachers in physics task devel- opment?, Phys. Rev. Phys. Educ. Res.19, 020128 (2023)

  21. [21]

    Lademann, J

    J. Lademann, J. Henze, and S. Becker-Genschow, Aug- menting learning environments using AI custom chat- bots: Effects on learning performance, cognitive load, and affective variables, Physical Review Physics Educa- tion Research 21, 010147 (2025)

  22. [22]

    Bitzenbauer, ChatGPT in physics education: A pi- lot study on easy-to-implement activities, Contemporary Educational Technology 15, ep430 (2023)

    P. Bitzenbauer, ChatGPT in physics education: A pi- lot study on easy-to-implement activities, Contemporary Educational Technology 15, ep430 (2023)

  23. [23]

    Kortemeyer, Ethel: A virtual teaching assistant, Phys

    G. Kortemeyer, Ethel: A virtual teaching assistant, Phys. Teach. 62, 698 (2024)

  24. [24]

    Kortemeyer and J

    G. Kortemeyer and J. N¨ ohl, Assessing confidence in ai- assisted grading of physics exams through psychometrics: An exploratory study, Physical Review Physics Educa- tion Research 21, 010136 (2025)

  25. [25]

    Chen and T

    Z. Chen and T. Wan, Grading explanations of problem- solving process and generating feedback using large lan- guage models at human-level accuracy, Physical Review Physics Education Research 21, 010126 (2025)

  26. [26]

    Gregorcic, G

    B. Gregorcic, G. Polverini, and A. Sarlah, ChatGPT as a tool for honing teachers’s Socratic dialogue skills, Physics Education 59, 045005 (2024)

  27. [27]

    M. A. R. Vasconcelos and R. P. Dos Santos, Enhancing STEM learning with ChatGPT and Bing chat as objects to think with: a case study, Eurasia Journal of Mathe- matics, Science and Technology Education 19, em2296 (2023)

  28. [28]

    L. Ding, T. Li, S. Jiang, and A. Gapud, Students’ per- ceptions of using ChatGPT in a physics class as a virtual tutor, International Journal of Educational Technology in Higher Education 20, 63 (2023)

  29. [29]

    Balabdaoui, N

    F. Balabdaoui, N. Dittmann-Domenichini, H. Grosse, C. Schlienger, and G. Kortemeyer, A survey on students’ use of AI at a technical university, Discover Education 3, 51 (2024)

  30. [30]

    Kortemeyer, Could an artificial-intelligence agent pass an introductory physics course?, Phys

    G. Kortemeyer, Could an artificial-intelligence agent pass an introductory physics course?, Phys. Rev. Phys. Educ. Res. 19, 010132 (2023). 16

  31. [31]

    K. A. Pimbblet and L. J. Morrell, Can ChatGPT pass a physics degree? making a case for reformation of as- sessment of undergraduate degrees, European Journal of Physics 46, 015702 (2024)

  32. [32]

    Polverini and B

    G. Polverini and B. Gregorcic, Performance of ChatGPT on the test of understanding graphs in kinematics, Phys. Rev. Phys. Educ. Res. 20, 010109 (2024)

  33. [33]

    Kortemeyer, M

    G. Kortemeyer, M. Babayeva, G. Polverini, R. Widen- horn, and B. Gregorcic, Multilingual performance of a multimodal artificial intelligence system on multisubject physics concept inventories, Physical Review Physics Ed- ucation Research 21, 020101 (2025)

  34. [34]

    Niu and H

    Y. Niu and H. Xue, Exercise generation and student cognitive ability research based on ChatGPT and Rasch model, IEEE Access 11, 116695 (2023)

  35. [35]

    Maity, A

    S. Maity, A. Deroy, and S. Sarkar, How effective is GPT-4 Turbo in generating school-level questions from textbooks based on Bloom’s revised taxonomy?, arXiv preprint arXiv:2406.15211 (2024)

  36. [36]

    P. A. Kirschner, J. Sweller, and R. E. Clark, Why mini- mal guidance during instruction does not work: An anal- ysis of the failure of constructivist, discovery, problem- based, experiential, and inquiry-based teaching, Educa- tional Psychologist 41, 75 (2006)

  37. [37]

    El-Adawy, I

    S. El-Adawy, I. Liao, V. Lad, M. Abdelhafez, and P. Dourmashkin, Streamlining physics problem genera- tion to support physics teachers in using generative arti- ficial intelligence, The Physics Teacher 62, 595 (2024)

  38. [38]

    Lewis, E

    P. Lewis, E. Perez, A. Piktus, F. Petroni, V. Karpukhin, N. Goyal, H. K¨ uttler, M. Lewis, W.-t. Yih, T. Rockt¨ aschel, et al. , Retrieval-augmented genera- tion for knowledge-intensive NLP tasks, Advances in neural information processing systems 33, 9459 (2020)

  39. [39]

    Scott, T

    M. Scott, T. Stelzer, and G. Gladding, Evaluating multiple-choice exams in large introductory physics courses, Physical Review Special Topics Physics Educa- tion Research 2, 020102 (2006)

  40. [40]

    Fakcharoenphol, E

    W. Fakcharoenphol, E. Potter, and T. Stelzer, What stu- dents learn when studying physics practice exam prob- lems, Physical Review Special Topics – Physics Educa- tion Research 7, 010107 (2011)

  41. [41]

    Gautreau and L

    R. Gautreau and L. Novemsky, Concepts first – a small group approach to physics learning, American journal of Physics 65, 418 (1997)

  42. [42]

    Fakcharoenphol and T

    W. Fakcharoenphol and T. Stelzer, Physics exam prepa- ration: A comparison of three methods, Physical Review Special Topics – Physics Education Research 10, 010108 (2014)

  43. [43]

    Rodriguez and G

    M. Rodriguez and G. Potvin, Frequent small group in- teractions improve student learning gains in physics: Re- sults from a nationally representative pre-post study of four-year colleges, Physical Review Physics Education Research 17, 020131 (2021)

  44. [44]

    M. J. Gierl, O. Bulut, Q. Guo, and X. Zhang, Develop- ing, analyzing, and using distractors for multiple-choice tests in education: A comprehensive review, Review of educational research 87, 1082 (2017)

  45. [45]

    T. F. Scott and D. Schumayer, Central distractors in force concept inventory data, Physical review physics ed- ucation research 14, 010106 (2018)

  46. [46]

    Heller, R

    P. Heller, R. Keith, and S. Anderson, Teaching problem solving through cooperative grouping, American journal of physics 60, 627 (1992)

  47. [47]

    A. R. Mota, N. Didi¸ s K¨ orhasan, K. Miller, and E. Mazur, Homework as a metacognitive tool in an undergraduate physics course, Physical Review Physics Education Re- search 15, 010136 (2019)

  48. [48]

    Walkington and M

    C. Walkington and M. L. Bernacki, Personalizing algebra to students’ individual interests in an intelligent tutoring system: Moderators of impact, International Journal of Artificial Intelligence in Education 29, 58 (2019)

  49. [49]

    Taasoobshirazi and M

    G. Taasoobshirazi and M. Carr, A review and critique of context-based physics instruction and assessment, Edu- cational Research Review 3, 155 (2008)

  50. [50]

    Dulger and F

    Z. Dulger and F. Ogan-Bekiroglu, Students’ metacogni- tion knowledge and skills during physics problem-solving process, Physical Review Physics Education Research21, 020106 (2025)

  51. [51]

    Harrison, C

    C. Harrison, C. P. Constantinou, C. F. Correia, M. Grangeat, M. H¨ ahki¨ oniemi, M. Livitzis, P. Nieminen, N. Papadouris, E. Rached, N. Serret, et al., Assessment on-the-fly: Promoting and collecting evidence of learn- ing through dialogue, Transforming assessment: Through an interplay between practice, research and policy , 83 (2018)

  52. [52]

    Sadigh, S

    D. Sadigh, S. A. Seshia, and M. Gupta, Automating ex- ercise generation: A step towards meeting the MOOC challenge for embedded systems, in Proceedings of the workshop on embedded and cyber-physical systems educa- tion (Association for Computing Machinery, New York, NY, 2012) pp. 1–8

  53. [53]

    M. N. Demaidi, M. M. Gaber, and N. Filer, Evaluating the quality of the ontology-based auto-generated ques- tions, Smart Learning Environments 4, 10.1186/s40561- 017-0046-6 (2017)

  54. [54]

    Nentwich, N

    V. Nentwich, N. Fischer, A. C. Sonnenbichler, and A. Geyer-Schulz, Computer aided exercise generation — a framework for human interaction in the automated ex- ercise generation process, inProceedings of the 13th Inter- national Joint Conference on e-Business and Telecommu- nications (SCITEPRESS, Set´ ubal, Portugal, 2016) pp. 57–63

  55. [55]

    Aldabe, M

    I. Aldabe, M. L. De Lacalle, M. Maritxalar, E. Martinez, and L. Uria, Arikiturri: an automatic question gener- ator based on corpora and NLP techniques, in Intel- ligent Tutoring Systems: 8th International Conference, ITS 2006, Jhongli, Taiwan, June 26-30, 2006. Proceed- ings 8 (Springer, New York, NY, 2006) pp. 584–594

  56. [56]

    Freitas, ´A

    T. Freitas, ´A. Neto, M. J. Pereira, and P. Henriques, NLP/AI based techniques for programming exercises generation, in 4th International Computer Programming Education Conference (ICPEC 2023) , Vol. 112 (2023)

  57. [57]

    G. Chen, J. Yang, C. Hauff, and G.-J. Houben, Learn- ingQ: a large-scale dataset for educational question gen- eration, in Proceedings of the international AAAI con- ference on web and social media , Vol. 12 (Association for the Advancement of Artificial Intelligence, Washington, DC, 2018)

  58. [58]

    Training Verifiers to Solve Math Word Problems

    K. Cobbe, V. Kosaraju, M. Bavarian, M. Chen, H. Jun, L. Kaiser, M. Plappert, J. Tworek, J. Hilton, R. Nakano, et al. , Training verifiers to solve math word problems, arXiv preprint arXiv:2110.14168 (2021)

  59. [59]

    Measuring Mathematical Problem Solving With the MATH Dataset

    D. Hendrycks, C. Burns, S. Kadavath, A. Arora, S. Basart, E. Tang, D. Song, and J. Steinhardt, Mea- suring mathematical problem solving with the MATH dataset, arXiv preprint arXiv:2103.03874 (2021). 17

  60. [60]

    Macina, N

    J. Macina, N. Daheim, S. P. Chowdhury, T. Sinha, M. Kapur, I. Gurevych, and M. Sachan, Mathdial: A dialogue tutoring dataset with rich pedagogical proper- ties grounded in math reasoning problems, arXiv preprint arXiv:2305.14536 (2023)

  61. [61]

    Doughty, Z

    J. Doughty, Z. Wan, A. Bompelli, J. Qayum, T. Wang, J. Zhang, Y. Zheng, A. Doyle, P. Sridhar, A. Agarwal, et al., A comparative study of AI-generated (gpt-4) and human-crafted MCQs in programming education, in Pro- ceedings of the 26th Australasian Computing Education Conference (Association for Computing Machinery, New York, NY, 2024) pp. 114–123

  62. [62]

    Kortemeyer, J

    G. Kortemeyer, J. N¨ ohl, and D. Onishchuk, Grading as- sistance for a handwritten thermodynamics exam using artificial intelligence: An exploratory study, Physical Re- view Physics Education Research 20, 020144 (2024)

  63. [63]

    We need structured out- put

    M. X. Liu, F. Liu, A. J. Fiannaca, T. Koo, L. Dixon, M. Terry, and C. J. Cai, “We need structured out- put”: Towards user-centered constraints on large lan- guage model output, in Extended Abstracts of the CHI Conference on Human Factors in Computing Systems (Association for Computing Machinery, New York, NY,

  64. [64]

    R. V. Hogg, E. A. Tanis, and D. L. Zimmerman, Proba- bility and Statistical Inference, 9th ed. (Pearson, Boston, MA, 2019) section 5.6, p. 202

  65. [65]

    L. A. Orawo, Confidence intervals for the binomial pro- portion: A comparison of four methods, Open Journal of Statistics 11, 806 (2021)

  66. [66]

    Z. Ji, N. Lee, R. Frieske, T. Yu, D. Su, Y. Xu, E. Ishii, Y. J. Bang, A. Madotto, and P. Fung, Survey of halluci- nation in natural language generation, ACM computing surveys 55, 1 (2023)

  67. [67]

    Spearman, The proof and measurement of association between two things, The American Journal of Psychology 15, 72 (1904)

    C. Spearman, The proof and measurement of association between two things, The American Journal of Psychology 15, 72 (1904)

  68. [68]

    T. M. Fruchterman and E. M. Reingold, Graph draw- ing by force-directed placement, Software: Practice and experience 21, 1129 (1991)

  69. [69]

    OpenAI, OpenAI, https://openai.com/ (accessed July 2025)

  70. [70]

    microsoft.com/en-us/products/ai-services (ac- cessed June 2024)

    Microsoft, Azure AI Services, https://azure. microsoft.com/en-us/products/ai-services (ac- cessed June 2024)

  71. [71]

    OpenAI, Hello GPT-4o, https://openai.com/index/ hello-gpt-4o/ (accessed June 2024)

  72. [72]

    OpenAI, GPT-4o mini: advancing cost- efficient intelligence, https://openai.com/index/ gpt-4o-mini-advancing-cost-efficient-intelligence/ (accessed Februrary 2025)

  73. [73]

    OpenAI, OpenAI o3-mini, https://openai.com/index/ openai-o3-mini/ (accessed February 2025)

  74. [74]

    Breiman, Random forests, Machine learning 45, 5 (2001)

    L. Breiman, Random forests, Machine learning 45, 5 (2001)

  75. [75]

    J. Wei, X. Wang, D. Schuurmans, M. Bosma, F. Xia, E. Chi, Q. V. Le, D. Zhou, et al. , Chain-of-thought prompting elicits reasoning in large language models, Advances in neural information processing systems 35, 24824 (2022)