Using an LLM to Investigate Students' Explanations on Conceptual Physics Questions

N. Sanjay Rebello; Sean Savage

arxiv: 2508.14823 · v1 · submitted 2025-08-20 · ⚛️ physics.ed-ph

Using an LLM to Investigate Students' Explanations on Conceptual Physics Questions

Sean Savage , N. Sanjay Rebello This is my paper

Pith reviewed 2026-05-18 22:00 UTC · model grok-4.3

classification ⚛️ physics.ed-ph

keywords LLM assessmentphysics education researchstudent explanationsconceptual surveysenergy and momentummisconceptionsGPT-4oopen-ended responses

0 comments

The pith

An LLM can grade students' written physics explanations as accurately as humans and surface misconceptions that multiple-choice tests miss.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper investigates whether large language models can evaluate open-ended student responses to conceptual physics questions at scale. It demonstrates that GPT-4o classifies explanations as correct or incorrect with close agreement to human graders, showing only 0-3 percent discrepancy. The model also sorts incorrect explanations into categories that differ from the wrong choices offered in multiple-choice versions of the same questions. This difference matters because multiple-choice formats are common in large college classes but can hide the actual ways students reason about energy and momentum. If the finding holds, instructors could move beyond easy-to-grade tests while still gaining insight into student thinking.

Core claim

GPT-4o was used to assess written explanations on three questions from the Energy and Momentum Conceptual Survey, first classifying them as correct or incorrect and then grouping incorrect responses into emergent categories. The LLM's classifications matched those of human graders within 0-3 percent. The resulting incorrect-explanation categories were distinct from the distractors on the corresponding multiple-choice items, indicating that written responses make different and deeper student conceptions available to educators.

What carries the argument

Prompting GPT-4o to both judge explanation correctness against a rubric and derive emergent categories from incorrect responses, with human grading as validation.

If this is right

Physics instructors could analyze written work from large classes without the usual grading burden and still identify misconceptions not captured by multiple-choice tests.
Conceptual surveys could shift from multiple-choice to open-response formats while remaining practical to score.
Physics education researchers would gain a scalable method for studying student reasoning that goes beyond predefined answer choices.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same LLM approach might be tested on other conceptual surveys in physics or in related fields such as chemistry to see if deeper conceptions emerge consistently.
Longitudinal tracking could check whether the new categories predict how students respond to targeted teaching interventions.
Refining the prompts or combining LLM output with small human samples could strengthen reliability for routine classroom use.

Load-bearing premise

The emergent categories of incorrect explanations produced by the LLM reflect genuine patterns in student thinking rather than artifacts of the model's training data or the way the prompt was worded.

What would settle it

A follow-up study that interviews a sample of students about the reasoning behind their written explanations and checks whether those reasons align with the categories the LLM generated would test the claim.

Figures

Figures reproduced from arXiv: 2508.14823 by N. Sanjay Rebello, Sean Savage.

**Figure 2.** Figure 2: FIG. 2: Question 16 from the EMCS (correct choice, C) [ [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗

**Figure 3.** Figure 3: FIG. 3: Question 23 from the EMCS (correct choice, B) [ [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗

read the original abstract

Analyzing students' written solutions to physics questions is a major area in PER. However, gauging student understanding in college courses is bottlenecked by large class sizes, which limits assessments to a multiple-choice (MC) format for ease of grading. Although sufficient in quantifying scientifically correct conceptions, MC assessments do not uncover students' deeper ways of understanding physics. Large language models (LLMs) offer a promising approach for assessing students' written responses at scale. Our study used an LLM, validated by human graders, to classify students' written explanations to three questions on the Energy and Momentum Conceptual Survey as correct or incorrect, and organized students' incorrect explanations into emergent categories. We found that the LLM (GPT-4o) can fairly assess students' explanations, comparable to human graders (0-3% discrepancy). Furthermore, the categories of incorrect explanations were different from corresponding MC distractors, allowing for different and deeper conceptions to become accessible to educators.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

GPT-4o matches humans within 0-3% on classifying explanations but the validation details are too thin to judge the result firmly.

read the letter

The central point is that GPT-4o classifies student explanations on three questions from the Energy and Momentum Conceptual Survey with a 0-3% discrepancy from human graders, and the categories of incorrect explanations it produces differ from the survey's multiple-choice distractors. This specific pairing of LLM output with a published conceptual survey and the direct check against existing distractors is new. Earlier LLM work in physics education research has looked at grading, but the comparison here gives a clearer picture of what extra information open-ended responses can supply beyond forced choices. The paper does a solid job laying out the practical bottleneck: large classes push instructors toward multiple-choice tests that miss nuanced student thinking, and the LLM approach offers one route to scale written-response analysis. The empirical grounding in human comparison and in the mismatch with distractors supplies an external check rather than pure self-reference. The soft spots sit in the methods reporting. The abstract states the discrepancy figure but does not give the number of responses scored, the prompts used, or any measure of agreement between the human graders themselves. Without those numbers the 0-3% claim is difficult to weigh; it could sit inside ordinary human variation. The categorization step also lacks detail on how the emergent groups were stabilized or validated. The stress-test note is accurate on this point. Readers working in physics education research on assessment or conceptual surveys would get the most from it. The work is a practical demonstration rather than a broad theoretical shift, but it shows clear engagement with the literature and a reproducible task. It deserves a serious referee to examine the full methods and data once they are supplied. I would send it out for peer review.

Referee Report

2 major / 2 minor

Summary. The manuscript describes the use of GPT-4o to classify students' written explanations to three questions from the Energy and Momentum Conceptual Survey as correct or incorrect, and to organize the incorrect explanations into emergent categories. The central claims are that the LLM assessments show only 0-3% discrepancy with human graders and that the resulting categories differ from the distractors in the corresponding multiple-choice items, thereby surfacing deeper student conceptions.

Significance. If the validation is made rigorous, the work could be significant for physics education research by demonstrating a scalable method for analyzing open-ended responses in large classes. This addresses a longstanding limitation of multiple-choice formats and could enable instructors to access qualitative insights that are currently impractical to obtain at scale. The approach has clear potential to influence both assessment design and the study of student reasoning in PER.

major comments (2)

[Results] The 0-3% discrepancy figure between LLM and human graders is reported without stating the total number of student responses scored, the size of the validation subsample, or any inter-rater agreement statistics among the human graders. Without these quantities the discrepancy cannot be meaningfully interpreted relative to normal human variation.
[Methods] The Methods section provides no information on prompt construction, temperature settings, or few-shot examples used for either the binary classification or the emergent categorization tasks. It is therefore impossible to assess whether the reported categories reflect stable student conceptions or are sensitive to prompt phrasing.

minor comments (2)

[Abstract] The abstract states that three questions were used but does not identify them; adding the specific item numbers or brief descriptions would improve reproducibility.
[Methods] A short table summarizing the exact prompt templates and the number of responses per question would clarify the experimental setup without lengthening the text.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and insightful comments, which identify key areas where additional detail will improve the transparency and interpretability of our work. We address each major comment below and have prepared revisions accordingly.

read point-by-point responses

Referee: [Results] The 0-3% discrepancy figure between LLM and human graders is reported without stating the total number of student responses scored, the size of the validation subsample, or any inter-rater agreement statistics among the human graders. Without these quantities the discrepancy cannot be meaningfully interpreted relative to normal human variation.

Authors: We agree that these quantitative details are necessary to place the reported discrepancy in proper context relative to typical human grading variation. In the revised manuscript we will explicitly state the total number of student responses scored, the size of the validation subsample graded by humans, and the inter-rater agreement statistics (e.g., percentage agreement or Cohen’s kappa) among the human graders. These additions will allow readers to evaluate the 0–3 % figure more rigorously. revision: yes
Referee: [Methods] The Methods section provides no information on prompt construction, temperature settings, or few-shot examples used for either the binary classification or the emergent categorization tasks. It is therefore impossible to assess whether the reported categories reflect stable student conceptions or are sensitive to prompt phrasing.

Authors: We acknowledge that the current Methods section lacks the necessary detail on prompting procedures. In the revision we will expand this section to describe how the prompts were constructed, report the temperature setting used with GPT-4o, and include any few-shot examples provided for the binary classification and emergent categorization tasks. These additions will enable readers to assess the stability of the resulting categories with respect to prompt design. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical validation against human graders and MC distractors

full rationale

The paper is an empirical study that applies GPT-4o to classify student written explanations on conceptual physics items, reports a 0-3% discrepancy with human graders, and compares emergent incorrect-explanation categories to MC distractors. No equations, fitted parameters, or first-principles derivations appear; the central claims rest on direct comparison to external human scoring and existing MC instruments rather than on self-referential definitions or self-citation chains. The work is therefore self-contained against external benchmarks and receives a score of 0.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The study rests on the assumption that LLM outputs can be treated as proxies for human judgment of student understanding and that emergent categories capture real conceptual differences; no free parameters or invented physical entities are introduced.

axioms (1)

domain assumption LLM classifications of student explanations can be validated against human graders with low discrepancy
Invoked when claiming 0-3% discrepancy and using LLM to organize incorrect explanations.

pith-pipeline@v0.9.0 · 5689 in / 1262 out tokens · 27911 ms · 2026-05-18T22:00:40.817896+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We found that the LLM (GPT-4o) can fairly assess students' explanations, comparable to human graders (0-3% discrepancy). Furthermore, the categories of incorrect explanations were different from corresponding MC distractors

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

33 extracted references · 33 canonical work pages · 1 internal anchor

[1]

Use claim, evidence, and reasoning to com- pare how fast the blocks are moving concerning each other after the collision. Use complete sentences

that students activate in the context of questions in the inventory. Furthermore, prior studies have demonstrated that repeated students’ exposure to distractors (incorrect MC op- tions) strengthens incorrect conceptual associations[12]. MC inventories offer a rich repertoire of questions de- signed to assess students’ conceptual understanding of physics ...

work page
[2]

Hestenes, M

D. Hestenes, M. Wells, and G. Swackhamer, Force concept in- ventory, The Physics Teacher30, 141 (1992)

work page 1992
[3]

R. K. Thornton and D. R. Sokoloff, Assessing student learning of newton’s laws: The force and motion conceptual evaluation and the evaluation of active learning laboratory and lecture cur- ricula, American Journal of Physics 66, 338 (1998)

work page 1998
[4]

Multiple-choice test of energy and momentum concepts

C. Singh and D. Rosengrant, Multiple-choice test of energy and momentum concepts, arXiv preprint arXiv:1602.06497 (2016)

work page internal anchor Pith review Pith/arXiv arXiv 2016
[5]

Nieswandt and K

M. Nieswandt and K. Bellomo, Written extended-response questions and the assessment of science learning: What do stu- dents’ answers tell us?, International Journal of Science Edu- cation 31, 2117 (2009)

work page 2009
[6]

W. L. Kuechler and M. G. Simkin, How well do multiple choice tests evaluate student understanding in computer programming classes? (2003)

work page 2003
[7]

Petersen, M

A. Petersen, M. Craig, and P. Denny, Employing multiple- answer multiple choice questions, in Proceedings of the 2016 ACM Conference on Innovation and Technology in Computer Science Education, ITiCSE ’16 (ACM, 2016)

work page 2016
[8]

L. A. Shepard, The role of assessment in a learning culture, Educational Researcher 29, 4 (2000)

work page 2000
[9]

C. Wong, P. Denny, A. Luxton-Reilly, and J. Whalley, The im- pact of multiple choice question design on predictions of per- formance, in Proceedings of the 23rd Australasian Computing Education Conference, ACE ’21 (ACM, 2021)

work page 2021
[10]

E. Wood, N. Klausz, and S. MacNeil, Examining the influence of multiple-choice test formats on student performance, Inno- vative Higher Education 47, 515–531 (2021)

work page 2021
[11]

N. S. Rebello and D. A. Zollman, The effect of distracters on student performance on the force concept inventory, American Journal of Physics 72, 116 (2004)

work page 2004
[12]

Hammer, Student resources for learning introductory physics, American Journal of Physics 68, S52 (2000)

D. Hammer, Student resources for learning introductory physics, American Journal of Physics 68, S52 (2000)

work page 2000
[13]

H. L. Roediger and E. J. Marsh, The positive and negative con- sequences of multiple-choice testing., Journal of Experimental Psychology: Learning, Memory, and Cognition31, 1155–1159 (2005)

work page 2005
[14]

M. Good, E. Marshman, E. Yerushalmi, and C. Singh, Physics teaching assistants’ views of different types of introductory problems: Challenge of perceiving the instructional benefits of context-rich and multiple-choice problems, Physical Review Physics Education Research 15, 020130 (2019)

work page 2019
[15]

Bao and E

L. Bao and E. F. Redish, Model analysis: Representing and assessing the dynamics of student learning, Physical Review Special Topics-Physics Education Research 2, 010103 (2006)

work page 2006
[16]

Aleven, E

V . Aleven, E. A. McLaughlin, and M. Glassman, Ai in educa- tion: A critical review and conceptual framework, Educational Psychologist 57, 145 (2022)

work page 2022
[17]

Munsell, N

J. Munsell, N. S. Rebello, and C. M. Rebello, Using natural language processing to predict student problem solving perfor- mance, in 2021 Physics Education Research Conference Pro- ceedings (2021)

work page 2021
[18]

Casalino, B

G. Casalino, B. Cafarelli, E. del Gobbo, L. Fontanella, L. Grilli, A. Guarino, P. Limone, D. Schicchi, and D. Taibi, Framing au- tomatic grading techniques for open-ended questionnaires re- sponses. a short survey (2021)

work page 2021
[19]

Kortemeyer, Toward ai grading of student problem solutions in introductory physics: A feasibility study, Physical Review Physics Education Research 19, 020163 (2023)

G. Kortemeyer, Toward ai grading of student problem solutions in introductory physics: A feasibility study, Physical Review Physics Education Research 19, 020163 (2023)

work page 2023
[20]

Department of Education, Office of Educational Tech- nology, Artificial Intelligence and the Future of Teaching and Learning: Insights and Recommendations , Tech

U.S. Department of Education, Office of Educational Tech- nology, Artificial Intelligence and the Future of Teaching and Learning: Insights and Recommendations , Tech. Rep. (U.S. Department of Education, 2023)

work page 2023
[21]

Weijers, W

S. Weijers, W. Westera, and M. Wiering, From intuition to un- derstanding: Using ai peers to overcome physics misconcep- tions, arXiv preprint arXiv:2504.00408 (2025)

work page arXiv 2025
[22]

Wang, Physical Review B94, 10.1103/phys- revb.94.195105 (2016)

T. Wan and Z. Chen, Exploring generative ai assisted feedback writing for students’ written responses to a physics conceptual question with prompt engineering and few-shot learning, Phys- ical Review Physics Education Research 20, 10.1103/phys- revphyseducres.20.010152 (2024)

work page doi:10.1103/phys- 2024
[23]

Khan, The amazing ai super tutor for students and teachers, Video

S. Khan, The amazing ai super tutor for students and teachers, Video. TED Conference (2023)

work page 2023
[24]

P. G. Butcher and S. E. Jordan, A comparison of human and computer marking of short free-text student responses, Com- puters & Education 55, 489 (2010)

work page 2010
[25]

H. R. Salim, C. De, N. D. Pratamaputra, and D. Suhartono, Indonesian automatic short answer grading system, Bulletin of Electrical Engineering and Informatics 11, 1586–1603 (2022)

work page 2022
[26]

K. L. McNeill and J. S. Krajcik, Supporting Grade 5-8 Stu- dents in Constructing Explanations in Science: The Claim, Ev- idence, and Reasoning Framework for Talk and Writing(Pear- son, 2011)

work page 2011
[27]

N. F. Afif, M. G. Nugraha, and A. Samsudin, Developing en- ergy and momentum conceptual survey (emcs) with four-tier diagnostic test items, in AIP Conference Proceedings (Au- thor(s), 2017)

work page 2017
[28]

D2L Inc., Brightspace learning management system (2025), accessed May 18, 2025

work page 2025
[29]

OpenAI, Chatgpt, https://chat.openai.com/chat (2025), [Ac- cessed May 2025]

work page 2025
[30]

B. Chen, Z. Zhang, N. Langrené, and S. Zhu, Unleashing the potential of prompt engineering in large language models: a comprehensive review (2023), arXiv:2310.14735

work page arXiv 2023
[31]

K. L. Sainani, Reliability statistics, PM&R 9, 622–628 (2017)

work page 2017
[32]

Latif and X

E. Latif and X. Zhai, Integrating generative ai into stem educa- tion: Enhancing conceptual understanding, addressing miscon- ceptions, and assessing student acceptance, Disciplinary and Interdisciplinary Science Education Research 7, 11 (2025)

work page 2025
[33]

Zhou, S.-M

L. Zhou, S.-M. Kim, and N. Ahmed, Artificial intelligence ap- plications in education: Natural language processing in detect- ing misconceptions, Education and Information Technologies 10.1007/s10639-024-12919-1 (2024). 5

work page doi:10.1007/s10639-024-12919-1 2024

[1] [1]

Use claim, evidence, and reasoning to com- pare how fast the blocks are moving concerning each other after the collision. Use complete sentences

that students activate in the context of questions in the inventory. Furthermore, prior studies have demonstrated that repeated students’ exposure to distractors (incorrect MC op- tions) strengthens incorrect conceptual associations[12]. MC inventories offer a rich repertoire of questions de- signed to assess students’ conceptual understanding of physics ...

work page

[2] [2]

Hestenes, M

D. Hestenes, M. Wells, and G. Swackhamer, Force concept in- ventory, The Physics Teacher30, 141 (1992)

work page 1992

[3] [3]

R. K. Thornton and D. R. Sokoloff, Assessing student learning of newton’s laws: The force and motion conceptual evaluation and the evaluation of active learning laboratory and lecture cur- ricula, American Journal of Physics 66, 338 (1998)

work page 1998

[4] [4]

Multiple-choice test of energy and momentum concepts

C. Singh and D. Rosengrant, Multiple-choice test of energy and momentum concepts, arXiv preprint arXiv:1602.06497 (2016)

work page internal anchor Pith review Pith/arXiv arXiv 2016

[5] [5]

Nieswandt and K

M. Nieswandt and K. Bellomo, Written extended-response questions and the assessment of science learning: What do stu- dents’ answers tell us?, International Journal of Science Edu- cation 31, 2117 (2009)

work page 2009

[6] [6]

W. L. Kuechler and M. G. Simkin, How well do multiple choice tests evaluate student understanding in computer programming classes? (2003)

work page 2003

[7] [7]

Petersen, M

A. Petersen, M. Craig, and P. Denny, Employing multiple- answer multiple choice questions, in Proceedings of the 2016 ACM Conference on Innovation and Technology in Computer Science Education, ITiCSE ’16 (ACM, 2016)

work page 2016

[8] [8]

L. A. Shepard, The role of assessment in a learning culture, Educational Researcher 29, 4 (2000)

work page 2000

[9] [9]

C. Wong, P. Denny, A. Luxton-Reilly, and J. Whalley, The im- pact of multiple choice question design on predictions of per- formance, in Proceedings of the 23rd Australasian Computing Education Conference, ACE ’21 (ACM, 2021)

work page 2021

[10] [10]

E. Wood, N. Klausz, and S. MacNeil, Examining the influence of multiple-choice test formats on student performance, Inno- vative Higher Education 47, 515–531 (2021)

work page 2021

[11] [11]

N. S. Rebello and D. A. Zollman, The effect of distracters on student performance on the force concept inventory, American Journal of Physics 72, 116 (2004)

work page 2004

[12] [12]

Hammer, Student resources for learning introductory physics, American Journal of Physics 68, S52 (2000)

D. Hammer, Student resources for learning introductory physics, American Journal of Physics 68, S52 (2000)

work page 2000

[13] [13]

H. L. Roediger and E. J. Marsh, The positive and negative con- sequences of multiple-choice testing., Journal of Experimental Psychology: Learning, Memory, and Cognition31, 1155–1159 (2005)

work page 2005

[14] [14]

M. Good, E. Marshman, E. Yerushalmi, and C. Singh, Physics teaching assistants’ views of different types of introductory problems: Challenge of perceiving the instructional benefits of context-rich and multiple-choice problems, Physical Review Physics Education Research 15, 020130 (2019)

work page 2019

[15] [15]

Bao and E

L. Bao and E. F. Redish, Model analysis: Representing and assessing the dynamics of student learning, Physical Review Special Topics-Physics Education Research 2, 010103 (2006)

work page 2006

[16] [16]

Aleven, E

V . Aleven, E. A. McLaughlin, and M. Glassman, Ai in educa- tion: A critical review and conceptual framework, Educational Psychologist 57, 145 (2022)

work page 2022

[17] [17]

Munsell, N

J. Munsell, N. S. Rebello, and C. M. Rebello, Using natural language processing to predict student problem solving perfor- mance, in 2021 Physics Education Research Conference Pro- ceedings (2021)

work page 2021

[18] [18]

Casalino, B

G. Casalino, B. Cafarelli, E. del Gobbo, L. Fontanella, L. Grilli, A. Guarino, P. Limone, D. Schicchi, and D. Taibi, Framing au- tomatic grading techniques for open-ended questionnaires re- sponses. a short survey (2021)

work page 2021

[19] [19]

Kortemeyer, Toward ai grading of student problem solutions in introductory physics: A feasibility study, Physical Review Physics Education Research 19, 020163 (2023)

G. Kortemeyer, Toward ai grading of student problem solutions in introductory physics: A feasibility study, Physical Review Physics Education Research 19, 020163 (2023)

work page 2023

[20] [20]

Department of Education, Office of Educational Tech- nology, Artificial Intelligence and the Future of Teaching and Learning: Insights and Recommendations , Tech

U.S. Department of Education, Office of Educational Tech- nology, Artificial Intelligence and the Future of Teaching and Learning: Insights and Recommendations , Tech. Rep. (U.S. Department of Education, 2023)

work page 2023

[21] [21]

Weijers, W

S. Weijers, W. Westera, and M. Wiering, From intuition to un- derstanding: Using ai peers to overcome physics misconcep- tions, arXiv preprint arXiv:2504.00408 (2025)

work page arXiv 2025

[22] [22]

Wang, Physical Review B94, 10.1103/phys- revb.94.195105 (2016)

T. Wan and Z. Chen, Exploring generative ai assisted feedback writing for students’ written responses to a physics conceptual question with prompt engineering and few-shot learning, Phys- ical Review Physics Education Research 20, 10.1103/phys- revphyseducres.20.010152 (2024)

work page doi:10.1103/phys- 2024

[23] [23]

Khan, The amazing ai super tutor for students and teachers, Video

S. Khan, The amazing ai super tutor for students and teachers, Video. TED Conference (2023)

work page 2023

[24] [24]

P. G. Butcher and S. E. Jordan, A comparison of human and computer marking of short free-text student responses, Com- puters & Education 55, 489 (2010)

work page 2010

[25] [25]

H. R. Salim, C. De, N. D. Pratamaputra, and D. Suhartono, Indonesian automatic short answer grading system, Bulletin of Electrical Engineering and Informatics 11, 1586–1603 (2022)

work page 2022

[26] [26]

K. L. McNeill and J. S. Krajcik, Supporting Grade 5-8 Stu- dents in Constructing Explanations in Science: The Claim, Ev- idence, and Reasoning Framework for Talk and Writing(Pear- son, 2011)

work page 2011

[27] [27]

N. F. Afif, M. G. Nugraha, and A. Samsudin, Developing en- ergy and momentum conceptual survey (emcs) with four-tier diagnostic test items, in AIP Conference Proceedings (Au- thor(s), 2017)

work page 2017

[28] [28]

D2L Inc., Brightspace learning management system (2025), accessed May 18, 2025

work page 2025

[29] [29]

OpenAI, Chatgpt, https://chat.openai.com/chat (2025), [Ac- cessed May 2025]

work page 2025

[30] [30]

B. Chen, Z. Zhang, N. Langrené, and S. Zhu, Unleashing the potential of prompt engineering in large language models: a comprehensive review (2023), arXiv:2310.14735

work page arXiv 2023

[31] [31]

K. L. Sainani, Reliability statistics, PM&R 9, 622–628 (2017)

work page 2017

[32] [32]

Latif and X

E. Latif and X. Zhai, Integrating generative ai into stem educa- tion: Enhancing conceptual understanding, addressing miscon- ceptions, and assessing student acceptance, Disciplinary and Interdisciplinary Science Education Research 7, 11 (2025)

work page 2025

[33] [33]

Zhou, S.-M

L. Zhou, S.-M. Kim, and N. Ahmed, Artificial intelligence ap- plications in education: Natural language processing in detect- ing misconceptions, Education and Information Technologies 10.1007/s10639-024-12919-1 (2024). 5

work page doi:10.1007/s10639-024-12919-1 2024