Using an LLM to Investigate Students' Explanations on Conceptual Physics Questions
Pith reviewed 2026-05-18 22:00 UTC · model grok-4.3
The pith
An LLM can grade students' written physics explanations as accurately as humans and surface misconceptions that multiple-choice tests miss.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
GPT-4o was used to assess written explanations on three questions from the Energy and Momentum Conceptual Survey, first classifying them as correct or incorrect and then grouping incorrect responses into emergent categories. The LLM's classifications matched those of human graders within 0-3 percent. The resulting incorrect-explanation categories were distinct from the distractors on the corresponding multiple-choice items, indicating that written responses make different and deeper student conceptions available to educators.
What carries the argument
Prompting GPT-4o to both judge explanation correctness against a rubric and derive emergent categories from incorrect responses, with human grading as validation.
If this is right
- Physics instructors could analyze written work from large classes without the usual grading burden and still identify misconceptions not captured by multiple-choice tests.
- Conceptual surveys could shift from multiple-choice to open-response formats while remaining practical to score.
- Physics education researchers would gain a scalable method for studying student reasoning that goes beyond predefined answer choices.
Where Pith is reading between the lines
- The same LLM approach might be tested on other conceptual surveys in physics or in related fields such as chemistry to see if deeper conceptions emerge consistently.
- Longitudinal tracking could check whether the new categories predict how students respond to targeted teaching interventions.
- Refining the prompts or combining LLM output with small human samples could strengthen reliability for routine classroom use.
Load-bearing premise
The emergent categories of incorrect explanations produced by the LLM reflect genuine patterns in student thinking rather than artifacts of the model's training data or the way the prompt was worded.
What would settle it
A follow-up study that interviews a sample of students about the reasoning behind their written explanations and checks whether those reasons align with the categories the LLM generated would test the claim.
Figures
read the original abstract
Analyzing students' written solutions to physics questions is a major area in PER. However, gauging student understanding in college courses is bottlenecked by large class sizes, which limits assessments to a multiple-choice (MC) format for ease of grading. Although sufficient in quantifying scientifically correct conceptions, MC assessments do not uncover students' deeper ways of understanding physics. Large language models (LLMs) offer a promising approach for assessing students' written responses at scale. Our study used an LLM, validated by human graders, to classify students' written explanations to three questions on the Energy and Momentum Conceptual Survey as correct or incorrect, and organized students' incorrect explanations into emergent categories. We found that the LLM (GPT-4o) can fairly assess students' explanations, comparable to human graders (0-3% discrepancy). Furthermore, the categories of incorrect explanations were different from corresponding MC distractors, allowing for different and deeper conceptions to become accessible to educators.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript describes the use of GPT-4o to classify students' written explanations to three questions from the Energy and Momentum Conceptual Survey as correct or incorrect, and to organize the incorrect explanations into emergent categories. The central claims are that the LLM assessments show only 0-3% discrepancy with human graders and that the resulting categories differ from the distractors in the corresponding multiple-choice items, thereby surfacing deeper student conceptions.
Significance. If the validation is made rigorous, the work could be significant for physics education research by demonstrating a scalable method for analyzing open-ended responses in large classes. This addresses a longstanding limitation of multiple-choice formats and could enable instructors to access qualitative insights that are currently impractical to obtain at scale. The approach has clear potential to influence both assessment design and the study of student reasoning in PER.
major comments (2)
- [Results] The 0-3% discrepancy figure between LLM and human graders is reported without stating the total number of student responses scored, the size of the validation subsample, or any inter-rater agreement statistics among the human graders. Without these quantities the discrepancy cannot be meaningfully interpreted relative to normal human variation.
- [Methods] The Methods section provides no information on prompt construction, temperature settings, or few-shot examples used for either the binary classification or the emergent categorization tasks. It is therefore impossible to assess whether the reported categories reflect stable student conceptions or are sensitive to prompt phrasing.
minor comments (2)
- [Abstract] The abstract states that three questions were used but does not identify them; adding the specific item numbers or brief descriptions would improve reproducibility.
- [Methods] A short table summarizing the exact prompt templates and the number of responses per question would clarify the experimental setup without lengthening the text.
Simulated Author's Rebuttal
We thank the referee for their constructive and insightful comments, which identify key areas where additional detail will improve the transparency and interpretability of our work. We address each major comment below and have prepared revisions accordingly.
read point-by-point responses
-
Referee: [Results] The 0-3% discrepancy figure between LLM and human graders is reported without stating the total number of student responses scored, the size of the validation subsample, or any inter-rater agreement statistics among the human graders. Without these quantities the discrepancy cannot be meaningfully interpreted relative to normal human variation.
Authors: We agree that these quantitative details are necessary to place the reported discrepancy in proper context relative to typical human grading variation. In the revised manuscript we will explicitly state the total number of student responses scored, the size of the validation subsample graded by humans, and the inter-rater agreement statistics (e.g., percentage agreement or Cohen’s kappa) among the human graders. These additions will allow readers to evaluate the 0–3 % figure more rigorously. revision: yes
-
Referee: [Methods] The Methods section provides no information on prompt construction, temperature settings, or few-shot examples used for either the binary classification or the emergent categorization tasks. It is therefore impossible to assess whether the reported categories reflect stable student conceptions or are sensitive to prompt phrasing.
Authors: We acknowledge that the current Methods section lacks the necessary detail on prompting procedures. In the revision we will expand this section to describe how the prompts were constructed, report the temperature setting used with GPT-4o, and include any few-shot examples provided for the binary classification and emergent categorization tasks. These additions will enable readers to assess the stability of the resulting categories with respect to prompt design. revision: yes
Circularity Check
No circularity: empirical validation against human graders and MC distractors
full rationale
The paper is an empirical study that applies GPT-4o to classify student written explanations on conceptual physics items, reports a 0-3% discrepancy with human graders, and compares emergent incorrect-explanation categories to MC distractors. No equations, fitted parameters, or first-principles derivations appear; the central claims rest on direct comparison to external human scoring and existing MC instruments rather than on self-referential definitions or self-citation chains. The work is therefore self-contained against external benchmarks and receives a score of 0.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption LLM classifications of student explanations can be validated against human graders with low discrepancy
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We found that the LLM (GPT-4o) can fairly assess students' explanations, comparable to human graders (0-3% discrepancy). Furthermore, the categories of incorrect explanations were different from corresponding MC distractors
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
that students activate in the context of questions in the inventory. Furthermore, prior studies have demonstrated that repeated students’ exposure to distractors (incorrect MC op- tions) strengthens incorrect conceptual associations[12]. MC inventories offer a rich repertoire of questions de- signed to assess students’ conceptual understanding of physics ...
-
[2]
D. Hestenes, M. Wells, and G. Swackhamer, Force concept in- ventory, The Physics Teacher30, 141 (1992)
work page 1992
-
[3]
R. K. Thornton and D. R. Sokoloff, Assessing student learning of newton’s laws: The force and motion conceptual evaluation and the evaluation of active learning laboratory and lecture cur- ricula, American Journal of Physics 66, 338 (1998)
work page 1998
-
[4]
Multiple-choice test of energy and momentum concepts
C. Singh and D. Rosengrant, Multiple-choice test of energy and momentum concepts, arXiv preprint arXiv:1602.06497 (2016)
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[5]
M. Nieswandt and K. Bellomo, Written extended-response questions and the assessment of science learning: What do stu- dents’ answers tell us?, International Journal of Science Edu- cation 31, 2117 (2009)
work page 2009
-
[6]
W. L. Kuechler and M. G. Simkin, How well do multiple choice tests evaluate student understanding in computer programming classes? (2003)
work page 2003
-
[7]
A. Petersen, M. Craig, and P. Denny, Employing multiple- answer multiple choice questions, in Proceedings of the 2016 ACM Conference on Innovation and Technology in Computer Science Education, ITiCSE ’16 (ACM, 2016)
work page 2016
-
[8]
L. A. Shepard, The role of assessment in a learning culture, Educational Researcher 29, 4 (2000)
work page 2000
-
[9]
C. Wong, P. Denny, A. Luxton-Reilly, and J. Whalley, The im- pact of multiple choice question design on predictions of per- formance, in Proceedings of the 23rd Australasian Computing Education Conference, ACE ’21 (ACM, 2021)
work page 2021
-
[10]
E. Wood, N. Klausz, and S. MacNeil, Examining the influence of multiple-choice test formats on student performance, Inno- vative Higher Education 47, 515–531 (2021)
work page 2021
-
[11]
N. S. Rebello and D. A. Zollman, The effect of distracters on student performance on the force concept inventory, American Journal of Physics 72, 116 (2004)
work page 2004
-
[12]
D. Hammer, Student resources for learning introductory physics, American Journal of Physics 68, S52 (2000)
work page 2000
-
[13]
H. L. Roediger and E. J. Marsh, The positive and negative con- sequences of multiple-choice testing., Journal of Experimental Psychology: Learning, Memory, and Cognition31, 1155–1159 (2005)
work page 2005
-
[14]
M. Good, E. Marshman, E. Yerushalmi, and C. Singh, Physics teaching assistants’ views of different types of introductory problems: Challenge of perceiving the instructional benefits of context-rich and multiple-choice problems, Physical Review Physics Education Research 15, 020130 (2019)
work page 2019
- [15]
- [16]
-
[17]
J. Munsell, N. S. Rebello, and C. M. Rebello, Using natural language processing to predict student problem solving perfor- mance, in 2021 Physics Education Research Conference Pro- ceedings (2021)
work page 2021
-
[18]
G. Casalino, B. Cafarelli, E. del Gobbo, L. Fontanella, L. Grilli, A. Guarino, P. Limone, D. Schicchi, and D. Taibi, Framing au- tomatic grading techniques for open-ended questionnaires re- sponses. a short survey (2021)
work page 2021
-
[19]
G. Kortemeyer, Toward ai grading of student problem solutions in introductory physics: A feasibility study, Physical Review Physics Education Research 19, 020163 (2023)
work page 2023
-
[20]
U.S. Department of Education, Office of Educational Tech- nology, Artificial Intelligence and the Future of Teaching and Learning: Insights and Recommendations , Tech. Rep. (U.S. Department of Education, 2023)
work page 2023
-
[21]
S. Weijers, W. Westera, and M. Wiering, From intuition to un- derstanding: Using ai peers to overcome physics misconcep- tions, arXiv preprint arXiv:2504.00408 (2025)
-
[22]
Wang, Physical Review B94, 10.1103/phys- revb.94.195105 (2016)
T. Wan and Z. Chen, Exploring generative ai assisted feedback writing for students’ written responses to a physics conceptual question with prompt engineering and few-shot learning, Phys- ical Review Physics Education Research 20, 10.1103/phys- revphyseducres.20.010152 (2024)
-
[23]
Khan, The amazing ai super tutor for students and teachers, Video
S. Khan, The amazing ai super tutor for students and teachers, Video. TED Conference (2023)
work page 2023
-
[24]
P. G. Butcher and S. E. Jordan, A comparison of human and computer marking of short free-text student responses, Com- puters & Education 55, 489 (2010)
work page 2010
-
[25]
H. R. Salim, C. De, N. D. Pratamaputra, and D. Suhartono, Indonesian automatic short answer grading system, Bulletin of Electrical Engineering and Informatics 11, 1586–1603 (2022)
work page 2022
-
[26]
K. L. McNeill and J. S. Krajcik, Supporting Grade 5-8 Stu- dents in Constructing Explanations in Science: The Claim, Ev- idence, and Reasoning Framework for Talk and Writing(Pear- son, 2011)
work page 2011
-
[27]
N. F. Afif, M. G. Nugraha, and A. Samsudin, Developing en- ergy and momentum conceptual survey (emcs) with four-tier diagnostic test items, in AIP Conference Proceedings (Au- thor(s), 2017)
work page 2017
-
[28]
D2L Inc., Brightspace learning management system (2025), accessed May 18, 2025
work page 2025
-
[29]
OpenAI, Chatgpt, https://chat.openai.com/chat (2025), [Ac- cessed May 2025]
work page 2025
- [30]
-
[31]
K. L. Sainani, Reliability statistics, PM&R 9, 622–628 (2017)
work page 2017
-
[32]
E. Latif and X. Zhai, Integrating generative ai into stem educa- tion: Enhancing conceptual understanding, addressing miscon- ceptions, and assessing student acceptance, Disciplinary and Interdisciplinary Science Education Research 7, 11 (2025)
work page 2025
-
[33]
L. Zhou, S.-M. Kim, and N. Ahmed, Artificial intelligence ap- plications in education: Natural language processing in detect- ing misconceptions, Education and Information Technologies 10.1007/s10639-024-12919-1 (2024). 5
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.