pith. sign in

arxiv: 2508.14823 · v1 · submitted 2025-08-20 · ⚛️ physics.ed-ph

Using an LLM to Investigate Students' Explanations on Conceptual Physics Questions

Pith reviewed 2026-05-18 22:00 UTC · model grok-4.3

classification ⚛️ physics.ed-ph
keywords LLM assessmentphysics education researchstudent explanationsconceptual surveysenergy and momentummisconceptionsGPT-4oopen-ended responses
0
0 comments X

The pith

An LLM can grade students' written physics explanations as accurately as humans and surface misconceptions that multiple-choice tests miss.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper investigates whether large language models can evaluate open-ended student responses to conceptual physics questions at scale. It demonstrates that GPT-4o classifies explanations as correct or incorrect with close agreement to human graders, showing only 0-3 percent discrepancy. The model also sorts incorrect explanations into categories that differ from the wrong choices offered in multiple-choice versions of the same questions. This difference matters because multiple-choice formats are common in large college classes but can hide the actual ways students reason about energy and momentum. If the finding holds, instructors could move beyond easy-to-grade tests while still gaining insight into student thinking.

Core claim

GPT-4o was used to assess written explanations on three questions from the Energy and Momentum Conceptual Survey, first classifying them as correct or incorrect and then grouping incorrect responses into emergent categories. The LLM's classifications matched those of human graders within 0-3 percent. The resulting incorrect-explanation categories were distinct from the distractors on the corresponding multiple-choice items, indicating that written responses make different and deeper student conceptions available to educators.

What carries the argument

Prompting GPT-4o to both judge explanation correctness against a rubric and derive emergent categories from incorrect responses, with human grading as validation.

If this is right

  • Physics instructors could analyze written work from large classes without the usual grading burden and still identify misconceptions not captured by multiple-choice tests.
  • Conceptual surveys could shift from multiple-choice to open-response formats while remaining practical to score.
  • Physics education researchers would gain a scalable method for studying student reasoning that goes beyond predefined answer choices.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same LLM approach might be tested on other conceptual surveys in physics or in related fields such as chemistry to see if deeper conceptions emerge consistently.
  • Longitudinal tracking could check whether the new categories predict how students respond to targeted teaching interventions.
  • Refining the prompts or combining LLM output with small human samples could strengthen reliability for routine classroom use.

Load-bearing premise

The emergent categories of incorrect explanations produced by the LLM reflect genuine patterns in student thinking rather than artifacts of the model's training data or the way the prompt was worded.

What would settle it

A follow-up study that interviews a sample of students about the reasoning behind their written explanations and checks whether those reasons align with the categories the LLM generated would test the claim.

Figures

Figures reproduced from arXiv: 2508.14823 by N. Sanjay Rebello, Sean Savage.

Figure 1
Figure 1. Figure 1: FIG. 1: Question 5 from the EMCS (correct choice, D) [ [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: FIG. 2: Question 16 from the EMCS (correct choice, C) [ [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: FIG. 3: Question 23 from the EMCS (correct choice, B) [ [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗
read the original abstract

Analyzing students' written solutions to physics questions is a major area in PER. However, gauging student understanding in college courses is bottlenecked by large class sizes, which limits assessments to a multiple-choice (MC) format for ease of grading. Although sufficient in quantifying scientifically correct conceptions, MC assessments do not uncover students' deeper ways of understanding physics. Large language models (LLMs) offer a promising approach for assessing students' written responses at scale. Our study used an LLM, validated by human graders, to classify students' written explanations to three questions on the Energy and Momentum Conceptual Survey as correct or incorrect, and organized students' incorrect explanations into emergent categories. We found that the LLM (GPT-4o) can fairly assess students' explanations, comparable to human graders (0-3% discrepancy). Furthermore, the categories of incorrect explanations were different from corresponding MC distractors, allowing for different and deeper conceptions to become accessible to educators.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript describes the use of GPT-4o to classify students' written explanations to three questions from the Energy and Momentum Conceptual Survey as correct or incorrect, and to organize the incorrect explanations into emergent categories. The central claims are that the LLM assessments show only 0-3% discrepancy with human graders and that the resulting categories differ from the distractors in the corresponding multiple-choice items, thereby surfacing deeper student conceptions.

Significance. If the validation is made rigorous, the work could be significant for physics education research by demonstrating a scalable method for analyzing open-ended responses in large classes. This addresses a longstanding limitation of multiple-choice formats and could enable instructors to access qualitative insights that are currently impractical to obtain at scale. The approach has clear potential to influence both assessment design and the study of student reasoning in PER.

major comments (2)
  1. [Results] The 0-3% discrepancy figure between LLM and human graders is reported without stating the total number of student responses scored, the size of the validation subsample, or any inter-rater agreement statistics among the human graders. Without these quantities the discrepancy cannot be meaningfully interpreted relative to normal human variation.
  2. [Methods] The Methods section provides no information on prompt construction, temperature settings, or few-shot examples used for either the binary classification or the emergent categorization tasks. It is therefore impossible to assess whether the reported categories reflect stable student conceptions or are sensitive to prompt phrasing.
minor comments (2)
  1. [Abstract] The abstract states that three questions were used but does not identify them; adding the specific item numbers or brief descriptions would improve reproducibility.
  2. [Methods] A short table summarizing the exact prompt templates and the number of responses per question would clarify the experimental setup without lengthening the text.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and insightful comments, which identify key areas where additional detail will improve the transparency and interpretability of our work. We address each major comment below and have prepared revisions accordingly.

read point-by-point responses
  1. Referee: [Results] The 0-3% discrepancy figure between LLM and human graders is reported without stating the total number of student responses scored, the size of the validation subsample, or any inter-rater agreement statistics among the human graders. Without these quantities the discrepancy cannot be meaningfully interpreted relative to normal human variation.

    Authors: We agree that these quantitative details are necessary to place the reported discrepancy in proper context relative to typical human grading variation. In the revised manuscript we will explicitly state the total number of student responses scored, the size of the validation subsample graded by humans, and the inter-rater agreement statistics (e.g., percentage agreement or Cohen’s kappa) among the human graders. These additions will allow readers to evaluate the 0–3 % figure more rigorously. revision: yes

  2. Referee: [Methods] The Methods section provides no information on prompt construction, temperature settings, or few-shot examples used for either the binary classification or the emergent categorization tasks. It is therefore impossible to assess whether the reported categories reflect stable student conceptions or are sensitive to prompt phrasing.

    Authors: We acknowledge that the current Methods section lacks the necessary detail on prompting procedures. In the revision we will expand this section to describe how the prompts were constructed, report the temperature setting used with GPT-4o, and include any few-shot examples provided for the binary classification and emergent categorization tasks. These additions will enable readers to assess the stability of the resulting categories with respect to prompt design. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical validation against human graders and MC distractors

full rationale

The paper is an empirical study that applies GPT-4o to classify student written explanations on conceptual physics items, reports a 0-3% discrepancy with human graders, and compares emergent incorrect-explanation categories to MC distractors. No equations, fitted parameters, or first-principles derivations appear; the central claims rest on direct comparison to external human scoring and existing MC instruments rather than on self-referential definitions or self-citation chains. The work is therefore self-contained against external benchmarks and receives a score of 0.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The study rests on the assumption that LLM outputs can be treated as proxies for human judgment of student understanding and that emergent categories capture real conceptual differences; no free parameters or invented physical entities are introduced.

axioms (1)
  • domain assumption LLM classifications of student explanations can be validated against human graders with low discrepancy
    Invoked when claiming 0-3% discrepancy and using LLM to organize incorrect explanations.

pith-pipeline@v0.9.0 · 5689 in / 1262 out tokens · 27911 ms · 2026-05-18T22:00:40.817896+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    We found that the LLM (GPT-4o) can fairly assess students' explanations, comparable to human graders (0-3% discrepancy). Furthermore, the categories of incorrect explanations were different from corresponding MC distractors

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

33 extracted references · 33 canonical work pages · 1 internal anchor

  1. [1]

    Use claim, evidence, and reasoning to com- pare how fast the blocks are moving concerning each other after the collision. Use complete sentences

    that students activate in the context of questions in the inventory. Furthermore, prior studies have demonstrated that repeated students’ exposure to distractors (incorrect MC op- tions) strengthens incorrect conceptual associations[12]. MC inventories offer a rich repertoire of questions de- signed to assess students’ conceptual understanding of physics ...

  2. [2]

    Hestenes, M

    D. Hestenes, M. Wells, and G. Swackhamer, Force concept in- ventory, The Physics Teacher30, 141 (1992)

  3. [3]

    R. K. Thornton and D. R. Sokoloff, Assessing student learning of newton’s laws: The force and motion conceptual evaluation and the evaluation of active learning laboratory and lecture cur- ricula, American Journal of Physics 66, 338 (1998)

  4. [4]

    Multiple-choice test of energy and momentum concepts

    C. Singh and D. Rosengrant, Multiple-choice test of energy and momentum concepts, arXiv preprint arXiv:1602.06497 (2016)

  5. [5]

    Nieswandt and K

    M. Nieswandt and K. Bellomo, Written extended-response questions and the assessment of science learning: What do stu- dents’ answers tell us?, International Journal of Science Edu- cation 31, 2117 (2009)

  6. [6]

    W. L. Kuechler and M. G. Simkin, How well do multiple choice tests evaluate student understanding in computer programming classes? (2003)

  7. [7]

    Petersen, M

    A. Petersen, M. Craig, and P. Denny, Employing multiple- answer multiple choice questions, in Proceedings of the 2016 ACM Conference on Innovation and Technology in Computer Science Education, ITiCSE ’16 (ACM, 2016)

  8. [8]

    L. A. Shepard, The role of assessment in a learning culture, Educational Researcher 29, 4 (2000)

  9. [9]

    C. Wong, P. Denny, A. Luxton-Reilly, and J. Whalley, The im- pact of multiple choice question design on predictions of per- formance, in Proceedings of the 23rd Australasian Computing Education Conference, ACE ’21 (ACM, 2021)

  10. [10]

    E. Wood, N. Klausz, and S. MacNeil, Examining the influence of multiple-choice test formats on student performance, Inno- vative Higher Education 47, 515–531 (2021)

  11. [11]

    N. S. Rebello and D. A. Zollman, The effect of distracters on student performance on the force concept inventory, American Journal of Physics 72, 116 (2004)

  12. [12]

    Hammer, Student resources for learning introductory physics, American Journal of Physics 68, S52 (2000)

    D. Hammer, Student resources for learning introductory physics, American Journal of Physics 68, S52 (2000)

  13. [13]

    H. L. Roediger and E. J. Marsh, The positive and negative con- sequences of multiple-choice testing., Journal of Experimental Psychology: Learning, Memory, and Cognition31, 1155–1159 (2005)

  14. [14]

    M. Good, E. Marshman, E. Yerushalmi, and C. Singh, Physics teaching assistants’ views of different types of introductory problems: Challenge of perceiving the instructional benefits of context-rich and multiple-choice problems, Physical Review Physics Education Research 15, 020130 (2019)

  15. [15]

    Bao and E

    L. Bao and E. F. Redish, Model analysis: Representing and assessing the dynamics of student learning, Physical Review Special Topics-Physics Education Research 2, 010103 (2006)

  16. [16]

    Aleven, E

    V . Aleven, E. A. McLaughlin, and M. Glassman, Ai in educa- tion: A critical review and conceptual framework, Educational Psychologist 57, 145 (2022)

  17. [17]

    Munsell, N

    J. Munsell, N. S. Rebello, and C. M. Rebello, Using natural language processing to predict student problem solving perfor- mance, in 2021 Physics Education Research Conference Pro- ceedings (2021)

  18. [18]

    Casalino, B

    G. Casalino, B. Cafarelli, E. del Gobbo, L. Fontanella, L. Grilli, A. Guarino, P. Limone, D. Schicchi, and D. Taibi, Framing au- tomatic grading techniques for open-ended questionnaires re- sponses. a short survey (2021)

  19. [19]

    Kortemeyer, Toward ai grading of student problem solutions in introductory physics: A feasibility study, Physical Review Physics Education Research 19, 020163 (2023)

    G. Kortemeyer, Toward ai grading of student problem solutions in introductory physics: A feasibility study, Physical Review Physics Education Research 19, 020163 (2023)

  20. [20]

    Department of Education, Office of Educational Tech- nology, Artificial Intelligence and the Future of Teaching and Learning: Insights and Recommendations , Tech

    U.S. Department of Education, Office of Educational Tech- nology, Artificial Intelligence and the Future of Teaching and Learning: Insights and Recommendations , Tech. Rep. (U.S. Department of Education, 2023)

  21. [21]

    Weijers, W

    S. Weijers, W. Westera, and M. Wiering, From intuition to un- derstanding: Using ai peers to overcome physics misconcep- tions, arXiv preprint arXiv:2504.00408 (2025)

  22. [22]

    Wang, Physical Review B94, 10.1103/phys- revb.94.195105 (2016)

    T. Wan and Z. Chen, Exploring generative ai assisted feedback writing for students’ written responses to a physics conceptual question with prompt engineering and few-shot learning, Phys- ical Review Physics Education Research 20, 10.1103/phys- revphyseducres.20.010152 (2024)

  23. [23]

    Khan, The amazing ai super tutor for students and teachers, Video

    S. Khan, The amazing ai super tutor for students and teachers, Video. TED Conference (2023)

  24. [24]

    P. G. Butcher and S. E. Jordan, A comparison of human and computer marking of short free-text student responses, Com- puters & Education 55, 489 (2010)

  25. [25]

    H. R. Salim, C. De, N. D. Pratamaputra, and D. Suhartono, Indonesian automatic short answer grading system, Bulletin of Electrical Engineering and Informatics 11, 1586–1603 (2022)

  26. [26]

    K. L. McNeill and J. S. Krajcik, Supporting Grade 5-8 Stu- dents in Constructing Explanations in Science: The Claim, Ev- idence, and Reasoning Framework for Talk and Writing(Pear- son, 2011)

  27. [27]

    N. F. Afif, M. G. Nugraha, and A. Samsudin, Developing en- ergy and momentum conceptual survey (emcs) with four-tier diagnostic test items, in AIP Conference Proceedings (Au- thor(s), 2017)

  28. [28]

    D2L Inc., Brightspace learning management system (2025), accessed May 18, 2025

  29. [29]

    OpenAI, Chatgpt, https://chat.openai.com/chat (2025), [Ac- cessed May 2025]

  30. [30]

    B. Chen, Z. Zhang, N. Langrené, and S. Zhu, Unleashing the potential of prompt engineering in large language models: a comprehensive review (2023), arXiv:2310.14735

  31. [31]

    K. L. Sainani, Reliability statistics, PM&R 9, 622–628 (2017)

  32. [32]

    Latif and X

    E. Latif and X. Zhai, Integrating generative ai into stem educa- tion: Enhancing conceptual understanding, addressing miscon- ceptions, and assessing student acceptance, Disciplinary and Interdisciplinary Science Education Research 7, 11 (2025)

  33. [33]

    Zhou, S.-M

    L. Zhou, S.-M. Kim, and N. Ahmed, Artificial intelligence ap- plications in education: Natural language processing in detect- ing misconceptions, Education and Information Technologies 10.1007/s10639-024-12919-1 (2024). 5