Using Large Language Models in Physics Education
Pith reviewed 2026-05-25 02:22 UTC · model grok-4.3
The pith
Frontier large language models achieve near-perfect scores on university physics problems and show improved alignment with human grading.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Recent architectures such as ChatGPT-5.1 and Gemini 3.0 Pro achieve near-perfect scores on text-based reasoning and demonstrate significant improvements in alignment with human grading, heavily mitigating the systemic over-marking observed in earlier iterations, while native multimodal integration resolves previous limitations in spatial geometry and topological interpretation.
What carries the argument
Three complementary studies that test LLMs first on generating accurate solutions to physics problems and then on reliability as automated graders against formal mark schemes.
If this is right
- LLMs can provide viable support for independent student learning in physics courses.
- Instructional automation for grading becomes more feasible with newer model versions.
- Limitations in assigning partial credit to ambiguous or incomplete reasoning must be actively managed.
- Multimodal capabilities now allow reliable interpretation of diagrams accompanying physics problems.
Where Pith is reading between the lines
- Physics departments could pilot LLM-assisted homework systems that flag cases needing human review for partial credit.
- Model training focused on partial-credit logic could further reduce the remaining grading gaps.
- Parallel evaluations in other STEM subjects would test whether the observed trajectory holds beyond physics.
Load-bearing premise
The university-level problems, mark schemes, and student solutions used in the three studies are representative of typical physics coursework and generalize to other problems and cohorts.
What would settle it
A follow-up evaluation in which the same models produce substantially lower scores or show renewed over-marking on a new collection of physics problems and student solutions not included in the original studies.
Figures
read the original abstract
The rapid advancement of Large Language Models (LLMs) has introduced new possibilities and challenges in physics education, necessitating rigorous evaluation of their capabilities as both problem solvers and automated assessors. This paper presents the results of three complementary studies that evaluated frontier models released between mid-2024 and late-2025. Models were assessed on their ability to generate accurate, step-by-step solutions to university-level physics problems in Classical Mechanics, Electromagnetism, and Quantum Mechanics, and subsequently on their reliability in grading student solutions against a formal mark scheme. The results indicate a clear trajectory toward benchmark saturation in text-based reasoning, with recent architectures (such as ChatGPT-5.1 and Gemini 3.0 Pro) achieving near-perfect scores. Furthermore, recent advances in native multimodal integration have resolved previous limitations in spatial geometry and topological interpretation, enabling models to accurately process accompanying diagrams. As automated assessors, newer models demonstrated significant improvements in alignment with human grading, heavily mitigating the systemic over-marking observed in earlier iterations. However, while models reliably evaluate fully correct handwritten work, assigning partial credit to flawed or incomplete reasoning remains a persistent challenge. These findings suggest that as of late 2025, LLMs offer viable support for both independent student learning and instructional automation, provided their limitations in evaluating ambiguous reasoning are actively managed.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript reports results from three complementary empirical studies evaluating frontier LLMs (mid-2024 to late-2025 releases) as problem solvers and automated graders on university-level physics problems drawn from Classical Mechanics, Electromagnetism, and Quantum Mechanics. It claims a clear trajectory toward benchmark saturation, with models such as ChatGPT-5.1 and Gemini 3.0 Pro achieving near-perfect scores on text-based reasoning, resolution of prior multimodal limitations via native diagram processing, and substantially improved alignment with human grading that mitigates earlier over-marking tendencies. Partial credit for flawed reasoning remains challenging, but the work concludes that LLMs now offer viable support for independent student learning and instructional automation when limitations are managed.
Significance. If the performance and alignment claims hold with adequate documentation, the work would document measurable progress in LLM capabilities relevant to physics education research, providing concrete evidence that recent architectures can support both problem-solving practice and assessment tasks at the university level. This could inform the design of hybrid instructional tools while underscoring the need for human oversight on ambiguous cases.
major comments (2)
- [Abstract / Methods] Abstract and Methods: The abstract states that recent models achieve near-perfect scores and significant grading alignment improvements, yet supplies no sample problems, mark schemes, inter-rater statistics, error analysis, or quantitative results. This absence is load-bearing for the central empirical claims and prevents verification of the reported trajectory toward benchmark saturation.
- [Abstract] Abstract: The conclusion that LLMs offer viable support for typical physics coursework rests on the assumption that the chosen problems, mark schemes, and student solutions are representative; however, no information is given on selection criteria, difficulty distribution, presence of ambiguity or open-ended elements, or diagram versus text balance, undermining the generalizability asserted in the final paragraph.
minor comments (1)
- [Abstract] The abstract refers to 'three complementary studies' without indicating their individual scopes or how they complement one another; a brief overview sentence would improve clarity.
Simulated Author's Rebuttal
We thank the referee for their careful review and constructive feedback. We address each major comment below and have revised the manuscript to improve the documentation and transparency of our empirical claims.
read point-by-point responses
-
Referee: [Abstract / Methods] Abstract and Methods: The abstract states that recent models achieve near-perfect scores and significant grading alignment improvements, yet supplies no sample problems, mark schemes, inter-rater statistics, error analysis, or quantitative results. This absence is load-bearing for the central empirical claims and prevents verification of the reported trajectory toward benchmark saturation.
Authors: The abstract provides a high-level summary of the three studies, while the full manuscript details sample problems, mark schemes, inter-rater agreement statistics, error categorizations, and quantitative scores (including near-perfect performance metrics for models such as ChatGPT-5.1) in the Methods and Results sections. To facilitate verification without relying solely on the body text, we have revised the abstract to incorporate key quantitative results and added explicit cross-references to the supporting tables and analyses in the main text. The Methods section has also been expanded with additional documentation of these elements. revision: yes
-
Referee: [Abstract] Abstract: The conclusion that LLMs offer viable support for typical physics coursework rests on the assumption that the chosen problems, mark schemes, and student solutions are representative; however, no information is given on selection criteria, difficulty distribution, presence of ambiguity or open-ended elements, or diagram versus text balance, undermining the generalizability asserted in the final paragraph.
Authors: The Methods section outlines the sourcing of problems from standard university curricula across Classical Mechanics, Electromagnetism, and Quantum Mechanics, including a combination of text-based and diagram-accompanied items. However, we acknowledge that explicit details on selection criteria, difficulty distribution, and the presence of open-ended or ambiguous elements were not sufficiently foregrounded. We have revised the Methods section to include this information and added a clarifying statement to the abstract on the representativeness of the problem set and the text-diagram balance. revision: yes
Circularity Check
No circularity: empirical benchmarking study with no derivations or self-referential claims
full rationale
This is a purely empirical paper reporting results from three studies that benchmark LLMs on university physics problems and grading tasks. It contains no equations, no fitted parameters, no predictions derived from inputs, no uniqueness theorems, and no self-citations used to justify core premises. All claims rest on direct performance measurements against mark schemes rather than any chain that reduces to the paper's own definitions or prior outputs by construction. The generalizability concern raised in the skeptic note is a question of external validity, not circularity.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Stage I: Solution Generation To maintain structural and formatting consistency across all trials, each question was transcribed into LaTeX before entry into the model. This conversion process leveraged the native multimodal capabilities of ChatGPT-4o to extract text and mathematical notation directly from the source PDFs. To simulate real-world usage and ...
-
[2]
Conversely, for the MM study (Set C), a bespoke marking scheme was developed
Stage II: Human Evaluation For the human evaluation baseline (Stage II) of the PB studies, the established marking rubric from the UCL dataset [30] was utilised. Conversely, for the MM study (Set C), a bespoke marking scheme was developed. This rubric was constructed by delineating generalised solu- tion pathways and apportioning marks across fundamen- ta...
-
[3]
The models were constrained to assign integer scores and pro- vide generalised feedback
Stage III: AI Evaluation The evaluation prompt adopted a revised expert per- sona, positioning the models as ”physics professors” to enforce a rigorous, pedagogical evaluative standard. The models were constrained to assign integer scores and pro- vide generalised feedback. Furthermore, the input con- text strictly delimited the problem statements, candid...
-
[4]
PB1 Given the highly competitive nature of the LLM land- scape during the study period, frontier models were fre- quently released in clustered cycles to maintain market parity. Consequently, the evaluated models have been categorised into three chronological generations: •Generation 1(May 2024): ChatGPT-4o and Gemini 1.5 Pro. •Generation 2(December 2024)...
work page 2024
-
[5]
All subsequent non-multimodal evaluations utilised the condensed Set B dataset
PB2 Before detailing the results for recent architectures, a methodological shift must be noted. All subsequent non-multimodal evaluations utilised the condensed Set B dataset. Again, to maintain cross-study continuity, all questions retain their original Set A numerical designa- tions (as mapped in Table III). Furthermore, to account for how this dataset...
work page 2025
-
[6]
MM The preceding phases established that while Gen- eration 5 models possess robust text-based reasoning, earlier generations suffered from a disconnect between syntactic processing and visual grounding. To test whether modern architectures have bridged this gap, the models were evaluated on Set C: a dedicated, na- tively multimodal problem set comprised ...
work page 2025
-
[7]
PB1 After having probed the LLM’s ability to solve prob- lems, this chapter evaluates the reliability of using LLMs as markers. Figures 20 and 21 present the relation be- tween grades awarded by the six LLMs and humans across the three core topics, comparing the cases where the LLMs were either provided with, or deprived of, the mark schemes. To mirror th...
work page 1902
-
[8]
PB2 Having established that Gen 4 and Gen 5 models pos- sess the foundational reasoning to solve the benchmark entirely, the final stage of analysis evaluates their utility as automated assessors. To conduct this, the method- ology was refined: rather than batching the three gen- erated ChatGPT-4o solutions together within a single prompt—as was done in P...
-
[9]
MM Handwritten Grading Across all models, grading accuracy was consistently higher for perfect handwritten solutions than for im- perfect ones. Mean absolute error (MAE) analysis shows near-zero deviation from human grading for per- fect scripts for all models, indicating reliable recognition of canonical solution structures and correct reasoning patterns...
-
[10]
T. Brownet al., Language models are few-shot learners, inAdvances in Neural Information Processing Systems, edited by H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin (Curran Associates, Inc., 2020) pp. 1877–1901
work page 2020
-
[11]
J. W. Raeet al., Scaling language models: Meth- ods, analysis & insights from training gopher (2021), arXiv:2112.11446 [cs.CL]
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[12]
L. Ouyanget al., Training language models to follow instructions with human feedback, inAdvances in Neu- ral Information Processing Systems, edited by S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh (Curran Associates, Inc., 2022) pp. 27730–27744
work page 2022
-
[13]
O. Zawacki-Richter, V. I. Mar´ ın, M. Bond, and F. Gou- verneur, Systematic review of research on artificial intelli- gence applications in higher education – where are the ed- ucators?, International Journal of Educational Technol- ogy in Higher Education16, 10.1186/s41239-019-0171-0 (2019)
-
[14]
S. Steenbergen-Hu and H. Cooper, A meta-analysis of the effectiveness of intelligent tutoring systems on col- lege students’ academic learning, Journal of Educational Psychology106, 331 (2014)
work page 2014
-
[15]
Training Verifiers to Solve Math Word Problems
K. Cobbeet al., Training verifiers to solve math word problems (2021), arXiv:2110.14168
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[16]
Survey of Hallucination in Natural Language Generation
Z. Jiet al., Survey of hallucination in natural language generation, ACM Computing Surveys 10.1145/3571730 (2022)
-
[17]
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks
P. Lewiset al., Retrieval-augmented generation for knowledge-intensive NLP tasks, Neural Information Pro- cessing Systemsabs/2005.11401, 9459 (2020)
work page internal anchor Pith review Pith/arXiv arXiv 2005
-
[18]
S. Wanget al., Large language models for education: A survey and outlook, IEEE Signal Processing Magazine 42, 51 (2025)
work page 2025
-
[19]
Emergent Abilities of Large Language Models
J. Weiet al., Emergent abilities of large language models (2022), arXiv:2206.07682 [cs.CL]
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[20]
T. Webb, K. J. Holyoak, and H. Lu, Emergent analog- ical reasoning in large language models, Nature Human Behaviour7, 1526 (2023)
work page 2023
-
[21]
W. Yeadon, O. Inyang, A. Mizouri, A. Peach, and C. P. Testrow, The death of the short-form physics es- say in the coming AI revolution, Physics Education58, 10.1088/1361-6552/acc5cf (2022)
- [22]
- [23]
-
[24]
Jonsson, Rapportsl¨ app: Back2school 2023 (2025), ac- cessed: Mar
S. Jonsson, Rapportsl¨ app: Back2school 2023 (2025), ac- cessed: Mar. 25, 2025
work page 2023
-
[25]
B. Gregorcic and A.-M. Pendrill, ChatGPT and the frus- trated socrates, Physics Education58, 10.1088/1361- 6552/acc299 (2023)
-
[26]
G. Polverini and B. Gregorcic, How understanding large language models can inform the use of ChatGPT in physics education, European Journal of Physics45, 10.1088/1361-6404/ad1420 (2023)
-
[27]
The Russell Group, Russell group, ‘new principles on use of AI in education’
-
[28]
H. Deli´ c and S. Be´ cirovi´ c, Socratic method as an ap- proach to teaching, European Researcher. Series A (2016)
work page 2016
-
[29]
Halpern,Social Capital(Polity Press, Oxford, Eng- land, 2005)
D. Halpern,Social Capital(Polity Press, Oxford, Eng- land, 2005)
work page 2005
- [30]
-
[31]
B. Paris, Instructors’ perspectives of challenges and bar- riers to providing effective feedback, Teaching & Learning Inquiry10, 10.20343/teachlearninqu.10.3 (2022)
-
[32]
Z. Chuet al., LLM agents for education: Advances and applications, inFindings of the Association for Computational Linguistics: EMNLP 2025, edited by C. Christodoulopoulos, T. Chakraborty, C. Rose, and V. Peng (Association for Computational Linguistics, Stroudsburg, PA, USA, 2025) pp. 13782–13810. 19
work page 2025
-
[33]
Shearing, Teachers can use AI to save time on mark- ing, new guidance says (2025), accessed: Oct
H. Shearing, Teachers can use AI to save time on mark- ing, new guidance says (2025), accessed: Oct. 27, 2025
work page 2025
- [34]
- [35]
- [36]
- [37]
- [38]
-
[39]
R. Moket al., Using large language models for grading in education: an applied test for physics, Physics Education 60, 035006 (2025)
work page 2025
-
[40]
J. L. Donaldson, A. Nawaz, D. Constantinos, and A. Lim, Using llms for physics education: Datasets and evalua- tion figures (2026)
work page 2026
-
[41]
B. Xu, A. Yang, J. Lin, Q. Wang, C. Zhou, Y. Zhang, and Z. Mao, Expertprompting: Instructing large language models to be distinguished experts, arXiv preprint arXiv:2305.14688 10.48550/arXiv.2305.14688 (2023), arXiv:2305.14688 [cs.CL]
-
[42]
J. Wei, X. Wang, D. Schuurmans, M. Bosma, B. Ichter, F. Xia, E. Chi, Q. Le, and D. Zhou, Chain-of-thought prompting elicits reasoning in large language models, in Proceedings of the 36th International Conference on Neu- ral Information Processing Systems, NIPS ’22 No. 1800 (Curran Associates Inc., Red Hook, NY, USA, 2022) pp. 24824–24837
work page 2022
-
[43]
X. Wang, J. Wei, D. Schuurmans, Q. Le, E. Chi, S. Narang, A. Chowdhery, and D. Zhou, Self- consistency improves chain of thought reasoning in lan- guage models, arXiv preprint arXiv:2203.11171 (2022), arXiv:2203.11171 [cs.CL]
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[44]
L. Zhenget al., Judging LLM-as-a-judge with MT-bench and chatbot arena, inProceedings of the 37th Interna- tional Conference on Neural Information Processing Sys- tems, NIPS ’23 (Curran Associates Inc., Red Hook, NY, USA, 2023) pp. 46595–46623
work page 2023
-
[45]
N. F. Liuet al., Lost in the middle: How language mod- els use long contexts, Transactions of the Association for Computational Linguistics12, 157 (2024)
work page 2024
- [46]
-
[47]
I. Khalid, A. M. Nourollah, and S. Schockaert, Large lan- guage and reasoning models are shallow disjunctive rea- soners, inProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), edited by W. Che, J. Nabende, E. Shutova, and M. T. Pilehvar (Association for Computational Lin- guistics, Stroudsburg, ...
work page 2025
-
[48]
Q. Zhao, Y. Lu, M. J. Kim, Z. Fu, Z. Zhang, Y. Wu, Z. Li, Q. Ma, S. Han, C. Finn, A. Handa, M.-Y. Liu, D. Xiang, G. Wetzstein, and T.-Y. Lin, Cot-vla: Visual chain-of- thought reasoning for vision-language-action models, in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)(2025). 20 Appendix A: Referenced Questions FIG. A1. Classical M...
work page 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.