Using Large Language Models in Physics Education

Aliya Navaz; Alysta Lim; Jonah R. Donaldson; Konstantinos Doran; Mario Campanelli

arxiv: 2605.23660 · v1 · pith:XOLQAKSCnew · submitted 2026-05-22 · ⚛️ physics.ed-ph

Using Large Language Models in Physics Education

Jonah R. Donaldson , Aliya Navaz , Konstantinos Doran , Alysta Lim , Mario Campanelli This is my paper

Pith reviewed 2026-05-25 02:22 UTC · model grok-4.3

classification ⚛️ physics.ed-ph

keywords large language modelsphysics educationautomated assessmentproblem solvinggrading alignmentmultimodal modelsclassical mechanicsquantum mechanics

0 comments

The pith

Frontier large language models achieve near-perfect scores on university physics problems and show improved alignment with human grading.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper reports results from three studies testing large language models released between mid-2024 and late-2025 on university-level physics tasks. The models generate accurate step-by-step solutions in classical mechanics, electromagnetism, and quantum mechanics, and they grade student work against formal mark schemes. Recent versions reach near-perfect text-based reasoning scores, handle diagrams through multimodal advances, and align more closely with human markers than earlier models. Partial credit for flawed or incomplete reasoning remains difficult. The findings indicate that these models can support student learning and instructional automation when their remaining limitations are addressed.

Core claim

Recent architectures such as ChatGPT-5.1 and Gemini 3.0 Pro achieve near-perfect scores on text-based reasoning and demonstrate significant improvements in alignment with human grading, heavily mitigating the systemic over-marking observed in earlier iterations, while native multimodal integration resolves previous limitations in spatial geometry and topological interpretation.

What carries the argument

Three complementary studies that test LLMs first on generating accurate solutions to physics problems and then on reliability as automated graders against formal mark schemes.

If this is right

LLMs can provide viable support for independent student learning in physics courses.
Instructional automation for grading becomes more feasible with newer model versions.
Limitations in assigning partial credit to ambiguous or incomplete reasoning must be actively managed.
Multimodal capabilities now allow reliable interpretation of diagrams accompanying physics problems.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Physics departments could pilot LLM-assisted homework systems that flag cases needing human review for partial credit.
Model training focused on partial-credit logic could further reduce the remaining grading gaps.
Parallel evaluations in other STEM subjects would test whether the observed trajectory holds beyond physics.

Load-bearing premise

The university-level problems, mark schemes, and student solutions used in the three studies are representative of typical physics coursework and generalize to other problems and cohorts.

What would settle it

A follow-up evaluation in which the same models produce substantially lower scores or show renewed over-marking on a new collection of physics problems and student solutions not included in the original studies.

Figures

Figures reproduced from arXiv: 2605.23660 by Aliya Navaz, Alysta Lim, Jonah R. Donaldson, Konstantinos Doran, Mario Campanelli.

**Figure 2.** Figure 2: FIG. 2. Mean percentage scores achieved by each model in [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: FIG. 3. Mean percentage scores achieved by each model in the [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 5.** Figure 5: FIG. 5. Heat map showing model performance in Quantum [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗

**Figure 6.** Figure 6: FIG. 6. Heat map showing model performance in Classical [PITH_FULL_IMAGE:figures/full_fig_p006_6.png] view at source ↗

**Figure 7.** Figure 7: FIG. 7. Heat map showing model performance in Electro [PITH_FULL_IMAGE:figures/full_fig_p007_7.png] view at source ↗

**Figure 9.** Figure 9: FIG. 9. Averaged percentage scores for Classical Mechanics [PITH_FULL_IMAGE:figures/full_fig_p008_9.png] view at source ↗

**Figure 8.** Figure 8: FIG. 8. Averaged percentage scores for Quantum Mechanics [PITH_FULL_IMAGE:figures/full_fig_p008_8.png] view at source ↗

**Figure 10.** Figure 10: FIG. 10. Averaged percentage scores for Electromagnetism [PITH_FULL_IMAGE:figures/full_fig_p008_10.png] view at source ↗

**Figure 11.** Figure 11: FIG. 11. Heat map of Quantum Mechanics performance [PITH_FULL_IMAGE:figures/full_fig_p009_11.png] view at source ↗

**Figure 12.** Figure 12: FIG. 12. Heat map of Classical Mechanics performance show [PITH_FULL_IMAGE:figures/full_fig_p009_12.png] view at source ↗

**Figure 13.** Figure 13: FIG. 13. Heat map of Electromagnetism performance high [PITH_FULL_IMAGE:figures/full_fig_p009_13.png] view at source ↗

**Figure 14.** Figure 14: FIG. 14. Multimodal performance in Electromagnetism: [PITH_FULL_IMAGE:figures/full_fig_p010_14.png] view at source ↗

**Figure 15.** Figure 15: FIG. 15. Multimodal performance in Electromagnetism: [PITH_FULL_IMAGE:figures/full_fig_p010_15.png] view at source ↗

**Figure 18.** Figure 18: FIG. 18. Multimodal performance in Quantum Mechanics: [PITH_FULL_IMAGE:figures/full_fig_p011_18.png] view at source ↗

**Figure 19.** Figure 19: FIG. 19. Multimodal performance in Electromagnetism: [PITH_FULL_IMAGE:figures/full_fig_p011_19.png] view at source ↗

**Figure 20.** Figure 20: FIG. 20. Linear Regression of Model Alignment (With Mark [PITH_FULL_IMAGE:figures/full_fig_p012_20.png] view at source ↗

**Figure 22.** Figure 22: FIG. 22. Deviation heat-map Mapping Absolute Percentage [PITH_FULL_IMAGE:figures/full_fig_p013_22.png] view at source ↗

**Figure 23.** Figure 23: FIG. 23. Linear Regression of Model Alignment (Individual [PITH_FULL_IMAGE:figures/full_fig_p014_23.png] view at source ↗

**Figure 24.** Figure 24: FIG. 24. AI Deviation vs. Human Baseline per Question (P2) [PITH_FULL_IMAGE:figures/full_fig_p014_24.png] view at source ↗

**Figure 25.** Figure 25: FIG. 25. Mean absolute grading error across models for per [PITH_FULL_IMAGE:figures/full_fig_p015_25.png] view at source ↗

**Figure 26.** Figure 26: FIG. 26. Relationship between model solution accuracy on [PITH_FULL_IMAGE:figures/full_fig_p015_26.png] view at source ↗

**Figure 27.** Figure 27: FIG. 27. Relationship between model solution accuracy on [PITH_FULL_IMAGE:figures/full_fig_p016_27.png] view at source ↗

**Figure 28.** Figure 28: FIG. 28. Absolute grading error distributions for each model [PITH_FULL_IMAGE:figures/full_fig_p016_28.png] view at source ↗

**Figure 29.** Figure 29: FIG. 29. Absolute grading error distributions for each model [PITH_FULL_IMAGE:figures/full_fig_p016_29.png] view at source ↗

**Figure 32.** Figure 32: FIG. 32. Signed grading error (LLM marks minus human [PITH_FULL_IMAGE:figures/full_fig_p017_32.png] view at source ↗

**Figure 31.** Figure 31: FIG. 31. Percentage grading error for imperfect handwritten [PITH_FULL_IMAGE:figures/full_fig_p017_31.png] view at source ↗

read the original abstract

The rapid advancement of Large Language Models (LLMs) has introduced new possibilities and challenges in physics education, necessitating rigorous evaluation of their capabilities as both problem solvers and automated assessors. This paper presents the results of three complementary studies that evaluated frontier models released between mid-2024 and late-2025. Models were assessed on their ability to generate accurate, step-by-step solutions to university-level physics problems in Classical Mechanics, Electromagnetism, and Quantum Mechanics, and subsequently on their reliability in grading student solutions against a formal mark scheme. The results indicate a clear trajectory toward benchmark saturation in text-based reasoning, with recent architectures (such as ChatGPT-5.1 and Gemini 3.0 Pro) achieving near-perfect scores. Furthermore, recent advances in native multimodal integration have resolved previous limitations in spatial geometry and topological interpretation, enabling models to accurately process accompanying diagrams. As automated assessors, newer models demonstrated significant improvements in alignment with human grading, heavily mitigating the systemic over-marking observed in earlier iterations. However, while models reliably evaluate fully correct handwritten work, assigning partial credit to flawed or incomplete reasoning remains a persistent challenge. These findings suggest that as of late 2025, LLMs offer viable support for both independent student learning and instructional automation, provided their limitations in evaluating ambiguous reasoning are actively managed.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper adds 2025 benchmark numbers on LLMs solving and grading university physics problems but supplies almost no detail on how the test items were chosen or analyzed.

read the letter

The main thing to know is that this paper tracks performance gains in recent models on classical mechanics, electromagnetism, and quantum mechanics problems, plus their use as graders. Newer systems like ChatGPT-5.1 and Gemini 3.0 Pro reach near-perfect scores on text items and handle diagrams better, while also aligning more closely with human graders than earlier versions did. The authors correctly flag that partial credit for incomplete reasoning is still unreliable. These are straightforward empirical updates worth recording for anyone following the technology in education settings. The work is useful as a snapshot of where the models stood in late 2025 and for noting the remaining practical limitation around ambiguous student work. The soft spot is the missing methodological grounding. The abstract and description give outcomes but no sample problems, mark schemes, inter-rater numbers, or error breakdowns. There is also no account of how the problems were selected or whether they represent typical coursework difficulty and ambiguity. The stress-test concern about generalizability therefore holds: without that information it is impossible to tell whether the high scores would appear on a broader or less curated set of items. The paper is aimed at physics education researchers and ed-tech developers who need current numbers on these tools. It does not introduce new methods or frameworks, so its value is incremental. I would still send it to peer review. The topic is timely for course design decisions, and referees could require the authors to add the concrete examples and selection criteria that are currently absent. That would turn the report into a more usable reference.

Referee Report

2 major / 1 minor

Summary. The manuscript reports results from three complementary empirical studies evaluating frontier LLMs (mid-2024 to late-2025 releases) as problem solvers and automated graders on university-level physics problems drawn from Classical Mechanics, Electromagnetism, and Quantum Mechanics. It claims a clear trajectory toward benchmark saturation, with models such as ChatGPT-5.1 and Gemini 3.0 Pro achieving near-perfect scores on text-based reasoning, resolution of prior multimodal limitations via native diagram processing, and substantially improved alignment with human grading that mitigates earlier over-marking tendencies. Partial credit for flawed reasoning remains challenging, but the work concludes that LLMs now offer viable support for independent student learning and instructional automation when limitations are managed.

Significance. If the performance and alignment claims hold with adequate documentation, the work would document measurable progress in LLM capabilities relevant to physics education research, providing concrete evidence that recent architectures can support both problem-solving practice and assessment tasks at the university level. This could inform the design of hybrid instructional tools while underscoring the need for human oversight on ambiguous cases.

major comments (2)

[Abstract / Methods] Abstract and Methods: The abstract states that recent models achieve near-perfect scores and significant grading alignment improvements, yet supplies no sample problems, mark schemes, inter-rater statistics, error analysis, or quantitative results. This absence is load-bearing for the central empirical claims and prevents verification of the reported trajectory toward benchmark saturation.
[Abstract] Abstract: The conclusion that LLMs offer viable support for typical physics coursework rests on the assumption that the chosen problems, mark schemes, and student solutions are representative; however, no information is given on selection criteria, difficulty distribution, presence of ambiguity or open-ended elements, or diagram versus text balance, undermining the generalizability asserted in the final paragraph.

minor comments (1)

[Abstract] The abstract refers to 'three complementary studies' without indicating their individual scopes or how they complement one another; a brief overview sentence would improve clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their careful review and constructive feedback. We address each major comment below and have revised the manuscript to improve the documentation and transparency of our empirical claims.

read point-by-point responses

Referee: [Abstract / Methods] Abstract and Methods: The abstract states that recent models achieve near-perfect scores and significant grading alignment improvements, yet supplies no sample problems, mark schemes, inter-rater statistics, error analysis, or quantitative results. This absence is load-bearing for the central empirical claims and prevents verification of the reported trajectory toward benchmark saturation.

Authors: The abstract provides a high-level summary of the three studies, while the full manuscript details sample problems, mark schemes, inter-rater agreement statistics, error categorizations, and quantitative scores (including near-perfect performance metrics for models such as ChatGPT-5.1) in the Methods and Results sections. To facilitate verification without relying solely on the body text, we have revised the abstract to incorporate key quantitative results and added explicit cross-references to the supporting tables and analyses in the main text. The Methods section has also been expanded with additional documentation of these elements. revision: yes
Referee: [Abstract] Abstract: The conclusion that LLMs offer viable support for typical physics coursework rests on the assumption that the chosen problems, mark schemes, and student solutions are representative; however, no information is given on selection criteria, difficulty distribution, presence of ambiguity or open-ended elements, or diagram versus text balance, undermining the generalizability asserted in the final paragraph.

Authors: The Methods section outlines the sourcing of problems from standard university curricula across Classical Mechanics, Electromagnetism, and Quantum Mechanics, including a combination of text-based and diagram-accompanied items. However, we acknowledge that explicit details on selection criteria, difficulty distribution, and the presence of open-ended or ambiguous elements were not sufficiently foregrounded. We have revised the Methods section to include this information and added a clarifying statement to the abstract on the representativeness of the problem set and the text-diagram balance. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmarking study with no derivations or self-referential claims

full rationale

This is a purely empirical paper reporting results from three studies that benchmark LLMs on university physics problems and grading tasks. It contains no equations, no fitted parameters, no predictions derived from inputs, no uniqueness theorems, and no self-citations used to justify core premises. All claims rest on direct performance measurements against mark schemes rather than any chain that reduces to the paper's own definitions or prior outputs by construction. The generalizability concern raised in the skeptic note is a question of external validity, not circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an empirical evaluation study with no mathematical derivations, free parameters, background axioms, or postulated entities.

pith-pipeline@v0.9.0 · 5771 in / 1069 out tokens · 22721 ms · 2026-05-25T02:22:44.375276+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

48 extracted references · 48 canonical work pages · 5 internal anchors

[1]

top physics student

Stage I: Solution Generation To maintain structural and formatting consistency across all trials, each question was transcribed into LaTeX before entry into the model. This conversion process leveraged the native multimodal capabilities of ChatGPT-4o to extract text and mathematical notation directly from the source PDFs. To simulate real-world usage and ...

work page
[2]

Conversely, for the MM study (Set C), a bespoke marking scheme was developed

Stage II: Human Evaluation For the human evaluation baseline (Stage II) of the PB studies, the established marking rubric from the UCL dataset [30] was utilised. Conversely, for the MM study (Set C), a bespoke marking scheme was developed. This rubric was constructed by delineating generalised solu- tion pathways and apportioning marks across fundamen- ta...

work page
[3]

The models were constrained to assign integer scores and pro- vide generalised feedback

Stage III: AI Evaluation The evaluation prompt adopted a revised expert per- sona, positioning the models as ”physics professors” to enforce a rigorous, pedagogical evaluative standard. The models were constrained to assign integer scores and pro- vide generalised feedback. Furthermore, the input con- text strictly delimited the problem statements, candid...

work page
[4]

language

PB1 Given the highly competitive nature of the LLM land- scape during the study period, frontier models were fre- quently released in clustered cycles to maintain market parity. Consequently, the evaluated models have been categorised into three chronological generations: •Generation 1(May 2024): ChatGPT-4o and Gemini 1.5 Pro. •Generation 2(December 2024)...

work page 2024
[5]

All subsequent non-multimodal evaluations utilised the condensed Set B dataset

PB2 Before detailing the results for recent architectures, a methodological shift must be noted. All subsequent non-multimodal evaluations utilised the condensed Set B dataset. Again, to maintain cross-study continuity, all questions retain their original Set A numerical designa- tions (as mapped in Table III). Furthermore, to account for how this dataset...

work page 2025
[6]

MM The preceding phases established that while Gen- eration 5 models possess robust text-based reasoning, earlier generations suffered from a disconnect between syntactic processing and visual grounding. To test whether modern architectures have bridged this gap, the models were evaluated on Set C: a dedicated, na- tively multimodal problem set comprised ...

work page 2025
[7]

PB1 After having probed the LLM’s ability to solve prob- lems, this chapter evaluates the reliability of using LLMs as markers. Figures 20 and 21 present the relation be- tween grades awarded by the six LLMs and humans across the three core topics, comparing the cases where the LLMs were either provided with, or deprived of, the mark schemes. To mirror th...

work page 1902
[8]

PB2 Having established that Gen 4 and Gen 5 models pos- sess the foundational reasoning to solve the benchmark entirely, the final stage of analysis evaluates their utility as automated assessors. To conduct this, the method- ology was refined: rather than batching the three gen- erated ChatGPT-4o solutions together within a single prompt—as was done in P...

work page
[9]

MM Handwritten Grading Across all models, grading accuracy was consistently higher for perfect handwritten solutions than for im- perfect ones. Mean absolute error (MAE) analysis shows near-zero deviation from human grading for per- fect scripts for all models, indicating reliable recognition of canonical solution structures and correct reasoning patterns...

work page
[10]

Brownet al., Language models are few-shot learners, inAdvances in Neural Information Processing Systems, edited by H

T. Brownet al., Language models are few-shot learners, inAdvances in Neural Information Processing Systems, edited by H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin (Curran Associates, Inc., 2020) pp. 1877–1901

work page 2020
[11]

J. W. Raeet al., Scaling language models: Meth- ods, analysis & insights from training gopher (2021), arXiv:2112.11446 [cs.CL]

work page internal anchor Pith review Pith/arXiv arXiv 2021
[12]

Ouyanget al., Training language models to follow instructions with human feedback, inAdvances in Neu- ral Information Processing Systems, edited by S

L. Ouyanget al., Training language models to follow instructions with human feedback, inAdvances in Neu- ral Information Processing Systems, edited by S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh (Curran Associates, Inc., 2022) pp. 27730–27744

work page 2022
[13]

Zawacki-Richter, V

O. Zawacki-Richter, V. I. Mar´ ın, M. Bond, and F. Gou- verneur, Systematic review of research on artificial intelli- gence applications in higher education – where are the ed- ucators?, International Journal of Educational Technol- ogy in Higher Education16, 10.1186/s41239-019-0171-0 (2019)

work page doi:10.1186/s41239-019-0171-0 2019
[14]

Steenbergen-Hu and H

S. Steenbergen-Hu and H. Cooper, A meta-analysis of the effectiveness of intelligent tutoring systems on col- lege students’ academic learning, Journal of Educational Psychology106, 331 (2014)

work page 2014
[15]

Training Verifiers to Solve Math Word Problems

K. Cobbeet al., Training verifiers to solve math word problems (2021), arXiv:2110.14168

work page internal anchor Pith review Pith/arXiv arXiv 2021
[16]

Survey of Hallucination in Natural Language Generation

Z. Jiet al., Survey of hallucination in natural language generation, ACM Computing Surveys 10.1145/3571730 (2022)

work page doi:10.1145/3571730 2022
[17]

Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks

P. Lewiset al., Retrieval-augmented generation for knowledge-intensive NLP tasks, Neural Information Pro- cessing Systemsabs/2005.11401, 9459 (2020)

work page internal anchor Pith review Pith/arXiv arXiv 2005
[18]

Wanget al., Large language models for education: A survey and outlook, IEEE Signal Processing Magazine 42, 51 (2025)

S. Wanget al., Large language models for education: A survey and outlook, IEEE Signal Processing Magazine 42, 51 (2025)

work page 2025
[19]

Emergent Abilities of Large Language Models

J. Weiet al., Emergent abilities of large language models (2022), arXiv:2206.07682 [cs.CL]

work page internal anchor Pith review Pith/arXiv arXiv 2022
[20]

T. Webb, K. J. Holyoak, and H. Lu, Emergent analog- ical reasoning in large language models, Nature Human Behaviour7, 1526 (2023)

work page 2023
[21]

Yeadon, O

W. Yeadon, O. Inyang, A. Mizouri, A. Peach, and C. P. Testrow, The death of the short-form physics es- say in the coming AI revolution, Physics Education58, 10.1088/1361-6552/acc5cf (2022)

work page doi:10.1088/1361-6552/acc5cf 2022
[22]

Scarfe, K

P. Scarfe, K. Watcham, A. Clarke, and E. Roesch, A real- world test of artificial intelligence infiltration of a uni- versity examinations system: A ‘turing test’ case study, PLoS One19, e0305354 (2024)

work page 2024
[23]

30, 2026

Artificial Analysis, GPQA diamond benchmark leader- board (2026), accessed: Mar. 30, 2026

work page 2026
[24]

Jonsson, Rapportsl¨ app: Back2school 2023 (2025), ac- cessed: Mar

S. Jonsson, Rapportsl¨ app: Back2school 2023 (2025), ac- cessed: Mar. 25, 2025

work page 2023
[25]

Figueroa

B. Gregorcic and A.-M. Pendrill, ChatGPT and the frus- trated socrates, Physics Education58, 10.1088/1361- 6552/acc299 (2023)

work page doi:10.1088/1361- 2023
[26]

Polverini and B

G. Polverini and B. Gregorcic, How understanding large language models can inform the use of ChatGPT in physics education, European Journal of Physics45, 10.1088/1361-6404/ad1420 (2023)

work page doi:10.1088/1361-6404/ad1420 2023
[27]

The Russell Group, Russell group, ‘new principles on use of AI in education’

work page
[28]

Deli´ c and S

H. Deli´ c and S. Be´ cirovi´ c, Socratic method as an ap- proach to teaching, European Researcher. Series A (2016)

work page 2016
[29]

Halpern,Social Capital(Polity Press, Oxford, Eng- land, 2005)

D. Halpern,Social Capital(Polity Press, Oxford, Eng- land, 2005)

work page 2005
[30]

Hamari, J

J. Hamari, J. Koivisto, and H. Sarsa, Does gamification work? – a literature review of empirical studies on gam- ification, in2014 47th Hawaii International Conference on System Sciences(IEEE, 2014) pp. 3025–3034

work page 2014
[31]

Paris, Instructors’ perspectives of challenges and bar- riers to providing effective feedback, Teaching & Learning Inquiry10, 10.20343/teachlearninqu.10.3 (2022)

B. Paris, Instructors’ perspectives of challenges and bar- riers to providing effective feedback, Teaching & Learning Inquiry10, 10.20343/teachlearninqu.10.3 (2022)

work page doi:10.20343/teachlearninqu.10.3 2022
[32]

Chuet al., LLM agents for education: Advances and applications, inFindings of the Association for Computational Linguistics: EMNLP 2025, edited by C

Z. Chuet al., LLM agents for education: Advances and applications, inFindings of the Association for Computational Linguistics: EMNLP 2025, edited by C. Christodoulopoulos, T. Chakraborty, C. Rose, and V. Peng (Association for Computational Linguistics, Stroudsburg, PA, USA, 2025) pp. 13782–13810. 19

work page 2025
[33]

Shearing, Teachers can use AI to save time on mark- ing, new guidance says (2025), accessed: Oct

H. Shearing, Teachers can use AI to save time on mark- ing, new guidance says (2025), accessed: Oct. 27, 2025

work page 2025
[34]

27, 2025

Department for Science and Technology, Teachers to get more trustworthy AI tech, helping them mark homework and save time (2025), accessed: Oct. 27, 2025

work page 2025
[35]

05, 2026

GOV.UK, Generative AI: product safety standards (2026), accessed: Apr. 05, 2026

work page 2026
[36]

27, 2025

OpenAI, ChatGPT,https://chatgpt.com/(2025), ac- cessed: Mar. 27, 2025

work page 2025
[37]

27, 2025

Google, Gemini,https://gemini.google.com/app (2025), accessed: Mar. 27, 2025

work page 2025
[38]

27, 2025

DeepSeek, DeepSeek,https://chat.deepseek.com/ (2025), accessed: Mar. 27, 2025

work page 2025
[39]

Moket al., Using large language models for grading in education: an applied test for physics, Physics Education 60, 035006 (2025)

R. Moket al., Using large language models for grading in education: an applied test for physics, Physics Education 60, 035006 (2025)

work page 2025
[40]

J. L. Donaldson, A. Nawaz, D. Constantinos, and A. Lim, Using llms for physics education: Datasets and evalua- tion figures (2026)

work page 2026
[41]

B. Xu, A. Yang, J. Lin, Q. Wang, C. Zhou, Y. Zhang, and Z. Mao, Expertprompting: Instructing large language models to be distinguished experts, arXiv preprint arXiv:2305.14688 10.48550/arXiv.2305.14688 (2023), arXiv:2305.14688 [cs.CL]

work page doi:10.48550/arxiv.2305.14688 2023
[42]

J. Wei, X. Wang, D. Schuurmans, M. Bosma, B. Ichter, F. Xia, E. Chi, Q. Le, and D. Zhou, Chain-of-thought prompting elicits reasoning in large language models, in Proceedings of the 36th International Conference on Neu- ral Information Processing Systems, NIPS ’22 No. 1800 (Curran Associates Inc., Red Hook, NY, USA, 2022) pp. 24824–24837

work page 2022
[43]

X. Wang, J. Wei, D. Schuurmans, Q. Le, E. Chi, S. Narang, A. Chowdhery, and D. Zhou, Self- consistency improves chain of thought reasoning in lan- guage models, arXiv preprint arXiv:2203.11171 (2022), arXiv:2203.11171 [cs.CL]

work page internal anchor Pith review Pith/arXiv arXiv 2022
[44]

L. Zhenget al., Judging LLM-as-a-judge with MT-bench and chatbot arena, inProceedings of the 37th Interna- tional Conference on Neural Information Processing Sys- tems, NIPS ’23 (Curran Associates Inc., Red Hook, NY, USA, 2023) pp. 46595–46623

work page 2023
[45]

N. F. Liuet al., Lost in the middle: How language mod- els use long contexts, Transactions of the Association for Computational Linguistics12, 157 (2024)

work page 2024
[46]

P. Song, P. Han, and N. Goodman, Large language model reasoning failures (2026), arXiv:2602.06176 [cs.AI]

work page arXiv 2026
[47]

Khalid, A

I. Khalid, A. M. Nourollah, and S. Schockaert, Large lan- guage and reasoning models are shallow disjunctive rea- soners, inProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), edited by W. Che, J. Nabende, E. Shutova, and M. T. Pilehvar (Association for Computational Lin- guistics, Stroudsburg, ...

work page 2025
[48]

Q. Zhao, Y. Lu, M. J. Kim, Z. Fu, Z. Zhang, Y. Wu, Z. Li, Q. Ma, S. Han, C. Finn, A. Handa, M.-Y. Liu, D. Xiang, G. Wetzstein, and T.-Y. Lin, Cot-vla: Visual chain-of- thought reasoning for vision-language-action models, in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)(2025). 20 Appendix A: Referenced Questions FIG. A1. Classical M...

work page 2025

[1] [1]

top physics student

Stage I: Solution Generation To maintain structural and formatting consistency across all trials, each question was transcribed into LaTeX before entry into the model. This conversion process leveraged the native multimodal capabilities of ChatGPT-4o to extract text and mathematical notation directly from the source PDFs. To simulate real-world usage and ...

work page

[2] [2]

Conversely, for the MM study (Set C), a bespoke marking scheme was developed

Stage II: Human Evaluation For the human evaluation baseline (Stage II) of the PB studies, the established marking rubric from the UCL dataset [30] was utilised. Conversely, for the MM study (Set C), a bespoke marking scheme was developed. This rubric was constructed by delineating generalised solu- tion pathways and apportioning marks across fundamen- ta...

work page

[3] [3]

The models were constrained to assign integer scores and pro- vide generalised feedback

Stage III: AI Evaluation The evaluation prompt adopted a revised expert per- sona, positioning the models as ”physics professors” to enforce a rigorous, pedagogical evaluative standard. The models were constrained to assign integer scores and pro- vide generalised feedback. Furthermore, the input con- text strictly delimited the problem statements, candid...

work page

[4] [4]

language

PB1 Given the highly competitive nature of the LLM land- scape during the study period, frontier models were fre- quently released in clustered cycles to maintain market parity. Consequently, the evaluated models have been categorised into three chronological generations: •Generation 1(May 2024): ChatGPT-4o and Gemini 1.5 Pro. •Generation 2(December 2024)...

work page 2024

[5] [5]

All subsequent non-multimodal evaluations utilised the condensed Set B dataset

PB2 Before detailing the results for recent architectures, a methodological shift must be noted. All subsequent non-multimodal evaluations utilised the condensed Set B dataset. Again, to maintain cross-study continuity, all questions retain their original Set A numerical designa- tions (as mapped in Table III). Furthermore, to account for how this dataset...

work page 2025

[6] [6]

MM The preceding phases established that while Gen- eration 5 models possess robust text-based reasoning, earlier generations suffered from a disconnect between syntactic processing and visual grounding. To test whether modern architectures have bridged this gap, the models were evaluated on Set C: a dedicated, na- tively multimodal problem set comprised ...

work page 2025

[7] [7]

PB1 After having probed the LLM’s ability to solve prob- lems, this chapter evaluates the reliability of using LLMs as markers. Figures 20 and 21 present the relation be- tween grades awarded by the six LLMs and humans across the three core topics, comparing the cases where the LLMs were either provided with, or deprived of, the mark schemes. To mirror th...

work page 1902

[8] [8]

PB2 Having established that Gen 4 and Gen 5 models pos- sess the foundational reasoning to solve the benchmark entirely, the final stage of analysis evaluates their utility as automated assessors. To conduct this, the method- ology was refined: rather than batching the three gen- erated ChatGPT-4o solutions together within a single prompt—as was done in P...

work page

[9] [9]

MM Handwritten Grading Across all models, grading accuracy was consistently higher for perfect handwritten solutions than for im- perfect ones. Mean absolute error (MAE) analysis shows near-zero deviation from human grading for per- fect scripts for all models, indicating reliable recognition of canonical solution structures and correct reasoning patterns...

work page

[10] [10]

Brownet al., Language models are few-shot learners, inAdvances in Neural Information Processing Systems, edited by H

T. Brownet al., Language models are few-shot learners, inAdvances in Neural Information Processing Systems, edited by H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin (Curran Associates, Inc., 2020) pp. 1877–1901

work page 2020

[11] [11]

J. W. Raeet al., Scaling language models: Meth- ods, analysis & insights from training gopher (2021), arXiv:2112.11446 [cs.CL]

work page internal anchor Pith review Pith/arXiv arXiv 2021

[12] [12]

Ouyanget al., Training language models to follow instructions with human feedback, inAdvances in Neu- ral Information Processing Systems, edited by S

L. Ouyanget al., Training language models to follow instructions with human feedback, inAdvances in Neu- ral Information Processing Systems, edited by S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh (Curran Associates, Inc., 2022) pp. 27730–27744

work page 2022

[13] [13]

Zawacki-Richter, V

O. Zawacki-Richter, V. I. Mar´ ın, M. Bond, and F. Gou- verneur, Systematic review of research on artificial intelli- gence applications in higher education – where are the ed- ucators?, International Journal of Educational Technol- ogy in Higher Education16, 10.1186/s41239-019-0171-0 (2019)

work page doi:10.1186/s41239-019-0171-0 2019

[14] [14]

Steenbergen-Hu and H

S. Steenbergen-Hu and H. Cooper, A meta-analysis of the effectiveness of intelligent tutoring systems on col- lege students’ academic learning, Journal of Educational Psychology106, 331 (2014)

work page 2014

[15] [15]

Training Verifiers to Solve Math Word Problems

K. Cobbeet al., Training verifiers to solve math word problems (2021), arXiv:2110.14168

work page internal anchor Pith review Pith/arXiv arXiv 2021

[16] [16]

Survey of Hallucination in Natural Language Generation

Z. Jiet al., Survey of hallucination in natural language generation, ACM Computing Surveys 10.1145/3571730 (2022)

work page doi:10.1145/3571730 2022

[17] [17]

Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks

P. Lewiset al., Retrieval-augmented generation for knowledge-intensive NLP tasks, Neural Information Pro- cessing Systemsabs/2005.11401, 9459 (2020)

work page internal anchor Pith review Pith/arXiv arXiv 2005

[18] [18]

Wanget al., Large language models for education: A survey and outlook, IEEE Signal Processing Magazine 42, 51 (2025)

S. Wanget al., Large language models for education: A survey and outlook, IEEE Signal Processing Magazine 42, 51 (2025)

work page 2025

[19] [19]

Emergent Abilities of Large Language Models

J. Weiet al., Emergent abilities of large language models (2022), arXiv:2206.07682 [cs.CL]

work page internal anchor Pith review Pith/arXiv arXiv 2022

[20] [20]

T. Webb, K. J. Holyoak, and H. Lu, Emergent analog- ical reasoning in large language models, Nature Human Behaviour7, 1526 (2023)

work page 2023

[21] [21]

Yeadon, O

W. Yeadon, O. Inyang, A. Mizouri, A. Peach, and C. P. Testrow, The death of the short-form physics es- say in the coming AI revolution, Physics Education58, 10.1088/1361-6552/acc5cf (2022)

work page doi:10.1088/1361-6552/acc5cf 2022

[22] [22]

Scarfe, K

P. Scarfe, K. Watcham, A. Clarke, and E. Roesch, A real- world test of artificial intelligence infiltration of a uni- versity examinations system: A ‘turing test’ case study, PLoS One19, e0305354 (2024)

work page 2024

[23] [23]

30, 2026

Artificial Analysis, GPQA diamond benchmark leader- board (2026), accessed: Mar. 30, 2026

work page 2026

[24] [24]

Jonsson, Rapportsl¨ app: Back2school 2023 (2025), ac- cessed: Mar

S. Jonsson, Rapportsl¨ app: Back2school 2023 (2025), ac- cessed: Mar. 25, 2025

work page 2023

[25] [25]

Figueroa

B. Gregorcic and A.-M. Pendrill, ChatGPT and the frus- trated socrates, Physics Education58, 10.1088/1361- 6552/acc299 (2023)

work page doi:10.1088/1361- 2023

[26] [26]

Polverini and B

G. Polverini and B. Gregorcic, How understanding large language models can inform the use of ChatGPT in physics education, European Journal of Physics45, 10.1088/1361-6404/ad1420 (2023)

work page doi:10.1088/1361-6404/ad1420 2023

[27] [27]

The Russell Group, Russell group, ‘new principles on use of AI in education’

work page

[28] [28]

Deli´ c and S

H. Deli´ c and S. Be´ cirovi´ c, Socratic method as an ap- proach to teaching, European Researcher. Series A (2016)

work page 2016

[29] [29]

Halpern,Social Capital(Polity Press, Oxford, Eng- land, 2005)

D. Halpern,Social Capital(Polity Press, Oxford, Eng- land, 2005)

work page 2005

[30] [30]

Hamari, J

J. Hamari, J. Koivisto, and H. Sarsa, Does gamification work? – a literature review of empirical studies on gam- ification, in2014 47th Hawaii International Conference on System Sciences(IEEE, 2014) pp. 3025–3034

work page 2014

[31] [31]

Paris, Instructors’ perspectives of challenges and bar- riers to providing effective feedback, Teaching & Learning Inquiry10, 10.20343/teachlearninqu.10.3 (2022)

B. Paris, Instructors’ perspectives of challenges and bar- riers to providing effective feedback, Teaching & Learning Inquiry10, 10.20343/teachlearninqu.10.3 (2022)

work page doi:10.20343/teachlearninqu.10.3 2022

[32] [32]

Chuet al., LLM agents for education: Advances and applications, inFindings of the Association for Computational Linguistics: EMNLP 2025, edited by C

Z. Chuet al., LLM agents for education: Advances and applications, inFindings of the Association for Computational Linguistics: EMNLP 2025, edited by C. Christodoulopoulos, T. Chakraborty, C. Rose, and V. Peng (Association for Computational Linguistics, Stroudsburg, PA, USA, 2025) pp. 13782–13810. 19

work page 2025

[33] [33]

Shearing, Teachers can use AI to save time on mark- ing, new guidance says (2025), accessed: Oct

H. Shearing, Teachers can use AI to save time on mark- ing, new guidance says (2025), accessed: Oct. 27, 2025

work page 2025

[34] [34]

27, 2025

Department for Science and Technology, Teachers to get more trustworthy AI tech, helping them mark homework and save time (2025), accessed: Oct. 27, 2025

work page 2025

[35] [35]

05, 2026

GOV.UK, Generative AI: product safety standards (2026), accessed: Apr. 05, 2026

work page 2026

[36] [36]

27, 2025

OpenAI, ChatGPT,https://chatgpt.com/(2025), ac- cessed: Mar. 27, 2025

work page 2025

[37] [37]

27, 2025

Google, Gemini,https://gemini.google.com/app (2025), accessed: Mar. 27, 2025

work page 2025

[38] [38]

27, 2025

DeepSeek, DeepSeek,https://chat.deepseek.com/ (2025), accessed: Mar. 27, 2025

work page 2025

[39] [39]

Moket al., Using large language models for grading in education: an applied test for physics, Physics Education 60, 035006 (2025)

R. Moket al., Using large language models for grading in education: an applied test for physics, Physics Education 60, 035006 (2025)

work page 2025

[40] [40]

J. L. Donaldson, A. Nawaz, D. Constantinos, and A. Lim, Using llms for physics education: Datasets and evalua- tion figures (2026)

work page 2026

[41] [41]

B. Xu, A. Yang, J. Lin, Q. Wang, C. Zhou, Y. Zhang, and Z. Mao, Expertprompting: Instructing large language models to be distinguished experts, arXiv preprint arXiv:2305.14688 10.48550/arXiv.2305.14688 (2023), arXiv:2305.14688 [cs.CL]

work page doi:10.48550/arxiv.2305.14688 2023

[42] [42]

J. Wei, X. Wang, D. Schuurmans, M. Bosma, B. Ichter, F. Xia, E. Chi, Q. Le, and D. Zhou, Chain-of-thought prompting elicits reasoning in large language models, in Proceedings of the 36th International Conference on Neu- ral Information Processing Systems, NIPS ’22 No. 1800 (Curran Associates Inc., Red Hook, NY, USA, 2022) pp. 24824–24837

work page 2022

[43] [43]

X. Wang, J. Wei, D. Schuurmans, Q. Le, E. Chi, S. Narang, A. Chowdhery, and D. Zhou, Self- consistency improves chain of thought reasoning in lan- guage models, arXiv preprint arXiv:2203.11171 (2022), arXiv:2203.11171 [cs.CL]

work page internal anchor Pith review Pith/arXiv arXiv 2022

[44] [44]

L. Zhenget al., Judging LLM-as-a-judge with MT-bench and chatbot arena, inProceedings of the 37th Interna- tional Conference on Neural Information Processing Sys- tems, NIPS ’23 (Curran Associates Inc., Red Hook, NY, USA, 2023) pp. 46595–46623

work page 2023

[45] [45]

N. F. Liuet al., Lost in the middle: How language mod- els use long contexts, Transactions of the Association for Computational Linguistics12, 157 (2024)

work page 2024

[46] [46]

P. Song, P. Han, and N. Goodman, Large language model reasoning failures (2026), arXiv:2602.06176 [cs.AI]

work page arXiv 2026

[47] [47]

Khalid, A

I. Khalid, A. M. Nourollah, and S. Schockaert, Large lan- guage and reasoning models are shallow disjunctive rea- soners, inProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), edited by W. Che, J. Nabende, E. Shutova, and M. T. Pilehvar (Association for Computational Lin- guistics, Stroudsburg, ...

work page 2025

[48] [48]

Q. Zhao, Y. Lu, M. J. Kim, Z. Fu, Z. Zhang, Y. Wu, Z. Li, Q. Ma, S. Han, C. Finn, A. Handa, M.-Y. Liu, D. Xiang, G. Wetzstein, and T.-Y. Lin, Cot-vla: Visual chain-of- thought reasoning for vision-language-action models, in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)(2025). 20 Appendix A: Referenced Questions FIG. A1. Classical M...

work page 2025