pith. sign in

arxiv: 2605.23660 · v1 · pith:XOLQAKSCnew · submitted 2026-05-22 · ⚛️ physics.ed-ph

Using Large Language Models in Physics Education

Pith reviewed 2026-05-25 02:22 UTC · model grok-4.3

classification ⚛️ physics.ed-ph
keywords large language modelsphysics educationautomated assessmentproblem solvinggrading alignmentmultimodal modelsclassical mechanicsquantum mechanics
0
0 comments X

The pith

Frontier large language models achieve near-perfect scores on university physics problems and show improved alignment with human grading.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper reports results from three studies testing large language models released between mid-2024 and late-2025 on university-level physics tasks. The models generate accurate step-by-step solutions in classical mechanics, electromagnetism, and quantum mechanics, and they grade student work against formal mark schemes. Recent versions reach near-perfect text-based reasoning scores, handle diagrams through multimodal advances, and align more closely with human markers than earlier models. Partial credit for flawed or incomplete reasoning remains difficult. The findings indicate that these models can support student learning and instructional automation when their remaining limitations are addressed.

Core claim

Recent architectures such as ChatGPT-5.1 and Gemini 3.0 Pro achieve near-perfect scores on text-based reasoning and demonstrate significant improvements in alignment with human grading, heavily mitigating the systemic over-marking observed in earlier iterations, while native multimodal integration resolves previous limitations in spatial geometry and topological interpretation.

What carries the argument

Three complementary studies that test LLMs first on generating accurate solutions to physics problems and then on reliability as automated graders against formal mark schemes.

If this is right

  • LLMs can provide viable support for independent student learning in physics courses.
  • Instructional automation for grading becomes more feasible with newer model versions.
  • Limitations in assigning partial credit to ambiguous or incomplete reasoning must be actively managed.
  • Multimodal capabilities now allow reliable interpretation of diagrams accompanying physics problems.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Physics departments could pilot LLM-assisted homework systems that flag cases needing human review for partial credit.
  • Model training focused on partial-credit logic could further reduce the remaining grading gaps.
  • Parallel evaluations in other STEM subjects would test whether the observed trajectory holds beyond physics.

Load-bearing premise

The university-level problems, mark schemes, and student solutions used in the three studies are representative of typical physics coursework and generalize to other problems and cohorts.

What would settle it

A follow-up evaluation in which the same models produce substantially lower scores or show renewed over-marking on a new collection of physics problems and student solutions not included in the original studies.

Figures

Figures reproduced from arXiv: 2605.23660 by Aliya Navaz, Alysta Lim, Jonah R. Donaldson, Konstantinos Doran, Mario Campanelli.

Figure 1
Figure 1. Figure 1: FIG. 1. Template prompt inputted into the Gemini web in [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: FIG. 2. Mean percentage scores achieved by each model in [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: FIG. 3. Mean percentage scores achieved by each model in the [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 5
Figure 5. Figure 5: FIG. 5. Heat map showing model performance in Quantum [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: FIG. 6. Heat map showing model performance in Classical [PITH_FULL_IMAGE:figures/full_fig_p006_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: FIG. 7. Heat map showing model performance in Electro [PITH_FULL_IMAGE:figures/full_fig_p007_7.png] view at source ↗
Figure 9
Figure 9. Figure 9: FIG. 9. Averaged percentage scores for Classical Mechanics [PITH_FULL_IMAGE:figures/full_fig_p008_9.png] view at source ↗
Figure 8
Figure 8. Figure 8: FIG. 8. Averaged percentage scores for Quantum Mechanics [PITH_FULL_IMAGE:figures/full_fig_p008_8.png] view at source ↗
Figure 10
Figure 10. Figure 10: FIG. 10. Averaged percentage scores for Electromagnetism [PITH_FULL_IMAGE:figures/full_fig_p008_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: FIG. 11. Heat map of Quantum Mechanics performance [PITH_FULL_IMAGE:figures/full_fig_p009_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: FIG. 12. Heat map of Classical Mechanics performance show [PITH_FULL_IMAGE:figures/full_fig_p009_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: FIG. 13. Heat map of Electromagnetism performance high [PITH_FULL_IMAGE:figures/full_fig_p009_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: FIG. 14. Multimodal performance in Electromagnetism: [PITH_FULL_IMAGE:figures/full_fig_p010_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: FIG. 15. Multimodal performance in Electromagnetism: [PITH_FULL_IMAGE:figures/full_fig_p010_15.png] view at source ↗
Figure 18
Figure 18. Figure 18: FIG. 18. Multimodal performance in Quantum Mechanics: [PITH_FULL_IMAGE:figures/full_fig_p011_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: FIG. 19. Multimodal performance in Electromagnetism: [PITH_FULL_IMAGE:figures/full_fig_p011_19.png] view at source ↗
Figure 20
Figure 20. Figure 20: FIG. 20. Linear Regression of Model Alignment (With Mark [PITH_FULL_IMAGE:figures/full_fig_p012_20.png] view at source ↗
Figure 22
Figure 22. Figure 22: FIG. 22. Deviation heat-map Mapping Absolute Percentage [PITH_FULL_IMAGE:figures/full_fig_p013_22.png] view at source ↗
Figure 23
Figure 23. Figure 23: FIG. 23. Linear Regression of Model Alignment (Individual [PITH_FULL_IMAGE:figures/full_fig_p014_23.png] view at source ↗
Figure 24
Figure 24. Figure 24: FIG. 24. AI Deviation vs. Human Baseline per Question (P2) [PITH_FULL_IMAGE:figures/full_fig_p014_24.png] view at source ↗
Figure 25
Figure 25. Figure 25: FIG. 25. Mean absolute grading error across models for per [PITH_FULL_IMAGE:figures/full_fig_p015_25.png] view at source ↗
Figure 26
Figure 26. Figure 26: FIG. 26. Relationship between model solution accuracy on [PITH_FULL_IMAGE:figures/full_fig_p015_26.png] view at source ↗
Figure 27
Figure 27. Figure 27: FIG. 27. Relationship between model solution accuracy on [PITH_FULL_IMAGE:figures/full_fig_p016_27.png] view at source ↗
Figure 28
Figure 28. Figure 28: FIG. 28. Absolute grading error distributions for each model [PITH_FULL_IMAGE:figures/full_fig_p016_28.png] view at source ↗
Figure 29
Figure 29. Figure 29: FIG. 29. Absolute grading error distributions for each model [PITH_FULL_IMAGE:figures/full_fig_p016_29.png] view at source ↗
Figure 32
Figure 32. Figure 32: FIG. 32. Signed grading error (LLM marks minus human [PITH_FULL_IMAGE:figures/full_fig_p017_32.png] view at source ↗
Figure 31
Figure 31. Figure 31: FIG. 31. Percentage grading error for imperfect handwritten [PITH_FULL_IMAGE:figures/full_fig_p017_31.png] view at source ↗
read the original abstract

The rapid advancement of Large Language Models (LLMs) has introduced new possibilities and challenges in physics education, necessitating rigorous evaluation of their capabilities as both problem solvers and automated assessors. This paper presents the results of three complementary studies that evaluated frontier models released between mid-2024 and late-2025. Models were assessed on their ability to generate accurate, step-by-step solutions to university-level physics problems in Classical Mechanics, Electromagnetism, and Quantum Mechanics, and subsequently on their reliability in grading student solutions against a formal mark scheme. The results indicate a clear trajectory toward benchmark saturation in text-based reasoning, with recent architectures (such as ChatGPT-5.1 and Gemini 3.0 Pro) achieving near-perfect scores. Furthermore, recent advances in native multimodal integration have resolved previous limitations in spatial geometry and topological interpretation, enabling models to accurately process accompanying diagrams. As automated assessors, newer models demonstrated significant improvements in alignment with human grading, heavily mitigating the systemic over-marking observed in earlier iterations. However, while models reliably evaluate fully correct handwritten work, assigning partial credit to flawed or incomplete reasoning remains a persistent challenge. These findings suggest that as of late 2025, LLMs offer viable support for both independent student learning and instructional automation, provided their limitations in evaluating ambiguous reasoning are actively managed.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript reports results from three complementary empirical studies evaluating frontier LLMs (mid-2024 to late-2025 releases) as problem solvers and automated graders on university-level physics problems drawn from Classical Mechanics, Electromagnetism, and Quantum Mechanics. It claims a clear trajectory toward benchmark saturation, with models such as ChatGPT-5.1 and Gemini 3.0 Pro achieving near-perfect scores on text-based reasoning, resolution of prior multimodal limitations via native diagram processing, and substantially improved alignment with human grading that mitigates earlier over-marking tendencies. Partial credit for flawed reasoning remains challenging, but the work concludes that LLMs now offer viable support for independent student learning and instructional automation when limitations are managed.

Significance. If the performance and alignment claims hold with adequate documentation, the work would document measurable progress in LLM capabilities relevant to physics education research, providing concrete evidence that recent architectures can support both problem-solving practice and assessment tasks at the university level. This could inform the design of hybrid instructional tools while underscoring the need for human oversight on ambiguous cases.

major comments (2)
  1. [Abstract / Methods] Abstract and Methods: The abstract states that recent models achieve near-perfect scores and significant grading alignment improvements, yet supplies no sample problems, mark schemes, inter-rater statistics, error analysis, or quantitative results. This absence is load-bearing for the central empirical claims and prevents verification of the reported trajectory toward benchmark saturation.
  2. [Abstract] Abstract: The conclusion that LLMs offer viable support for typical physics coursework rests on the assumption that the chosen problems, mark schemes, and student solutions are representative; however, no information is given on selection criteria, difficulty distribution, presence of ambiguity or open-ended elements, or diagram versus text balance, undermining the generalizability asserted in the final paragraph.
minor comments (1)
  1. [Abstract] The abstract refers to 'three complementary studies' without indicating their individual scopes or how they complement one another; a brief overview sentence would improve clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their careful review and constructive feedback. We address each major comment below and have revised the manuscript to improve the documentation and transparency of our empirical claims.

read point-by-point responses
  1. Referee: [Abstract / Methods] Abstract and Methods: The abstract states that recent models achieve near-perfect scores and significant grading alignment improvements, yet supplies no sample problems, mark schemes, inter-rater statistics, error analysis, or quantitative results. This absence is load-bearing for the central empirical claims and prevents verification of the reported trajectory toward benchmark saturation.

    Authors: The abstract provides a high-level summary of the three studies, while the full manuscript details sample problems, mark schemes, inter-rater agreement statistics, error categorizations, and quantitative scores (including near-perfect performance metrics for models such as ChatGPT-5.1) in the Methods and Results sections. To facilitate verification without relying solely on the body text, we have revised the abstract to incorporate key quantitative results and added explicit cross-references to the supporting tables and analyses in the main text. The Methods section has also been expanded with additional documentation of these elements. revision: yes

  2. Referee: [Abstract] Abstract: The conclusion that LLMs offer viable support for typical physics coursework rests on the assumption that the chosen problems, mark schemes, and student solutions are representative; however, no information is given on selection criteria, difficulty distribution, presence of ambiguity or open-ended elements, or diagram versus text balance, undermining the generalizability asserted in the final paragraph.

    Authors: The Methods section outlines the sourcing of problems from standard university curricula across Classical Mechanics, Electromagnetism, and Quantum Mechanics, including a combination of text-based and diagram-accompanied items. However, we acknowledge that explicit details on selection criteria, difficulty distribution, and the presence of open-ended or ambiguous elements were not sufficiently foregrounded. We have revised the Methods section to include this information and added a clarifying statement to the abstract on the representativeness of the problem set and the text-diagram balance. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmarking study with no derivations or self-referential claims

full rationale

This is a purely empirical paper reporting results from three studies that benchmark LLMs on university physics problems and grading tasks. It contains no equations, no fitted parameters, no predictions derived from inputs, no uniqueness theorems, and no self-citations used to justify core premises. All claims rest on direct performance measurements against mark schemes rather than any chain that reduces to the paper's own definitions or prior outputs by construction. The generalizability concern raised in the skeptic note is a question of external validity, not circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an empirical evaluation study with no mathematical derivations, free parameters, background axioms, or postulated entities.

pith-pipeline@v0.9.0 · 5771 in / 1069 out tokens · 22721 ms · 2026-05-25T02:22:44.375276+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

48 extracted references · 48 canonical work pages · 5 internal anchors

  1. [1]

    top physics student

    Stage I: Solution Generation To maintain structural and formatting consistency across all trials, each question was transcribed into LaTeX before entry into the model. This conversion process leveraged the native multimodal capabilities of ChatGPT-4o to extract text and mathematical notation directly from the source PDFs. To simulate real-world usage and ...

  2. [2]

    Conversely, for the MM study (Set C), a bespoke marking scheme was developed

    Stage II: Human Evaluation For the human evaluation baseline (Stage II) of the PB studies, the established marking rubric from the UCL dataset [30] was utilised. Conversely, for the MM study (Set C), a bespoke marking scheme was developed. This rubric was constructed by delineating generalised solu- tion pathways and apportioning marks across fundamen- ta...

  3. [3]

    The models were constrained to assign integer scores and pro- vide generalised feedback

    Stage III: AI Evaluation The evaluation prompt adopted a revised expert per- sona, positioning the models as ”physics professors” to enforce a rigorous, pedagogical evaluative standard. The models were constrained to assign integer scores and pro- vide generalised feedback. Furthermore, the input con- text strictly delimited the problem statements, candid...

  4. [4]

    language

    PB1 Given the highly competitive nature of the LLM land- scape during the study period, frontier models were fre- quently released in clustered cycles to maintain market parity. Consequently, the evaluated models have been categorised into three chronological generations: •Generation 1(May 2024): ChatGPT-4o and Gemini 1.5 Pro. •Generation 2(December 2024)...

  5. [5]

    All subsequent non-multimodal evaluations utilised the condensed Set B dataset

    PB2 Before detailing the results for recent architectures, a methodological shift must be noted. All subsequent non-multimodal evaluations utilised the condensed Set B dataset. Again, to maintain cross-study continuity, all questions retain their original Set A numerical designa- tions (as mapped in Table III). Furthermore, to account for how this dataset...

  6. [6]

    MM The preceding phases established that while Gen- eration 5 models possess robust text-based reasoning, earlier generations suffered from a disconnect between syntactic processing and visual grounding. To test whether modern architectures have bridged this gap, the models were evaluated on Set C: a dedicated, na- tively multimodal problem set comprised ...

  7. [7]

    PB1 After having probed the LLM’s ability to solve prob- lems, this chapter evaluates the reliability of using LLMs as markers. Figures 20 and 21 present the relation be- tween grades awarded by the six LLMs and humans across the three core topics, comparing the cases where the LLMs were either provided with, or deprived of, the mark schemes. To mirror th...

  8. [8]

    PB2 Having established that Gen 4 and Gen 5 models pos- sess the foundational reasoning to solve the benchmark entirely, the final stage of analysis evaluates their utility as automated assessors. To conduct this, the method- ology was refined: rather than batching the three gen- erated ChatGPT-4o solutions together within a single prompt—as was done in P...

  9. [9]

    MM Handwritten Grading Across all models, grading accuracy was consistently higher for perfect handwritten solutions than for im- perfect ones. Mean absolute error (MAE) analysis shows near-zero deviation from human grading for per- fect scripts for all models, indicating reliable recognition of canonical solution structures and correct reasoning patterns...

  10. [10]

    Brownet al., Language models are few-shot learners, inAdvances in Neural Information Processing Systems, edited by H

    T. Brownet al., Language models are few-shot learners, inAdvances in Neural Information Processing Systems, edited by H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin (Curran Associates, Inc., 2020) pp. 1877–1901

  11. [11]

    J. W. Raeet al., Scaling language models: Meth- ods, analysis & insights from training gopher (2021), arXiv:2112.11446 [cs.CL]

  12. [12]

    Ouyanget al., Training language models to follow instructions with human feedback, inAdvances in Neu- ral Information Processing Systems, edited by S

    L. Ouyanget al., Training language models to follow instructions with human feedback, inAdvances in Neu- ral Information Processing Systems, edited by S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh (Curran Associates, Inc., 2022) pp. 27730–27744

  13. [13]

    Zawacki-Richter, V

    O. Zawacki-Richter, V. I. Mar´ ın, M. Bond, and F. Gou- verneur, Systematic review of research on artificial intelli- gence applications in higher education – where are the ed- ucators?, International Journal of Educational Technol- ogy in Higher Education16, 10.1186/s41239-019-0171-0 (2019)

  14. [14]

    Steenbergen-Hu and H

    S. Steenbergen-Hu and H. Cooper, A meta-analysis of the effectiveness of intelligent tutoring systems on col- lege students’ academic learning, Journal of Educational Psychology106, 331 (2014)

  15. [15]

    Training Verifiers to Solve Math Word Problems

    K. Cobbeet al., Training verifiers to solve math word problems (2021), arXiv:2110.14168

  16. [16]

    Survey of Hallucination in Natural Language Generation

    Z. Jiet al., Survey of hallucination in natural language generation, ACM Computing Surveys 10.1145/3571730 (2022)

  17. [17]

    Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks

    P. Lewiset al., Retrieval-augmented generation for knowledge-intensive NLP tasks, Neural Information Pro- cessing Systemsabs/2005.11401, 9459 (2020)

  18. [18]

    Wanget al., Large language models for education: A survey and outlook, IEEE Signal Processing Magazine 42, 51 (2025)

    S. Wanget al., Large language models for education: A survey and outlook, IEEE Signal Processing Magazine 42, 51 (2025)

  19. [19]

    Emergent Abilities of Large Language Models

    J. Weiet al., Emergent abilities of large language models (2022), arXiv:2206.07682 [cs.CL]

  20. [20]

    T. Webb, K. J. Holyoak, and H. Lu, Emergent analog- ical reasoning in large language models, Nature Human Behaviour7, 1526 (2023)

  21. [21]

    Yeadon, O

    W. Yeadon, O. Inyang, A. Mizouri, A. Peach, and C. P. Testrow, The death of the short-form physics es- say in the coming AI revolution, Physics Education58, 10.1088/1361-6552/acc5cf (2022)

  22. [22]

    Scarfe, K

    P. Scarfe, K. Watcham, A. Clarke, and E. Roesch, A real- world test of artificial intelligence infiltration of a uni- versity examinations system: A ‘turing test’ case study, PLoS One19, e0305354 (2024)

  23. [23]

    30, 2026

    Artificial Analysis, GPQA diamond benchmark leader- board (2026), accessed: Mar. 30, 2026

  24. [24]

    Jonsson, Rapportsl¨ app: Back2school 2023 (2025), ac- cessed: Mar

    S. Jonsson, Rapportsl¨ app: Back2school 2023 (2025), ac- cessed: Mar. 25, 2025

  25. [25]

    Figueroa

    B. Gregorcic and A.-M. Pendrill, ChatGPT and the frus- trated socrates, Physics Education58, 10.1088/1361- 6552/acc299 (2023)

  26. [26]

    Polverini and B

    G. Polverini and B. Gregorcic, How understanding large language models can inform the use of ChatGPT in physics education, European Journal of Physics45, 10.1088/1361-6404/ad1420 (2023)

  27. [27]

    The Russell Group, Russell group, ‘new principles on use of AI in education’

  28. [28]

    Deli´ c and S

    H. Deli´ c and S. Be´ cirovi´ c, Socratic method as an ap- proach to teaching, European Researcher. Series A (2016)

  29. [29]

    Halpern,Social Capital(Polity Press, Oxford, Eng- land, 2005)

    D. Halpern,Social Capital(Polity Press, Oxford, Eng- land, 2005)

  30. [30]

    Hamari, J

    J. Hamari, J. Koivisto, and H. Sarsa, Does gamification work? – a literature review of empirical studies on gam- ification, in2014 47th Hawaii International Conference on System Sciences(IEEE, 2014) pp. 3025–3034

  31. [31]

    Paris, Instructors’ perspectives of challenges and bar- riers to providing effective feedback, Teaching & Learning Inquiry10, 10.20343/teachlearninqu.10.3 (2022)

    B. Paris, Instructors’ perspectives of challenges and bar- riers to providing effective feedback, Teaching & Learning Inquiry10, 10.20343/teachlearninqu.10.3 (2022)

  32. [32]

    Chuet al., LLM agents for education: Advances and applications, inFindings of the Association for Computational Linguistics: EMNLP 2025, edited by C

    Z. Chuet al., LLM agents for education: Advances and applications, inFindings of the Association for Computational Linguistics: EMNLP 2025, edited by C. Christodoulopoulos, T. Chakraborty, C. Rose, and V. Peng (Association for Computational Linguistics, Stroudsburg, PA, USA, 2025) pp. 13782–13810. 19

  33. [33]

    Shearing, Teachers can use AI to save time on mark- ing, new guidance says (2025), accessed: Oct

    H. Shearing, Teachers can use AI to save time on mark- ing, new guidance says (2025), accessed: Oct. 27, 2025

  34. [34]

    27, 2025

    Department for Science and Technology, Teachers to get more trustworthy AI tech, helping them mark homework and save time (2025), accessed: Oct. 27, 2025

  35. [35]

    05, 2026

    GOV.UK, Generative AI: product safety standards (2026), accessed: Apr. 05, 2026

  36. [36]

    27, 2025

    OpenAI, ChatGPT,https://chatgpt.com/(2025), ac- cessed: Mar. 27, 2025

  37. [37]

    27, 2025

    Google, Gemini,https://gemini.google.com/app (2025), accessed: Mar. 27, 2025

  38. [38]

    27, 2025

    DeepSeek, DeepSeek,https://chat.deepseek.com/ (2025), accessed: Mar. 27, 2025

  39. [39]

    Moket al., Using large language models for grading in education: an applied test for physics, Physics Education 60, 035006 (2025)

    R. Moket al., Using large language models for grading in education: an applied test for physics, Physics Education 60, 035006 (2025)

  40. [40]

    J. L. Donaldson, A. Nawaz, D. Constantinos, and A. Lim, Using llms for physics education: Datasets and evalua- tion figures (2026)

  41. [41]

    B. Xu, A. Yang, J. Lin, Q. Wang, C. Zhou, Y. Zhang, and Z. Mao, Expertprompting: Instructing large language models to be distinguished experts, arXiv preprint arXiv:2305.14688 10.48550/arXiv.2305.14688 (2023), arXiv:2305.14688 [cs.CL]

  42. [42]

    J. Wei, X. Wang, D. Schuurmans, M. Bosma, B. Ichter, F. Xia, E. Chi, Q. Le, and D. Zhou, Chain-of-thought prompting elicits reasoning in large language models, in Proceedings of the 36th International Conference on Neu- ral Information Processing Systems, NIPS ’22 No. 1800 (Curran Associates Inc., Red Hook, NY, USA, 2022) pp. 24824–24837

  43. [43]

    X. Wang, J. Wei, D. Schuurmans, Q. Le, E. Chi, S. Narang, A. Chowdhery, and D. Zhou, Self- consistency improves chain of thought reasoning in lan- guage models, arXiv preprint arXiv:2203.11171 (2022), arXiv:2203.11171 [cs.CL]

  44. [44]

    L. Zhenget al., Judging LLM-as-a-judge with MT-bench and chatbot arena, inProceedings of the 37th Interna- tional Conference on Neural Information Processing Sys- tems, NIPS ’23 (Curran Associates Inc., Red Hook, NY, USA, 2023) pp. 46595–46623

  45. [45]

    N. F. Liuet al., Lost in the middle: How language mod- els use long contexts, Transactions of the Association for Computational Linguistics12, 157 (2024)

  46. [46]

    P. Song, P. Han, and N. Goodman, Large language model reasoning failures (2026), arXiv:2602.06176 [cs.AI]

  47. [47]

    Khalid, A

    I. Khalid, A. M. Nourollah, and S. Schockaert, Large lan- guage and reasoning models are shallow disjunctive rea- soners, inProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), edited by W. Che, J. Nabende, E. Shutova, and M. T. Pilehvar (Association for Computational Lin- guistics, Stroudsburg, ...

  48. [48]

    Q. Zhao, Y. Lu, M. J. Kim, Z. Fu, Z. Zhang, Y. Wu, Z. Li, Q. Ma, S. Han, C. Finn, A. Handa, M.-Y. Liu, D. Xiang, G. Wetzstein, and T.-Y. Lin, Cot-vla: Visual chain-of- thought reasoning for vision-language-action models, in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)(2025). 20 Appendix A: Referenced Questions FIG. A1. Classical M...