pith. sign in

arxiv: 2607.00276 · v1 · pith:FCRSSLX2new · submitted 2026-06-30 · 💻 cs.LG · cs.AI· cs.CL

Testing Frontier Large Language Models' Physics Literacy in Parallel Physical Worlds

Pith reviewed 2026-07-02 19:16 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CL
keywords LLM evaluationphysics reasoninginductive reasoningcounterfactual worldsdiagnostic protocolmodel self-reviewbenchmarking
0
0 comments X

The pith

Frontier LLMs pass at most 6 out of 15 trials when forced to reason inside three unfamiliar physics frameworks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a four-stage diagnostic that requires an LLM to induce rules from an unfamiliar physics world, formulate equations, make predictions, and review its own work. The protocol uses locked pre-registrations, fresh sessions, and dual-LLM judging to reduce contamination from training data. It is applied to three parallel worlds: a counterfactual F equals mv mechanics, historical Aristotelian mechanics, and a four-domain Decay World. Across three frontier models the composite pass rates are 6/15, 6/15, and 0/15, with a consistent pattern that models rarely err on the direction of change yet frequently compute the wrong quantitative ratio by reverting to standard physics. The work also documents that LLM judges fail to transfer across frameworks and that models' self-review misses errors in at least two-thirds of trials that contain them.

Core claim

The four-stage diagnostic applied to the F=mv world, Aristotelian mechanics, and Decay World shows that Claude Opus 4.7, GPT-5.5, and Gemini 3.1 Pro achieve composite pass rates of only 6/15, 6/15, and 0/15 respectively when content and structural correctness are both required (content axis only for Decay World). Models exhibit a qualitative-versus-quantitative asymmetry in Decay World, correctly predicting direction of change yet slipping to standard-physics ratios. LLM-judge reliability does not transfer across frameworks, and Stage 4 self-review wrongly reports no error in at least two-thirds of trials that contain one.

What carries the argument

Four-stage diagnostic protocol that evaluates induction, formulation, prediction, and review inside locked parallel physics worlds using pre-registrations, fresh sessions, and dual-LLM judging.

If this is right

  • Accuracy-only physics benchmarks cannot distinguish genuine reasoning from pattern recall.
  • Models display a systematic qualitative-versus-quantitative asymmetry when operating inside new frameworks.
  • LLM-based judges for evaluation do not generalize across different physics worlds.
  • Model self-review fails to detect earlier errors in the majority of trials that contain them.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The diagnostic could be reused to test inductive reasoning in domains other than physics without changing the protocol structure.
  • Persistent reversion to standard-physics ratios suggests that current training data distributions dominate even when explicit instructions require a new framework.
  • If the asymmetry holds, training objectives that reward only directional correctness would leave quantitative prediction errors unaddressed.

Load-bearing premise

The constructed parallel worlds and four-stage protocol with locked pre-registrations and dual-LLM judging successfully isolate genuine inductive reasoning from pattern recall or contamination.

What would settle it

A fourth novel physics world, constructed identically to the three tested worlds, on which the same models and protocol produce pass rates above 12/15.

Figures

Figures reproduced from arXiv: 2607.00276 by Dong Zhang.

Figure 1
Figure 1. Figure 1: Four-stage protocol and evaluation pipeline. A trial runs one tested model through Stage 1 (induc [PITH_FULL_IMAGE:figures/full_fig_p006_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Where the 15 trials end up in each framework: composite PASS, or the stage at which the trial [PITH_FULL_IMAGE:figures/full_fig_p017_2.png] view at source ↗
read the original abstract

Current large-language-model (LLM) physics benchmarks are usually scored by answer accuracy, which cannot distinguish genuine reasoning from recall of familiar problem patterns and reveals little about where a model's reasoning breaks down. We introduce an auditable four-stage diagnostic that evaluates whether an LLM can reason inside an unfamiliar physics framework through induction, formulation, prediction, and review. The diagnostic combines locked pre-registrations, fresh sessions between stages, dual-LLM judging, and a human-audit pathway, and we apply it to three parallel physics worlds: a single-equation counterfactual world ($F=mv$), a historical framework (Aristotelian mechanics), and a four-domain counterfactual world (Decay World). Across Claude Opus 4.7, GPT-5.5, and Gemini 3.1 Pro, the three worlds yield composite PASS rates are 6/15, 6/15, and 0/15 respectively (content $\land$ structural for $F=mv$ and Aristotelian, content axis only for Decay World where the structural axis is out of scope). The most pointed empirical pattern is a qualitative-versus-quantitative asymmetry: in Decay World, models almost never predict the wrong direction of change, but frequently compute the wrong ratio by slipping back to standard-physics relations. The protocol also surfaces two methodology findings: LLM-judge reliability does not transfer across frameworks, and Stage 4 self-review is weak in every framework, with the model's own review wrongly reporting no earlier error in at least two-thirds of the trials that actually contained one. We release the full prompts, responses, verdicts, and audit records.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 3 minor

Summary. The paper introduces an auditable four-stage diagnostic protocol (locked pre-registrations, fresh sessions, dual-LLM judging, human-audit pathway) to test whether frontier LLMs can perform inductive reasoning inside three unfamiliar parallel physics worlds: a single-equation counterfactual (F=mv), Aristotelian mechanics, and a four-domain Decay World. It reports composite PASS rates (content ∧ structural for the first two worlds; content only for Decay World) of 6/15, 6/15, and 0/15 across Claude Opus 4.7, GPT-5.5, and Gemini 3.1 Pro, documents a qualitative-versus-quantitative asymmetry in Decay World, and releases all prompts, responses, verdicts, and audit records. Two methodology findings are also reported: LLM-judge reliability does not transfer across frameworks, and Stage 4 self-review is weak.

Significance. If the protocol successfully isolates genuine inductive reasoning from pattern recall or contamination, the work supplies a reproducible, auditable evaluation framework that goes beyond accuracy-based physics benchmarks and surfaces concrete failure modes (e.g., reversion to standard-physics ratios). The open release of full materials is a clear strength that supports verifiability and future replication.

major comments (3)
  1. [Abstract and §2] Abstract and §2 (The Diagnostic Protocol): the headline composite PASS rates (6/15, 6/15, 0/15) are presented as evidence that the models cannot reliably induce inside the counterfactual frameworks. This interpretation is load-bearing on the claim that locked pre-registrations, fresh sessions, and dual-LLM judging successfully prevent recall of standard physics or cross-stage leakage; yet the text supplies no explicit mechanism (e.g., prompt constraints or exclusion rules) that would block re-interpretation of F=mv as a limiting case of F=ma or keep Decay World quantities orthogonal to training data.
  2. [§4] §4 (Methodology Findings) and the judging procedure: the paper reports that LLM-judge reliability does not transfer across frameworks, yet the same dual-LLM judges are used to produce the composite scores for all three worlds. This directly affects the validity of the 0/15 rate in Decay World (where only the content axis is scored) and the cross-world comparison of the 6/15 rates.
  3. [§3] §3 (The Three Parallel Worlds) and the Decay World definition: the qualitative-versus-quantitative asymmetry is a central empirical pattern, but the four-domain structure and the precise definition of “content axis only” scoring are not accompanied by an explicit validation that the structural axis is truly out of scope rather than simply unmeasurable; this choice influences the 0/15 result and requires a concrete example of a trial that would have failed on structure had it been scored.
minor comments (3)
  1. [Abstract] Abstract: the sentence fragment “the three worlds yield composite PASS rates are 6/15” is grammatically incorrect.
  2. [Results] Table or results section: the 15 trials per model/world are not broken down by stage or by error type, making it difficult to trace exactly where the induction failures occur.
  3. [§3] Notation: the symbols used for the four domains of Decay World are introduced without a compact reference table, forcing the reader to hunt through the text on each mention.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and indicate the revisions we will make to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Abstract and §2] Abstract and §2 (The Diagnostic Protocol): the headline composite PASS rates (6/15, 6/15, 0/15) are presented as evidence that the models cannot reliably induce inside the counterfactual frameworks. This interpretation is load-bearing on the claim that locked pre-registrations, fresh sessions, and dual-LLM judging successfully prevent recall of standard physics or cross-stage leakage; yet the text supplies no explicit mechanism (e.g., prompt constraints or exclusion rules) that would block re-interpretation of F=mv as a limiting case of F=ma or keep Decay World quantities orthogonal to training data.

    Authors: We agree that the manuscript would benefit from greater explicitness on the isolation mechanisms. The locked pre-registrations and fresh sessions are the primary safeguards against cross-stage leakage, while the stage-1 prompts contain explicit instructions requiring the model to treat the supplied framework as self-contained and to make no reference to standard physics. We will expand §2 with the precise prompt wording and exclusion rules used to enforce this separation, thereby providing the requested concrete documentation. revision: yes

  2. Referee: [§4] §4 (Methodology Findings) and the judging procedure: the paper reports that LLM-judge reliability does not transfer across frameworks, yet the same dual-LLM judges are used to produce the composite scores for all three worlds. This directly affects the validity of the 0/15 rate in Decay World (where only the content axis is scored) and the cross-world comparison of the 6/15 rates.

    Authors: This observation is correct and highlights a genuine limitation. Because reliability does not transfer, the cross-framework comparisons must be interpreted cautiously. For Decay World the content-only scoring reduces the impact, yet we will revise §4 to state explicitly that the non-transferability qualifies both the 0/15 result and any direct comparison with the 6/15 rates from the other worlds, and we will add a short discussion of the implications for future protocol design. revision: partial

  3. Referee: [§3] §3 (The Three Parallel Worlds) and the Decay World definition: the qualitative-versus-quantitative asymmetry is a central empirical pattern, but the four-domain structure and the precise definition of “content axis only” scoring are not accompanied by an explicit validation that the structural axis is truly out of scope rather than simply unmeasurable; this choice influences the 0/15 result and requires a concrete example of a trial that would have failed on structure had it been scored.

    Authors: We accept the need for additional clarification. The structural axis is out of scope because Decay World comprises four independent domains whose inter-domain consistency cannot be captured by a single structural criterion applicable across all trials. We will add to §3 both an explicit justification of this design choice and a concrete example of a trial response that would have failed a structural check had one been applied, thereby making the rationale for content-only scoring transparent. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical evaluation protocol with no derivation chain

full rationale

The paper presents an empirical four-stage diagnostic protocol applied to LLMs across constructed physics worlds, reporting composite PASS rates (6/15, 6/15, 0/15) from direct testing. No equations, fitted parameters, or first-principles derivations are claimed; the work does not reduce any reported outcome to a self-referential definition, fitted input renamed as prediction, or self-citation chain. The protocol description (locked pre-registrations, dual-LLM judging) is presented as a methodological choice without mathematical reduction to its own inputs. This is a standard empirical evaluation paper whose central claims rest on observed test outcomes rather than any load-bearing derivation that collapses by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the premise that the diagnostic stages and the constructed worlds isolate genuine reasoning; the three worlds are invented test cases without independent external validation.

axioms (1)
  • domain assumption The four-stage process (induction, formulation, prediction, review) with locked pre-registrations and dual-LLM judging isolates genuine inductive reasoning from pattern recall.
    This premise underpins the entire evaluation and the interpretation of low pass rates.
invented entities (1)
  • Three parallel physics worlds (single-equation F=mv counterfactual, Aristotelian mechanics, four-domain Decay World) no independent evidence
    purpose: Provide unfamiliar frameworks for testing LLM reasoning
    These are constructed counterfactuals introduced by the authors for the diagnostic.

pith-pipeline@v0.9.1-grok · 5818 in / 1363 out tokens · 26200 ms · 2026-07-02T19:16:22.528334+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

48 extracted references · 23 canonical work pages · 10 internal anchors

  1. [1]

    Remarks on the disproof of the unit distance conjecture

    Noga Alon, Thomas F. Bloom, W. T. Gowers, Daniel Litt, Will Sawin, Arul Shankar, Jacob Tsimerman, Victor Wang, and Melanie Matchett Wood. Remarks on the disproof of the unit distance conjecture. arXiv preprint arXiv:2605.20695, 2026. Human-verified companion to the OpenAI -generated counterexample

  2. [2]

    Battaglia, Jessica B

    Peter W. Battaglia, Jessica B. Hamrick, and Joshua B. Tenenbaum. Simulation as an engine of physical scene understanding. Proceedings of the National Academy of Sciences (PNAS), 110 0 (45): 0 18327--18332, 2013. doi:10.1073/pnas.1306572110

  3. [3]

    Bear, Elias Wang, Damian Mrowca, Felix J

    Daniel M. Bear, Elias Wang, Damian Mrowca, Felix J. Binder, Hsiau-Yu Fish Tung, R. T. Pramod, Cameron Holdaway, Sirui Tao, Kevin Smith, Fan-Yun Sun, Li Fei-Fei, Nancy Kanwisher, Joshua B. Tenenbaum, Daniel L. K. Yamins, and Judith E. Fan. Physion: Evaluating physical prediction from vision in humans and machines. In NeurIPS Datasets and Benchmarks, 2021

  4. [4]

    Boiko, Robert MacKnight, Ben Kline, and Gabe Gomes

    Daniil A. Boiko, Robert MacKnight, Ben Kline, and Gabe Gomes. Autonomous chemical research with large language models. Nature, 624: 0 570--578, 2023. doi:10.1038/s41586-023-06792-0

  5. [5]

    Video generation models as world simulators

    Tim Brooks, Bill Peebles, Connor Holmes, Will DePue, Yufei Guo, Li Jing, David Schnurr, Joe Taylor, Troy Luhman, Eric Luhman, Clarence Ng, Ricky Wang, and Aditya Ramesh. Video generation models as world simulators. OpenAI technical report, https://openai.com/research/video-generation-models-as-world-simulators, 2024

  6. [6]

    Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin...

  7. [7]

    Quantifying Memorization Across Neural Language Models

    Nicholas Carlini, Daphne Ippolito, Matthew Jagielski, Katherine Lee, Florian Tram \`e r, and Chiyuan Zhang. Quantifying memorization across neural language models. In ICLR, 2023. arXiv:2202.07646

  8. [8]

    The Science of Mechanics in the Middle Ages

    Marshall Clagett. The Science of Mechanics in the Middle Ages. University of Wisconsin Press, Madison, 1959

  9. [9]

    Training Verifiers to Solve Math Word Problems

    Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021

  10. [10]

    Die Grundlage der allgemeinen Relativit\"atstheorie

    Albert Einstein. Die Grundlage der allgemeinen Relativit\"atstheorie . Annalen der Physik, 49: 0 769--822, 1916. General Relativity field equations; predicts light bending in a gravitational field, confirmed by the 1919 eclipse

  11. [11]

    Szostkiewicz, Jon M

    Ali Essam Ghareeb, Benjamin Chang, Ludovico Mitchener, Angela Yiu, Caralyn J. Szostkiewicz, Jon M. Laurent, Muhammed T. Razzak, Andrew D. White, Michaela M. Hinks, and Samuel G. Rodriques. Robin: A multi-agent system for automating scientific discovery. arXiv preprint arXiv:2505.13400, 2025

  12. [12]

    Towards an AI co-scientist

    Juraj Gottweis, Wei-Hung Weng, Alexander Daryin, Tao Tu, Anil Palepu, et al. Towards an AI co-scientist. arXiv preprint arXiv:2502.18864, 2025

  13. [13]

    The Llama 3 Herd of Models

    Aaron Grattafiori et al. The Llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024

  14. [14]

    World models

    David Ha and J\" u rgen Schmidhuber. World models. In Advances in Neural Information Processing Systems (NeurIPS), 2018

  15. [15]

    Remarks on an `` Einstein Test '' for AGI

    Demis Hassabis. Remarks on an `` Einstein Test '' for AGI . India AI Impact Summit, February 17, 2026, 2026. Public address

  16. [16]

    Carl G. Hempel. Philosophy of Natural Science. Prentice-Hall, Englewood Cliffs, NJ, 1966

  17. [17]

    Measuring mathematical problem solving with the MATH dataset

    Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the MATH dataset. In NeurIPS Datasets and Benchmarks, 2021

  18. [18]

    Towards Reasoning in Large Language Models: A Survey

    Jie Huang and Kevin Chen-Chuan Chang. Towards reasoning in large language models: A survey. arXiv preprint arXiv:2212.10403, 2023

  19. [19]

    Astronomia Nova

    Johannes Kepler. Astronomia Nova. 1609. Three laws of planetary motion; geometric formulation. T^2 a^3 appears in Harmonices Mundi (1619)

  20. [20]

    Large language models are zero-shot reasoners

    Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. Large language models are zero-shot reasoners. In Advances in Neural Information Processing Systems (NeurIPS), 2022

  21. [21]

    Thomas S. Kuhn. The Structure of Scientific Revolutions. University of Chicago Press, Chicago, 1962. 2nd edition with Postscript, 1970

  22. [22]

    A path towards autonomous machine intelligence (position paper, version 0.9)

    Yann LeCun. A path towards autonomous machine intelligence (position paper, version 0.9). Technical report, OpenReview, June 2022. URL https://openreview.net/forum?id=BZ5a1r-kVsf

  23. [23]

    G. E. R. Lloyd. Aristotle: The Growth and Structure of his Thought. Cambridge University Press, Cambridge, 1968

  24. [24]

    The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery

    Chris Lu, Cong Lu, Robert Tjarko Lange, Jakob Foerster, Jeff Clune, and David Ha. The AI scientist: Towards fully automated open-ended scientific discovery. arXiv preprint arXiv:2408.06292, 2024

  25. [25]

    A dynamical theory of the electromagnetic field

    James Clerk Maxwell. A dynamical theory of the electromagnetic field. Philosophical Transactions of the Royal Society of London, 155: 0 459--512, 1865

  26. [26]

    GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models

    Iman Mirzadeh, Keivan Alizadeh, Hooman Shahrokhi, Oncel Tuzel, Samy Bengio, and Mehrdad Farajtabar. GSM-Symbolic : Understanding the limitations of mathematical reasoning in large language models. arXiv preprint arXiv:2410.05229, 2024

  27. [27]

    Philosophi Naturalis Principia Mathematica

    Isaac Newton. Philosophi Naturalis Principia Mathematica . 1687. Universal gravitation as inverse-square attraction; modern algebraic form assembled by later writers

  28. [28]

    Nosek, George Alter, George C

    Brian A. Nosek, George Alter, George C. Banks, Denny Borsboom, Sara D. Bowman, Steven J. Breckler, Stuart Buck, Christopher D. Chambers, Gilbert Chin, Garret Christensen, et al. Promoting an open research culture. Science, 348 0 (6242): 0 1422--1425, 2015

  29. [29]

    Cosmos World Foundation Model Platform for Physical AI

    NVIDIA , Niket Agarwal, et al. Cosmos world foundation model platform for physical ai. arXiv preprint arXiv:2501.03575, 2025. Released around CES 2025. Open-license world foundation models for robotics and autonomous-driving applications

  30. [30]

    Improving reproducibility in machine learning research (a report from the NeurIPS 2019 reproducibility program)

    Joelle Pineau, Philippe Vincent-Lamarre, Koustuv Sinha, Vincent Larivi \`e re, Alina Beygelzimer, Florence d'Alch \'e Buc, Emily Fox, and Hugo Larochelle. Improving reproducibility in machine learning research (a report from the NeurIPS 2019 reproducibility program). Journal of Machine Learning Research, 22 0 (164): 0 1--20, 2021

  31. [31]

    Karl R. Popper. The Logic of Scientific Discovery. Hutchinson, London, 1959. Original German edition: Logik der Forschung, Springer, Vienna, 1934 (imprint 1935)

  32. [32]

    Phybench: Holistic evaluation of physical perception and reasoning in large language models, 2025

    Shi Qiu et al. PHYBench : Holistic evaluation of physical perception and reasoning in large language models. arXiv preprint arXiv:2504.16074, 2025

  33. [33]

    David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R. Bowman. GPQA : A graduate-level G oogle-proof Q&A benchmark. In Conference on Language Modeling (COLM), 2024

  34. [34]

    Pawan Kumar, Emilien Dupont, Francisco J

    Bernardino Romera-Paredes, Mohammadamin Barekatain, Alexander Novikov, Matej Balog, M. Pawan Kumar, Emilien Dupont, Francisco J. R. Ruiz, Jordan S. Ellenberg, Pengming Wang, Omar Fawzi, Pushmeet Kohli, and Alhussein Fawzi. Mathematical discoveries from program search with large language models. Nature, 625: 0 468--475, 2024. doi:10.1038/s41586-023-06924-6

  35. [35]

    NLP evaluation in trouble: On the need to measure LLM data contamination for each benchmark

    Oscar Sainz, Jon Ander Campos, Iker Garc \'i a-Ferrero, Julen Etxaniz, Oier Lopez de Lacalle, and Eneko Agirre. NLP evaluation in trouble: On the need to measure LLM data contamination for each benchmark. In Findings of EMNLP, 2023. arXiv:2310.18018

  36. [36]

    Are emergent abilities of large language models a mirage? In Advances in Neural Information Processing Systems (NeurIPS), 2023

    Rylan Schaeffer, Brando Miranda, and Sanmi Koyejo. Are emergent abilities of large language models a mirage? In Advances in Neural Information Processing Systems (NeurIPS), 2023

  37. [37]

    Sculley, Jasper Snoek, Alex Wiltschko, and Ali Rahimi

    D. Sculley, Jasper Snoek, Alex Wiltschko, and Ali Rahimi. Winner's curse? On pace, progress, and empirical rigor. In ICLR Workshop Track, 2018

  38. [38]

    Dolma : An open corpus of three trillion tokens for language model pretraining research

    Luca Soldaini, Rodney Kinney, Akshita Bhagia, et al. Dolma : An open corpus of three trillion tokens for language model pretraining research. arXiv preprint arXiv:2402.00159, 2024

  39. [39]

    Trinh, Yuhuai Wu, Quoc V

    Trieu H. Trinh, Yuhuai Wu, Quoc V. Le, He He, and Thang Luong. Solving olympiad geometry without human demonstrations. Nature, 625: 0 476--482, 2024. doi:10.1038/s41586-023-06747-5

  40. [40]

    Chi, Tatsunori Hashimoto, Oriol Vinyals, Percy Liang, Jeff Dean, and William Fedus

    Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, Dani Yogatama, Maarten Bosma, Denny Zhou, Donald Metzler, Ed H. Chi, Tatsunori Hashimoto, Oriol Vinyals, Percy Liang, Jeff Dean, and William Fedus. Emergent abilities of large language models. Transactions on Machine Learning Research (TMLR), 2022 a

  41. [41]

    Chi, Quoc V

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed H. Chi, Quoc V. Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models. In Advances in Neural Information Processing Systems (NeurIPS), 2022 b

  42. [42]

    The Philosophy of the Inductive Sciences, Founded Upon Their History

    William Whewell. The Philosophy of the Inductive Sciences, Founded Upon Their History. John W. Parker, London, 1840. 2 volumes

  43. [43]

    DiscoverPhysics: Benchmarking LLMs for Out-of-the-Box Scientific Thinking

    Matt L. Wiemann, Lindsay M. Smith, Peter Melchior, Siddharth Mishra-Sharma, Andrew Gordon Wilson, Pavel Izmailov, and Carolina Cuesta-L \'a zaro. DiscoverPhysics : Benchmarking LLMs for out-of-the-box scientific thinking. arXiv preprint arXiv:2605.26087, 2026

  44. [44]

    arXiv preprint arXiv:2307.02477 , year=

    Zhaofeng Wu, Linlu Qiu, Alexis Ross, Ekin Aky \"u rek, Boyuan Chen, Bailin Wang, Najoung Kim, Jacob Andreas, and Yoon Kim. Reasoning or reciting? exploring the capabilities and limitations of language models through counterfactual tasks. In NAACL, 2024. arXiv:2307.02477

  45. [45]

    A careful examination of large language model performance on grade school arithmetic

    Hugh Zhang, Jeff Da, Dean Lee, et al. A careful examination of large language model performance on grade school arithmetic. arXiv preprint arXiv:2405.00332, 2024

  46. [46]

    Sora and V-JEPA have not learned the complete real world model -- a philosophical analysis of video AIs through the theory of productive imagination

    Jianqiu Zhang. Sora and V-JEPA have not learned the complete real world model -- a philosophical analysis of video AIs through the theory of productive imagination. arXiv preprint arXiv:2407.10311, 2024

  47. [47]

    ABench-Physics : Benchmarking physical reasoning in LLMs via high-difficulty and dynamic physics problems

    Yiming Zhang et al. ABench-Physics : Benchmarking physical reasoning in LLMs via high-difficulty and dynamic physics problems. arXiv preprint arXiv:2507.04766, 2025

  48. [48]

    Nguyen, Baixuan Xu, Zhaowei Wang, Jiayang Cheng, Hong Ting Tsang, Weiqi Wang, Jiaxin Bai, Tianqing Fang, Yangqiu Song, Ginny Y

    Tianshi Zheng, Kelvin Kiu-Wai Tam, Newt Hue-Nam K. Nguyen, Baixuan Xu, Zhaowei Wang, Jiayang Cheng, Hong Ting Tsang, Weiqi Wang, Jiaxin Bai, Tianqing Fang, Yangqiu Song, Ginny Y. Wong, and Simon See. NewtonBench : Benchmarking generalizable scientific law discovery in LLM agents. In ICLR, 2026. arXiv:2510.07172