Testing Frontier Large Language Models' Physics Literacy in Parallel Physical Worlds

Dong Zhang

arxiv: 2607.00276 · v1 · pith:FCRSSLX2new · submitted 2026-06-30 · 💻 cs.LG · cs.AI· cs.CL

Testing Frontier Large Language Models' Physics Literacy in Parallel Physical Worlds

Dong Zhang This is my paper

Pith reviewed 2026-07-02 19:16 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CL

keywords LLM evaluationphysics reasoninginductive reasoningcounterfactual worldsdiagnostic protocolmodel self-reviewbenchmarking

0 comments

The pith

Frontier LLMs pass at most 6 out of 15 trials when forced to reason inside three unfamiliar physics frameworks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a four-stage diagnostic that requires an LLM to induce rules from an unfamiliar physics world, formulate equations, make predictions, and review its own work. The protocol uses locked pre-registrations, fresh sessions, and dual-LLM judging to reduce contamination from training data. It is applied to three parallel worlds: a counterfactual F equals mv mechanics, historical Aristotelian mechanics, and a four-domain Decay World. Across three frontier models the composite pass rates are 6/15, 6/15, and 0/15, with a consistent pattern that models rarely err on the direction of change yet frequently compute the wrong quantitative ratio by reverting to standard physics. The work also documents that LLM judges fail to transfer across frameworks and that models' self-review misses errors in at least two-thirds of trials that contain them.

Core claim

The four-stage diagnostic applied to the F=mv world, Aristotelian mechanics, and Decay World shows that Claude Opus 4.7, GPT-5.5, and Gemini 3.1 Pro achieve composite pass rates of only 6/15, 6/15, and 0/15 respectively when content and structural correctness are both required (content axis only for Decay World). Models exhibit a qualitative-versus-quantitative asymmetry in Decay World, correctly predicting direction of change yet slipping to standard-physics ratios. LLM-judge reliability does not transfer across frameworks, and Stage 4 self-review wrongly reports no error in at least two-thirds of trials that contain one.

What carries the argument

Four-stage diagnostic protocol that evaluates induction, formulation, prediction, and review inside locked parallel physics worlds using pre-registrations, fresh sessions, and dual-LLM judging.

If this is right

Accuracy-only physics benchmarks cannot distinguish genuine reasoning from pattern recall.
Models display a systematic qualitative-versus-quantitative asymmetry when operating inside new frameworks.
LLM-based judges for evaluation do not generalize across different physics worlds.
Model self-review fails to detect earlier errors in the majority of trials that contain them.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The diagnostic could be reused to test inductive reasoning in domains other than physics without changing the protocol structure.
Persistent reversion to standard-physics ratios suggests that current training data distributions dominate even when explicit instructions require a new framework.
If the asymmetry holds, training objectives that reward only directional correctness would leave quantitative prediction errors unaddressed.

Load-bearing premise

The constructed parallel worlds and four-stage protocol with locked pre-registrations and dual-LLM judging successfully isolate genuine inductive reasoning from pattern recall or contamination.

What would settle it

A fourth novel physics world, constructed identically to the three tested worlds, on which the same models and protocol produce pass rates above 12/15.

Figures

Figures reproduced from arXiv: 2607.00276 by Dong Zhang.

**Figure 2.** Figure 2: Where the 15 trials end up in each framework: composite PASS, or the stage at which the trial [PITH_FULL_IMAGE:figures/full_fig_p017_2.png] view at source ↗

read the original abstract

Current large-language-model (LLM) physics benchmarks are usually scored by answer accuracy, which cannot distinguish genuine reasoning from recall of familiar problem patterns and reveals little about where a model's reasoning breaks down. We introduce an auditable four-stage diagnostic that evaluates whether an LLM can reason inside an unfamiliar physics framework through induction, formulation, prediction, and review. The diagnostic combines locked pre-registrations, fresh sessions between stages, dual-LLM judging, and a human-audit pathway, and we apply it to three parallel physics worlds: a single-equation counterfactual world ($F=mv$), a historical framework (Aristotelian mechanics), and a four-domain counterfactual world (Decay World). Across Claude Opus 4.7, GPT-5.5, and Gemini 3.1 Pro, the three worlds yield composite PASS rates are 6/15, 6/15, and 0/15 respectively (content $\land$ structural for $F=mv$ and Aristotelian, content axis only for Decay World where the structural axis is out of scope). The most pointed empirical pattern is a qualitative-versus-quantitative asymmetry: in Decay World, models almost never predict the wrong direction of change, but frequently compute the wrong ratio by slipping back to standard-physics relations. The protocol also surfaces two methodology findings: LLM-judge reliability does not transfer across frameworks, and Stage 4 self-review is weak in every framework, with the model's own review wrongly reporting no earlier error in at least two-thirds of the trials that actually contained one. We release the full prompts, responses, verdicts, and audit records.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper introduces a four-stage protocol with pre-regs and dual judging to test LLM induction in three counterfactual physics worlds, but the abstract leaves the isolation from recall unproven.

read the letter

The paper's core offering is a controlled four-stage diagnostic meant to push LLMs into inductive reasoning inside unfamiliar physics setups instead of pattern recall. It applies the method to an F=mv counterfactual, Aristotelian mechanics, and a Decay World, reporting composite pass rates of 6/15, 6/15, and 0/15.

What is actually new is the full package: locked pre-registrations, fresh sessions between stages, dual-LLM judging plus human audit, all run across multiple worlds. Releasing the prompts, responses, verdicts, and audit records is a clear positive; anyone can inspect the raw material.

The work also surfaces two practical findings worth keeping: LLM judges lose reliability when the framework changes, and model self-review misses errors in at least two-thirds of cases. The reported asymmetry in Decay World, where direction is usually correct but ratios slip back to standard physics, is a usable observation.

The soft spot is whether the protocol really forces induction. The abstract states the worlds are designed to be unfamiliar, yet gives no concrete mechanism for blocking reinterpretation as variants of known physics or leakage from training data. The judge-transfer problem the authors themselves flag makes the composite scores harder to trust. Low pass rates are reported, but without seeing the exact prompts and exclusion rules it is difficult to separate task difficulty from a genuine test of reasoning.

This is for people building or using LLM benchmarks in science domains. The protocol is documented and the data will be public, so the paper deserves a serious referee even if the central claim needs tighter validation on the isolation question.

Referee Report

3 major / 3 minor

Summary. The paper introduces an auditable four-stage diagnostic protocol (locked pre-registrations, fresh sessions, dual-LLM judging, human-audit pathway) to test whether frontier LLMs can perform inductive reasoning inside three unfamiliar parallel physics worlds: a single-equation counterfactual (F=mv), Aristotelian mechanics, and a four-domain Decay World. It reports composite PASS rates (content ∧ structural for the first two worlds; content only for Decay World) of 6/15, 6/15, and 0/15 across Claude Opus 4.7, GPT-5.5, and Gemini 3.1 Pro, documents a qualitative-versus-quantitative asymmetry in Decay World, and releases all prompts, responses, verdicts, and audit records. Two methodology findings are also reported: LLM-judge reliability does not transfer across frameworks, and Stage 4 self-review is weak.

Significance. If the protocol successfully isolates genuine inductive reasoning from pattern recall or contamination, the work supplies a reproducible, auditable evaluation framework that goes beyond accuracy-based physics benchmarks and surfaces concrete failure modes (e.g., reversion to standard-physics ratios). The open release of full materials is a clear strength that supports verifiability and future replication.

major comments (3)

[Abstract and §2] Abstract and §2 (The Diagnostic Protocol): the headline composite PASS rates (6/15, 6/15, 0/15) are presented as evidence that the models cannot reliably induce inside the counterfactual frameworks. This interpretation is load-bearing on the claim that locked pre-registrations, fresh sessions, and dual-LLM judging successfully prevent recall of standard physics or cross-stage leakage; yet the text supplies no explicit mechanism (e.g., prompt constraints or exclusion rules) that would block re-interpretation of F=mv as a limiting case of F=ma or keep Decay World quantities orthogonal to training data.
[§4] §4 (Methodology Findings) and the judging procedure: the paper reports that LLM-judge reliability does not transfer across frameworks, yet the same dual-LLM judges are used to produce the composite scores for all three worlds. This directly affects the validity of the 0/15 rate in Decay World (where only the content axis is scored) and the cross-world comparison of the 6/15 rates.
[§3] §3 (The Three Parallel Worlds) and the Decay World definition: the qualitative-versus-quantitative asymmetry is a central empirical pattern, but the four-domain structure and the precise definition of “content axis only” scoring are not accompanied by an explicit validation that the structural axis is truly out of scope rather than simply unmeasurable; this choice influences the 0/15 result and requires a concrete example of a trial that would have failed on structure had it been scored.

minor comments (3)

[Abstract] Abstract: the sentence fragment “the three worlds yield composite PASS rates are 6/15” is grammatically incorrect.
[Results] Table or results section: the 15 trials per model/world are not broken down by stage or by error type, making it difficult to trace exactly where the induction failures occur.
[§3] Notation: the symbols used for the four domains of Decay World are introduced without a compact reference table, forcing the reader to hunt through the text on each mention.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and indicate the revisions we will make to strengthen the manuscript.

read point-by-point responses

Referee: [Abstract and §2] Abstract and §2 (The Diagnostic Protocol): the headline composite PASS rates (6/15, 6/15, 0/15) are presented as evidence that the models cannot reliably induce inside the counterfactual frameworks. This interpretation is load-bearing on the claim that locked pre-registrations, fresh sessions, and dual-LLM judging successfully prevent recall of standard physics or cross-stage leakage; yet the text supplies no explicit mechanism (e.g., prompt constraints or exclusion rules) that would block re-interpretation of F=mv as a limiting case of F=ma or keep Decay World quantities orthogonal to training data.

Authors: We agree that the manuscript would benefit from greater explicitness on the isolation mechanisms. The locked pre-registrations and fresh sessions are the primary safeguards against cross-stage leakage, while the stage-1 prompts contain explicit instructions requiring the model to treat the supplied framework as self-contained and to make no reference to standard physics. We will expand §2 with the precise prompt wording and exclusion rules used to enforce this separation, thereby providing the requested concrete documentation. revision: yes
Referee: [§4] §4 (Methodology Findings) and the judging procedure: the paper reports that LLM-judge reliability does not transfer across frameworks, yet the same dual-LLM judges are used to produce the composite scores for all three worlds. This directly affects the validity of the 0/15 rate in Decay World (where only the content axis is scored) and the cross-world comparison of the 6/15 rates.

Authors: This observation is correct and highlights a genuine limitation. Because reliability does not transfer, the cross-framework comparisons must be interpreted cautiously. For Decay World the content-only scoring reduces the impact, yet we will revise §4 to state explicitly that the non-transferability qualifies both the 0/15 result and any direct comparison with the 6/15 rates from the other worlds, and we will add a short discussion of the implications for future protocol design. revision: partial
Referee: [§3] §3 (The Three Parallel Worlds) and the Decay World definition: the qualitative-versus-quantitative asymmetry is a central empirical pattern, but the four-domain structure and the precise definition of “content axis only” scoring are not accompanied by an explicit validation that the structural axis is truly out of scope rather than simply unmeasurable; this choice influences the 0/15 result and requires a concrete example of a trial that would have failed on structure had it been scored.

Authors: We accept the need for additional clarification. The structural axis is out of scope because Decay World comprises four independent domains whose inter-domain consistency cannot be captured by a single structural criterion applicable across all trials. We will add to §3 both an explicit justification of this design choice and a concrete example of a trial response that would have failed a structural check had one been applied, thereby making the rationale for content-only scoring transparent. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical evaluation protocol with no derivation chain

full rationale

The paper presents an empirical four-stage diagnostic protocol applied to LLMs across constructed physics worlds, reporting composite PASS rates (6/15, 6/15, 0/15) from direct testing. No equations, fitted parameters, or first-principles derivations are claimed; the work does not reduce any reported outcome to a self-referential definition, fitted input renamed as prediction, or self-citation chain. The protocol description (locked pre-registrations, dual-LLM judging) is presented as a methodological choice without mathematical reduction to its own inputs. This is a standard empirical evaluation paper whose central claims rest on observed test outcomes rather than any load-bearing derivation that collapses by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the premise that the diagnostic stages and the constructed worlds isolate genuine reasoning; the three worlds are invented test cases without independent external validation.

axioms (1)

domain assumption The four-stage process (induction, formulation, prediction, review) with locked pre-registrations and dual-LLM judging isolates genuine inductive reasoning from pattern recall.
This premise underpins the entire evaluation and the interpretation of low pass rates.

invented entities (1)

Three parallel physics worlds (single-equation F=mv counterfactual, Aristotelian mechanics, four-domain Decay World) no independent evidence
purpose: Provide unfamiliar frameworks for testing LLM reasoning
These are constructed counterfactuals introduced by the authors for the diagnostic.

pith-pipeline@v0.9.1-grok · 5818 in / 1363 out tokens · 26200 ms · 2026-07-02T19:16:22.528334+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

48 extracted references · 23 canonical work pages · 10 internal anchors

[1]

Remarks on the disproof of the unit distance conjecture

Noga Alon, Thomas F. Bloom, W. T. Gowers, Daniel Litt, Will Sawin, Arul Shankar, Jacob Tsimerman, Victor Wang, and Melanie Matchett Wood. Remarks on the disproof of the unit distance conjecture. arXiv preprint arXiv:2605.20695, 2026. Human-verified companion to the OpenAI -generated counterexample

work page internal anchor Pith review Pith/arXiv arXiv 2026
[2]

Battaglia, Jessica B

Peter W. Battaglia, Jessica B. Hamrick, and Joshua B. Tenenbaum. Simulation as an engine of physical scene understanding. Proceedings of the National Academy of Sciences (PNAS), 110 0 (45): 0 18327--18332, 2013. doi:10.1073/pnas.1306572110

work page doi:10.1073/pnas.1306572110 2013
[3]

Bear, Elias Wang, Damian Mrowca, Felix J

Daniel M. Bear, Elias Wang, Damian Mrowca, Felix J. Binder, Hsiau-Yu Fish Tung, R. T. Pramod, Cameron Holdaway, Sirui Tao, Kevin Smith, Fan-Yun Sun, Li Fei-Fei, Nancy Kanwisher, Joshua B. Tenenbaum, Daniel L. K. Yamins, and Judith E. Fan. Physion: Evaluating physical prediction from vision in humans and machines. In NeurIPS Datasets and Benchmarks, 2021

2021
[4]

Boiko, Robert MacKnight, Ben Kline, and Gabe Gomes

Daniil A. Boiko, Robert MacKnight, Ben Kline, and Gabe Gomes. Autonomous chemical research with large language models. Nature, 624: 0 570--578, 2023. doi:10.1038/s41586-023-06792-0

work page doi:10.1038/s41586-023-06792-0 2023
[5]

Video generation models as world simulators

Tim Brooks, Bill Peebles, Connor Holmes, Will DePue, Yufei Guo, Li Jing, David Schnurr, Joe Taylor, Troy Luhman, Eric Luhman, Clarence Ng, Ricky Wang, and Aditya Ramesh. Video generation models as world simulators. OpenAI technical report, https://openai.com/research/video-generation-models-as-world-simulators, 2024

2024
[6]

Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin...

2020
[7]

Quantifying Memorization Across Neural Language Models

Nicholas Carlini, Daphne Ippolito, Matthew Jagielski, Katherine Lee, Florian Tram \`e r, and Chiyuan Zhang. Quantifying memorization across neural language models. In ICLR, 2023. arXiv:2202.07646

work page internal anchor Pith review Pith/arXiv arXiv 2023
[8]

The Science of Mechanics in the Middle Ages

Marshall Clagett. The Science of Mechanics in the Middle Ages. University of Wisconsin Press, Madison, 1959

1959
[9]

Training Verifiers to Solve Math Word Problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[10]

Die Grundlage der allgemeinen Relativit\"atstheorie

Albert Einstein. Die Grundlage der allgemeinen Relativit\"atstheorie . Annalen der Physik, 49: 0 769--822, 1916. General Relativity field equations; predicts light bending in a gravitational field, confirmed by the 1919 eclipse

1916
[11]

Szostkiewicz, Jon M

Ali Essam Ghareeb, Benjamin Chang, Ludovico Mitchener, Angela Yiu, Caralyn J. Szostkiewicz, Jon M. Laurent, Muhammed T. Razzak, Andrew D. White, Michaela M. Hinks, and Samuel G. Rodriques. Robin: A multi-agent system for automating scientific discovery. arXiv preprint arXiv:2505.13400, 2025

work page arXiv 2025
[12]

Towards an AI co-scientist

Juraj Gottweis, Wei-Hung Weng, Alexander Daryin, Tao Tu, Anil Palepu, et al. Towards an AI co-scientist. arXiv preprint arXiv:2502.18864, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[13]

The Llama 3 Herd of Models

Aaron Grattafiori et al. The Llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[14]

World models

David Ha and J\" u rgen Schmidhuber. World models. In Advances in Neural Information Processing Systems (NeurIPS), 2018

2018
[15]

Remarks on an `` Einstein Test '' for AGI

Demis Hassabis. Remarks on an `` Einstein Test '' for AGI . India AI Impact Summit, February 17, 2026, 2026. Public address

2026
[16]

Carl G. Hempel. Philosophy of Natural Science. Prentice-Hall, Englewood Cliffs, NJ, 1966

1966
[17]

Measuring mathematical problem solving with the MATH dataset

Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the MATH dataset. In NeurIPS Datasets and Benchmarks, 2021

2021
[18]

Towards Reasoning in Large Language Models: A Survey

Jie Huang and Kevin Chen-Chuan Chang. Towards reasoning in large language models: A survey. arXiv preprint arXiv:2212.10403, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[19]

Astronomia Nova

Johannes Kepler. Astronomia Nova. 1609. Three laws of planetary motion; geometric formulation. T^2 a^3 appears in Harmonices Mundi (1619)
[20]

Large language models are zero-shot reasoners

Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. Large language models are zero-shot reasoners. In Advances in Neural Information Processing Systems (NeurIPS), 2022

2022
[21]

Thomas S. Kuhn. The Structure of Scientific Revolutions. University of Chicago Press, Chicago, 1962. 2nd edition with Postscript, 1970

1962
[22]

A path towards autonomous machine intelligence (position paper, version 0.9)

Yann LeCun. A path towards autonomous machine intelligence (position paper, version 0.9). Technical report, OpenReview, June 2022. URL https://openreview.net/forum?id=BZ5a1r-kVsf

2022
[23]

G. E. R. Lloyd. Aristotle: The Growth and Structure of his Thought. Cambridge University Press, Cambridge, 1968

1968
[24]

The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery

Chris Lu, Cong Lu, Robert Tjarko Lange, Jakob Foerster, Jeff Clune, and David Ha. The AI scientist: Towards fully automated open-ended scientific discovery. arXiv preprint arXiv:2408.06292, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[25]

A dynamical theory of the electromagnetic field

James Clerk Maxwell. A dynamical theory of the electromagnetic field. Philosophical Transactions of the Royal Society of London, 155: 0 459--512, 1865
[26]

GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models

Iman Mirzadeh, Keivan Alizadeh, Hooman Shahrokhi, Oncel Tuzel, Samy Bengio, and Mehrdad Farajtabar. GSM-Symbolic : Understanding the limitations of mathematical reasoning in large language models. arXiv preprint arXiv:2410.05229, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[27]

Philosophi Naturalis Principia Mathematica

Isaac Newton. Philosophi Naturalis Principia Mathematica . 1687. Universal gravitation as inverse-square attraction; modern algebraic form assembled by later writers
[28]

Nosek, George Alter, George C

Brian A. Nosek, George Alter, George C. Banks, Denny Borsboom, Sara D. Bowman, Steven J. Breckler, Stuart Buck, Christopher D. Chambers, Gilbert Chin, Garret Christensen, et al. Promoting an open research culture. Science, 348 0 (6242): 0 1422--1425, 2015

2015
[29]

Cosmos World Foundation Model Platform for Physical AI

NVIDIA , Niket Agarwal, et al. Cosmos world foundation model platform for physical ai. arXiv preprint arXiv:2501.03575, 2025. Released around CES 2025. Open-license world foundation models for robotics and autonomous-driving applications

work page internal anchor Pith review Pith/arXiv arXiv 2025
[30]

Improving reproducibility in machine learning research (a report from the NeurIPS 2019 reproducibility program)

Joelle Pineau, Philippe Vincent-Lamarre, Koustuv Sinha, Vincent Larivi \`e re, Alina Beygelzimer, Florence d'Alch \'e Buc, Emily Fox, and Hugo Larochelle. Improving reproducibility in machine learning research (a report from the NeurIPS 2019 reproducibility program). Journal of Machine Learning Research, 22 0 (164): 0 1--20, 2021

2019
[31]

Karl R. Popper. The Logic of Scientific Discovery. Hutchinson, London, 1959. Original German edition: Logik der Forschung, Springer, Vienna, 1934 (imprint 1935)

1959
[32]

Phybench: Holistic evaluation of physical perception and reasoning in large language models, 2025

Shi Qiu et al. PHYBench : Holistic evaluation of physical perception and reasoning in large language models. arXiv preprint arXiv:2504.16074, 2025

work page arXiv 2025
[33]

David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R. Bowman. GPQA : A graduate-level G oogle-proof Q&A benchmark. In Conference on Language Modeling (COLM), 2024

2024
[34]

Pawan Kumar, Emilien Dupont, Francisco J

Bernardino Romera-Paredes, Mohammadamin Barekatain, Alexander Novikov, Matej Balog, M. Pawan Kumar, Emilien Dupont, Francisco J. R. Ruiz, Jordan S. Ellenberg, Pengming Wang, Omar Fawzi, Pushmeet Kohli, and Alhussein Fawzi. Mathematical discoveries from program search with large language models. Nature, 625: 0 468--475, 2024. doi:10.1038/s41586-023-06924-6

work page doi:10.1038/s41586-023-06924-6 2024
[35]

NLP evaluation in trouble: On the need to measure LLM data contamination for each benchmark

Oscar Sainz, Jon Ander Campos, Iker Garc \'i a-Ferrero, Julen Etxaniz, Oier Lopez de Lacalle, and Eneko Agirre. NLP evaluation in trouble: On the need to measure LLM data contamination for each benchmark. In Findings of EMNLP, 2023. arXiv:2310.18018

work page arXiv 2023
[36]

Are emergent abilities of large language models a mirage? In Advances in Neural Information Processing Systems (NeurIPS), 2023

Rylan Schaeffer, Brando Miranda, and Sanmi Koyejo. Are emergent abilities of large language models a mirage? In Advances in Neural Information Processing Systems (NeurIPS), 2023

2023
[37]

Sculley, Jasper Snoek, Alex Wiltschko, and Ali Rahimi

D. Sculley, Jasper Snoek, Alex Wiltschko, and Ali Rahimi. Winner's curse? On pace, progress, and empirical rigor. In ICLR Workshop Track, 2018

2018
[38]

Dolma : An open corpus of three trillion tokens for language model pretraining research

Luca Soldaini, Rodney Kinney, Akshita Bhagia, et al. Dolma : An open corpus of three trillion tokens for language model pretraining research. arXiv preprint arXiv:2402.00159, 2024

work page arXiv 2024
[39]

Trinh, Yuhuai Wu, Quoc V

Trieu H. Trinh, Yuhuai Wu, Quoc V. Le, He He, and Thang Luong. Solving olympiad geometry without human demonstrations. Nature, 625: 0 476--482, 2024. doi:10.1038/s41586-023-06747-5

work page doi:10.1038/s41586-023-06747-5 2024
[40]

Chi, Tatsunori Hashimoto, Oriol Vinyals, Percy Liang, Jeff Dean, and William Fedus

Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, Dani Yogatama, Maarten Bosma, Denny Zhou, Donald Metzler, Ed H. Chi, Tatsunori Hashimoto, Oriol Vinyals, Percy Liang, Jeff Dean, and William Fedus. Emergent abilities of large language models. Transactions on Machine Learning Research (TMLR), 2022 a

2022
[41]

Chi, Quoc V

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed H. Chi, Quoc V. Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models. In Advances in Neural Information Processing Systems (NeurIPS), 2022 b

2022
[42]

The Philosophy of the Inductive Sciences, Founded Upon Their History

William Whewell. The Philosophy of the Inductive Sciences, Founded Upon Their History. John W. Parker, London, 1840. 2 volumes
[43]

DiscoverPhysics: Benchmarking LLMs for Out-of-the-Box Scientific Thinking

Matt L. Wiemann, Lindsay M. Smith, Peter Melchior, Siddharth Mishra-Sharma, Andrew Gordon Wilson, Pavel Izmailov, and Carolina Cuesta-L \'a zaro. DiscoverPhysics : Benchmarking LLMs for out-of-the-box scientific thinking. arXiv preprint arXiv:2605.26087, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[44]

arXiv preprint arXiv:2307.02477 , year=

Zhaofeng Wu, Linlu Qiu, Alexis Ross, Ekin Aky \"u rek, Boyuan Chen, Bailin Wang, Najoung Kim, Jacob Andreas, and Yoon Kim. Reasoning or reciting? exploring the capabilities and limitations of language models through counterfactual tasks. In NAACL, 2024. arXiv:2307.02477

work page arXiv 2024
[45]

A careful examination of large language model performance on grade school arithmetic

Hugh Zhang, Jeff Da, Dean Lee, et al. A careful examination of large language model performance on grade school arithmetic. arXiv preprint arXiv:2405.00332, 2024

work page arXiv 2024
[46]

Sora and V-JEPA have not learned the complete real world model -- a philosophical analysis of video AIs through the theory of productive imagination

Jianqiu Zhang. Sora and V-JEPA have not learned the complete real world model -- a philosophical analysis of video AIs through the theory of productive imagination. arXiv preprint arXiv:2407.10311, 2024

work page arXiv 2024
[47]

ABench-Physics : Benchmarking physical reasoning in LLMs via high-difficulty and dynamic physics problems

Yiming Zhang et al. ABench-Physics : Benchmarking physical reasoning in LLMs via high-difficulty and dynamic physics problems. arXiv preprint arXiv:2507.04766, 2025

work page arXiv 2025
[48]

Nguyen, Baixuan Xu, Zhaowei Wang, Jiayang Cheng, Hong Ting Tsang, Weiqi Wang, Jiaxin Bai, Tianqing Fang, Yangqiu Song, Ginny Y

Tianshi Zheng, Kelvin Kiu-Wai Tam, Newt Hue-Nam K. Nguyen, Baixuan Xu, Zhaowei Wang, Jiayang Cheng, Hong Ting Tsang, Weiqi Wang, Jiaxin Bai, Tianqing Fang, Yangqiu Song, Ginny Y. Wong, and Simon See. NewtonBench : Benchmarking generalizable scientific law discovery in LLM agents. In ICLR, 2026. arXiv:2510.07172

work page arXiv 2026

[1] [1]

Remarks on the disproof of the unit distance conjecture

Noga Alon, Thomas F. Bloom, W. T. Gowers, Daniel Litt, Will Sawin, Arul Shankar, Jacob Tsimerman, Victor Wang, and Melanie Matchett Wood. Remarks on the disproof of the unit distance conjecture. arXiv preprint arXiv:2605.20695, 2026. Human-verified companion to the OpenAI -generated counterexample

work page internal anchor Pith review Pith/arXiv arXiv 2026

[2] [2]

Battaglia, Jessica B

Peter W. Battaglia, Jessica B. Hamrick, and Joshua B. Tenenbaum. Simulation as an engine of physical scene understanding. Proceedings of the National Academy of Sciences (PNAS), 110 0 (45): 0 18327--18332, 2013. doi:10.1073/pnas.1306572110

work page doi:10.1073/pnas.1306572110 2013

[3] [3]

Bear, Elias Wang, Damian Mrowca, Felix J

Daniel M. Bear, Elias Wang, Damian Mrowca, Felix J. Binder, Hsiau-Yu Fish Tung, R. T. Pramod, Cameron Holdaway, Sirui Tao, Kevin Smith, Fan-Yun Sun, Li Fei-Fei, Nancy Kanwisher, Joshua B. Tenenbaum, Daniel L. K. Yamins, and Judith E. Fan. Physion: Evaluating physical prediction from vision in humans and machines. In NeurIPS Datasets and Benchmarks, 2021

2021

[4] [4]

Boiko, Robert MacKnight, Ben Kline, and Gabe Gomes

Daniil A. Boiko, Robert MacKnight, Ben Kline, and Gabe Gomes. Autonomous chemical research with large language models. Nature, 624: 0 570--578, 2023. doi:10.1038/s41586-023-06792-0

work page doi:10.1038/s41586-023-06792-0 2023

[5] [5]

Video generation models as world simulators

Tim Brooks, Bill Peebles, Connor Holmes, Will DePue, Yufei Guo, Li Jing, David Schnurr, Joe Taylor, Troy Luhman, Eric Luhman, Clarence Ng, Ricky Wang, and Aditya Ramesh. Video generation models as world simulators. OpenAI technical report, https://openai.com/research/video-generation-models-as-world-simulators, 2024

2024

[6] [6]

Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin...

2020

[7] [7]

Quantifying Memorization Across Neural Language Models

Nicholas Carlini, Daphne Ippolito, Matthew Jagielski, Katherine Lee, Florian Tram \`e r, and Chiyuan Zhang. Quantifying memorization across neural language models. In ICLR, 2023. arXiv:2202.07646

work page internal anchor Pith review Pith/arXiv arXiv 2023

[8] [8]

The Science of Mechanics in the Middle Ages

Marshall Clagett. The Science of Mechanics in the Middle Ages. University of Wisconsin Press, Madison, 1959

1959

[9] [9]

Training Verifiers to Solve Math Word Problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[10] [10]

Die Grundlage der allgemeinen Relativit\"atstheorie

Albert Einstein. Die Grundlage der allgemeinen Relativit\"atstheorie . Annalen der Physik, 49: 0 769--822, 1916. General Relativity field equations; predicts light bending in a gravitational field, confirmed by the 1919 eclipse

1916

[11] [11]

Szostkiewicz, Jon M

Ali Essam Ghareeb, Benjamin Chang, Ludovico Mitchener, Angela Yiu, Caralyn J. Szostkiewicz, Jon M. Laurent, Muhammed T. Razzak, Andrew D. White, Michaela M. Hinks, and Samuel G. Rodriques. Robin: A multi-agent system for automating scientific discovery. arXiv preprint arXiv:2505.13400, 2025

work page arXiv 2025

[12] [12]

Towards an AI co-scientist

Juraj Gottweis, Wei-Hung Weng, Alexander Daryin, Tao Tu, Anil Palepu, et al. Towards an AI co-scientist. arXiv preprint arXiv:2502.18864, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[13] [13]

The Llama 3 Herd of Models

Aaron Grattafiori et al. The Llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[14] [14]

World models

David Ha and J\" u rgen Schmidhuber. World models. In Advances in Neural Information Processing Systems (NeurIPS), 2018

2018

[15] [15]

Remarks on an `` Einstein Test '' for AGI

Demis Hassabis. Remarks on an `` Einstein Test '' for AGI . India AI Impact Summit, February 17, 2026, 2026. Public address

2026

[16] [16]

Carl G. Hempel. Philosophy of Natural Science. Prentice-Hall, Englewood Cliffs, NJ, 1966

1966

[17] [17]

Measuring mathematical problem solving with the MATH dataset

Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the MATH dataset. In NeurIPS Datasets and Benchmarks, 2021

2021

[18] [18]

Towards Reasoning in Large Language Models: A Survey

Jie Huang and Kevin Chen-Chuan Chang. Towards reasoning in large language models: A survey. arXiv preprint arXiv:2212.10403, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[19] [19]

Astronomia Nova

Johannes Kepler. Astronomia Nova. 1609. Three laws of planetary motion; geometric formulation. T^2 a^3 appears in Harmonices Mundi (1619)

[20] [20]

Large language models are zero-shot reasoners

Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. Large language models are zero-shot reasoners. In Advances in Neural Information Processing Systems (NeurIPS), 2022

2022

[21] [21]

Thomas S. Kuhn. The Structure of Scientific Revolutions. University of Chicago Press, Chicago, 1962. 2nd edition with Postscript, 1970

1962

[22] [22]

A path towards autonomous machine intelligence (position paper, version 0.9)

Yann LeCun. A path towards autonomous machine intelligence (position paper, version 0.9). Technical report, OpenReview, June 2022. URL https://openreview.net/forum?id=BZ5a1r-kVsf

2022

[23] [23]

G. E. R. Lloyd. Aristotle: The Growth and Structure of his Thought. Cambridge University Press, Cambridge, 1968

1968

[24] [24]

The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery

Chris Lu, Cong Lu, Robert Tjarko Lange, Jakob Foerster, Jeff Clune, and David Ha. The AI scientist: Towards fully automated open-ended scientific discovery. arXiv preprint arXiv:2408.06292, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[25] [25]

A dynamical theory of the electromagnetic field

James Clerk Maxwell. A dynamical theory of the electromagnetic field. Philosophical Transactions of the Royal Society of London, 155: 0 459--512, 1865

[26] [26]

GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models

Iman Mirzadeh, Keivan Alizadeh, Hooman Shahrokhi, Oncel Tuzel, Samy Bengio, and Mehrdad Farajtabar. GSM-Symbolic : Understanding the limitations of mathematical reasoning in large language models. arXiv preprint arXiv:2410.05229, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[27] [27]

Philosophi Naturalis Principia Mathematica

Isaac Newton. Philosophi Naturalis Principia Mathematica . 1687. Universal gravitation as inverse-square attraction; modern algebraic form assembled by later writers

[28] [28]

Nosek, George Alter, George C

Brian A. Nosek, George Alter, George C. Banks, Denny Borsboom, Sara D. Bowman, Steven J. Breckler, Stuart Buck, Christopher D. Chambers, Gilbert Chin, Garret Christensen, et al. Promoting an open research culture. Science, 348 0 (6242): 0 1422--1425, 2015

2015

[29] [29]

Cosmos World Foundation Model Platform for Physical AI

NVIDIA , Niket Agarwal, et al. Cosmos world foundation model platform for physical ai. arXiv preprint arXiv:2501.03575, 2025. Released around CES 2025. Open-license world foundation models for robotics and autonomous-driving applications

work page internal anchor Pith review Pith/arXiv arXiv 2025

[30] [30]

Improving reproducibility in machine learning research (a report from the NeurIPS 2019 reproducibility program)

Joelle Pineau, Philippe Vincent-Lamarre, Koustuv Sinha, Vincent Larivi \`e re, Alina Beygelzimer, Florence d'Alch \'e Buc, Emily Fox, and Hugo Larochelle. Improving reproducibility in machine learning research (a report from the NeurIPS 2019 reproducibility program). Journal of Machine Learning Research, 22 0 (164): 0 1--20, 2021

2019

[31] [31]

Karl R. Popper. The Logic of Scientific Discovery. Hutchinson, London, 1959. Original German edition: Logik der Forschung, Springer, Vienna, 1934 (imprint 1935)

1959

[32] [32]

Phybench: Holistic evaluation of physical perception and reasoning in large language models, 2025

Shi Qiu et al. PHYBench : Holistic evaluation of physical perception and reasoning in large language models. arXiv preprint arXiv:2504.16074, 2025

work page arXiv 2025

[33] [33]

David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R. Bowman. GPQA : A graduate-level G oogle-proof Q&A benchmark. In Conference on Language Modeling (COLM), 2024

2024

[34] [34]

Pawan Kumar, Emilien Dupont, Francisco J

Bernardino Romera-Paredes, Mohammadamin Barekatain, Alexander Novikov, Matej Balog, M. Pawan Kumar, Emilien Dupont, Francisco J. R. Ruiz, Jordan S. Ellenberg, Pengming Wang, Omar Fawzi, Pushmeet Kohli, and Alhussein Fawzi. Mathematical discoveries from program search with large language models. Nature, 625: 0 468--475, 2024. doi:10.1038/s41586-023-06924-6

work page doi:10.1038/s41586-023-06924-6 2024

[35] [35]

NLP evaluation in trouble: On the need to measure LLM data contamination for each benchmark

Oscar Sainz, Jon Ander Campos, Iker Garc \'i a-Ferrero, Julen Etxaniz, Oier Lopez de Lacalle, and Eneko Agirre. NLP evaluation in trouble: On the need to measure LLM data contamination for each benchmark. In Findings of EMNLP, 2023. arXiv:2310.18018

work page arXiv 2023

[36] [36]

Are emergent abilities of large language models a mirage? In Advances in Neural Information Processing Systems (NeurIPS), 2023

Rylan Schaeffer, Brando Miranda, and Sanmi Koyejo. Are emergent abilities of large language models a mirage? In Advances in Neural Information Processing Systems (NeurIPS), 2023

2023

[37] [37]

Sculley, Jasper Snoek, Alex Wiltschko, and Ali Rahimi

D. Sculley, Jasper Snoek, Alex Wiltschko, and Ali Rahimi. Winner's curse? On pace, progress, and empirical rigor. In ICLR Workshop Track, 2018

2018

[38] [38]

Dolma : An open corpus of three trillion tokens for language model pretraining research

Luca Soldaini, Rodney Kinney, Akshita Bhagia, et al. Dolma : An open corpus of three trillion tokens for language model pretraining research. arXiv preprint arXiv:2402.00159, 2024

work page arXiv 2024

[39] [39]

Trinh, Yuhuai Wu, Quoc V

Trieu H. Trinh, Yuhuai Wu, Quoc V. Le, He He, and Thang Luong. Solving olympiad geometry without human demonstrations. Nature, 625: 0 476--482, 2024. doi:10.1038/s41586-023-06747-5

work page doi:10.1038/s41586-023-06747-5 2024

[40] [40]

Chi, Tatsunori Hashimoto, Oriol Vinyals, Percy Liang, Jeff Dean, and William Fedus

Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, Dani Yogatama, Maarten Bosma, Denny Zhou, Donald Metzler, Ed H. Chi, Tatsunori Hashimoto, Oriol Vinyals, Percy Liang, Jeff Dean, and William Fedus. Emergent abilities of large language models. Transactions on Machine Learning Research (TMLR), 2022 a

2022

[41] [41]

Chi, Quoc V

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed H. Chi, Quoc V. Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models. In Advances in Neural Information Processing Systems (NeurIPS), 2022 b

2022

[42] [42]

The Philosophy of the Inductive Sciences, Founded Upon Their History

William Whewell. The Philosophy of the Inductive Sciences, Founded Upon Their History. John W. Parker, London, 1840. 2 volumes

[43] [43]

DiscoverPhysics: Benchmarking LLMs for Out-of-the-Box Scientific Thinking

Matt L. Wiemann, Lindsay M. Smith, Peter Melchior, Siddharth Mishra-Sharma, Andrew Gordon Wilson, Pavel Izmailov, and Carolina Cuesta-L \'a zaro. DiscoverPhysics : Benchmarking LLMs for out-of-the-box scientific thinking. arXiv preprint arXiv:2605.26087, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[44] [44]

arXiv preprint arXiv:2307.02477 , year=

Zhaofeng Wu, Linlu Qiu, Alexis Ross, Ekin Aky \"u rek, Boyuan Chen, Bailin Wang, Najoung Kim, Jacob Andreas, and Yoon Kim. Reasoning or reciting? exploring the capabilities and limitations of language models through counterfactual tasks. In NAACL, 2024. arXiv:2307.02477

work page arXiv 2024

[45] [45]

A careful examination of large language model performance on grade school arithmetic

Hugh Zhang, Jeff Da, Dean Lee, et al. A careful examination of large language model performance on grade school arithmetic. arXiv preprint arXiv:2405.00332, 2024

work page arXiv 2024

[46] [46]

Sora and V-JEPA have not learned the complete real world model -- a philosophical analysis of video AIs through the theory of productive imagination

Jianqiu Zhang. Sora and V-JEPA have not learned the complete real world model -- a philosophical analysis of video AIs through the theory of productive imagination. arXiv preprint arXiv:2407.10311, 2024

work page arXiv 2024

[47] [47]

ABench-Physics : Benchmarking physical reasoning in LLMs via high-difficulty and dynamic physics problems

Yiming Zhang et al. ABench-Physics : Benchmarking physical reasoning in LLMs via high-difficulty and dynamic physics problems. arXiv preprint arXiv:2507.04766, 2025

work page arXiv 2025

[48] [48]

Nguyen, Baixuan Xu, Zhaowei Wang, Jiayang Cheng, Hong Ting Tsang, Weiqi Wang, Jiaxin Bai, Tianqing Fang, Yangqiu Song, Ginny Y

Tianshi Zheng, Kelvin Kiu-Wai Tam, Newt Hue-Nam K. Nguyen, Baixuan Xu, Zhaowei Wang, Jiayang Cheng, Hong Ting Tsang, Weiqi Wang, Jiaxin Bai, Tianqing Fang, Yangqiu Song, Ginny Y. Wong, and Simon See. NewtonBench : Benchmarking generalizable scientific law discovery in LLM agents. In ICLR, 2026. arXiv:2510.07172

work page arXiv 2026