PHYSICS: Benchmarking Foundation Models on University-Level Physics Problem Solving

Arman Cohan; Chen Zhao; John Sous; Kaiyue Feng; Tianyu Yang; Yilun Zhao; Yixin Liu

arxiv: 2503.21821 · v1 · pith:AYTLSQJ6new · submitted 2025-03-26 · 💻 cs.AI

PHYSICS: Benchmarking Foundation Models on University-Level Physics Problem Solving

Kaiyue Feng , Yilun Zhao , Yixin Liu , Tianyu Yang , Chen Zhao , John Sous , Arman Cohan This is my paper

Pith reviewed 2026-05-22 23:08 UTC · model grok-4.3

classification 💻 cs.AI

keywords physics benchmarkfoundation modelsuniversity physicsproblem solvingautomated evaluationo3-miniretrieval augmented generation

0 comments

The pith

Even the strongest foundation model solves only 59.9 percent of university-level physics problems.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces PHYSICS, a benchmark of 1297 expert-annotated problems spanning classical mechanics, quantum mechanics, thermodynamics and statistical mechanics, electromagnetism, atomic physics, and optics. It evaluates leading models and reports that o3-mini, the best performer, reaches just 59.9 percent accuracy on tasks that demand both advanced physics knowledge and mathematical reasoning. The authors also examine error patterns, prompting techniques, and retrieval-augmented generation to locate specific weaknesses. These results establish a concrete measure of current AI limits in high-level scientific problem solving.

Core claim

The PHYSICS benchmark shows that current foundation models, even the most advanced, achieve at most 59.9 percent accuracy when solving university-level physics problems that require integration of domain knowledge and multi-step mathematical reasoning.

What carries the argument

The PHYSICS benchmark of 1297 problems with an automated evaluation system that validates model answers against expert solutions.

If this is right

Current models require stronger mechanisms for combining physics principles with algebraic and calculus-based reasoning.
Retrieval-augmented generation and prompting provide only modest gains and do not close the performance gap.
Error analysis identifies recurring failure modes that future training regimes can target directly.
The benchmark supplies a stable test set for measuring incremental progress in scientific reasoning.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Progress on this benchmark may require new architectures that maintain long chains of physical constraints rather than relying on pattern matching alone.
The same evaluation pipeline could be adapted to create parallel benchmarks in chemistry or engineering.
Low scores suggest that scaling model size without physics-specific data curation will leave these gaps largely unchanged.

Load-bearing premise

The selected problems accurately represent typical university physics coursework and the automated grader correctly scores model outputs without systematic mistakes.

What would settle it

Human experts re-grading a random sample of model answers and finding substantially higher accuracy than the automated system reports.

read the original abstract

We introduce PHYSICS, a comprehensive benchmark for university-level physics problem solving. It contains 1297 expert-annotated problems covering six core areas: classical mechanics, quantum mechanics, thermodynamics and statistical mechanics, electromagnetism, atomic physics, and optics. Each problem requires advanced physics knowledge and mathematical reasoning. We develop a robust automated evaluation system for precise and reliable validation. Our evaluation of leading foundation models reveals substantial limitations. Even the most advanced model, o3-mini, achieves only 59.9% accuracy, highlighting significant challenges in solving high-level scientific problems. Through comprehensive error analysis, exploration of diverse prompting strategies, and Retrieval-Augmented Generation (RAG)-based knowledge augmentation, we identify key areas for improvement, laying the foundation for future advancements.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

PHYSICS benchmark adds a useful new dataset for university physics but the automated evaluator's reliability is not shown, so the 59.9% claim on o3-mini needs verification before the 'substantial limitations' conclusion can be taken at face value.

read the letter

The paper's main contribution is the PHYSICS benchmark: 1297 expert-annotated problems across classical mechanics, quantum mechanics, thermodynamics, electromagnetism, atomic physics, and optics, plus an automated scoring system and some follow-up experiments on prompting and RAG. That is a concrete addition to the set of science benchmarks, and the scale plus domain coverage is better than most existing ones. The error analysis and prompting tests are straightforward and useful for people who actually want to improve model performance on these problems. The 59.9% figure for o3-mini is the headline result, but it depends entirely on whether the automated grader correctly accepts mathematically equivalent answers, handles units, and applies reasonable numerical tolerances. The abstract calls the system 'robust' without describing the mechanism or reporting agreement with human graders on edge cases. If the grader is doing simple string or loose numerical matching, the reported accuracies are likely depressed by false negatives, which would make the 'significant challenges' conclusion look stronger than the data support. Problem selection criteria and inter-annotator numbers are also missing from the abstract, though the full paper may contain them. This is the kind of work that belongs in a reading group for people building or evaluating scientific reasoning systems. It is not yet strong enough for a definitive statement about current model limits, but the dataset itself is worth having. I would send it to peer review with the expectation that referees will focus on the evaluator validation and the representativeness of the 1297 problems. Minor revisions on those points would make it a solid reference benchmark.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces PHYSICS, a benchmark of 1297 expert-annotated university-level physics problems across six areas (classical mechanics, quantum mechanics, thermodynamics and statistical mechanics, electromagnetism, atomic physics, optics). It evaluates leading foundation models using a claimed robust automated evaluation system, reports that o3-mini reaches only 59.9% accuracy, and includes error analysis plus experiments on prompting strategies and RAG-based augmentation.

Significance. If the benchmark construction and automated evaluator prove reliable, the work would provide a valuable, high-difficulty testbed for scientific reasoning in AI, documenting clear performance gaps even in frontier models. The additional analyses of prompting and RAG supply concrete directions for improvement and strengthen the paper's utility beyond raw accuracy numbers.

major comments (2)

[Abstract] Abstract: the central performance claim (o3-mini at 59.9% accuracy) rests on the automated evaluation system correctly recognizing equivalent physics answers, including symbolic equivalence, numerical tolerances, and unit variants. No mechanism, validation against human graders, or inter-rater statistics are described, so it is impossible to determine whether the reported accuracy understates or accurately reflects model capability.
[Benchmark construction] Benchmark construction (implied in abstract and results): the claim that the 1297 problems are representative of university-level physics requires explicit selection criteria and inter-annotator agreement metrics. Their absence is load-bearing because the headline conclusion of 'significant challenges' cannot be assessed without evidence that the problems are appropriately difficult and consistently annotated.

minor comments (1)

[Abstract] The abstract states the benchmark covers 'six core areas' but does not list the exact distribution of problems per area; adding a table or breakdown would improve clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the two major comments below and will revise the manuscript to incorporate additional details on the evaluation system and benchmark construction.

read point-by-point responses

Referee: [Abstract] Abstract: the central performance claim (o3-mini at 59.9% accuracy) rests on the automated evaluation system correctly recognizing equivalent physics answers, including symbolic equivalence, numerical tolerances, and unit variants. No mechanism, validation against human graders, or inter-rater statistics are described, so it is impossible to determine whether the reported accuracy understates or accurately reflects model capability.

Authors: We agree that the reliability of the automated evaluator is central to the headline result and that the current manuscript does not provide sufficient detail on its mechanisms or human validation. We will expand the Methods section with explicit descriptions of the equivalence rules (symbolic, numerical tolerances, unit variants), add a validation experiment comparing the system to human expert grading on a held-out sample of problems, and report agreement statistics. These additions will appear in the revised version. revision: yes
Referee: [Benchmark construction] Benchmark construction (implied in abstract and results): the claim that the 1297 problems are representative of university-level physics requires explicit selection criteria and inter-annotator agreement metrics. Their absence is load-bearing because the headline conclusion of 'significant challenges' cannot be assessed without evidence that the problems are appropriately difficult and consistently annotated.

Authors: We acknowledge that the manuscript does not currently state explicit selection criteria or report inter-annotator agreement. We will add a dedicated subsection describing the problem sourcing process (standard university curricula and exams across the six domains), the annotation protocol, and quantitative inter-annotator agreement metrics. This will allow readers to assess representativeness and annotation consistency. revision: yes

Circularity Check

0 steps flagged

No circularity: direct empirical benchmark evaluation

full rationale

The paper introduces a new dataset of 1297 physics problems and reports model accuracies from direct testing. No derivation chain, fitted parameters renamed as predictions, or self-citation load-bearing steps exist. Results are obtained by running models on the benchmark and applying an automated evaluator; nothing reduces to its own inputs by construction. The evaluation system is presented as a tool rather than a derived claim.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an empirical benchmark paper with no mathematical derivations, free parameters, axioms, or invented entities; the contribution is the dataset and evaluation results.

pith-pipeline@v0.9.0 · 5668 in / 1086 out tokens · 81342 ms · 2026-05-22T23:08:01.211338+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Fine-Tuning Small Reasoning Models for Quantum Field Theory
cs.LG 2026-04 unverdicted novelty 7.0

Small 7B reasoning models were fine-tuned on synthetic and curated QFT problems using RL and SFT, yielding performance gains, error analysis, and public release of data and traces.