arxiv: 2605.14040 · v1 · submitted 2026-05-13 · 💻 cs.CL

Recognition: no theorem link

Physics-R1: An Audited Olympiad Corpus and Recipe for Visual Physics Reasoning

Shan Yang

Authors on Pith no claims yet

Pith reviewed 2026-05-15 05:31 UTC · model grok-4.3

classification 💻 cs.CL

keywords physics olympiadvision language modelevaluation auditmultimodal reasoningreinforcement learningbenchmark contaminationheld-out evaluationvisual physics

0 comments

The pith

Audited olympiad corpus and RL recipe lift 8B vision model 18 points on physics reasoning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that standard evaluations for vision-language models on physics problems are distorted by hidden train-eval overlaps, language translation effects, and the ease of multiple-choice formats. It applies a three-stage audit to clean existing pools and releases a new 500-problem held-out olympiad set along with a closed-form RL training recipe. Starting from an 8B base model, the resulting Physics-R1 achieves clear gains across the audited benchmarks. These gains place the small model ahead of some larger systems on certain tasks while still trailing top closed models. The work establishes that careful data filtering plus targeted reinforcement learning can produce measurable progress on hard visual physics problems.

Core claim

End-to-end auditing reveals undetected contamination and format biases in public physics evaluations; releasing the cleaned PhysCorp-A, PhysR1Corp, and PhysOlym-A artifacts together with a GSPO+DAPO recipe cold-started from Qwen3-VL-8B-Thinking produces +18.3 pp on PhysOlym-A liberal, +15.7 pp on PhysReason, +6.9 pp on OlympiadBench-Physics, and +4.1 pp on PhyX MCQ across three seeds.

What carries the argument

The three-stage audit (5-gram Jaccard then embedding cosine then LLM judge) that removes duplicates and paraphrases, combined with the GSPO+DAPO reinforcement learning recipe applied to closed-form physics problems.

If this is right

The PhysOlym-A set supplies a more reliable measure of generalization because it is 99.8 percent novel-source and carries native difficulty labels.
An 8B model trained with the recipe can exceed the performance of certain 32B and proprietary models on specific audited physics tasks.
Open-ended olympiad evaluation exposes substantially larger performance gaps than MCQ formats on the same model weights.
Translation between languages on identical physics problems produces statistically significant accuracy swings of roughly 17 points.
Releasing the audited corpora allows future training runs to start from contamination-free pools.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar multi-stage audits could be applied to mathematics or chemistry benchmarks to reduce hidden overlap across those domains as well.
The GSPO+DAPO recipe might transfer to other multimodal reasoning areas if closed-form problem pools are constructed for them.
Incorporating the audit step as standard preprocessing could become a practical way to keep large-scale training data cleaner over time.
Evaluating the trained model on newly created olympiad problems written after the audit date would provide an even stricter test of generalization.

Load-bearing premise

The three-stage audit has removed essentially all contamination so that the new PhysOlym-A set has no overlap with any training data used for the base model or the recipe.

What would settle it

Finding even one problem among the 500 in PhysOlym-A that matches or closely paraphrases an item from the original training pools such as SciInstruct or UGPhysics-Train.

read the original abstract

We audit the multimodal-physics evaluation pipeline end-to-end and document three undetected construction practices that distort how the field measures vision-language reasoning: train-eval contamination, translation drift, and MCQ saturation. (1) Public training pools (UGPhysics-Train, SciInstruct, MMK12) pass single-stage 5-gram-Jaccard audits with zero hits across all six public physics evals; a three-stage audit (Jaccard -> mxbai-embed-large cosine -> Haiku-4.5 LLM-judge) surfaces 134 near-duplicates and 4,846 paraphrase candidates in SciInstruct alone. (2) A 17-pp Sonnet 4.5 delta on 59 paired Estonian-English olympiad problems (30.5% vs. 13.6%; sign test p=0.011, McNemar p=0.021, paired bootstrap 95% CI [+5.1, +28.9] pp). (3) A 46-pp format-and-novelty gradient on identical Sonnet weights between MCQ (79.7% on PhyX) and open-ended olympiad evaluation (33.4% on PhysOlym-A). We release four artifacts addressing these gaps: PhysCorp-A (6,432-record three-stage-audited multimodal corpus), PhysR1Corp (2,268-record closed-form RL pool), PhysOlym-A (500-problem, 99.8% novel-source held-out olympiad eval with native difficulty labels and an EN/ET bilingual subset), and Physics-R1, a reference GSPO+DAPO recipe cold-started from Qwen3-VL-8B-Thinking. Across 3 seeds, Physics-R1 lifts the audited corpus over the 8B base by +18.3 pp on PhysOlym-A liberal (8.0 -> 26.3 +/- 1.7; 7.1 pp behind Sonnet 4.5), +15.7 pp on PhysReason (23.9 -> 39.6 +/- 6.4; ahead of Qwen3-VL-32B and Gemini 2.5 Pro), +6.9 pp on OlympiadBench-Physics (46.2 +/- 1.5), and +4.1 pp on PhyX MCQ (77.8 +/- 0.3).

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper delivers a new audited physics corpus and held-out olympiad benchmark plus a training recipe that produces measurable gains, but the contamination checks do not cover the base model's full pretraining data.

read the letter

The core contribution is a three-stage audit process that catches more contamination than simple n-gram checks, a 500-problem held-out olympiad evaluation set with bilingual examples, and a GSPO+DAPO recipe that lifts Qwen3-VL-8B by roughly 18 points on the new set and 15 points on PhysReason. The authors also document translation drift on paired Estonian-English problems and the large gap between MCQ and open-ended formats, all backed by sign tests, McNemar tests, and bootstrap intervals. Releasing the audited PhysCorp-A, PhysOlym-A, and the recipe gives the field concrete artifacts to build on. The multi-seed results and confidence intervals are a step above single-run claims common in the area. The gains look real on the reported numbers and the audit pipeline is more thorough than most prior work on these pools. The main limitation is that the three-stage check was applied only to three public training sources. No procedure is described for scanning the much larger, non-public pretraining mixture used for the 8B base model, so residual overlap remains possible and could inflate the measured lifts. Exact thresholds for the embedding and LLM-judge stages are not fully specified in the abstract, which makes independent verification harder. Absolute performance stays modest and still trails some larger models. This work is aimed at people building or evaluating multimodal models for physics and science reasoning. The new benchmark and audit details are worth a close look even if the training recipe needs further testing. A serious editor should send it to referees rather than desk-reject it.

Referee Report

1 major / 2 minor

Summary. The manuscript audits the multimodal physics evaluation pipeline end-to-end, documenting three construction practices (train-eval contamination, translation drift, and MCQ saturation) that distort vision-language reasoning measurements. It releases four artifacts—PhysCorp-A (6,432-record audited corpus), PhysR1Corp (2,268-record RL pool), PhysOlym-A (500-problem 99.8% novel held-out olympiad eval), and the Physics-R1 GSPO+DAPO recipe cold-started from Qwen3-VL-8B-Thinking—and reports that the recipe lifts the base model by +18.3 pp on PhysOlym-A liberal, +15.7 pp on PhysReason, +6.9 pp on OlympiadBench-Physics, and +4.1 pp on PhyX MCQ across three seeds, with statistical tests (sign test, McNemar, bootstrap CI) and comparisons to closed models.

Significance. If the three-stage audit ensures the new evaluations are free of contamination from both public pools and the base model's full pretraining mixture, and if the reported gains are reproducible, the work supplies valuable audited resources and a practical open recipe that narrows the gap between 8B open models and frontier closed models on olympiad-level visual physics tasks.

major comments (1)

[Audit Methodology (abstract and § on three-stage pipeline)] The three-stage audit (5-gram Jaccard then mxbai-embed-large cosine then Haiku-4.5 LLM judge) is described only for the public pools UGPhysics-Train, SciInstruct, and MMK12; no procedure, thresholds, or results are given for checking overlap against the full (non-public) pretraining mixture of Qwen3-VL-8B-Thinking. This is load-bearing for the central performance claims, because any undetected contamination would mean the measured lifts (+18.3 pp on PhysOlym-A liberal, etc.) conflate memorization with the GSPO+DAPO recipe.

minor comments (2)

[Audit Methodology] Exact numerical thresholds for the embedding cosine similarity and LLM-judge stages of the audit are not stated, nor are the precise criteria used to classify the 134 near-duplicates and 4,846 paraphrase candidates.
[Results] The manuscript reports aggregate multi-seed means and CIs but does not tabulate the per-seed scores or variance for all four benchmarks, which would strengthen reproducibility claims.

Simulated Author's Rebuttal

1 responses · 1 unresolved

We thank the referee for the detailed and constructive review. The concern about auditing the closed pretraining mixture is well-taken and highlights an important limitation of working with proprietary base models. We address it point-by-point below.

read point-by-point responses

Referee: [Audit Methodology (abstract and § on three-stage pipeline)] The three-stage audit (5-gram Jaccard then mxbai-embed-large cosine then Haiku-4.5 LLM judge) is described only for the public pools UGPhysics-Train, SciInstruct, and MMK12; no procedure, thresholds, or results are given for checking overlap against the full (non-public) pretraining mixture of Qwen3-VL-8B-Thinking. This is load-bearing for the central performance claims, because any undetected contamination would mean the measured lifts (+18.3 pp on PhysOlym-A liberal, etc.) conflate memorization with the GSPO+DAPO recipe.

Authors: We agree that an audit against the full proprietary pretraining mixture would be ideal and would further strengthen the contamination-free claim. Unfortunately, the pretraining data for Qwen3-VL-8B-Thinking is not publicly released by the model provider, so no such procedure, thresholds, or results can be provided. Our three-stage audit was applied exhaustively to every public training pool referenced in the literature (UGPhysics-Train, SciInstruct, MMK12), surfacing the 134 near-duplicates and 4,846 paraphrase candidates in SciInstruct. For the new evaluation set, PhysOlym-A was deliberately sourced from olympiad problems whose original contest sources post-date or lie outside public web corpora and the listed training pools; we report 99.8% novel-source status after manual verification of contest provenance. We will add an explicit limitations paragraph in the revised manuscript stating that closed-model pretraining mixtures cannot be audited and that our claims rest on (i) exhaustive public-pool decontamination and (ii) novel-source construction of PhysOlym-A. The reported gains are therefore measured on a held-out set that is verifiably free of the public contamination we document. revision: partial

standing simulated objections not resolved

Full audit of overlap against the non-public pretraining mixture of Qwen3-VL-8B-Thinking, which remains inaccessible to the research community.

Circularity Check

0 steps flagged

No significant circularity; gains measured on newly audited held-out sets

full rationale

The paper constructs PhysCorp-A, PhysR1Corp, and PhysOlym-A via documented three-stage audit on public pools (UGPhysics-Train, SciInstruct, MMK12), releases them as artifacts, cold-starts Physics-R1 from external Qwen3-VL-8B-Thinking base using GSPO+DAPO, and reports lifts on the new 99.8% novel PhysOlym-A and other held-out benchmarks. No self-definitional reductions, fitted inputs renamed as predictions, or load-bearing self-citations appear; central claims rest on externally verifiable new data rather than quantities defined in terms of the same fitted parameters.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim depends on the three-stage audit correctly identifying and removing contamination and on the new PhysOlym-A set being free of overlap with any data used to train the base model or the recipe.

axioms (2)

domain assumption The three-stage audit (Jaccard -> embedding -> LLM judge) fully removes train-eval contamination
Invoked in the construction of PhysCorp-A and PhysOlym-A and in the claim that prior evals were contaminated
standard math The GSPO+DAPO recipe and Qwen3-VL-8B-Thinking base can be reproduced from the released artifacts
Standard assumption in ML training papers when artifacts are released

pith-pipeline@v0.9.0 · 5751 in / 1563 out tokens · 66042 ms · 2026-05-15T05:31:35.296432+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

52 extracted references · 52 canonical work pages · 3 internal anchors

[1]

Shen, Hui and Wu, Taiqiang and Han, Qi and others , journal=

work page
[2]

He, Chaoqun and Luo, Renjie and Bai, Yuzhuo and Hu, Shengding and others , booktitle=

work page
[3]

Xu, Xin and Xu, Qiyun and Xiao, Tong and others , journal=

work page
[4]

Zhang, Xinyu and Dong, Yuxuan and Wu, Yanrui and others , journal=

work page
[5]

Yue, Xiang and Ni, Yuansheng and Zhang, Kai and Zheng, Tianyu and others , booktitle=

work page
[6]

Tsoukalas, George and Lee, Jasper and Jennings, John and Xin, Yifan and others , journal=

work page
[7]

Glazer, Eric and Erdil, Ege and Besiroglu, Tamay and Chicharro, Diego and others , journal=

work page
[8]

EMNLP Findings , year=

Sainz, Oscar and Campos, Jon Ander and Garc. EMNLP Findings , year=

work page
[9]

and Kocyigit, Muhammed Yusuf and Poulton, Andrew and others , journal=

Singh, Aaditya K. and Kocyigit, Muhammed Yusuf and Poulton, Andrew and others , journal=. Evaluation data contamination in

work page
[10]

Shao, Zhihong and Wang, Peiyi and Zhu, Qihao and others , journal=

work page
[11]

Yu, Qiying and Zhang, Zheng and Zhu, Ruofei and others , journal=

work page
[12]

Group Sequence Policy Optimization

Group Sequence Policy Optimization , author=. arXiv preprint arXiv:2507.18071 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[13]

NeurIPS , year=

Deep Reinforcement Learning from Human Preferences , author=. NeurIPS , year=

work page
[14]

NeurIPS , year=

Training Language Models to Follow Instructions with Human Feedback , author=. NeurIPS , year=

work page
[15]

Constitutional

Bai, Yuntao and Kadavath, Saurav and Kundu, Sandipan and others , journal=. Constitutional

work page
[16]

Sheng, Guangming and Zhang, Chi and Ye, Zilingfeng and Wu, Xibin and others , journal=

work page
[17]

2025 , howpublished=

work page 2025
[18]

Chen, Zhe and Wang, Weiyun and Cao, Yue and others , journal=

work page
[19]

Liu, Bo and Li, Bo and Zhang, Yuanhan and others , booktitle=

work page
[20]

Educational and Psychological Measurement , volume=

A Coefficient of Agreement for Nominal Scales , author=. Educational and Psychological Measurement , volume=

work page
[21]

Communications of the ACM , year=

Datasheets for Datasets , author=. Communications of the ACM , year=

work page
[22]

Estonian Physics Olympiad: Problem Collection 2004--2018 , author=

work page 2004
[23]

Olympiad Physics Handouts , author=

work page
[24]

International Physics Olympiad: Archived Problems and Solutions , author=

work page
[25]

Physics Olympiad: Archived Problems and Solutions , author=

U.S. Physics Olympiad: Archived Problems and Solutions , author=

work page
[26]

Asian Physics Olympiad: Archived Problems and Solutions , author=

work page
[27]

Indian National Physics Olympiad: Archived Problems and Solutions , author=

work page
[28]

Nordic-Baltic Physics Olympiad: Archived Problems and Solutions , author=

work page
[29]

European Physics Olympiad: Archived Problems and Solutions , author=

work page
[30]

Physics Stack Exchange: Question and Answer Archive , author=

work page
[31]

NeurIPS , year=

Direct Preference Optimization: Your Language Model is Secretly a Reward Model , author=. NeurIPS , year=

work page
[32]

MM-Eureka: Exploring the Frontiers of Multimodal Reasoning with Rule-based Reinforcement Learning

MM-Eureka: Exploring Visual Aha-Moment with Rule-based Large-scale Reinforcement Learning , author=. arXiv preprint arXiv:2503.07365 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[33]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning , author=. arXiv preprint arXiv:2501.12948 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[34]

Croissant: A Metadata Format for

Akhtar, Mubashara and Benjelloun, Omar and Conforti, Costanza and others , booktitle=. Croissant: A Metadata Format for. 2024 , url=

work page 2024
[35]

2025 , note=

Qiu, Shi and Guo, Shaoyang and Song, Zhuo-Yang and Sun, Yunbo and Cai, Zeyu and Wei, Jiashen and Luo, Tianyu and Yin, Yixuan and Zhang, Haoxu and Hu, Yi and others , booktitle=. 2025 , note=

work page 2025
[36]

2506.17667 , archivePrefix=

Wang, Lintao and Su, Encheng and Liu, Jiaqi and others , year=. 2506.17667 , archivePrefix=

work page arXiv
[37]

2025 , note=

Zhu, Yaoming and Wang, Junxin and Li, Yiyang and others , booktitle=. 2025 , note=

work page 2025
[38]

2025 , eprint=

Humanity's Last Exam , author=. 2025 , eprint=

work page 2025
[39]

2024 , note=

Wang, Yubo and Ma, Xueguang and Zhang, Ge and Ni, Yuansheng and Chandra, Abhranil and Guo, Shiguang and Ren, Weiming and Arulraj, Aaran and He, Xuan and Jiang, Ziyan and others , booktitle=. 2024 , note=

work page 2024
[40]

2024 , eprint=

Yue, Xiang and Zheng, Tianyu and Ni, Yuansheng and Wang, Yubo and Zhang, Kai and Tong, Shengbang and Sun, Yuxuan and Yu, Botao and Zhang, Ge and Sun, Huan and others , booktitle=. 2024 , eprint=

work page 2024
[41]

2024 , note=

Zhang, Dan and Hu, Ziniu and Zhoubian, Sining and Du, Zhengxiao and Yang, Kaiyu and Wang, Zihan and Yue, Yisong and Dong, Yuxiao and Tang, Jie , booktitle=. 2024 , note=

work page 2024
[42]

2023 , eprint=

Rethinking Benchmark and Contamination for Language Models with Rephrased Samples , author=. 2023 , eprint=

work page 2023
[43]

2024 , eprint=

Contamination Report for Multilingual Benchmarks , author=. 2024 , eprint=

work page 2024
[44]

under review , year=

On The Fragility of Benchmark Contamination Detection in Reasoning Models , author=. under review , year=. 2510.02386 , archivePrefix=

work page arXiv
[45]

2503.10497 , archivePrefix=

Xuan, Weihao and Yang, Rui and Qi, Heli and Zeng, Qingcheng and Xiao, Yunze and Feng, Yun and Liu, Jiaxing and Hou, Jingyi and Zhao, Jiawei and Yu, Wenxiang and others , year=. 2503.10497 , archivePrefix=

work page arXiv
[46]

2025 , note=

Chow, Wei and Mao, Jiageng and Li, Boyi and Seita, Daniel and Guizilini, Vitor and Wang, Yue , booktitle=. 2025 , note=

work page 2025
[47]

2024 , eprint=

A Comprehensive Survey of Contamination Detection Methods in Large Language Models , author=. 2024 , eprint=

work page 2024
[48]

Dekoninck, Jasper and Mueller, Mark Niklas and Vechev, Martin , booktitle=

work page
[49]

The Bitter Lesson Learned from 2

Wu, Minghao and Wang, Weixuan and Liu, Sinuo and others , year=. The Bitter Lesson Learned from 2. 2504.15521 , archivePrefix=

work page arXiv
[50]

and Lee, Dean and Menghini, Cristina and others , year=

Wang, Clinton J. and Lee, Dean and Menghini, Cristina and others , year=. 2502.08859 , archivePrefix=

work page arXiv
[51]

2024 , publisher=

Open Source Strikes Bread - New Fluffy Embeddings Model , author=. 2024 , publisher=

work page 2024
[52]

2025 , url=

Claude Sonnet 4.5 , author=. 2025 , url=

work page 2025