Recognition: no theorem link
Physics-R1: An Audited Olympiad Corpus and Recipe for Visual Physics Reasoning
Pith reviewed 2026-05-15 05:31 UTC · model grok-4.3
The pith
Audited olympiad corpus and RL recipe lift 8B vision model 18 points on physics reasoning.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
End-to-end auditing reveals undetected contamination and format biases in public physics evaluations; releasing the cleaned PhysCorp-A, PhysR1Corp, and PhysOlym-A artifacts together with a GSPO+DAPO recipe cold-started from Qwen3-VL-8B-Thinking produces +18.3 pp on PhysOlym-A liberal, +15.7 pp on PhysReason, +6.9 pp on OlympiadBench-Physics, and +4.1 pp on PhyX MCQ across three seeds.
What carries the argument
The three-stage audit (5-gram Jaccard then embedding cosine then LLM judge) that removes duplicates and paraphrases, combined with the GSPO+DAPO reinforcement learning recipe applied to closed-form physics problems.
If this is right
- The PhysOlym-A set supplies a more reliable measure of generalization because it is 99.8 percent novel-source and carries native difficulty labels.
- An 8B model trained with the recipe can exceed the performance of certain 32B and proprietary models on specific audited physics tasks.
- Open-ended olympiad evaluation exposes substantially larger performance gaps than MCQ formats on the same model weights.
- Translation between languages on identical physics problems produces statistically significant accuracy swings of roughly 17 points.
- Releasing the audited corpora allows future training runs to start from contamination-free pools.
Where Pith is reading between the lines
- Similar multi-stage audits could be applied to mathematics or chemistry benchmarks to reduce hidden overlap across those domains as well.
- The GSPO+DAPO recipe might transfer to other multimodal reasoning areas if closed-form problem pools are constructed for them.
- Incorporating the audit step as standard preprocessing could become a practical way to keep large-scale training data cleaner over time.
- Evaluating the trained model on newly created olympiad problems written after the audit date would provide an even stricter test of generalization.
Load-bearing premise
The three-stage audit has removed essentially all contamination so that the new PhysOlym-A set has no overlap with any training data used for the base model or the recipe.
What would settle it
Finding even one problem among the 500 in PhysOlym-A that matches or closely paraphrases an item from the original training pools such as SciInstruct or UGPhysics-Train.
read the original abstract
We audit the multimodal-physics evaluation pipeline end-to-end and document three undetected construction practices that distort how the field measures vision-language reasoning: train-eval contamination, translation drift, and MCQ saturation. (1) Public training pools (UGPhysics-Train, SciInstruct, MMK12) pass single-stage 5-gram-Jaccard audits with zero hits across all six public physics evals; a three-stage audit (Jaccard -> mxbai-embed-large cosine -> Haiku-4.5 LLM-judge) surfaces 134 near-duplicates and 4,846 paraphrase candidates in SciInstruct alone. (2) A 17-pp Sonnet 4.5 delta on 59 paired Estonian-English olympiad problems (30.5% vs. 13.6%; sign test p=0.011, McNemar p=0.021, paired bootstrap 95% CI [+5.1, +28.9] pp). (3) A 46-pp format-and-novelty gradient on identical Sonnet weights between MCQ (79.7% on PhyX) and open-ended olympiad evaluation (33.4% on PhysOlym-A). We release four artifacts addressing these gaps: PhysCorp-A (6,432-record three-stage-audited multimodal corpus), PhysR1Corp (2,268-record closed-form RL pool), PhysOlym-A (500-problem, 99.8% novel-source held-out olympiad eval with native difficulty labels and an EN/ET bilingual subset), and Physics-R1, a reference GSPO+DAPO recipe cold-started from Qwen3-VL-8B-Thinking. Across 3 seeds, Physics-R1 lifts the audited corpus over the 8B base by +18.3 pp on PhysOlym-A liberal (8.0 -> 26.3 +/- 1.7; 7.1 pp behind Sonnet 4.5), +15.7 pp on PhysReason (23.9 -> 39.6 +/- 6.4; ahead of Qwen3-VL-32B and Gemini 2.5 Pro), +6.9 pp on OlympiadBench-Physics (46.2 +/- 1.5), and +4.1 pp on PhyX MCQ (77.8 +/- 0.3).
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript audits the multimodal physics evaluation pipeline end-to-end, documenting three construction practices (train-eval contamination, translation drift, and MCQ saturation) that distort vision-language reasoning measurements. It releases four artifacts—PhysCorp-A (6,432-record audited corpus), PhysR1Corp (2,268-record RL pool), PhysOlym-A (500-problem 99.8% novel held-out olympiad eval), and the Physics-R1 GSPO+DAPO recipe cold-started from Qwen3-VL-8B-Thinking—and reports that the recipe lifts the base model by +18.3 pp on PhysOlym-A liberal, +15.7 pp on PhysReason, +6.9 pp on OlympiadBench-Physics, and +4.1 pp on PhyX MCQ across three seeds, with statistical tests (sign test, McNemar, bootstrap CI) and comparisons to closed models.
Significance. If the three-stage audit ensures the new evaluations are free of contamination from both public pools and the base model's full pretraining mixture, and if the reported gains are reproducible, the work supplies valuable audited resources and a practical open recipe that narrows the gap between 8B open models and frontier closed models on olympiad-level visual physics tasks.
major comments (1)
- [Audit Methodology (abstract and § on three-stage pipeline)] The three-stage audit (5-gram Jaccard then mxbai-embed-large cosine then Haiku-4.5 LLM judge) is described only for the public pools UGPhysics-Train, SciInstruct, and MMK12; no procedure, thresholds, or results are given for checking overlap against the full (non-public) pretraining mixture of Qwen3-VL-8B-Thinking. This is load-bearing for the central performance claims, because any undetected contamination would mean the measured lifts (+18.3 pp on PhysOlym-A liberal, etc.) conflate memorization with the GSPO+DAPO recipe.
minor comments (2)
- [Audit Methodology] Exact numerical thresholds for the embedding cosine similarity and LLM-judge stages of the audit are not stated, nor are the precise criteria used to classify the 134 near-duplicates and 4,846 paraphrase candidates.
- [Results] The manuscript reports aggregate multi-seed means and CIs but does not tabulate the per-seed scores or variance for all four benchmarks, which would strengthen reproducibility claims.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive review. The concern about auditing the closed pretraining mixture is well-taken and highlights an important limitation of working with proprietary base models. We address it point-by-point below.
read point-by-point responses
-
Referee: [Audit Methodology (abstract and § on three-stage pipeline)] The three-stage audit (5-gram Jaccard then mxbai-embed-large cosine then Haiku-4.5 LLM judge) is described only for the public pools UGPhysics-Train, SciInstruct, and MMK12; no procedure, thresholds, or results are given for checking overlap against the full (non-public) pretraining mixture of Qwen3-VL-8B-Thinking. This is load-bearing for the central performance claims, because any undetected contamination would mean the measured lifts (+18.3 pp on PhysOlym-A liberal, etc.) conflate memorization with the GSPO+DAPO recipe.
Authors: We agree that an audit against the full proprietary pretraining mixture would be ideal and would further strengthen the contamination-free claim. Unfortunately, the pretraining data for Qwen3-VL-8B-Thinking is not publicly released by the model provider, so no such procedure, thresholds, or results can be provided. Our three-stage audit was applied exhaustively to every public training pool referenced in the literature (UGPhysics-Train, SciInstruct, MMK12), surfacing the 134 near-duplicates and 4,846 paraphrase candidates in SciInstruct. For the new evaluation set, PhysOlym-A was deliberately sourced from olympiad problems whose original contest sources post-date or lie outside public web corpora and the listed training pools; we report 99.8% novel-source status after manual verification of contest provenance. We will add an explicit limitations paragraph in the revised manuscript stating that closed-model pretraining mixtures cannot be audited and that our claims rest on (i) exhaustive public-pool decontamination and (ii) novel-source construction of PhysOlym-A. The reported gains are therefore measured on a held-out set that is verifiably free of the public contamination we document. revision: partial
- Full audit of overlap against the non-public pretraining mixture of Qwen3-VL-8B-Thinking, which remains inaccessible to the research community.
Circularity Check
No significant circularity; gains measured on newly audited held-out sets
full rationale
The paper constructs PhysCorp-A, PhysR1Corp, and PhysOlym-A via documented three-stage audit on public pools (UGPhysics-Train, SciInstruct, MMK12), releases them as artifacts, cold-starts Physics-R1 from external Qwen3-VL-8B-Thinking base using GSPO+DAPO, and reports lifts on the new 99.8% novel PhysOlym-A and other held-out benchmarks. No self-definitional reductions, fitted inputs renamed as predictions, or load-bearing self-citations appear; central claims rest on externally verifiable new data rather than quantities defined in terms of the same fitted parameters.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption The three-stage audit (Jaccard -> embedding -> LLM judge) fully removes train-eval contamination
- standard math The GSPO+DAPO recipe and Qwen3-VL-8B-Thinking base can be reproduced from the released artifacts
Reference graph
Works this paper leans on
-
[1]
Shen, Hui and Wu, Taiqiang and Han, Qi and others , journal=
-
[2]
He, Chaoqun and Luo, Renjie and Bai, Yuzhuo and Hu, Shengding and others , booktitle=
-
[3]
Xu, Xin and Xu, Qiyun and Xiao, Tong and others , journal=
-
[4]
Zhang, Xinyu and Dong, Yuxuan and Wu, Yanrui and others , journal=
-
[5]
Yue, Xiang and Ni, Yuansheng and Zhang, Kai and Zheng, Tianyu and others , booktitle=
-
[6]
Tsoukalas, George and Lee, Jasper and Jennings, John and Xin, Yifan and others , journal=
-
[7]
Glazer, Eric and Erdil, Ege and Besiroglu, Tamay and Chicharro, Diego and others , journal=
-
[8]
Sainz, Oscar and Campos, Jon Ander and Garc. EMNLP Findings , year=
-
[9]
and Kocyigit, Muhammed Yusuf and Poulton, Andrew and others , journal=
Singh, Aaditya K. and Kocyigit, Muhammed Yusuf and Poulton, Andrew and others , journal=. Evaluation data contamination in
-
[10]
Shao, Zhihong and Wang, Peiyi and Zhu, Qihao and others , journal=
-
[11]
Yu, Qiying and Zhang, Zheng and Zhu, Ruofei and others , journal=
-
[12]
Group Sequence Policy Optimization
Group Sequence Policy Optimization , author=. arXiv preprint arXiv:2507.18071 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[13]
Deep Reinforcement Learning from Human Preferences , author=. NeurIPS , year=
-
[14]
Training Language Models to Follow Instructions with Human Feedback , author=. NeurIPS , year=
-
[15]
Bai, Yuntao and Kadavath, Saurav and Kundu, Sandipan and others , journal=. Constitutional
-
[16]
Sheng, Guangming and Zhang, Chi and Ye, Zilingfeng and Wu, Xibin and others , journal=
-
[17]
2025 , howpublished=
work page 2025
-
[18]
Chen, Zhe and Wang, Weiyun and Cao, Yue and others , journal=
-
[19]
Liu, Bo and Li, Bo and Zhang, Yuanhan and others , booktitle=
-
[20]
Educational and Psychological Measurement , volume=
A Coefficient of Agreement for Nominal Scales , author=. Educational and Psychological Measurement , volume=
-
[21]
Communications of the ACM , year=
Datasheets for Datasets , author=. Communications of the ACM , year=
-
[22]
Estonian Physics Olympiad: Problem Collection 2004--2018 , author=
work page 2004
-
[23]
Olympiad Physics Handouts , author=
-
[24]
International Physics Olympiad: Archived Problems and Solutions , author=
-
[25]
Physics Olympiad: Archived Problems and Solutions , author=
U.S. Physics Olympiad: Archived Problems and Solutions , author=
-
[26]
Asian Physics Olympiad: Archived Problems and Solutions , author=
-
[27]
Indian National Physics Olympiad: Archived Problems and Solutions , author=
-
[28]
Nordic-Baltic Physics Olympiad: Archived Problems and Solutions , author=
-
[29]
European Physics Olympiad: Archived Problems and Solutions , author=
-
[30]
Physics Stack Exchange: Question and Answer Archive , author=
-
[31]
Direct Preference Optimization: Your Language Model is Secretly a Reward Model , author=. NeurIPS , year=
-
[32]
MM-Eureka: Exploring the Frontiers of Multimodal Reasoning with Rule-based Reinforcement Learning
MM-Eureka: Exploring Visual Aha-Moment with Rule-based Large-scale Reinforcement Learning , author=. arXiv preprint arXiv:2503.07365 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[33]
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning , author=. arXiv preprint arXiv:2501.12948 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[34]
Croissant: A Metadata Format for
Akhtar, Mubashara and Benjelloun, Omar and Conforti, Costanza and others , booktitle=. Croissant: A Metadata Format for. 2024 , url=
work page 2024
-
[35]
Qiu, Shi and Guo, Shaoyang and Song, Zhuo-Yang and Sun, Yunbo and Cai, Zeyu and Wei, Jiashen and Luo, Tianyu and Yin, Yixuan and Zhang, Haoxu and Hu, Yi and others , booktitle=. 2025 , note=
work page 2025
-
[36]
Wang, Lintao and Su, Encheng and Liu, Jiaqi and others , year=. 2506.17667 , archivePrefix=
-
[37]
Zhu, Yaoming and Wang, Junxin and Li, Yiyang and others , booktitle=. 2025 , note=
work page 2025
- [38]
-
[39]
Wang, Yubo and Ma, Xueguang and Zhang, Ge and Ni, Yuansheng and Chandra, Abhranil and Guo, Shiguang and Ren, Weiming and Arulraj, Aaran and He, Xuan and Jiang, Ziyan and others , booktitle=. 2024 , note=
work page 2024
-
[40]
Yue, Xiang and Zheng, Tianyu and Ni, Yuansheng and Wang, Yubo and Zhang, Kai and Tong, Shengbang and Sun, Yuxuan and Yu, Botao and Zhang, Ge and Sun, Huan and others , booktitle=. 2024 , eprint=
work page 2024
-
[41]
Zhang, Dan and Hu, Ziniu and Zhoubian, Sining and Du, Zhengxiao and Yang, Kaiyu and Wang, Zihan and Yue, Yisong and Dong, Yuxiao and Tang, Jie , booktitle=. 2024 , note=
work page 2024
-
[42]
Rethinking Benchmark and Contamination for Language Models with Rephrased Samples , author=. 2023 , eprint=
work page 2023
-
[43]
Contamination Report for Multilingual Benchmarks , author=. 2024 , eprint=
work page 2024
-
[44]
On The Fragility of Benchmark Contamination Detection in Reasoning Models , author=. under review , year=. 2510.02386 , archivePrefix=
-
[45]
Xuan, Weihao and Yang, Rui and Qi, Heli and Zeng, Qingcheng and Xiao, Yunze and Feng, Yun and Liu, Jiaxing and Hou, Jingyi and Zhao, Jiawei and Yu, Wenxiang and others , year=. 2503.10497 , archivePrefix=
-
[46]
Chow, Wei and Mao, Jiageng and Li, Boyi and Seita, Daniel and Guizilini, Vitor and Wang, Yue , booktitle=. 2025 , note=
work page 2025
-
[47]
A Comprehensive Survey of Contamination Detection Methods in Large Language Models , author=. 2024 , eprint=
work page 2024
-
[48]
Dekoninck, Jasper and Mueller, Mark Niklas and Vechev, Martin , booktitle=
-
[49]
The Bitter Lesson Learned from 2
Wu, Minghao and Wang, Weixuan and Liu, Sinuo and others , year=. The Bitter Lesson Learned from 2. 2504.15521 , archivePrefix=
-
[50]
and Lee, Dean and Menghini, Cristina and others , year=
Wang, Clinton J. and Lee, Dean and Menghini, Cristina and others , year=. 2502.08859 , archivePrefix=
-
[51]
Open Source Strikes Bread - New Fluffy Embeddings Model , author=. 2024 , publisher=
work page 2024
- [52]
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.