pith. machine review for the scientific record. sign in

arxiv: 2605.14040 · v1 · submitted 2026-05-13 · 💻 cs.CL

Recognition: no theorem link

Physics-R1: An Audited Olympiad Corpus and Recipe for Visual Physics Reasoning

Authors on Pith no claims yet

Pith reviewed 2026-05-15 05:31 UTC · model grok-4.3

classification 💻 cs.CL
keywords physics olympiadvision language modelevaluation auditmultimodal reasoningreinforcement learningbenchmark contaminationheld-out evaluationvisual physics
0
0 comments X

The pith

Audited olympiad corpus and RL recipe lift 8B vision model 18 points on physics reasoning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that standard evaluations for vision-language models on physics problems are distorted by hidden train-eval overlaps, language translation effects, and the ease of multiple-choice formats. It applies a three-stage audit to clean existing pools and releases a new 500-problem held-out olympiad set along with a closed-form RL training recipe. Starting from an 8B base model, the resulting Physics-R1 achieves clear gains across the audited benchmarks. These gains place the small model ahead of some larger systems on certain tasks while still trailing top closed models. The work establishes that careful data filtering plus targeted reinforcement learning can produce measurable progress on hard visual physics problems.

Core claim

End-to-end auditing reveals undetected contamination and format biases in public physics evaluations; releasing the cleaned PhysCorp-A, PhysR1Corp, and PhysOlym-A artifacts together with a GSPO+DAPO recipe cold-started from Qwen3-VL-8B-Thinking produces +18.3 pp on PhysOlym-A liberal, +15.7 pp on PhysReason, +6.9 pp on OlympiadBench-Physics, and +4.1 pp on PhyX MCQ across three seeds.

What carries the argument

The three-stage audit (5-gram Jaccard then embedding cosine then LLM judge) that removes duplicates and paraphrases, combined with the GSPO+DAPO reinforcement learning recipe applied to closed-form physics problems.

If this is right

  • The PhysOlym-A set supplies a more reliable measure of generalization because it is 99.8 percent novel-source and carries native difficulty labels.
  • An 8B model trained with the recipe can exceed the performance of certain 32B and proprietary models on specific audited physics tasks.
  • Open-ended olympiad evaluation exposes substantially larger performance gaps than MCQ formats on the same model weights.
  • Translation between languages on identical physics problems produces statistically significant accuracy swings of roughly 17 points.
  • Releasing the audited corpora allows future training runs to start from contamination-free pools.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar multi-stage audits could be applied to mathematics or chemistry benchmarks to reduce hidden overlap across those domains as well.
  • The GSPO+DAPO recipe might transfer to other multimodal reasoning areas if closed-form problem pools are constructed for them.
  • Incorporating the audit step as standard preprocessing could become a practical way to keep large-scale training data cleaner over time.
  • Evaluating the trained model on newly created olympiad problems written after the audit date would provide an even stricter test of generalization.

Load-bearing premise

The three-stage audit has removed essentially all contamination so that the new PhysOlym-A set has no overlap with any training data used for the base model or the recipe.

What would settle it

Finding even one problem among the 500 in PhysOlym-A that matches or closely paraphrases an item from the original training pools such as SciInstruct or UGPhysics-Train.

read the original abstract

We audit the multimodal-physics evaluation pipeline end-to-end and document three undetected construction practices that distort how the field measures vision-language reasoning: train-eval contamination, translation drift, and MCQ saturation. (1) Public training pools (UGPhysics-Train, SciInstruct, MMK12) pass single-stage 5-gram-Jaccard audits with zero hits across all six public physics evals; a three-stage audit (Jaccard -> mxbai-embed-large cosine -> Haiku-4.5 LLM-judge) surfaces 134 near-duplicates and 4,846 paraphrase candidates in SciInstruct alone. (2) A 17-pp Sonnet 4.5 delta on 59 paired Estonian-English olympiad problems (30.5% vs. 13.6%; sign test p=0.011, McNemar p=0.021, paired bootstrap 95% CI [+5.1, +28.9] pp). (3) A 46-pp format-and-novelty gradient on identical Sonnet weights between MCQ (79.7% on PhyX) and open-ended olympiad evaluation (33.4% on PhysOlym-A). We release four artifacts addressing these gaps: PhysCorp-A (6,432-record three-stage-audited multimodal corpus), PhysR1Corp (2,268-record closed-form RL pool), PhysOlym-A (500-problem, 99.8% novel-source held-out olympiad eval with native difficulty labels and an EN/ET bilingual subset), and Physics-R1, a reference GSPO+DAPO recipe cold-started from Qwen3-VL-8B-Thinking. Across 3 seeds, Physics-R1 lifts the audited corpus over the 8B base by +18.3 pp on PhysOlym-A liberal (8.0 -> 26.3 +/- 1.7; 7.1 pp behind Sonnet 4.5), +15.7 pp on PhysReason (23.9 -> 39.6 +/- 6.4; ahead of Qwen3-VL-32B and Gemini 2.5 Pro), +6.9 pp on OlympiadBench-Physics (46.2 +/- 1.5), and +4.1 pp on PhyX MCQ (77.8 +/- 0.3).

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The manuscript audits the multimodal physics evaluation pipeline end-to-end, documenting three construction practices (train-eval contamination, translation drift, and MCQ saturation) that distort vision-language reasoning measurements. It releases four artifacts—PhysCorp-A (6,432-record audited corpus), PhysR1Corp (2,268-record RL pool), PhysOlym-A (500-problem 99.8% novel held-out olympiad eval), and the Physics-R1 GSPO+DAPO recipe cold-started from Qwen3-VL-8B-Thinking—and reports that the recipe lifts the base model by +18.3 pp on PhysOlym-A liberal, +15.7 pp on PhysReason, +6.9 pp on OlympiadBench-Physics, and +4.1 pp on PhyX MCQ across three seeds, with statistical tests (sign test, McNemar, bootstrap CI) and comparisons to closed models.

Significance. If the three-stage audit ensures the new evaluations are free of contamination from both public pools and the base model's full pretraining mixture, and if the reported gains are reproducible, the work supplies valuable audited resources and a practical open recipe that narrows the gap between 8B open models and frontier closed models on olympiad-level visual physics tasks.

major comments (1)
  1. [Audit Methodology (abstract and § on three-stage pipeline)] The three-stage audit (5-gram Jaccard then mxbai-embed-large cosine then Haiku-4.5 LLM judge) is described only for the public pools UGPhysics-Train, SciInstruct, and MMK12; no procedure, thresholds, or results are given for checking overlap against the full (non-public) pretraining mixture of Qwen3-VL-8B-Thinking. This is load-bearing for the central performance claims, because any undetected contamination would mean the measured lifts (+18.3 pp on PhysOlym-A liberal, etc.) conflate memorization with the GSPO+DAPO recipe.
minor comments (2)
  1. [Audit Methodology] Exact numerical thresholds for the embedding cosine similarity and LLM-judge stages of the audit are not stated, nor are the precise criteria used to classify the 134 near-duplicates and 4,846 paraphrase candidates.
  2. [Results] The manuscript reports aggregate multi-seed means and CIs but does not tabulate the per-seed scores or variance for all four benchmarks, which would strengthen reproducibility claims.

Simulated Author's Rebuttal

1 responses · 1 unresolved

We thank the referee for the detailed and constructive review. The concern about auditing the closed pretraining mixture is well-taken and highlights an important limitation of working with proprietary base models. We address it point-by-point below.

read point-by-point responses
  1. Referee: [Audit Methodology (abstract and § on three-stage pipeline)] The three-stage audit (5-gram Jaccard then mxbai-embed-large cosine then Haiku-4.5 LLM judge) is described only for the public pools UGPhysics-Train, SciInstruct, and MMK12; no procedure, thresholds, or results are given for checking overlap against the full (non-public) pretraining mixture of Qwen3-VL-8B-Thinking. This is load-bearing for the central performance claims, because any undetected contamination would mean the measured lifts (+18.3 pp on PhysOlym-A liberal, etc.) conflate memorization with the GSPO+DAPO recipe.

    Authors: We agree that an audit against the full proprietary pretraining mixture would be ideal and would further strengthen the contamination-free claim. Unfortunately, the pretraining data for Qwen3-VL-8B-Thinking is not publicly released by the model provider, so no such procedure, thresholds, or results can be provided. Our three-stage audit was applied exhaustively to every public training pool referenced in the literature (UGPhysics-Train, SciInstruct, MMK12), surfacing the 134 near-duplicates and 4,846 paraphrase candidates in SciInstruct. For the new evaluation set, PhysOlym-A was deliberately sourced from olympiad problems whose original contest sources post-date or lie outside public web corpora and the listed training pools; we report 99.8% novel-source status after manual verification of contest provenance. We will add an explicit limitations paragraph in the revised manuscript stating that closed-model pretraining mixtures cannot be audited and that our claims rest on (i) exhaustive public-pool decontamination and (ii) novel-source construction of PhysOlym-A. The reported gains are therefore measured on a held-out set that is verifiably free of the public contamination we document. revision: partial

standing simulated objections not resolved
  • Full audit of overlap against the non-public pretraining mixture of Qwen3-VL-8B-Thinking, which remains inaccessible to the research community.

Circularity Check

0 steps flagged

No significant circularity; gains measured on newly audited held-out sets

full rationale

The paper constructs PhysCorp-A, PhysR1Corp, and PhysOlym-A via documented three-stage audit on public pools (UGPhysics-Train, SciInstruct, MMK12), releases them as artifacts, cold-starts Physics-R1 from external Qwen3-VL-8B-Thinking base using GSPO+DAPO, and reports lifts on the new 99.8% novel PhysOlym-A and other held-out benchmarks. No self-definitional reductions, fitted inputs renamed as predictions, or load-bearing self-citations appear; central claims rest on externally verifiable new data rather than quantities defined in terms of the same fitted parameters.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim depends on the three-stage audit correctly identifying and removing contamination and on the new PhysOlym-A set being free of overlap with any data used to train the base model or the recipe.

axioms (2)
  • domain assumption The three-stage audit (Jaccard -> embedding -> LLM judge) fully removes train-eval contamination
    Invoked in the construction of PhysCorp-A and PhysOlym-A and in the claim that prior evals were contaminated
  • standard math The GSPO+DAPO recipe and Qwen3-VL-8B-Thinking base can be reproduced from the released artifacts
    Standard assumption in ML training papers when artifacts are released

pith-pipeline@v0.9.0 · 5751 in / 1563 out tokens · 66042 ms · 2026-05-15T05:31:35.296432+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

52 extracted references · 52 canonical work pages · 3 internal anchors

  1. [1]

    Shen, Hui and Wu, Taiqiang and Han, Qi and others , journal=

  2. [2]

    He, Chaoqun and Luo, Renjie and Bai, Yuzhuo and Hu, Shengding and others , booktitle=

  3. [3]

    Xu, Xin and Xu, Qiyun and Xiao, Tong and others , journal=

  4. [4]

    Zhang, Xinyu and Dong, Yuxuan and Wu, Yanrui and others , journal=

  5. [5]

    Yue, Xiang and Ni, Yuansheng and Zhang, Kai and Zheng, Tianyu and others , booktitle=

  6. [6]

    Tsoukalas, George and Lee, Jasper and Jennings, John and Xin, Yifan and others , journal=

  7. [7]

    Glazer, Eric and Erdil, Ege and Besiroglu, Tamay and Chicharro, Diego and others , journal=

  8. [8]

    EMNLP Findings , year=

    Sainz, Oscar and Campos, Jon Ander and Garc. EMNLP Findings , year=

  9. [9]

    and Kocyigit, Muhammed Yusuf and Poulton, Andrew and others , journal=

    Singh, Aaditya K. and Kocyigit, Muhammed Yusuf and Poulton, Andrew and others , journal=. Evaluation data contamination in

  10. [10]

    Shao, Zhihong and Wang, Peiyi and Zhu, Qihao and others , journal=

  11. [11]

    Yu, Qiying and Zhang, Zheng and Zhu, Ruofei and others , journal=

  12. [12]

    Group Sequence Policy Optimization

    Group Sequence Policy Optimization , author=. arXiv preprint arXiv:2507.18071 , year=

  13. [13]

    NeurIPS , year=

    Deep Reinforcement Learning from Human Preferences , author=. NeurIPS , year=

  14. [14]

    NeurIPS , year=

    Training Language Models to Follow Instructions with Human Feedback , author=. NeurIPS , year=

  15. [15]

    Constitutional

    Bai, Yuntao and Kadavath, Saurav and Kundu, Sandipan and others , journal=. Constitutional

  16. [16]

    Sheng, Guangming and Zhang, Chi and Ye, Zilingfeng and Wu, Xibin and others , journal=

  17. [17]

    2025 , howpublished=

  18. [18]

    Chen, Zhe and Wang, Weiyun and Cao, Yue and others , journal=

  19. [19]

    Liu, Bo and Li, Bo and Zhang, Yuanhan and others , booktitle=

  20. [20]

    Educational and Psychological Measurement , volume=

    A Coefficient of Agreement for Nominal Scales , author=. Educational and Psychological Measurement , volume=

  21. [21]

    Communications of the ACM , year=

    Datasheets for Datasets , author=. Communications of the ACM , year=

  22. [22]

    Estonian Physics Olympiad: Problem Collection 2004--2018 , author=

  23. [23]

    Olympiad Physics Handouts , author=

  24. [24]

    International Physics Olympiad: Archived Problems and Solutions , author=

  25. [25]

    Physics Olympiad: Archived Problems and Solutions , author=

    U.S. Physics Olympiad: Archived Problems and Solutions , author=

  26. [26]

    Asian Physics Olympiad: Archived Problems and Solutions , author=

  27. [27]

    Indian National Physics Olympiad: Archived Problems and Solutions , author=

  28. [28]

    Nordic-Baltic Physics Olympiad: Archived Problems and Solutions , author=

  29. [29]

    European Physics Olympiad: Archived Problems and Solutions , author=

  30. [30]

    Physics Stack Exchange: Question and Answer Archive , author=

  31. [31]

    NeurIPS , year=

    Direct Preference Optimization: Your Language Model is Secretly a Reward Model , author=. NeurIPS , year=

  32. [32]

    MM-Eureka: Exploring the Frontiers of Multimodal Reasoning with Rule-based Reinforcement Learning

    MM-Eureka: Exploring Visual Aha-Moment with Rule-based Large-scale Reinforcement Learning , author=. arXiv preprint arXiv:2503.07365 , year=

  33. [33]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning , author=. arXiv preprint arXiv:2501.12948 , year=

  34. [34]

    Croissant: A Metadata Format for

    Akhtar, Mubashara and Benjelloun, Omar and Conforti, Costanza and others , booktitle=. Croissant: A Metadata Format for. 2024 , url=

  35. [35]

    2025 , note=

    Qiu, Shi and Guo, Shaoyang and Song, Zhuo-Yang and Sun, Yunbo and Cai, Zeyu and Wei, Jiashen and Luo, Tianyu and Yin, Yixuan and Zhang, Haoxu and Hu, Yi and others , booktitle=. 2025 , note=

  36. [36]

    2506.17667 , archivePrefix=

    Wang, Lintao and Su, Encheng and Liu, Jiaqi and others , year=. 2506.17667 , archivePrefix=

  37. [37]

    2025 , note=

    Zhu, Yaoming and Wang, Junxin and Li, Yiyang and others , booktitle=. 2025 , note=

  38. [38]

    2025 , eprint=

    Humanity's Last Exam , author=. 2025 , eprint=

  39. [39]

    2024 , note=

    Wang, Yubo and Ma, Xueguang and Zhang, Ge and Ni, Yuansheng and Chandra, Abhranil and Guo, Shiguang and Ren, Weiming and Arulraj, Aaran and He, Xuan and Jiang, Ziyan and others , booktitle=. 2024 , note=

  40. [40]

    2024 , eprint=

    Yue, Xiang and Zheng, Tianyu and Ni, Yuansheng and Wang, Yubo and Zhang, Kai and Tong, Shengbang and Sun, Yuxuan and Yu, Botao and Zhang, Ge and Sun, Huan and others , booktitle=. 2024 , eprint=

  41. [41]

    2024 , note=

    Zhang, Dan and Hu, Ziniu and Zhoubian, Sining and Du, Zhengxiao and Yang, Kaiyu and Wang, Zihan and Yue, Yisong and Dong, Yuxiao and Tang, Jie , booktitle=. 2024 , note=

  42. [42]

    2023 , eprint=

    Rethinking Benchmark and Contamination for Language Models with Rephrased Samples , author=. 2023 , eprint=

  43. [43]

    2024 , eprint=

    Contamination Report for Multilingual Benchmarks , author=. 2024 , eprint=

  44. [44]

    under review , year=

    On The Fragility of Benchmark Contamination Detection in Reasoning Models , author=. under review , year=. 2510.02386 , archivePrefix=

  45. [45]

    2503.10497 , archivePrefix=

    Xuan, Weihao and Yang, Rui and Qi, Heli and Zeng, Qingcheng and Xiao, Yunze and Feng, Yun and Liu, Jiaxing and Hou, Jingyi and Zhao, Jiawei and Yu, Wenxiang and others , year=. 2503.10497 , archivePrefix=

  46. [46]

    2025 , note=

    Chow, Wei and Mao, Jiageng and Li, Boyi and Seita, Daniel and Guizilini, Vitor and Wang, Yue , booktitle=. 2025 , note=

  47. [47]

    2024 , eprint=

    A Comprehensive Survey of Contamination Detection Methods in Large Language Models , author=. 2024 , eprint=

  48. [48]

    Dekoninck, Jasper and Mueller, Mark Niklas and Vechev, Martin , booktitle=

  49. [49]

    The Bitter Lesson Learned from 2

    Wu, Minghao and Wang, Weixuan and Liu, Sinuo and others , year=. The Bitter Lesson Learned from 2. 2504.15521 , archivePrefix=

  50. [50]

    and Lee, Dean and Menghini, Cristina and others , year=

    Wang, Clinton J. and Lee, Dean and Menghini, Cristina and others , year=. 2502.08859 , archivePrefix=

  51. [51]

    2024 , publisher=

    Open Source Strikes Bread - New Fluffy Embeddings Model , author=. 2024 , publisher=

  52. [52]

    2025 , url=

    Claude Sonnet 4.5 , author=. 2025 , url=