{"paper":{"title":"Physics-R1: An Audited Olympiad Corpus and Recipe for Visual Physics Reasoning","license":"http://creativecommons.org/licenses/by/4.0/","headline":"Audited olympiad corpus and RL recipe lift 8B vision model 18 points on physics reasoning.","cross_cats":[],"primary_cat":"cs.CL","authors_text":"Shan Yang","submitted_at":"2026-05-13T19:00:57Z","abstract_excerpt":"We audit the multimodal-physics evaluation pipeline end-to-end and document three undetected construction practices that distort how the field measures vision-language reasoning: train-eval contamination, translation drift, and MCQ saturation. (1) Public training pools (UGPhysics-Train, SciInstruct, MMK12) pass single-stage 5-gram-Jaccard audits with zero hits across all six public physics evals; a three-stage audit (Jaccard -> mxbai-embed-large cosine -> Haiku-4.5 LLM-judge) surfaces 134 near-duplicates and 4,846 paraphrase candidates in SciInstruct alone. (2) A 17-pp Sonnet 4.5 delta on 59 p"},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"Across 3 seeds, Physics-R1 lifts the audited corpus over the 8B base by +18.3 pp on PhysOlym-A liberal (8.0 -> 26.3 +/- 1.7; 7.1 pp behind Sonnet 4.5), +15.7 pp on PhysReason (23.9 -> 39.6 +/- 6.4; ahead of Qwen3-VL-32B and Gemini 2.5 Pro), +6.9 pp on OlympiadBench-Physics, and +4.1 pp on PhyX MCQ.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"The three-stage audit (5-gram Jaccard then embedding cosine then LLM judge) has removed essentially all contamination and the new PhysOlym-A set is truly held-out with no overlap to any training data used for the base model or the recipe.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"Audited olympiad corpus and Physics-R1 recipe improve 8B VLM by up to 18 points on held-out physics problems while exposing contamination in prior evals.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"Audited olympiad corpus and RL recipe lift 8B vision model 18 points on physics reasoning.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"c5fd5b8c73bddcbe9b3eb47395f15bcd82633177951a1ca522bc5b9dd052d6e5"},"source":{"id":"2605.14040","kind":"arxiv","version":1},"verdict":{"id":"0fae45ee-4962-4eb8-87d6-8681e27ec0a0","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-15T05:31:35.296432Z","strongest_claim":"Across 3 seeds, Physics-R1 lifts the audited corpus over the 8B base by +18.3 pp on PhysOlym-A liberal (8.0 -> 26.3 +/- 1.7; 7.1 pp behind Sonnet 4.5), +15.7 pp on PhysReason (23.9 -> 39.6 +/- 6.4; ahead of Qwen3-VL-32B and Gemini 2.5 Pro), +6.9 pp on OlympiadBench-Physics, and +4.1 pp on PhyX MCQ.","one_line_summary":"Audited olympiad corpus and Physics-R1 recipe improve 8B VLM by up to 18 points on held-out physics problems while exposing contamination in prior evals.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"The three-stage audit (5-gram Jaccard then embedding cosine then LLM judge) has removed essentially all contamination and the new PhysOlym-A set is truly held-out with no overlap to any training data used for the base model or the recipe.","pith_extraction_headline":"Audited olympiad corpus and RL recipe lift 8B vision model 18 points on physics reasoning."},"references":{"count":52,"sample":[{"doi":"","year":null,"title":"Shen, Hui and Wu, Taiqiang and Han, Qi and others , journal=","work_id":"63d35c63-43bc-45e0-91e4-2f7ad0538808","ref_index":1,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":null,"title":"He, Chaoqun and Luo, Renjie and Bai, Yuzhuo and Hu, Shengding and others , booktitle=","work_id":"162e041c-7707-46b5-8ca1-ed10e4fc72ba","ref_index":2,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":null,"title":"Xu, Xin and Xu, Qiyun and Xiao, Tong and others , journal=","work_id":"54060fa6-ed23-41ac-ba14-f825a38a45af","ref_index":3,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":null,"title":"Zhang, Xinyu and Dong, Yuxuan and Wu, Yanrui and others , journal=","work_id":"3c778b29-142e-47e5-8b61-f7a98dc895e4","ref_index":4,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":null,"title":"Yue, Xiang and Ni, Yuansheng and Zhang, Kai and Zheng, Tianyu and others , booktitle=","work_id":"fab91dec-8cbe-497b-b77d-7aed4a1876fe","ref_index":5,"cited_arxiv_id":"","is_internal_anchor":false}],"resolved_work":52,"snapshot_sha256":"816bb11743460de72db65b69c88cb31f068311902f186d88feb64c99c911120e","internal_anchors":3},"formal_canon":{"evidence_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"}