{"record_type":"pith_number_record","schema_url":"https://pith.science/schemas/pith-number/v1.json","pith_number":"pith:2025:MMV3S43NX2NZS5FKPHO33KGNPX","short_pith_number":"pith:MMV3S43N","schema_version":"1.0","canonical_sha256":"632bb9736dbe9b9974aa79ddbda8cd7dc09d9c242dc0470ab304df9b8ecd32c5","source":{"kind":"arxiv","id":"2506.10947","version":2},"attestation_state":"computed","paper":{"title":"Spurious Rewards: Rethinking Training Signals in RLVR","license":"http://creativecommons.org/licenses/by-nc-sa/4.0/","headline":"Reinforcement learning with verifiable rewards improves math performance in some models even when rewards are random or spurious.","cross_cats":["cs.LG"],"primary_cat":"cs.AI","authors_text":"Hannaneh Hajishirzi, Luke Zettlemoyer, Nathan Lambert, Pang Wei Koh, Ranjay Krishna, Rui Xin, Rulin Shao, Scott Geng, Sewon Min, Sewoong Oh, Shuyue Stella Li, Simon Shaolei Du, Yiping Wang, Yulia Tsvetkov","submitted_at":"2025-06-12T17:49:55Z","abstract_excerpt":"We show that reinforcement learning with verifiable rewards (RLVR) can elicit strong mathematical reasoning in certain language models even with spurious rewards that have little, no, or even negative correlation with the correct answer. For example, RLVR training with GRPO improves MATH-500 performance for Qwen2.5-Math-7B by 21.4 percentage points using randomly assigned rewards, nearly matching the 29.1-point gain from ground-truth rewards. To explain this counterintuitive observation, we show that GRPO exhibits a clipping bias from the clip term, which can amplify high-prior behaviors learn"},"verification_status":{"content_addressed":true,"pith_receipt":true,"author_attested":false,"weak_author_claims":0,"strong_author_claims":0,"externally_anchored":false,"storage_verified":false,"citation_signatures":0,"replication_records":0,"graph_snapshot":true,"references_resolved":true,"formal_links_present":true},"canonical_record":{"source":{"id":"2506.10947","kind":"arxiv","version":2},"metadata":{"license":"http://creativecommons.org/licenses/by-nc-sa/4.0/","primary_cat":"cs.AI","submitted_at":"2025-06-12T17:49:55Z","cross_cats_sorted":["cs.LG"],"title_canon_sha256":"54ce252e6e4a74379d16b07187cb77f39b51b53b770d20a1668851db184f1cc0","abstract_canon_sha256":"02836581b64f996450743dd106f6e4e88786e5671e9d6a5c5158087a906901cd"},"schema_version":"1.0"},"receipt":{"kind":"pith_receipt","key_id":"pith-v1-2026-05","algorithm":"ed25519","signed_at":"2026-05-17T23:38:47.721631Z","signature_b64":"QGnyQ9Ly3wzkf3lnsqOQKtRfGm7rUsWkmdvgyZDl+XkDjUGSjjC43g3+cbNkb+cEozXAGFcR13jhU/XVSpsBBw==","signed_message":"canonical_sha256_bytes","builder_version":"pith-number-builder-2026-05-17-v1","receipt_version":"0.3","canonical_sha256":"632bb9736dbe9b9974aa79ddbda8cd7dc09d9c242dc0470ab304df9b8ecd32c5","last_reissued_at":"2026-05-17T23:38:47.721148Z","signature_status":"signed_v1","first_computed_at":"2026-05-17T23:38:47.721148Z","public_key_fingerprint":"8d4b5ee74e4693bcd1df2446408b0d54"},"graph_snapshot":{"paper":{"title":"Spurious Rewards: Rethinking Training Signals in RLVR","license":"http://creativecommons.org/licenses/by-nc-sa/4.0/","headline":"Reinforcement learning with verifiable rewards improves math performance in some models even when rewards are random or spurious.","cross_cats":["cs.LG"],"primary_cat":"cs.AI","authors_text":"Hannaneh Hajishirzi, Luke Zettlemoyer, Nathan Lambert, Pang Wei Koh, Ranjay Krishna, Rui Xin, Rulin Shao, Scott Geng, Sewon Min, Sewoong Oh, Shuyue Stella Li, Simon Shaolei Du, Yiping Wang, Yulia Tsvetkov","submitted_at":"2025-06-12T17:49:55Z","abstract_excerpt":"We show that reinforcement learning with verifiable rewards (RLVR) can elicit strong mathematical reasoning in certain language models even with spurious rewards that have little, no, or even negative correlation with the correct answer. For example, RLVR training with GRPO improves MATH-500 performance for Qwen2.5-Math-7B by 21.4 percentage points using randomly assigned rewards, nearly matching the 29.1-point gain from ground-truth rewards. To explain this counterintuitive observation, we show that GRPO exhibits a clipping bias from the clip term, which can amplify high-prior behaviors learn"},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"RLVR training with GRPO improves MATH-500 performance for Qwen2.5-Math-7B by 21.4 percentage points using randomly assigned rewards, nearly matching the 29.1-point gain from ground-truth rewards.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"The assumption that the performance gains with spurious rewards are primarily driven by the clipping bias in GRPO amplifying specific pretraining behaviors, rather than other unaccounted factors in the training process or model-specific quirks.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"Spurious rewards in RLVR can produce large gains in mathematical reasoning for certain language models via GRPO's clipping bias amplifying pretraining behaviors like code reasoning.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"Reinforcement learning with verifiable rewards improves math performance in some models even when rewards are random or spurious.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"a1313d28b42a591a874dbe65e2af1e7ae3d9a574f6d16be6d26ee8f7585da365"},"source":{"id":"2506.10947","kind":"arxiv","version":2},"verdict":{"id":"26070616-b572-4562-9a0b-ed7124ac98f8","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-16T13:33:45.992784Z","strongest_claim":"RLVR training with GRPO improves MATH-500 performance for Qwen2.5-Math-7B by 21.4 percentage points using randomly assigned rewards, nearly matching the 29.1-point gain from ground-truth rewards.","one_line_summary":"Spurious rewards in RLVR can produce large gains in mathematical reasoning for certain language models via GRPO's clipping bias amplifying pretraining behaviors like code reasoning.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"The assumption that the performance gains with spurious rewards are primarily driven by the clipping bias in GRPO amplifying specific pretraining behaviors, rather than other unaccounted factors in the training process or model-specific quirks.","pith_extraction_headline":"Reinforcement learning with verifiable rewards improves math performance in some models even when rewards are random or spurious."},"references":{"count":21,"sample":[{"doi":"10.1038/s41586-025-09422-z","year":2025,"title":"Nature645(8081), 633–638 (2025) https://doi.org/10.1038/s41586-025-09422-z","work_id":"9835b482-5032-4135-93dd-82a066677569","ref_index":1,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":null,"title":"2 OLMo 2 Furious","work_id":"9ef0dc2b-fdfe-4f14-b235-ef7556dc709a","ref_index":2,"cited_arxiv_id":"2501.00656","is_internal_anchor":true},{"doi":"","year":2022,"title":"CoRR , volume =","work_id":"471b6786-7627-4021-a6b3-7f5cd8c83643","ref_index":3,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"10.18653/v1/2025.acl-industry","year":2025,"title":"ISBN 9 Coverage, Not Averages Semantic Stratification for Trustworthy Retrieval Evaluation 979-8-89176-288-6","work_id":"74d89fb3-6798-4148-af6b-9764a4f36db8","ref_index":4,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2025,"title":"Logic-RL: Unleashing LLM Reasoning with Rule-Based Reinforcement Learning","work_id":"b1f42628-30b7-4593-80ff-813523035c23","ref_index":5,"cited_arxiv_id":"2502.14768","is_internal_anchor":true}],"resolved_work":21,"snapshot_sha256":"35f2c6e607b49f7b17ab9b655c305739f3402da7da0cdcf67275a3771ce4460e","internal_anchors":3},"formal_canon":{"evidence_count":1,"snapshot_sha256":"f473b6a76c0669e9d4301fc756ee7bd8f3aa0d2e5b7e7562c63260df054b38eb"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"},"aliases":[{"alias_kind":"arxiv","alias_value":"2506.10947","created_at":"2026-05-17T23:38:47.721227+00:00"},{"alias_kind":"arxiv_version","alias_value":"2506.10947v2","created_at":"2026-05-17T23:38:47.721227+00:00"},{"alias_kind":"doi","alias_value":"10.48550/arxiv.2506.10947","created_at":"2026-05-17T23:38:47.721227+00:00"},{"alias_kind":"pith_short_12","alias_value":"MMV3S43NX2NZ","created_at":"2026-05-18T12:33:37.589309+00:00"},{"alias_kind":"pith_short_16","alias_value":"MMV3S43NX2NZS5FK","created_at":"2026-05-18T12:33:37.589309+00:00"},{"alias_kind":"pith_short_8","alias_value":"MMV3S43N","created_at":"2026-05-18T12:33:37.589309+00:00"}],"events":[],"event_summary":{},"paper_claims":[],"inbound_citations":{"count":30,"internal_anchor_count":30,"sample":[{"citing_arxiv_id":"2505.14412","citing_title":"PRL: Prompts from Reinforcement Learning","ref_index":5,"is_internal_anchor":true},{"citing_arxiv_id":"2605.12058","citing_title":"Holder Policy Optimisation","ref_index":56,"is_internal_anchor":true},{"citing_arxiv_id":"2510.18814","citing_title":"A Model Can Help Itself: Reward-Free Self-Training for LLM Reasoning","ref_index":16,"is_internal_anchor":true},{"citing_arxiv_id":"2512.00778","citing_title":"What Is Preference Optimization Doing, and Why?","ref_index":8,"is_internal_anchor":true},{"citing_arxiv_id":"2605.20066","citing_title":"Text-to-SPARQL Generation with Reinforcement Learning: A GRPO-based Approach on DBLP","ref_index":9,"is_internal_anchor":true},{"citing_arxiv_id":"2605.20164","citing_title":"Not Every Rubric Teaches Equally: Policy-Aware Rubric Rewards for RLVR","ref_index":4,"is_internal_anchor":true},{"citing_arxiv_id":"2508.19652","citing_title":"Self-Rewarding Vision-Language Model via Reasoning Decomposition","ref_index":16,"is_internal_anchor":true},{"citing_arxiv_id":"2509.18052","citing_title":"The PIMMUR Principles: Ensuring Validity in Collective Behavior of LLM Societies","ref_index":33,"is_internal_anchor":true},{"citing_arxiv_id":"2510.18814","citing_title":"A Model Can Help Itself: Reward-Free Self-Training for LLM Reasoning","ref_index":16,"is_internal_anchor":true},{"citing_arxiv_id":"2511.14045","citing_title":"Auditing Data Membership in Reinforcement Learning With Verifiable Rewards","ref_index":37,"is_internal_anchor":true},{"citing_arxiv_id":"2511.23473","citing_title":"ThetaEvolve: Test-time Learning on Open Problems","ref_index":1,"is_internal_anchor":true},{"citing_arxiv_id":"2605.14445","citing_title":"FrontierSmith: Synthesizing Open-Ended Coding Problems at Scale","ref_index":30,"is_internal_anchor":true},{"citing_arxiv_id":"2604.03472","citing_title":"Vocabulary Dropout for Curriculum Diversity in LLM Co-Evolution","ref_index":31,"is_internal_anchor":true},{"citing_arxiv_id":"2605.12058","citing_title":"Holder Policy Optimisation","ref_index":56,"is_internal_anchor":true},{"citing_arxiv_id":"2605.11813","citing_title":"Automated Reformulation of Robust Optimization via Memory-Augmented Large Language Models","ref_index":9,"is_internal_anchor":true},{"citing_arxiv_id":"2605.12474","citing_title":"Reward Hacking in Rubric-Based Reinforcement Learning","ref_index":26,"is_internal_anchor":true},{"citing_arxiv_id":"2605.02427","citing_title":"The Model Knows, the Decoder Finds: Future Value Guided Particle Power Sampling","ref_index":9,"is_internal_anchor":true},{"citing_arxiv_id":"2604.23747","citing_title":"SFT-then-RL Outperforms Mixed-Policy Methods for LLM Reasoning","ref_index":42,"is_internal_anchor":true},{"citing_arxiv_id":"2605.04542","citing_title":"Power Distribution Bridges Sampling, Self-Reward RL, and Self-Distillation","ref_index":168,"is_internal_anchor":true},{"citing_arxiv_id":"2605.00380","citing_title":"ResRL: Boosting LLM Reasoning via Negative Sample Projection Residual Reinforcement Learning","ref_index":56,"is_internal_anchor":true},{"citing_arxiv_id":"2507.01006","citing_title":"GLM-4.5V and GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning","ref_index":41,"is_internal_anchor":true},{"citing_arxiv_id":"2605.07244","citing_title":"Experience Sharing in Mutual Reinforcement Learning for Heterogeneous Language Models","ref_index":82,"is_internal_anchor":true},{"citing_arxiv_id":"2605.00380","citing_title":"ResRL: Boosting LLM Reasoning via Negative Sample Projection Residual Reinforcement Learning","ref_index":56,"is_internal_anchor":true},{"citing_arxiv_id":"2605.02913","citing_title":"Generate, Filter, Control, Replay: A Comprehensive Survey of Rollout Strategies for LLM Reinforcement Learning","ref_index":97,"is_internal_anchor":true},{"citing_arxiv_id":"2604.16259","citing_title":"Beyond Distribution Sharpening: The Importance of Task Rewards","ref_index":35,"is_internal_anchor":true}]},"formal_canon":{"evidence_count":1,"sample":[],"anchors":[]},"links":{"html":"https://pith.science/pith/MMV3S43NX2NZS5FKPHO33KGNPX","json":"https://pith.science/pith/MMV3S43NX2NZS5FKPHO33KGNPX.json","graph_json":"https://pith.science/api/pith-number/MMV3S43NX2NZS5FKPHO33KGNPX/graph.json","events_json":"https://pith.science/api/pith-number/MMV3S43NX2NZS5FKPHO33KGNPX/events.json","paper":"https://pith.science/paper/MMV3S43N"},"agent_actions":{"view_html":"https://pith.science/pith/MMV3S43NX2NZS5FKPHO33KGNPX","download_json":"https://pith.science/pith/MMV3S43NX2NZS5FKPHO33KGNPX.json","view_paper":"https://pith.science/paper/MMV3S43N","resolve_alias":"https://pith.science/api/pith-number/resolve?arxiv=2506.10947&json=true","fetch_graph":"https://pith.science/api/pith-number/MMV3S43NX2NZS5FKPHO33KGNPX/graph.json","fetch_events":"https://pith.science/api/pith-number/MMV3S43NX2NZS5FKPHO33KGNPX/events.json","actions":{"anchor_timestamp":"https://pith.science/pith/MMV3S43NX2NZS5FKPHO33KGNPX/action/timestamp_anchor","attest_storage":"https://pith.science/pith/MMV3S43NX2NZS5FKPHO33KGNPX/action/storage_attestation","attest_author":"https://pith.science/pith/MMV3S43NX2NZS5FKPHO33KGNPX/action/author_attestation","sign_citation":"https://pith.science/pith/MMV3S43NX2NZS5FKPHO33KGNPX/action/citation_signature","submit_replication":"https://pith.science/pith/MMV3S43NX2NZS5FKPHO33KGNPX/action/replication_record"}},"created_at":"2026-05-17T23:38:47.721227+00:00","updated_at":"2026-05-17T23:38:47.721227+00:00"}