{"record_type":"pith_number_record","schema_url":"https://pith.science/schemas/pith-number/v1.json","pith_number":"pith:2025:LWH4ADZ4MWN3KIIC5FXC6ZKBCM","short_pith_number":"pith:LWH4ADZ4","schema_version":"1.0","canonical_sha256":"5d8fc00f3c659bb52102e96e2f6541132aad902fe978735a8188def52dde9f83","source":{"kind":"arxiv","id":"2509.08827","version":3},"attestation_state":"computed","paper":{"title":"A Survey of Reinforcement Learning for Large Reasoning Models","license":"http://creativecommons.org/licenses/by-nc-sa/4.0/","headline":"","cross_cats":["cs.AI","cs.LG"],"primary_cat":"cs.CL","authors_text":"Bingxiang He, Biqing Qi, Bowen Zhou, Che Jiang, Dong Li, Ermo Hua, Fangfu Liu, Ganqu Cui, Guoli Jia, Haozhan Li, Huayu Chen, Jiaze Ma, Junqi Gao, Kai Tian, Kaiyan Zhang, Ning Ding, Pengfei Li, Runze Liu, Shang Qu, Shijie Wang, Sihang Zeng, Weize Chen, Xiang Xu, Xiaoye Qu, Xingtai Lv, Xinwei Long, Xuekai Zhu, Yafu Li, Yihao Liu, Youbang Sun, Yuchen Fan, Yuchen Zhang, Yu Fu, Yuru Wang, Yuxin Zuo, Zhenzhao Yuan, Zhiyuan Liu, Zhiyuan Ma, Zonglin Li","submitted_at":"2025-09-10T17:59:43Z","abstract_excerpt":"In this paper, we survey recent advances in Reinforcement Learning (RL) for reasoning with Large Language Models (LLMs). RL has achieved remarkable success in advancing the frontier of LLM capabilities, particularly in addressing complex logical tasks such as mathematics and coding. As a result, RL has emerged as a foundational methodology for transforming LLMs into LRMs. With the rapid progress of the field, further scaling of RL for LRMs now faces foundational challenges not only in computational resources but also in algorithm design, training data, and infrastructure. To this end, it is ti"},"verification_status":{"content_addressed":true,"pith_receipt":true,"author_attested":false,"weak_author_claims":0,"strong_author_claims":0,"externally_anchored":false,"storage_verified":false,"citation_signatures":0,"replication_records":0,"graph_snapshot":true,"references_resolved":false,"formal_links_present":false},"canonical_record":{"source":{"id":"2509.08827","kind":"arxiv","version":3},"metadata":{"license":"http://creativecommons.org/licenses/by-nc-sa/4.0/","primary_cat":"cs.CL","submitted_at":"2025-09-10T17:59:43Z","cross_cats_sorted":["cs.AI","cs.LG"],"title_canon_sha256":"c7c07720025a1b339b5e8562879eacd8dfe59a7c8fdbd461faa42c4b2f10849e","abstract_canon_sha256":"747d44ea4d84f76f8029daeb4f0509180301360c427349ac35b30c4ce34b38b4"},"schema_version":"1.0"},"receipt":{"kind":"pith_receipt","key_id":"pith-v1-2026-05","algorithm":"ed25519","signed_at":"2026-05-17T23:57:53.182270Z","signature_b64":"UOk/R+SApFgpNh4Ue7gIN3RBLznj/f5Um//2Qz/BWILDuAU8zy6Wyd0nTsiMFg9OzatAxpAe0l/3DymY4RlMCg==","signed_message":"canonical_sha256_bytes","builder_version":"pith-number-builder-2026-05-17-v1","receipt_version":"0.3","canonical_sha256":"5d8fc00f3c659bb52102e96e2f6541132aad902fe978735a8188def52dde9f83","last_reissued_at":"2026-05-17T23:57:53.181737Z","signature_status":"signed_v1","first_computed_at":"2026-05-17T23:57:53.181737Z","public_key_fingerprint":"8d4b5ee74e4693bcd1df2446408b0d54"},"graph_snapshot":{"paper":{"title":"A Survey of Reinforcement Learning for Large Reasoning Models","license":"http://creativecommons.org/licenses/by-nc-sa/4.0/","headline":"","cross_cats":["cs.AI","cs.LG"],"primary_cat":"cs.CL","authors_text":"Bingxiang He, Biqing Qi, Bowen Zhou, Che Jiang, Dong Li, Ermo Hua, Fangfu Liu, Ganqu Cui, Guoli Jia, Haozhan Li, Huayu Chen, Jiaze Ma, Junqi Gao, Kai Tian, Kaiyan Zhang, Ning Ding, Pengfei Li, Runze Liu, Shang Qu, Shijie Wang, Sihang Zeng, Weize Chen, Xiang Xu, Xiaoye Qu, Xingtai Lv, Xinwei Long, Xuekai Zhu, Yafu Li, Yihao Liu, Youbang Sun, Yuchen Fan, Yuchen Zhang, Yu Fu, Yuru Wang, Yuxin Zuo, Zhenzhao Yuan, Zhiyuan Liu, Zhiyuan Ma, Zonglin Li","submitted_at":"2025-09-10T17:59:43Z","abstract_excerpt":"In this paper, we survey recent advances in Reinforcement Learning (RL) for reasoning with Large Language Models (LLMs). RL has achieved remarkable success in advancing the frontier of LLM capabilities, particularly in addressing complex logical tasks such as mathematics and coding. As a result, RL has emerged as a foundational methodology for transforming LLMs into LRMs. With the rapid progress of the field, further scaling of RL for LRMs now faces foundational challenges not only in computational resources but also in algorithm design, training data, and infrastructure. To this end, it is ti"},"claims":{"count":0,"items":[],"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"source":{"id":"2509.08827","kind":"arxiv","version":3},"verdict":{"id":null,"model_set":{},"created_at":null,"strongest_claim":"","one_line_summary":"","pipeline_version":null,"weakest_assumption":"","pith_extraction_headline":""},"references":{"count":0,"sample":[],"resolved_work":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57","internal_anchors":0},"formal_canon":{"evidence_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"},"aliases":[{"alias_kind":"arxiv","alias_value":"2509.08827","created_at":"2026-05-17T23:57:53.181821+00:00"},{"alias_kind":"arxiv_version","alias_value":"2509.08827v3","created_at":"2026-05-17T23:57:53.181821+00:00"},{"alias_kind":"doi","alias_value":"10.48550/arxiv.2509.08827","created_at":"2026-05-17T23:57:53.181821+00:00"},{"alias_kind":"pith_short_12","alias_value":"LWH4ADZ4MWN3","created_at":"2026-05-18T12:33:37.589309+00:00"},{"alias_kind":"pith_short_16","alias_value":"LWH4ADZ4MWN3KIIC","created_at":"2026-05-18T12:33:37.589309+00:00"},{"alias_kind":"pith_short_8","alias_value":"LWH4ADZ4","created_at":"2026-05-18T12:33:37.589309+00:00"}],"events":[],"event_summary":{},"paper_claims":[],"inbound_citations":{"count":21,"internal_anchor_count":21,"sample":[{"citing_arxiv_id":"2509.25758","citing_title":"Thinking Sparks!: Emergent Attention Heads in Reasoning Models During Post Training","ref_index":54,"is_internal_anchor":true},{"citing_arxiv_id":"2510.10150","citing_title":"Rethinking Entropy Interventions in RLVR: An Entropy Change Perspective","ref_index":26,"is_internal_anchor":true},{"citing_arxiv_id":"2510.17881","citing_title":"POPI: Personalizing LLMs via Optimized Natural Language Preference Inference","ref_index":49,"is_internal_anchor":true},{"citing_arxiv_id":"2601.12538","citing_title":"Agentic Reasoning for Large Language Models","ref_index":55,"is_internal_anchor":true},{"citing_arxiv_id":"2512.07461","citing_title":"Native Parallel Reasoner: Reasoning in Parallelism via Self-Distilled Reinforcement Learning","ref_index":20,"is_internal_anchor":true},{"citing_arxiv_id":"2601.18832","citing_title":"The Geometric Reasoner: Manifold-Informed Latent Foresight Search for Long-Context Reasoning","ref_index":26,"is_internal_anchor":true},{"citing_arxiv_id":"2602.15620","citing_title":"STAPO: Stabilizing Reinforcement Learning for LLMs by Silencing Rare Spurious Tokens","ref_index":3,"is_internal_anchor":true},{"citing_arxiv_id":"2603.12554","citing_title":"Reinforcement Learning for Diffusion LLMs with Entropy-Guided Step Selection and Stepwise Advantages","ref_index":18,"is_internal_anchor":true},{"citing_arxiv_id":"2605.11775","citing_title":"Entropy Polarity in Reinforcement Fine-Tuning: Direction, Asymmetry, and Control","ref_index":91,"is_internal_anchor":true},{"citing_arxiv_id":"2605.11775","citing_title":"Entropy Polarity in Reinforcement Fine-Tuning: Direction, Asymmetry, and Control","ref_index":22,"is_internal_anchor":true},{"citing_arxiv_id":"2605.10937","citing_title":"Power Reinforcement Post-Training of Text-to-Image Models with Super-Linear Advantage Shaping","ref_index":76,"is_internal_anchor":true},{"citing_arxiv_id":"2604.11734","citing_title":"SCORP: Scene-Consistent Multi-agent Diffusion Planning with Stable Online Reinforcement Post-Training for Cooperative Driving","ref_index":17,"is_internal_anchor":true},{"citing_arxiv_id":"2605.09640","citing_title":"Overcoming Catastrophic Forgetting in Visual Continual Learning with Reinforcement Fine-Tuning","ref_index":5,"is_internal_anchor":true},{"citing_arxiv_id":"2604.24396","citing_title":"Global Context or Local Detail? Adaptive Visual Grounding for Hallucination Mitigation","ref_index":5,"is_internal_anchor":true},{"citing_arxiv_id":"2605.06326","citing_title":"Teaching Thinking Models to Reason with Tools: A Full-Pipeline Recipe for Tool-Integrated Reasoning","ref_index":45,"is_internal_anchor":true},{"citing_arxiv_id":"2604.22192","citing_title":"CharTide: Data-Centric Chart-to-Code Generation via Tri-Perspective Tuning and Inquiry-Driven Evolution","ref_index":40,"is_internal_anchor":true},{"citing_arxiv_id":"2604.11734","citing_title":"SCORP: Scene-Consistent Multi-agent Diffusion Planning with Stable Online Reinforcement Post-Training for Cooperative Driving","ref_index":17,"is_internal_anchor":true},{"citing_arxiv_id":"2604.08527","citing_title":"Demystifying OPD: Length Inflation and Stabilization Strategies for Large Language Models","ref_index":27,"is_internal_anchor":true},{"citing_arxiv_id":"2604.08905","citing_title":"StaRPO: Stability-Augmented Reinforcement Policy Optimization","ref_index":44,"is_internal_anchor":true},{"citing_arxiv_id":"2604.07941","citing_title":"Large Language Model Post-Training: A Unified View of Off-Policy and On-Policy Learning","ref_index":14,"is_internal_anchor":true},{"citing_arxiv_id":"2605.02913","citing_title":"Generate, Filter, Control, Replay: A Comprehensive Survey of Rollout Strategies for LLM Reinforcement Learning","ref_index":174,"is_internal_anchor":true}]},"formal_canon":{"evidence_count":0,"sample":[],"anchors":[]},"links":{"html":"https://pith.science/pith/LWH4ADZ4MWN3KIIC5FXC6ZKBCM","json":"https://pith.science/pith/LWH4ADZ4MWN3KIIC5FXC6ZKBCM.json","graph_json":"https://pith.science/api/pith-number/LWH4ADZ4MWN3KIIC5FXC6ZKBCM/graph.json","events_json":"https://pith.science/api/pith-number/LWH4ADZ4MWN3KIIC5FXC6ZKBCM/events.json","paper":"https://pith.science/paper/LWH4ADZ4"},"agent_actions":{"view_html":"https://pith.science/pith/LWH4ADZ4MWN3KIIC5FXC6ZKBCM","download_json":"https://pith.science/pith/LWH4ADZ4MWN3KIIC5FXC6ZKBCM.json","view_paper":"https://pith.science/paper/LWH4ADZ4","resolve_alias":"https://pith.science/api/pith-number/resolve?arxiv=2509.08827&json=true","fetch_graph":"https://pith.science/api/pith-number/LWH4ADZ4MWN3KIIC5FXC6ZKBCM/graph.json","fetch_events":"https://pith.science/api/pith-number/LWH4ADZ4MWN3KIIC5FXC6ZKBCM/events.json","actions":{"anchor_timestamp":"https://pith.science/pith/LWH4ADZ4MWN3KIIC5FXC6ZKBCM/action/timestamp_anchor","attest_storage":"https://pith.science/pith/LWH4ADZ4MWN3KIIC5FXC6ZKBCM/action/storage_attestation","attest_author":"https://pith.science/pith/LWH4ADZ4MWN3KIIC5FXC6ZKBCM/action/author_attestation","sign_citation":"https://pith.science/pith/LWH4ADZ4MWN3KIIC5FXC6ZKBCM/action/citation_signature","submit_replication":"https://pith.science/pith/LWH4ADZ4MWN3KIIC5FXC6ZKBCM/action/replication_record"}},"created_at":"2026-05-17T23:57:53.181821+00:00","updated_at":"2026-05-17T23:57:53.181821+00:00"}