{"work":{"id":"4ebcdbe2-5000-4f58-a7e0-aa9ae381b684","openalex_id":null,"doi":null,"arxiv_id":"2504.14945","raw_key":null,"title":"Learning to Reason under Off-Policy Guidance","authors":null,"authors_text":"Jianhao Yan, Yafu Li, Zican Hu, Zhi Wang, Ganqu Cui, Xiaoye Qu","year":2025,"venue":"cs.LG","abstract":"Recent advances in large reasoning models (LRMs) demonstrate that sophisticated behaviors such as multi-step reasoning and self-reflection can emerge via reinforcement learning with verifiable rewards~(\\textit{RLVR}). However, existing \\textit{RLVR} approaches are inherently ``on-policy'', limiting learning to a model's own outputs and failing to acquire reasoning abilities beyond its initial capabilities. To address this issue, we introduce \\textbf{LUFFY} (\\textbf{L}earning to reason \\textbf{U}nder o\\textbf{FF}-polic\\textbf{Y} guidance), a framework that augments \\textit{RLVR} with off-policy reasoning traces. LUFFY dynamically balances imitation and exploration by combining off-policy demonstrations with on-policy rollouts during training. Specifically, LUFFY combines the Mixed-Policy GRPO framework, which has a theoretically guaranteed convergence rate, alongside policy shaping via regularized importance sampling to avoid superficial and rigid imitation during mixed-policy training. Compared with previous RLVR methods, LUFFY achieves an over \\textbf{+6.4} average gain across six math benchmarks and an advantage of over \\textbf{+6.2} points in out-of-distribution tasks. Most significantly, we show that LUFFY successfully trains weak models in scenarios where on-policy RLVR completely fails. These results provide compelling evidence that LUFFY transcends the fundamental limitations of on-policy RLVR and demonstrates the great potential of utilizing off-policy guidance in RLVR.","external_url":"https://arxiv.org/abs/2504.14945","cited_by_count":null,"metadata_source":"pith","metadata_fetched_at":"2026-05-21T23:40:46.435333+00:00","pith_arxiv_id":"2504.14945","created_at":"2026-05-10T00:54:48.628108+00:00","updated_at":"2026-05-21T23:40:46.435333+00:00","title_quality_ok":true,"display_title":"Learning to Reason under Off-Policy Guidance","render_title":"Learning to Reason under Off-Policy Guidance"},"hub":{"state":{"work_id":"4ebcdbe2-5000-4f58-a7e0-aa9ae381b684","tier":"hub","tier_reason":"10+ Pith inbound or 1,000+ external citations","pith_inbound_count":34,"external_cited_by_count":null,"distinct_field_count":5,"first_pith_cited_at":"2025-07-02T13:04:09+00:00","last_pith_cited_at":"2026-05-20T17:53:09+00:00","author_build_status":"not_needed","summary_status":"needed","contexts_status":"needed","graph_status":"needed","ask_index_status":"not_needed","reader_status":"not_needed","recognition_status":"not_needed","updated_at":"2026-05-25T03:45:22.993013+00:00","tier_text":"hub"},"tier":"hub","role_counts":[{"context_role":"background","n":7},{"context_role":"baseline","n":1}],"polarity_counts":[{"context_polarity":"background","n":7},{"context_polarity":"baseline","n":1}],"runs":{},"summary":{},"graph":{},"authors":[]}}