{"work":{"id":"d854765a-e664-41c0-8655-21c4bf2e0cc4","openalex_id":null,"doi":null,"arxiv_id":"2504.13837","raw_key":null,"title":"Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?","authors":null,"authors_text":"Yang Yue, Zhiqi Chen, Rui Lu, Andrew Zhao, Zhaokai Wang, Yang Yue","year":2025,"venue":"cs.AI","abstract":"Reinforcement Learning with Verifiable Rewards (RLVR) has recently demonstrated notable success in enhancing the reasoning performance of large language models (LLMs), particularly on mathematics and programming tasks. Similar to how traditional RL helps agents explore and learn new strategies, RLVR is believed to enable LLMs to continuously self-improve, thus acquiring novel reasoning abilities beyond those of the corresponding base models. In this study we critically examine the current state of RLVR by systematically probing the reasoning capability boundaries of RLVR-trained LLMs across various model families, RL algorithms, and math, coding, and visual reasoning benchmarks, using pass@k at large k values as the evaluation metric. Surprisingly, we find that the current training setup does not elicit fundamentally new reasoning patterns. While RLVR-trained models outperform their base models at small k (e.g., k = 1), the base models achieve a higher pass@k score when k is large. Coverage and perplexity analyses show that the observed reasoning abilities originate from and are bounded by the base model. Treating the base model as an upper bound, our quantitative analysis shows that six popular RLVR algorithms perform similarly and remain far from optimal in leveraging the potential of the base model. By contrast, we find that distillation can introduce new reasoning patterns from the teacher and genuinely expand the model's reasoning capabilities. Overall, our findings suggest that current RLVR methods have not yet realized the potential of RL to elicit truly novel reasoning abilities in LLMs. This highlights the need for improved RL paradigms, such as continual scaling and multi-turn agent-environment interaction, to unlock this potential.","external_url":"https://arxiv.org/abs/2504.13837","cited_by_count":null,"metadata_source":"pith","metadata_fetched_at":"2026-05-25T08:15:33.575795+00:00","pith_arxiv_id":"2504.13837","created_at":"2026-05-09T05:45:23.140091+00:00","updated_at":"2026-05-25T08:15:33.575795+00:00","title_quality_ok":true,"display_title":"Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?","render_title":"Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?"},"hub":{"state":{"work_id":"d854765a-e664-41c0-8655-21c4bf2e0cc4","tier":"super_hub","tier_reason":"100+ Pith inbound or 10,000+ external citations","pith_inbound_count":117,"external_cited_by_count":null,"distinct_field_count":11,"first_pith_cited_at":"2025-04-29T09:24:30+00:00","last_pith_cited_at":"2026-05-21T17:59:26+00:00","author_build_status":"needed","summary_status":"needed","contexts_status":"needed","graph_status":"needed","ask_index_status":"needed","reader_status":"not_needed","recognition_status":"not_needed","updated_at":"2026-06-05T17:49:21.340067+00:00","tier_text":"super_hub"},"tier":"super_hub","role_counts":[{"context_role":"background","n":22},{"context_role":"dataset","n":1},{"context_role":"method","n":1},{"context_role":"other","n":1}],"polarity_counts":[{"context_polarity":"background","n":19},{"context_polarity":"unclear","n":3},{"context_polarity":"support","n":1},{"context_polarity":"use_dataset","n":1},{"context_polarity":"use_method","n":1}],"runs":{"ask_index":{"job_type":"ask_index","status":"succeeded","result":{"title":"Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?","claims":[{"claim_text":"Reinforcement Learning with Verifiable Rewards (RLVR) has recently demonstrated notable success in enhancing the reasoning performance of large language models (LLMs), particularly on mathematics and programming tasks. Similar to how traditional RL helps agents explore and learn new strategies, RLVR is believed to enable LLMs to continuously self-improve, thus acquiring novel reasoning abilities beyond those of the corresponding base models. In this study we critically examine the current state of RLVR by systematically probing the reasoning capability boundaries of RLVR-trained LLMs across va","claim_type":"abstract","evidence_strength":"source_metadata"},{"claim_text":"display behaviors, including multi-step reasoning, self-verification, and self-correction, that resemble human-like problem-solving [21, 11, 26]. However, recent works argue that RL with simple, verifiable rewards (RLVR) primarily performs distributional sharpening [ 65, 14]. It selectively amplifies behaviors already present in the base model rather than inducing genuinely new capabilities [ 60]. In some cases, RLVR can even degrade performance under pass@k evaluation absent intervention [7, 52","claim_type":"background","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"Agentic reasoning itself is also susceptible to overthinking problems. Purely fast models may overlook critical reasoning steps, while slow models often suffer from excessive latency oroverthinking behaviors, such as unnecessarily long chains of thought. Emerging approaches seek hybrid strategies [195] that combine the efficiency of fast reasoning with the rigor of slow reasoning [196, 197, 198, 199]. For instance, adaptive test-time scaling allows a model to decide whether to respond quickly or","claim_type":"background","confidence":0.85,"evidence_strength":"citation_context"},{"claim_text":"1 Introduction Reinforcement learning with verifiable rewards (RLVR) has become a core post-training recipe for reasoning models [8, 20, 31], but pure on-policy exploration faces two structural limits: early training suffers from sparse correct trajectories, whereas later training often converges to a plateau after the rollout distribution narrows [33, 36]. A natural response is to enrich the learning signal by mixing in auxiliary trajectories from other sources, moving from pure on-policy updat","claim_type":"background","confidence":0.85,"evidence_strength":"citation_context"},{"claim_text":"On-policy training is defined by the student's current rollout distribution, which evolves at every gradient step, making the teacher appear indispensable. However, recent empirical studies suggest that RL-trained models remain surprisingly close to their SFT initialization: reasoning trajectories in RL models are largely a reweighted subset of those present in the SFT model [19], and on-policy updates are inherently biased toward solutions that minimize KL divergence from the reference policy [","claim_type":"background","confidence":0.85,"evidence_strength":"citation_context"},{"claim_text":"[56] Yongcheng Zeng, Zexu Sun, Bokai Ji, Erxue Min, Hengyi Cai, Shuaiqiang Wang, Dawei Yin, Haifeng Zhang, Xu Chen, and Jun Wang. CurES: From gradient analysis to efficient curriculum learning for reasoning LLMs. InProceedings of the Fourteenth International Conference on Learning Representations (ICLR), 2026. URLhttps://arxiv.org/abs/2510.01037. [57] Ruiqi Zhang, Daman Arora, Song Mei, and Andrea Zanette. SPEED-RL: Faster training of reasoning models via online curriculum learning. InICML 2025 ","claim_type":"background","confidence":0.85,"evidence_strength":"citation_context"},{"claim_text":"3.14 Evaluation on Language Capability To evaluate the language capabilities of InternVL3.5, we use benchmarks covering comprehensive assessments in general knowledge (MMLU [44], CMMLU [61], C-Eval [49], GAOKAO-Bench [177]), linguistic understanding (TriviaQA [52], NaturalQuestions [56], C3 [115], RACE [57]), reasoning (WinoGrande [107], HellaSwag [172], BigBench Hard [ 117]), mathematics (GSM8K-Test [ 18], MATH [ 45], AIME24 [ 84], AIME25 [ 85]), and 20 Model Text2SVG Img2SVG FID ↓ FID-C ↓ CLIP","claim_type":"dataset","confidence":0.85,"evidence_strength":"citation_context"}],"why_cited":"Pith tracks Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model? because it crossed a citation-hub threshold. Current citing contexts most often use it as background evidence (22 contexts).","role_counts":[{"n":22,"context_role":"background"},{"n":1,"context_role":"dataset"},{"n":1,"context_role":"method"}]},"error":null,"updated_at":"2026-05-20T19:22:13.920386+00:00"},"author_expand":{"job_type":"author_expand","status":"succeeded","result":{"authors_linked":[{"id":"109aa4fa-cb17-4c43-9c53-c7fc8ad43920","orcid":null,"display_name":"Yang Yue"},{"id":"edd2c48b-46ba-4d3c-8f76-34bc9760e809","orcid":null,"display_name":"Zhiqi Chen"},{"id":"e22bf7d1-9f52-40f9-ba8d-b365a2681336","orcid":null,"display_name":"Rui Lu"},{"id":"0deade05-6850-47b9-b563-7e2dd5466083","orcid":null,"display_name":"Andrew Zhao"},{"id":"118858de-9f1e-4c6d-b0a4-302371ca7dd7","orcid":null,"display_name":"Zhaokai Wang"},{"id":"109aa4fa-cb17-4c43-9c53-c7fc8ad43920","orcid":null,"display_name":"Yang Yue"}]},"error":null,"updated_at":"2026-05-20T19:22:14.299304+00:00"},"context_extract":{"job_type":"context_extract","status":"succeeded","result":{"enqueued_papers":25},"error":null,"updated_at":"2026-05-14T11:39:48.829867+00:00"},"graph_features":{"job_type":"graph_features","status":"succeeded","result":{"co_cited":[{"title":"DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models","work_id":"c5006563-f3ec-438a-9e35-b7b484f34828","shared_citers":42},{"title":"DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning","work_id":"e6b75ad5-2877-4168-97c8-710407094d20","shared_citers":33},{"title":"DAPO: An Open-Source LLM Reinforcement Learning System at Scale","work_id":"64019d00-0b11-4bbd-b173-b46c8fad0157","shared_citers":31},{"title":"Qwen3 Technical Report","work_id":"25a4e30c-1232-48e7-9925-02fa12ba7c9e","shared_citers":28},{"title":"Proximal Policy Optimization Algorithms","work_id":"240c67fe-d14d-4520-91c1-38a4e272ca19","shared_citers":24},{"title":"Group Sequence Policy Optimization","work_id":"3a98b53b-9f52-4d95-adf7-89353c0a9a65","shared_citers":18},{"title":"Measuring Mathematical Problem Solving With the MATH Dataset","work_id":"50652ac6-fb7c-4675-a2c2-159c241feb17","shared_citers":17},{"title":"Training Verifiers to Solve Math Word Problems","work_id":"acab1aa8-b4d6-40e0-a3ee-25341701dca2","shared_citers":17},{"title":"Beyond the 80/20 Rule: High-Entropy Minority Tokens Drive Effective Reinforcement Learning for LLM Reasoning","work_id":"e5e936f3-0cff-4732-b394-f607d7a63f5f","shared_citers":12},{"title":"Evaluating Large Language Models Trained on Code","work_id":"042493e9-b26f-4b4e-bbde-382072ca9b08","shared_citers":12},{"title":"The Entropy Mechanism of Reinforcement Learning for Reasoning Language Models","work_id":"d4b4aee4-d20f-4572-886a-4ba9ea6c9b81","shared_citers":12},{"title":"Understanding R1-Zero-Like Training: A Critical Perspective","work_id":"ec354f3b-9484-4a0c-94c8-92d4d0260835","shared_citers":12},{"title":"Qwen2.5 Technical Report","work_id":"d8432992-4980-4a81-85c7-9fa2c2b87f85","shared_citers":11},{"title":"The Llama 3 Herd of Models","work_id":"1549a635-88af-4ac1-acfe-51ae7bb53345","shared_citers":11},{"title":"Qwen2.5-Math Technical Report: Toward Mathematical Expert Model via Self-Improvement","work_id":"a097c5d4-6d32-46ee-9826-57d532bbfc9c","shared_citers":10},{"title":"SimpleRL-Zoo: Investigating and Taming Zero Reinforcement Learning for Open Base Models in the Wild","work_id":"94a68437-02e7-425a-91b2-5846ddcbd38c","shared_citers":10},{"title":"Tulu 3: Pushing Frontiers in Open Language Model Post-Training","work_id":"28c9dbea-056a-48c2-8000-85f809827e45","shared_citers":10},{"title":"DeepSeek-V3 Technical Report","work_id":"57d2791d-2219-4c31-a077-afc04b12a75c","shared_citers":9},{"title":"Kimi k1.5: Scaling Reinforcement Learning with LLMs","work_id":"bff96ab1-bd6a-4585-be23-74fdb51969c7","shared_citers":9},{"title":"OpenAI o1 System Card","work_id":"68d3c334-0fc9-49e3-b7b0-a69afae933e2","shared_citers":9},{"title":"Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities","work_id":"008df105-2fdd-45d8-857a-8e35868aecb6","shared_citers":8},{"title":"Open-Reasoner-Zero: An Open Source Approach to Scaling Up Reinforcement Learning on the Base Model","work_id":"763e0e44-40dd-4bdd-8414-21f8f9ce6d10","shared_citers":8},{"title":"VAPO: Efficient and Reliable Reinforcement Learning for Advanced Reasoning Tasks","work_id":"c2351652-65f7-47cd-ae80-dbcd72a6eb20","shared_citers":8},{"title":"HybridFlow: A Flexible and Efficient RLHF Framework","work_id":"7eb9c9f4-b322-4bba-8011-09ff8d6ad801","shared_citers":7}],"time_series":[{"n":2,"year":2025},{"n":53,"year":2026}],"dependency_candidates":[]},"error":null,"updated_at":"2026-05-14T11:49:54.793196+00:00"},"identity_refresh":{"job_type":"identity_refresh","status":"succeeded","result":{"items":[{"title":"Qwen3 Technical Report","outcome":"unchanged","work_id":"25a4e30c-1232-48e7-9925-02fa12ba7c9e","resolver":"local_arxiv","confidence":0.98,"old_work_id":"25a4e30c-1232-48e7-9925-02fa12ba7c9e"}],"counts":{"fixed":0,"merged":0,"unchanged":1,"quarantined":0,"needs_external_resolution":0},"errors":[],"attempted":1},"error":null,"updated_at":"2026-05-14T11:39:55.422212+00:00"},"role_polarity":{"job_type":"role_polarity","status":"succeeded","result":{"title":"Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?","claims":[{"claim_text":"Reinforcement Learning with Verifiable Rewards (RLVR) has recently demonstrated notable success in enhancing the reasoning performance of large language models (LLMs), particularly on mathematics and programming tasks. Similar to how traditional RL helps agents explore and learn new strategies, RLVR is believed to enable LLMs to continuously self-improve, thus acquiring novel reasoning abilities beyond those of the corresponding base models. In this study we critically examine the current state of RLVR by systematically probing the reasoning capability boundaries of RLVR-trained LLMs across va","claim_type":"abstract","evidence_strength":"source_metadata"},{"claim_text":"display behaviors, including multi-step reasoning, self-verification, and self-correction, that resemble human-like problem-solving [21, 11, 26]. However, recent works argue that RL with simple, verifiable rewards (RLVR) primarily performs distributional sharpening [ 65, 14]. It selectively amplifies behaviors already present in the base model rather than inducing genuinely new capabilities [ 60]. In some cases, RLVR can even degrade performance under pass@k evaluation absent intervention [7, 52","claim_type":"background","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"Agentic reasoning itself is also susceptible to overthinking problems. Purely fast models may overlook critical reasoning steps, while slow models often suffer from excessive latency oroverthinking behaviors, such as unnecessarily long chains of thought. Emerging approaches seek hybrid strategies [195] that combine the efficiency of fast reasoning with the rigor of slow reasoning [196, 197, 198, 199]. For instance, adaptive test-time scaling allows a model to decide whether to respond quickly or","claim_type":"background","confidence":0.85,"evidence_strength":"citation_context"},{"claim_text":"1 Introduction Reinforcement learning with verifiable rewards (RLVR) has become a core post-training recipe for reasoning models [8, 20, 31], but pure on-policy exploration faces two structural limits: early training suffers from sparse correct trajectories, whereas later training often converges to a plateau after the rollout distribution narrows [33, 36]. A natural response is to enrich the learning signal by mixing in auxiliary trajectories from other sources, moving from pure on-policy updat","claim_type":"background","confidence":0.85,"evidence_strength":"citation_context"},{"claim_text":"On-policy training is defined by the student's current rollout distribution, which evolves at every gradient step, making the teacher appear indispensable. However, recent empirical studies suggest that RL-trained models remain surprisingly close to their SFT initialization: reasoning trajectories in RL models are largely a reweighted subset of those present in the SFT model [19], and on-policy updates are inherently biased toward solutions that minimize KL divergence from the reference policy [","claim_type":"background","confidence":0.85,"evidence_strength":"citation_context"},{"claim_text":"[56] Yongcheng Zeng, Zexu Sun, Bokai Ji, Erxue Min, Hengyi Cai, Shuaiqiang Wang, Dawei Yin, Haifeng Zhang, Xu Chen, and Jun Wang. CurES: From gradient analysis to efficient curriculum learning for reasoning LLMs. InProceedings of the Fourteenth International Conference on Learning Representations (ICLR), 2026. URLhttps://arxiv.org/abs/2510.01037. [57] Ruiqi Zhang, Daman Arora, Song Mei, and Andrea Zanette. SPEED-RL: Faster training of reasoning models via online curriculum learning. InICML 2025 ","claim_type":"background","confidence":0.85,"evidence_strength":"citation_context"},{"claim_text":"3.14 Evaluation on Language Capability To evaluate the language capabilities of InternVL3.5, we use benchmarks covering comprehensive assessments in general knowledge (MMLU [44], CMMLU [61], C-Eval [49], GAOKAO-Bench [177]), linguistic understanding (TriviaQA [52], NaturalQuestions [56], C3 [115], RACE [57]), reasoning (WinoGrande [107], HellaSwag [172], BigBench Hard [ 117]), mathematics (GSM8K-Test [ 18], MATH [ 45], AIME24 [ 84], AIME25 [ 85]), and 20 Model Text2SVG Img2SVG FID ↓ FID-C ↓ CLIP","claim_type":"dataset","confidence":0.85,"evidence_strength":"citation_context"}],"why_cited":"Pith tracks Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model? because it crossed a citation-hub threshold. Current citing contexts most often use it as background evidence (22 contexts).","role_counts":[{"n":22,"context_role":"background"},{"n":1,"context_role":"dataset"},{"n":1,"context_role":"method"}]},"error":null,"updated_at":"2026-05-20T19:22:13.923002+00:00"},"summary_claims":{"job_type":"summary_claims","status":"succeeded","result":{"title":"Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?","claims":[{"claim_text":"Reinforcement Learning with Verifiable Rewards (RLVR) has recently demonstrated notable success in enhancing the reasoning performance of large language models (LLMs), particularly on mathematics and programming tasks. Similar to how traditional RL helps agents explore and learn new strategies, RLVR is believed to enable LLMs to continuously self-improve, thus acquiring novel reasoning abilities beyond those of the corresponding base models. In this study we critically examine the current state of RLVR by systematically probing the reasoning capability boundaries of RLVR-trained LLMs across va","claim_type":"abstract","evidence_strength":"source_metadata"}],"why_cited":"Pith tracks Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model? because it crossed a citation-hub threshold.","role_counts":[]},"error":null,"updated_at":"2026-05-14T11:49:56.764965+00:00"}},"summary":{"title":"Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?","claims":[{"claim_text":"Reinforcement Learning with Verifiable Rewards (RLVR) has recently demonstrated notable success in enhancing the reasoning performance of large language models (LLMs), particularly on mathematics and programming tasks. Similar to how traditional RL helps agents explore and learn new strategies, RLVR is believed to enable LLMs to continuously self-improve, thus acquiring novel reasoning abilities beyond those of the corresponding base models. In this study we critically examine the current state of RLVR by systematically probing the reasoning capability boundaries of RLVR-trained LLMs across va","claim_type":"abstract","evidence_strength":"source_metadata"}],"why_cited":"Pith tracks Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model? because it crossed a citation-hub threshold.","role_counts":[]},"graph":{"co_cited":[{"title":"DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models","work_id":"c5006563-f3ec-438a-9e35-b7b484f34828","shared_citers":42},{"title":"DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning","work_id":"e6b75ad5-2877-4168-97c8-710407094d20","shared_citers":33},{"title":"DAPO: An Open-Source LLM Reinforcement Learning System at Scale","work_id":"64019d00-0b11-4bbd-b173-b46c8fad0157","shared_citers":31},{"title":"Qwen3 Technical Report","work_id":"25a4e30c-1232-48e7-9925-02fa12ba7c9e","shared_citers":28},{"title":"Proximal Policy Optimization Algorithms","work_id":"240c67fe-d14d-4520-91c1-38a4e272ca19","shared_citers":24},{"title":"Group Sequence Policy Optimization","work_id":"3a98b53b-9f52-4d95-adf7-89353c0a9a65","shared_citers":18},{"title":"Measuring Mathematical Problem Solving With the MATH Dataset","work_id":"50652ac6-fb7c-4675-a2c2-159c241feb17","shared_citers":17},{"title":"Training Verifiers to Solve Math Word Problems","work_id":"acab1aa8-b4d6-40e0-a3ee-25341701dca2","shared_citers":17},{"title":"Beyond the 80/20 Rule: High-Entropy Minority Tokens Drive Effective Reinforcement Learning for LLM Reasoning","work_id":"e5e936f3-0cff-4732-b394-f607d7a63f5f","shared_citers":12},{"title":"Evaluating Large Language Models Trained on Code","work_id":"042493e9-b26f-4b4e-bbde-382072ca9b08","shared_citers":12},{"title":"The Entropy Mechanism of Reinforcement Learning for Reasoning Language Models","work_id":"d4b4aee4-d20f-4572-886a-4ba9ea6c9b81","shared_citers":12},{"title":"Understanding R1-Zero-Like Training: A Critical Perspective","work_id":"ec354f3b-9484-4a0c-94c8-92d4d0260835","shared_citers":12},{"title":"Qwen2.5 Technical Report","work_id":"d8432992-4980-4a81-85c7-9fa2c2b87f85","shared_citers":11},{"title":"The Llama 3 Herd of Models","work_id":"1549a635-88af-4ac1-acfe-51ae7bb53345","shared_citers":11},{"title":"Qwen2.5-Math Technical Report: Toward Mathematical Expert Model via Self-Improvement","work_id":"a097c5d4-6d32-46ee-9826-57d532bbfc9c","shared_citers":10},{"title":"SimpleRL-Zoo: Investigating and Taming Zero Reinforcement Learning for Open Base Models in the Wild","work_id":"94a68437-02e7-425a-91b2-5846ddcbd38c","shared_citers":10},{"title":"Tulu 3: Pushing Frontiers in Open Language Model Post-Training","work_id":"28c9dbea-056a-48c2-8000-85f809827e45","shared_citers":10},{"title":"DeepSeek-V3 Technical Report","work_id":"57d2791d-2219-4c31-a077-afc04b12a75c","shared_citers":9},{"title":"Kimi k1.5: Scaling Reinforcement Learning with LLMs","work_id":"bff96ab1-bd6a-4585-be23-74fdb51969c7","shared_citers":9},{"title":"OpenAI o1 System Card","work_id":"68d3c334-0fc9-49e3-b7b0-a69afae933e2","shared_citers":9},{"title":"Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities","work_id":"008df105-2fdd-45d8-857a-8e35868aecb6","shared_citers":8},{"title":"Open-Reasoner-Zero: An Open Source Approach to Scaling Up Reinforcement Learning on the Base Model","work_id":"763e0e44-40dd-4bdd-8414-21f8f9ce6d10","shared_citers":8},{"title":"VAPO: Efficient and Reliable Reinforcement Learning for Advanced Reasoning Tasks","work_id":"c2351652-65f7-47cd-ae80-dbcd72a6eb20","shared_citers":8},{"title":"HybridFlow: A Flexible and Efficient RLHF Framework","work_id":"7eb9c9f4-b322-4bba-8011-09ff8d6ad801","shared_citers":7}],"time_series":[{"n":2,"year":2025},{"n":53,"year":2026}],"dependency_candidates":[]},"authors":[{"id":"0deade05-6850-47b9-b563-7e2dd5466083","orcid":null,"display_name":"Andrew Zhao","source":"manual","import_confidence":0.72},{"id":"e22bf7d1-9f52-40f9-ba8d-b365a2681336","orcid":null,"display_name":"Rui Lu","source":"manual","import_confidence":0.72},{"id":"109aa4fa-cb17-4c43-9c53-c7fc8ad43920","orcid":null,"display_name":"Yang Yue","source":"manual","import_confidence":0.72},{"id":"118858de-9f1e-4c6d-b0a4-302371ca7dd7","orcid":null,"display_name":"Zhaokai Wang","source":"manual","import_confidence":0.72},{"id":"edd2c48b-46ba-4d3c-8f76-34bc9760e809","orcid":null,"display_name":"Zhiqi Chen","source":"manual","import_confidence":0.72}]}}