AIPO: Learning to Reason from Active Interaction

Gholamreza Haffari; Junnan Liu; Linhao Luo; Thuy-Trang Vu

arxiv: 2605.08401 · v2 · pith:GXYV33CPnew · submitted 2026-05-08 · 💻 cs.CL · cs.AI

AIPO: Learning to Reason from Active Interaction

Junnan Liu , Linhao Luo , Thuy-Trang Vu , Gholamreza Haffari This is my paper

Pith reviewed 2026-05-19 18:02 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords LLM reasoningreinforcement learningmulti-agent systemsactive interactioncapability expansionRLVRimportance sampling

0 comments

The pith

AIPO enables language models to expand their reasoning boundaries by actively consulting specialized agents at training bottlenecks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper proposes AIPO, a reinforcement learning method that lets the main model reach out to three helper agents for specific help when it hits a reasoning problem. The helpers provide targeted feedback on verification, knowledge, or reasoning steps rather than full solution paths. If successful, this active interaction during training allows the model to handle harder problems on its own afterward. A custom importance sampling and clipping technique is used to learn effectively from this off-policy feedback. The result is better performance on math and science reasoning tests without needing the agents at test time.

Core claim

AIPO is an enhanced reinforcement learning framework that improves LLM reasoning through active multi-agent interaction during exploration. The policy model proactively consults Verify Agent, Knowledge Agent, and Reasoning Agent when encountering reasoning bottlenecks to receive fine-grained and targeted guidance, thereby actively expanding its capability boundary during training. A tailored importance sampling coefficient together with a clipping strategy mitigates off-policy bias and gradient vanishing issues.

What carries the argument

The proactive consultation of three collaborative agents (Verify, Knowledge, and Reasoning) triggered at reasoning bottlenecks, combined with importance sampling and clipping for stable learning from their feedback.

If this is right

Reasoning performance improves consistently on benchmarks such as AIME, MATH500, GPQA-Diamond, and LiveCodeBench.
The approach generalizes across different policy models and existing RLVR algorithms.
The trained policy model can perform reasoning independently without the collaborative agents after training.
Exploration during training expands beyond the initial capability boundary of the policy model.
Guidance becomes more sample-efficient and information-dense compared to complete trajectory-level expert demonstrations.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Such active interaction frameworks might reduce reliance on large static datasets of expert solutions for training advanced reasoners.
The idea of dynamic agent consultation could extend to training models for multi-step planning or scientific discovery tasks.
Models might develop internal signals for when to seek help, leading to more self-directed learning systems.

Load-bearing premise

The importance sampling coefficient and clipping strategy successfully address off-policy bias and gradient vanishing so that the model genuinely expands its capabilities instead of just imitating the agents.

What would settle it

Training a model with AIPO and then testing it on new reasoning problems without any agents, showing no improvement over a standard RLVR baseline, would indicate the boundary expansion did not occur.

Figures

Figures reproduced from arXiv: 2605.08401 by Gholamreza Haffari, Junnan Liu, Linhao Luo, Thuy-Trang Vu.

**Figure 2.** Figure 2: Illustration of AIPO. In the AIPO framework, during each rollout, the policy model engages in active interactions with collaborators. We then compute the reward and optimize the policy model using losses derived from both internal (on-policy) and external (off-policy) tokens. Additionally, we propose an amended importance sampling coefficient and clipping strategy to mitigate off-policy errors and the vani… view at source ↗

**Figure 3.** Figure 3: Ablation Study of the collaborators in AIPO. Each bar indicates the average performance of all benchmarks in this domain. 0 20 40 60 80 100 Training Step 0.0 0.2 0.4 0.6 Pass@n Our GRPO LUFFY [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 5.** Figure 5: Training Dynamics of AIPO and baselines on Qwen2.5-7B-Instruct with the same model as collaborators. initiated by the policy model per batch (Batch Interactions). Under AIPO, the interaction frequency initially rises, then declines, and eventually stabilizes. This pattern suggests that the policy model queries external collaborators frequently in the early stages of training because of its limited initial … view at source ↗

read the original abstract

Recent advances in large language models (LLMs) have demonstrated remarkable reasoning capabilities, largely stimulated by Reinforcement Learning with Verifiable Rewards (RLVR). However, existing RL algorithms face a fundamental limitation: their exploration remains largely constrained by the inherent capability boundary of the policy model. Although recent methods introduce external expert demonstrations to extend this boundary, they typically rely on complete trajectory-level guidance, which is sample-inefficient, information-sparse, and may confine exploration to a static guidance space. Inspired by the potential of multi-agent systems, we propose $\textbf{AIPO}$, an enhanced reinforcement learning framework that improves LLM reasoning through active multi-agent interaction during exploration. Specifically, AIPO enables the policy model to proactively consult three functional collaborative agents, $\textit{Verify Agent}$, $\textit{Knowledge Agent}$, and $\textit{Reasoning Agent}$, when encountering reasoning bottlenecks, thereby receiving fine-grained and targeted guidance to actively expand its capability boundary during training. We further introduce a tailored importance sampling coefficient together with a clipping strategy to mitigate the off-policy bias and gradient vanishing issues that arise when learning from agent-provided feedback. After training, the policy model performs reasoning independently without relying on collaborative agents. Extensive experiments on diverse reasoning benchmarks, including AIME, MATH500, GPQA-Diamond, and LiveCodeBench, show that AIPO consistently improves reasoning performance, generalizes robustly across different policy models and RLVR algorithms, and effectively expands the reasoning capability boundary of the policy model.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes AIPO, an enhanced RLVR framework in which the policy model proactively consults three specialized agents (Verify Agent, Knowledge Agent, and Reasoning Agent) upon encountering reasoning bottlenecks during exploration. A custom importance sampling coefficient combined with clipping is introduced to address off-policy bias and gradient vanishing arising from agent-generated feedback. After training the policy reasons independently without the agents. Experiments on AIME, MATH500, GPQA-Diamond, and LiveCodeBench report consistent gains and generalization across base models and RLVR backbones.

Significance. If the off-policy correction is shown to be effective, AIPO would offer a dynamic, fine-grained alternative to static expert trajectories for expanding exploration boundaries in RLVR, potentially improving sample efficiency and post-training independence in LLM reasoning systems.

major comments (2)

[§3.2] §3.2 (Importance Sampling Coefficient): The manuscript introduces a tailored importance sampling coefficient and clipping to mitigate off-policy bias when the policy learns from agent-provided feedback, yet provides neither a derivation showing that the coefficient correctly reweights advantages to the current policy distribution nor empirical diagnostics (e.g., effective sample size or KL divergence between agent and policy trajectories). Without this, the central claim that observed gains reflect genuine capability expansion rather than imitation of the helpers does not follow.
[§5] §5 (Experiments): The reported improvements and cross-model generalization are presented without ablations that isolate the contribution of the importance sampling/clipping strategy (e.g., performance when the coefficient is replaced by standard PPO importance sampling). This omission leaves open whether the benchmark gains are driven by the active interaction mechanism or by the bias-correction component itself.

minor comments (2)

[Abstract] The abstract states performance gains but supplies no numerical deltas, standard deviations, or baseline comparisons, which would allow readers to assess effect size immediately.
[Figure 2] Figure 2 or the interaction protocol description would benefit from explicit pseudocode showing the exact conditions under which each agent is consulted and how their outputs are incorporated into the trajectory.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback on our manuscript. We appreciate the opportunity to clarify and strengthen our presentation of the AIPO framework, particularly regarding the importance sampling component and experimental validation. Below, we address each major comment point by point.

read point-by-point responses

Referee: [§3.2] §3.2 (Importance Sampling Coefficient): The manuscript introduces a tailored importance sampling coefficient and clipping to mitigate off-policy bias when the policy learns from agent-provided feedback, yet provides neither a derivation showing that the coefficient correctly reweights advantages to the current policy distribution nor empirical diagnostics (e.g., effective sample size or KL divergence between agent and policy trajectories). Without this, the central claim that observed gains reflect genuine capability expansion rather than imitation of the helpers does not follow.

Authors: We agree that providing a formal derivation and supporting diagnostics would better substantiate the effectiveness of our custom importance sampling approach. In the revised manuscript, we will include a step-by-step derivation demonstrating how the tailored coefficient reweights the advantages under the current policy distribution, accounting for the agent-generated feedback. Furthermore, we will add empirical analyses including effective sample size calculations and KL divergence measurements between the agent trajectories and the policy's distribution to show that the correction mitigates bias effectively and that performance gains stem from expanded reasoning capabilities rather than mere imitation. revision: yes
Referee: [§5] §5 (Experiments): The reported improvements and cross-model generalization are presented without ablations that isolate the contribution of the importance sampling/clipping strategy (e.g., performance when the coefficient is replaced by standard PPO importance sampling). This omission leaves open whether the benchmark gains are driven by the active interaction mechanism or by the bias-correction component itself.

Authors: We acknowledge that isolating the impact of the importance sampling and clipping strategy through targeted ablations would provide clearer evidence of its contribution. In the revised version, we will include additional ablation studies comparing the full AIPO (with custom coefficient and clipping) against a variant that uses standard PPO importance sampling while retaining the active multi-agent interaction. This will help demonstrate whether the observed gains on benchmarks like AIME and MATH500 are attributable to the bias-correction mechanism or primarily to the agent consultation process itself. revision: yes

Circularity Check

0 steps flagged

No significant circularity in AIPO derivation chain

full rationale

The paper proposes an extension to RLVR by introducing proactive multi-agent consultation (Verify, Knowledge, Reasoning Agents) during exploration plus a custom importance sampling coefficient with clipping to handle off-policy feedback. The central claim of genuine capability expansion that persists post-training is supported by empirical results on AIME, MATH500, GPQA-Diamond and LiveCodeBench across multiple base models and RLVR backbones, rather than reducing by construction to fitted inputs, self-citations, or renamed prior patterns. No load-bearing step equates the reported gains to quantities defined from the method's own equations or prior author work; the importance sampling is presented as a design choice whose effectiveness is validated externally via benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

Review limited to abstract; no explicit free parameters, axioms, or invented entities are quantified, but the design implicitly relies on the agents providing useful guidance and the sampling fix working as intended.

free parameters (1)

importance sampling coefficient
Introduced to mitigate off-policy bias; value and tuning procedure not specified in abstract.

axioms (1)

domain assumption Agent feedback can be incorporated via importance sampling without introducing uncorrectable bias or dependency after training.
Central to the claim that post-training independent reasoning succeeds.

invented entities (1)

Verify Agent, Knowledge Agent, Reasoning Agent no independent evidence
purpose: Supply fine-grained, on-demand guidance during exploration bottlenecks.
Three new functional roles introduced to expand the policy's capability boundary.

pith-pipeline@v0.9.0 · 5800 in / 1360 out tokens · 58642 ms · 2026-05-19T18:02:46.960407+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

J′(θ)=E_πθold[∑_τι π_θ(τt|τ<t) Ãt] + E_πθold[∑_τϵ π_θ(τt|τ<t) Ãt] … clip(π_θ, ω/sg(π_θ)·π_θ,∞) Ãt (Eq. 5,6,8)
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

AIPO enables the policy model to proactively consult three functional collaborative agents … after training the policy model performs reasoning independently

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

81 extracted references · 81 canonical work pages · 34 internal anchors

[1]

Back to basics: Revisiting reinforce-style optimization for learning from human feedback in llms

Arash Ahmadian, Chris Cremer, Matthias Gallé, Marzieh Fadaee, Julia Kreutzer, Olivier Pietquin, Ahmet Üstün, and Sara Hooker. Back to basics: Revisiting reinforce-style optimization for learning from human feedback in llms. InACL (1), pp. 12248–12267. Association for Computational Linguistics, 2024. 2

work page 2024
[2]

Program Synthesis with Large Language Models

Jacob Austin, Augustus Odena, Maxwell I. Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie J. Cai, Michael Terry, Quoc V . Le, and Charles Sutton. Program synthesis with large language models.CoRR, abs/2108.07732, 2021. 1, 4.1, B.6

work page internal anchor Pith review Pith/arXiv arXiv 2021
[3]

Reflect, retry, reward: Self-improving llms via reinforcement learning.CoRR, abs/2505.24726, 2025

Shelly Bensal, Umar Jamil, Christopher Bryant, Melisa Russak, Kiran Kamble, Dmytro Mo- zolevskyi, Muayad Ali, and Waseem AlShikh. Reflect, retry, reward: Self-improving llms via reinforcement learning.CoRR, abs/2505.24726, 2025. D.2

work page arXiv 2025
[4]

Introduction to techniques used in seed1.6

ByteDance Seed. Introduction to techniques used in seed1.6. https://seed.bytedance.com/ en/seed1_6, 2025. 5

work page 2025
[5]

Nudging the boundaries of LLM reasoning

Justin Chih-Yao Chen, Becky Xiangyu Peng, Prafulla Kumar Choubey, Kung-Hsiang Huang, Jiaxin Zhang, Mohit Bansal, and Chien-Sheng Wu. Nudging the boundaries of LLM reasoning. CoRR, abs/2509.25666, 2025. 1

work page arXiv 2025
[6]

Beyond two-stage training: Cooperative sft and rl for llm reasoning.arXiv preprint arXiv:2509.06948, 2025

Liang Chen, Xueting Han, Li Shen, Jing Bai, and Kam-Fai Wong. Beyond two-stage training: Cooperative SFT and RL for LLM reasoning.CoRR, abs/2509.06948, 2025. 1, 2, 5, D.1

work page internal anchor Pith review arXiv 2025
[7]

Towards Reasoning Era: A Survey of Long Chain-of-Thought for Reasoning Large Language Models

Qiguang Chen, Libo Qin, Jinhao Liu, Dengyun Peng, Jiannan Guan, Peng Wang, Mengkang Hu, Yuhang Zhou, Te Gao, and Wanxiang Che. Towards reasoning era: A survey of long chain-of-thought for reasoning large language models.CoRR, abs/2503.09567, 2025. 1

work page internal anchor Pith review Pith/arXiv arXiv 2025
[8]

arXiv preprint arXiv:2510.23595 , year=

Yixing Chen, Yiding Wang, Siqi Zhu, Haofei Yu, Tao Feng, Muhan Zhang, Mostofa Pat- wary, and Jiaxuan You. Multi-agent evolve: LLM self-improve through co-evolution.CoRR, abs/2510.23595, 2025. 1

work page arXiv 2025
[9]

Reasoning with Exploration: An Entropy Perspective

Daixuan Cheng, Shaohan Huang, Xuekai Zhu, Bo Dai, Wayne Xin Zhao, Zhenliang Zhang, and Furu Wei. Reasoning with exploration: An entropy perspective.CoRR, abs/2506.14758, 2025. A

work page internal anchor Pith review Pith/arXiv arXiv 2025
[10]

Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit S. Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, Luke Marris, Sam Petulla, Colin Gaffney, Asaf Aharoni, Nathan Lintz, Tiago Cardal Pais, Henrik Jacobsson, Idan Szpektor, Nan- Jiang Jiang, Krishna Haridasan, Ahmed Omran, Nikunj Saunshi, Dara Bahri, Gaurav ...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[11]

Process Reinforcement through Implicit Rewards

Ganqu Cui, Lifan Yuan, Zefan Wang, Hanbin Wang, Wendi Li, Bingxiang He, Yuchen Fan, Tianyu Yu, Qixin Xu, Weize Chen, Jiarui Yuan, Huayu Chen, Kaiyan Zhang, Xingtai Lv, Shuo Wang, Yuan Yao, Xu Han, Hao Peng, Yu Cheng, Zhiyuan Liu, Maosong Sun, Bowen Zhou, and Ning Ding. Process reinforcement through implicit rewards.CoRR, abs/2502.01456, 2025. 4.1 10

work page internal anchor Pith review Pith/arXiv arXiv 2025
[12]

Weight ensembling improves reasoning in language models, 2025

Xingyu Dang, Christina Baek, Kaiyue Wen, Zico Kolter, and Aditi Raghunathan. Weight ensembling improves reasoning in language models.CoRR, abs/2504.10478, 2025. 1

work page arXiv 2025
[13]

DeepSeek-AI, Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z. F. Wu, Zhibin Gou, Zhihong Shao, Zhuoshu Li, Ziyi Gao, Aixin Liu, Bing Xue, Bingxuan Wang, Bochao Wu, Bei Feng, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, Damai D...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[14]

Supervised reinforcement learning: From expert trajectories to step-wise reasoning.CoRR, abs/2510.25992, 2025

Yihe Deng, I-Hung Hsu, Jun Yan, Zifeng Wang, Rujun Han, Gufeng Zhang, Yanfei Chen, Wei Wang, Tomas Pfister, and Chen-Yu Lee. Supervised reinforcement learning: From expert trajectories to step-wise reasoning.CoRR, abs/2510.25992, 2025. 1

work page arXiv 2025
[15]

Re-rest: Reflection-reinforced self-training for language agents

Zi-Yi Dou, Cheng-Fu Yang, Xueqing Wu, Kai-Wei Chang, and Nanyun Peng. Re-rest: Reflection-reinforced self-training for language agents. InEMNLP, pp. 15394–15411. As- sociation for Computational Linguistics, 2024. D.1

work page 2024
[16]

Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, Anirudh Goyal, Anthony Hartshorn, Aobo Yang, Archi Mitra, Archie Sravankumar, Artem Korenev, Arthur Hinsvark, Arun Rao, Aston Zhang, Aurélien Rodriguez, Austen Gregerson, Ava Spataru, Baptiste Rozière, Bethany...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[17]

arXiv preprint arXiv:2506.19767 , year=

Yuqian Fu, Tinghong Chen, Jiajun Chai, Xihuai Wang, Songjun Tu, Guojun Yin, Wei Lin, Qichao Zhang, Yuanheng Zhu, and Dongbin Zhao. SRFT: A single-stage method with super- vised and reinforcement fine-tuning for reasoning.CoRR, abs/2506.19767, 2025. 1, 1, 2, 5, D.1

work page arXiv 2025
[18]

Flowreasoner: Reinforcing query-level meta-agents.arXiv preprint arXiv:2504.15257, 2025

Hongcheng Gao, Yue Liu, Yufei He, Longxu Dou, Chao Du, Zhijie Deng, Bryan Hooi, Min Lin, and Tianyu Pang. Flowreasoner: Reinforcing query-level meta-agents.CoRR, abs/2504.15257,

work page arXiv
[19]

Etash Kumar Guha, Ryan Marten, Sedrick Keh, Negin Raoof, Georgios Smyrnis, Hritik Bansal, Marianna Nezhurina, Jean Mercat, Trung Vu, Zayne Sprague, Ashima Suvarna, Benjamin Feuer, Liangyu Chen, Zaid Khan, Eric Frankel, Sachin Grover, Caroline Choi, Niklas Muennighoff, Shiye Su, Wanjia Zhao, John Yang, Shreyas Pimpalgaonkar, Kartik Sharma, Charlie Cheng-Ji...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[20]

Rewarding the unlikely: Lifting GRPO beyond distribution sharpening

Andre Wang He, Daniel Fried, and Sean Welleck. Rewarding the unlikely: Lifting GRPO beyond distribution sharpening. InEMNLP, pp. 25548–25560. Association for Computational Linguistics, 2025. 1

work page 2025
[21]

DeepMath-103K: A Large-Scale, Challenging, Decontaminated, and Verifiable Mathematical Dataset for Advancing Reasoning

Zhiwei He, Tian Liang, Jiahao Xu, Qiuzhi Liu, Xingyu Chen, Yue Wang, Linfeng Song, Dian Yu, Zhenwen Liang, Wenxuan Wang, Zhuosheng Zhang, Rui Wang, Zhaopeng Tu, Haitao Mi, and Dong Yu. Deepmath-103k: A large-scale, challenging, decontaminated, and verifiable mathematical dataset for advancing reasoning.CoRR, abs/2504.11456, 2025. 4.4

work page internal anchor Pith review Pith/arXiv arXiv 2025
[22]

Measuring mathematical problem solving with the MATH dataset

Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the MATH dataset. InNeurIPS Datasets and Benchmarks, 2021. 1, 4.1, B.6

work page 2021
[23]

Scaling Laws for Autoregressive Generative Modeling

Tom Henighan, Jared Kaplan, Mor Katz, Mark Chen, Christopher Hesse, Jacob Jackson, Heewoo Jun, Tom B. Brown, Prafulla Dhariwal, Scott Gray, Chris Hallacy, Benjamin Mann, Alec Radford, Aditya Ramesh, Nick Ryder, Daniel M. Ziegler, John Schulman, Dario Amodei, and Sam McCandlish. Scaling laws for autoregressive generative modeling.CoRR, abs/2010.14701,

work page internal anchor Pith review Pith/arXiv arXiv 2010
[24]

Deep Learning Scaling is Predictable, Empirically

Joel Hestness, Sharan Narang, Newsha Ardalani, Gregory F. Diamos, Heewoo Jun, Hassan Kianinejad, Md. Mostofa Ali Patwary, Yang Yang, and Yanqi Zhou. Deep learning scaling is predictable, empirically.CoRR, abs/1712.00409, 2017. E

work page internal anchor Pith review Pith/arXiv arXiv 2017
[25]

REINFORCE++: Stabilizing Critic-Free Policy Optimization with Global Advantage Normalization

Jian Hu. REINFORCE++: A simple and efficient approach for aligning large language models. CoRR, abs/2501.03262, 2025. 2

work page internal anchor Pith review Pith/arXiv arXiv 2025
[26]

Open-Reasoner-Zero: An Open Source Approach to Scaling Up Reinforcement Learning on the Base Model

Jingcheng Hu, Yinmin Zhang, Qi Han, Daxin Jiang, Xiangyu Zhang, and Heung-Yeung Shum. Open-reasoner-zero: An open source approach to scaling up reinforcement learning on the base model.CoRR, abs/2503.24290, 2025. 2

work page internal anchor Pith review Pith/arXiv arXiv 2025
[27]

A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions.ACM Trans

Lei Huang, Weijiang Yu, Weitao Ma, Weihong Zhong, Zhangyin Feng, Haotian Wang, Qiang- long Chen, Weihua Peng, Xiaocheng Feng, Bing Qin, and Ting Liu. A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions.ACM Trans. Inf. Syst., 43(2):42:1–42:55, 2025. D.2

work page 2025
[28]

Yichen Huang and Lin F. Yang. Gemini 2.5 pro capable of winning gold at IMO 2025.CoRR, abs/2507.15855, 2025. 1

work page arXiv 2025
[29]

O1 replication journey - part 2: Surpassing o1- preview through simple distillation, big progress or bitter lesson?CoRR, abs/2411.16489, 2024

Zhen Huang, Haoyang Zou, Xuefeng Li, Yixiu Liu, Yuxiang Zheng, Ethan Chern, Shijie Xia, Yiwei Qin, Weizhe Yuan, and Pengfei Liu. O1 replication journey - part 2: Surpassing o1- preview through simple distillation, big progress or bitter lesson?CoRR, abs/2411.16489, 2024. 3.1

work page arXiv 2024
[30]

Livecodebench: Holistic and contamination free evaluation of large language models for code

Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Ar- mando Solar-Lezama, Koushik Sen, and Ion Stoica. Livecodebench: Holistic and contamination free evaluation of large language models for code. InICLR. OpenReview.net, 2025. 1, 4.1, B.6

work page 2025
[31]

Search-R1: Training LLMs to Reason and Leverage Search Engines with Reinforcement Learning

Bowen Jin, Hansi Zeng, Zhenrui Yue, Dong Wang, Hamed Zamani, and Jiawei Han. Search- r1: Training llms to reason and leverage search engines with reinforcement learning.CoRR, abs/2503.09516, 2025. 3.2

work page internal anchor Pith review Pith/arXiv arXiv 2025
[32]

Kimi-Team, Angang Du, Bofei Gao, Bowei Xing, Changjiu Jiang, Cheng Chen, Cheng Li, Chenjun Xiao, Chenzhuang Du, Chonghua Liao, Chuning Tang, Congcong Wang, Dehao Zhang, Enming Yuan, Enzhe Lu, Fengxiang Tang, Flood Sung, Guangda Wei, Guokun Lai, Haiqing Guo, Han Zhu, Hao Ding, Hao Hu, Hao Yang, Hao Zhang, Haotian Yao, Haotian Zhao, Haoyu Lu, Haoze Li, Haoz...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[33]

Co-Reyes, Avi Singh, Kate Baumli, Shariq Iqbal, Colton Bishop, Rebecca Roelofs, Lei M

Aviral Kumar, Vincent Zhuang, Rishabh Agarwal, Yi Su, John D. Co-Reyes, Avi Singh, Kate Baumli, Shariq Iqbal, Colton Bishop, Rebecca Roelofs, Lei M. Zhang, Kay McKinney, Disha Shrivastava, Cosmin Paduraru, George Tucker, Doina Precup, Feryal M. P. Behbahani, and Aleksandra Faust. Training language models to self-correct via reinforcement learning. InICLR....

work page 2025
[34]

Efficient memory management for large language model serving with pagedattention

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. InSOSP, pp. 611–626. ACM, 2023. 4.1, B.1

work page 2023
[35]

Remax: A simple, effective, and efficient reinforcement learning method for aligning large language models

Ziniu Li, Tian Xu, Yushun Zhang, Zhihang Lin, Yang Yu, Ruoyu Sun, and Zhi-Quan Luo. Remax: A simple, effective, and efficient reinforcement learning method for aligning large language models. InICML. OpenReview.net, 2024. 2

work page 2024
[36]

Marft: Multi-agent reinforcement fine-tuning, 2025

Junwei Liao, Muning Wen, Jun Wang, and Weinan Zhang. MARFT: multi-agent reinforcement fine-tuning.CoRR, abs/2504.16129, 2025. 3.1

work page internal anchor Pith review arXiv 2025
[37]

Enhancing efficiency and exploration in reinforcement learning for llms

Mengqi Liao, Xiangyu Xi, Ruinian Chen, Jia Leng, Yangen Hu, Ke Zeng, Shuai Liu, and Huaiyu Wan. Enhancing efficiency and exploration in reinforcement learning for llms. In EMNLP, pp. 1451–1463. Association for Computational Linguistics, 2025. 1

work page 2025
[38]

Let’s verify step by step

Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harrison Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step. InICLR. OpenReview.net, 2024. B.6

work page 2024
[39]

Interactive Learning for LLM Reasoning

Hehai Lin, Shilei Cao, Sudong Wang, Haotian Wu, Minzhi Li, Linyi Yang, Juepeng Zheng, and Chengwei Qin. Interactive learning for LLM reasoning.CoRR, abs/2509.26306, 2025. 1, 5, D.1

work page internal anchor Pith review Pith/arXiv arXiv 2025
[40]

Are your llms capable of stable reasoning?arXiv preprint arXiv:2412.13147, 2024

Junnan Liu, Hongwei Liu, Linchen Xiao, Ziyi Wang, Kuikun Liu, Songyang Gao, Wenwei Zhang, Songyang Zhang, and Kai Chen. Are your llms capable of stable reasoning?CoRR, abs/2412.13147, 2024. 1, 4.1, B.6

work page arXiv 2024
[41]

Situatedthinker: Grounding LLM reasoning with real-world through situated thinking.CoRR, abs/2505.19300, 2025

Junnan Liu, Linhao Luo, Thuy-Trang Vu, and Gholamreza Haffari. Situatedthinker: Grounding LLM reasoning with real-world through situated thinking.CoRR, abs/2505.19300, 2025. 3.2

work page arXiv 2025
[42]

Learn to reason efficiently with adaptive length-based reward shaping.arXiv preprint arXiv:2505.15612, 2025

Wei Liu, Ruochen Zhou, Yiyun Deng, Yuzhen Huang, Junteng Liu, Yuntian Deng, Yizhe Zhang, and Junxian He. Learn to reason efficiently with adaptive length-based reward shaping.CoRR, abs/2505.15612, 2025. 4.1

work page arXiv 2025
[43]

Exploratory memory- augmented llm agent via hybrid on- and off-policy optimization.CoRR, abs/2602.23008, 2026

Zeyuan Liu, Jeonghye Kim, Xufang Luo, Dongsheng Li, and Yuqing Yang. Exploratory memory- augmented llm agent via hybrid on- and off-policy optimization.CoRR, abs/2602.23008, 2026. 1

work page arXiv 2026
[44]

Understanding R1-Zero-Like Training: A Critical Perspective

Zichen Liu, Changyu Chen, Wenjun Li, Penghui Qi, Tianyu Pang, Chao Du, Wee Sun Lee, and Min Lin. Understanding r1-zero-like training: A critical perspective.CoRR, abs/2503.20783,

work page internal anchor Pith review Pith/arXiv arXiv
[45]

Towards a unified view of large language model post-training.arXiv preprint arXiv:2509.04419,

Xingtai Lv, Yuxin Zuo, Youbang Sun, Hongyi Liu, Yuntian Wei, Zhekai Chen, Lixuan He, Xuekai Zhu, Kaiyan Zhang, Bingning Wang, Ning Ding, and Bowen Zhou. Towards a unified view of large language model post-training.CoRR, abs/2509.04419, 2025. 1, 2, 5, D.1 13

work page arXiv 2025
[46]

Learning what reinforcement learning can't: Interleaved online fine-tuning for hardest questions

Lu Ma, Hao Liang, Meiyi Qiang, Lexiang Tang, Xiaochen Ma, Zhen Hao Wong, Junbo Niu, Chengyu Shen, Runming He, Bin Cui, and Wentao Zhang. Learning what reinforcement learning can’t: Interleaved online fine-tuning for hardest questions.CoRR, abs/2506.07527,

work page arXiv
[47]

s1: Simple test-time scaling

Niklas Muennighoff, Zitong Yang, Weijia Shi, Xiang Lisa Li, Li Fei-Fei, Hannaneh Hajishirzi, Luke Zettlemoyer, Percy Liang, Emmanuel J. Candès, and Tatsunori Hashimoto. s1: Simple test-time scaling.CoRR, abs/2501.19393, 2025. 3.1

work page internal anchor Pith review Pith/arXiv arXiv 2025
[48]

Learning to reason with llms

OpenAI. Learning to reason with llms. https://openai.com/index/learning-to- reason-with-llms/, 2024. Accessed: 2024-09. 1, 2, 5

work page 2024
[49]

Introducing openai o3 and o4-mini

OpenAI. Introducing openai o3 and o4-mini. https://openai.com/index/introducing- o3-and-o4-mini/, 2024. Accessed: 2024-12. 2

work page 2024
[50]

Gpt-5 and the new era of work

OpenAI. Gpt-5 and the new era of work. https://openai.com/index/gpt-5-new-era- of-work/, 2025. Accessed: 2025-08. 1, 5

work page 2025
[51]

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul F. Christiano, Jan Leike, and Ryan Lowe. Training language models to follow instructions with human fee...

work page 2022
[52]

Ram´e, A.; Ferret, J.; Vieillard, N.; Dadashi, R.; Hussenot, L.; Cedoz, P.-L.; Sessa, P

Yiwei Qin, Xuefeng Li, Haoyang Zou, Yixiu Liu, Shijie Xia, Zhen Huang, Yixin Ye, Weizhe Yuan, Hector Liu, Yuanzhi Li, and Pengfei Liu. O1 replication journey: A strategic progress report - part 1.CoRR, abs/2410.18982, 2024. 3.1

work page arXiv 2024
[53]

David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R. Bowman. GPQA: A graduate-level google-proof q&a benchmark.CoRR, abs/2311.12022, 2023. 1, 4.1, B.6

work page internal anchor Pith review Pith/arXiv arXiv 2023
[54]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.CoRR, abs/1707.06347, 2017. 1, 2, 3.2

work page internal anchor Pith review Pith/arXiv arXiv 2017
[55]

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Mingchuan Zhang, Y . K. Li, Y . Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.CoRR, abs/2402.03300, 2024. 1, 1, 2, 4.1

work page internal anchor Pith review Pith/arXiv arXiv 2024
[56]

Hybridflow: A flexible and efficient RLHF framework

Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient RLHF framework. In EuroSys, pp. 1279–1297. ACM, 2025. 4.1, B.1

work page 2025
[57]

Reflexion: language agents with verbal reinforcement learning

Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: language agents with verbal reinforcement learning. InNeurIPS, 2023. 5, D.1

work page 2023
[58]

Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters

Charlie Snell, Jaehoon Lee, Kelvin Xu, and Aviral Kumar. Scaling LLM test-time compute optimally can be more effective than scaling model parameters.CoRR, abs/2408.03314, 2024. 1

work page internal anchor Pith review Pith/arXiv arXiv 2024
[59]

R1-Searcher: Incentivizing the Search Capability in LLMs via Reinforcement Learning

Huatong Song, Jinhao Jiang, Yingqian Min, Jie Chen, Zhipeng Chen, Wayne Xin Zhao, Lei Fang, and Ji-Rong Wen. R1-searcher: Incentivizing the search capability in llms via reinforcement learning.CoRR, abs/2503.05592, 2025. 3.2

work page internal anchor Pith review Pith/arXiv arXiv 2025
[60]

Reasoning gym: Reasoning environments for reinforcement learning with verifiable rewards.arXiv preprint arXiv:2505.24760,

Zafir Stojanovski, Oliver Stanley, Joe Sharratt, Richard Jones, Abdulhakeem Adefioye, Jean Kaddour, and Andreas Köpf. REASONING GYM: reasoning environments for reinforcement learning with verifiable rewards.CoRR, abs/2505.24760, 2025. 1, 4.1, B.6

work page arXiv 2025
[61]

ZeroSearch: Incentivize the Search Capability of LLMs without Searching

Hao Sun, Zile Qiao, Jiayan Guo, Xuanbo Fan, Yingyan Hou, Yong Jiang, Pengjun Xie, Yan Zhang, Fei Huang, and Jingren Zhou. Zerosearch: Incentivize the search capability of llms without searching.CoRR, abs/2505.04588, 2025. 3.1

work page internal anchor Pith review Pith/arXiv arXiv 2025
[62]

Improving data efficiency for llm reinforcement fine-tuning through difficulty-targeted online data selection and rollout replay.arXiv preprint arXiv:2506.05316,

Yifan Sun, Jingyan Shen, Yibin Wang, Tianyu Chen, Zhendong Wang, Mingyuan Zhou, and Huan Zhang. Improving data efficiency for LLM reinforcement fine-tuning through difficulty- targeted online data selection and rollout replay.CoRR, abs/2506.05316, 2025. 5, D.1 14

work page arXiv 2025
[63]

Qwq-32b: Embracing the power of reinforcement learning, March 2025

Qwen Team. Qwq-32b: Embracing the power of reinforcement learning, March 2025. URL https://qwenlm.github.io/blog/qwq-32b/. 5

work page 2025
[64]

Rema: Learning to meta-think for llms with multi-agent reinforcement learning, 2025

Ziyu Wan, Yunxiang Li, Yan Song, Hanjing Wang, Linyi Yang, Mark Schmidt, Jun Wang, Weinan Zhang, Shuyue Hu, and Ying Wen. Rema: Learning to meta-think for llms with multi-agent reinforcement learning.CoRR, abs/2503.09501, 2025. 3.1

work page arXiv 2025
[65]

Beyond the 80/20 Rule: High-Entropy Minority Tokens Drive Effective Reinforcement Learning for LLM Reasoning

Shenzhi Wang, Le Yu, Chang Gao, Chujie Zheng, Shixuan Liu, Rui Lu, Kai Dang, Xionghui Chen, Jianxin Yang, Zhenru Zhang, Yuqiong Liu, An Yang, Andrew Zhao, Yang Yue, Shiji Song, Bowen Yu, Gao Huang, and Junyang Lin. Beyond the 80/20 rule: High-entropy minority tokens drive effective reinforcement learning for LLM reasoning.CoRR, abs/2506.01939, 2025. A

work page internal anchor Pith review Pith/arXiv arXiv 2025
[66]

Chi, Quoc V

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed H. Chi, Quoc V . Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models. InNeurIPS, 2022. 3.1

work page 2022
[67]

Truthrl: Incentivizing truthful llms via reinforcement learning.CoRR, abs/2509.25760, 2025

Zhepei Wei, Xiao Yang, Kai Sun, Jiaqi Wang, Rulin Shao, Sean Chen, Mohammad Kachuee, Teja Gollapudi, Tony Liao, Nicolas Scheffer, Rakesh Wanga, Anuj Kumar, Yu Meng, Wen-tau Yih, and Xin Luna Dong. Truthrl: Incentivizing truthful llms via reinforcement learning.CoRR, abs/2509.25760, 2025. D.2

work page arXiv 2025
[68]

Grok 4.https://x.ai/news/grok-4/, 2025

xAI. Grok 4.https://x.ai/news/grok-4/, 2025. Accessed: 2025-07. 1

work page 2025
[69]

Kdrl: Post-training reasoning llms via unified knowledge distillation and reinforcement learning.arXiv preprint arXiv:2506.02208,

Hongling Xu, Qi Zhu, Heyuan Deng, Jinpeng Li, Lu Hou, Yasheng Wang, Lifeng Shang, Ruifeng Xu, and Fei Mi. KDRL: post-training reasoning llms via unified knowledge distillation and reinforcement learning.CoRR, abs/2506.02208, 2025. 5

work page arXiv 2025
[70]

Comas: Co-evolving multi-agent systems via interaction rewards.arXiv preprint arXiv:2510.08529, 2025

Xiangyuan Xue, Yifan Zhou, Guibin Zhang, Zaibin Zhang, Yijiang Li, Chen Zhang, Zhenfei Yin, Philip Torr, Wanli Ouyang, and Lei Bai. Comas: Co-evolving multi-agent systems via interaction rewards.CoRR, abs/2510.08529, 2025. 1

work page arXiv 2025
[71]

Qwen2.5 Technical Report

An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianhao Li, Tingyu X...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[72]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jian Yang, Jiaxi Yang, Jingren Zhou, Jingren Zhou, Junyang Lin, Kai Dang, Keqin Bao, Kexin Yang, ...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[73]

DAPO: An Open-Source LLM Reinforcement Learning System at Scale

Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Tiantian Fan, Gaohong Liu, Lingjun Liu, Xin Liu, Haibin Lin, Zhiqi Lin, Bole Ma, Guangming Sheng, Yuxuan Tong, Chi Zhang, Mofan Zhang, Wang Zhang, Hang Zhu, Jinhua Zhu, Jiaze Chen, Jiangjie Chen, Chengyi Wang, Hongli Yu, Weinan Dai, Yuxuan Song, Xiangpeng Wei, Hao Zhou, Jingjing Liu, W...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[74]

Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?

Yang Yue, Zhiqi Chen, Rui Lu, Andrew Zhao, Zhaokai Wang, Yang Yue, Shiji Song, and Gao Huang. Does reinforcement learning really incentivize reasoning capacity in llms beyond the base model?CoRR, abs/2504.13837, 2025. 1, 5 15

work page internal anchor Pith review Pith/arXiv arXiv 2025
[75]

VAPO: Efficient and Reliable Reinforcement Learning for Advanced Reasoning Tasks

Yu Yue, Yufeng Yuan, Qiying Yu, Xiaochen Zuo, Ruofei Zhu, Wenyuan Xu, Jiaze Chen, Cheng- Xiang Wang, Tiantian Fan, Zhengyin Du, Xiangpeng Wei, Xiangyu Yu, Gaohong Liu, Juncai Liu, Lingjun Liu, Haibin Lin, Zhiqi Lin, Bole Ma, Chi Zhang, Mofan Zhang, Wang Zhang, Hang Zhu, Ru Zhang, Xin Liu, Mingxuan Wang, Yonghui Wu, and Lin Yan. V APO: efficient and reliab...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[76]

Incentivizing llms to self-verify their answers.arXiv preprint arXiv:2506.01369, 2025a

Fuxiang Zhang, Jiacheng Xu, Chaojie Wang, Ce Cui, Yang Liu, and Bo An. Incentivizing llms to self-verify their answers.CoRR, abs/2506.01369, 2025. D.2

work page arXiv 2025
[77]

arXiv preprint arXiv:2508.11408 , year=

Wenhao Zhang, Yuexiang Xie, Yuchang Sun, Yanxi Chen, Guoyin Wang, Yaliang Li, Bolin Ding, and Jingren Zhou. On-policy RL meets off-policy experts: Harmonizing supervised fine-tuning and reinforcement learning via dynamic weighting.CoRR, abs/2508.11408, 2025. 1, 2, 2, 5, D.1

work page arXiv 2025
[78]

Learning to reason under off-policy guidance

Yue Zhang, Yafu Li, Ganqu Cui, Yu Cheng, Zhi Wang, Xiaoye Qu, Jianhao Yan, and Zican Hu. Learning to reason under off-policy guidance. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025. 1, 2, 2, 3.2, 5, D.1

work page 2025
[79]

Echo chamber: Rl post-training amplifies behaviors learned in pretraining

Rosie Zhao, Alexandru Meterez, Sham M. Kakade, Cengiz Pehlevan, Samy Jelassi, and Eran Malach. Echo chamber: RL post-training amplifies behaviors learned in pretraining.CoRR, abs/2504.07912, 2025. 1, 5

work page arXiv 2025
[80]

Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models

Siyan Zhao, Zhihui Xie, Mengchen Liu, Jing Huang, Guan Pang, Feiyu Chen, and Aditya Grover. Self-distilled reasoner: On-policy self-distillation for large language models.CoRR, abs/2601.18734, 2026. 1, 4.1, 5, D.1

work page internal anchor Pith review Pith/arXiv arXiv 2026

Showing first 80 references.

[1] [1]

Back to basics: Revisiting reinforce-style optimization for learning from human feedback in llms

Arash Ahmadian, Chris Cremer, Matthias Gallé, Marzieh Fadaee, Julia Kreutzer, Olivier Pietquin, Ahmet Üstün, and Sara Hooker. Back to basics: Revisiting reinforce-style optimization for learning from human feedback in llms. InACL (1), pp. 12248–12267. Association for Computational Linguistics, 2024. 2

work page 2024

[2] [2]

Program Synthesis with Large Language Models

Jacob Austin, Augustus Odena, Maxwell I. Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie J. Cai, Michael Terry, Quoc V . Le, and Charles Sutton. Program synthesis with large language models.CoRR, abs/2108.07732, 2021. 1, 4.1, B.6

work page internal anchor Pith review Pith/arXiv arXiv 2021

[3] [3]

Reflect, retry, reward: Self-improving llms via reinforcement learning.CoRR, abs/2505.24726, 2025

Shelly Bensal, Umar Jamil, Christopher Bryant, Melisa Russak, Kiran Kamble, Dmytro Mo- zolevskyi, Muayad Ali, and Waseem AlShikh. Reflect, retry, reward: Self-improving llms via reinforcement learning.CoRR, abs/2505.24726, 2025. D.2

work page arXiv 2025

[4] [4]

Introduction to techniques used in seed1.6

ByteDance Seed. Introduction to techniques used in seed1.6. https://seed.bytedance.com/ en/seed1_6, 2025. 5

work page 2025

[5] [5]

Nudging the boundaries of LLM reasoning

Justin Chih-Yao Chen, Becky Xiangyu Peng, Prafulla Kumar Choubey, Kung-Hsiang Huang, Jiaxin Zhang, Mohit Bansal, and Chien-Sheng Wu. Nudging the boundaries of LLM reasoning. CoRR, abs/2509.25666, 2025. 1

work page arXiv 2025

[6] [6]

Beyond two-stage training: Cooperative sft and rl for llm reasoning.arXiv preprint arXiv:2509.06948, 2025

Liang Chen, Xueting Han, Li Shen, Jing Bai, and Kam-Fai Wong. Beyond two-stage training: Cooperative SFT and RL for LLM reasoning.CoRR, abs/2509.06948, 2025. 1, 2, 5, D.1

work page internal anchor Pith review arXiv 2025

[7] [7]

Towards Reasoning Era: A Survey of Long Chain-of-Thought for Reasoning Large Language Models

Qiguang Chen, Libo Qin, Jinhao Liu, Dengyun Peng, Jiannan Guan, Peng Wang, Mengkang Hu, Yuhang Zhou, Te Gao, and Wanxiang Che. Towards reasoning era: A survey of long chain-of-thought for reasoning large language models.CoRR, abs/2503.09567, 2025. 1

work page internal anchor Pith review Pith/arXiv arXiv 2025

[8] [8]

arXiv preprint arXiv:2510.23595 , year=

Yixing Chen, Yiding Wang, Siqi Zhu, Haofei Yu, Tao Feng, Muhan Zhang, Mostofa Pat- wary, and Jiaxuan You. Multi-agent evolve: LLM self-improve through co-evolution.CoRR, abs/2510.23595, 2025. 1

work page arXiv 2025

[9] [9]

Reasoning with Exploration: An Entropy Perspective

Daixuan Cheng, Shaohan Huang, Xuekai Zhu, Bo Dai, Wayne Xin Zhao, Zhenliang Zhang, and Furu Wei. Reasoning with exploration: An entropy perspective.CoRR, abs/2506.14758, 2025. A

work page internal anchor Pith review Pith/arXiv arXiv 2025

[10] [10]

Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit S. Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, Luke Marris, Sam Petulla, Colin Gaffney, Asaf Aharoni, Nathan Lintz, Tiago Cardal Pais, Henrik Jacobsson, Idan Szpektor, Nan- Jiang Jiang, Krishna Haridasan, Ahmed Omran, Nikunj Saunshi, Dara Bahri, Gaurav ...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[11] [11]

Process Reinforcement through Implicit Rewards

Ganqu Cui, Lifan Yuan, Zefan Wang, Hanbin Wang, Wendi Li, Bingxiang He, Yuchen Fan, Tianyu Yu, Qixin Xu, Weize Chen, Jiarui Yuan, Huayu Chen, Kaiyan Zhang, Xingtai Lv, Shuo Wang, Yuan Yao, Xu Han, Hao Peng, Yu Cheng, Zhiyuan Liu, Maosong Sun, Bowen Zhou, and Ning Ding. Process reinforcement through implicit rewards.CoRR, abs/2502.01456, 2025. 4.1 10

work page internal anchor Pith review Pith/arXiv arXiv 2025

[12] [12]

Weight ensembling improves reasoning in language models, 2025

Xingyu Dang, Christina Baek, Kaiyue Wen, Zico Kolter, and Aditi Raghunathan. Weight ensembling improves reasoning in language models.CoRR, abs/2504.10478, 2025. 1

work page arXiv 2025

[13] [13]

DeepSeek-AI, Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z. F. Wu, Zhibin Gou, Zhihong Shao, Zhuoshu Li, Ziyi Gao, Aixin Liu, Bing Xue, Bingxuan Wang, Bochao Wu, Bei Feng, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, Damai D...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[14] [14]

Supervised reinforcement learning: From expert trajectories to step-wise reasoning.CoRR, abs/2510.25992, 2025

Yihe Deng, I-Hung Hsu, Jun Yan, Zifeng Wang, Rujun Han, Gufeng Zhang, Yanfei Chen, Wei Wang, Tomas Pfister, and Chen-Yu Lee. Supervised reinforcement learning: From expert trajectories to step-wise reasoning.CoRR, abs/2510.25992, 2025. 1

work page arXiv 2025

[15] [15]

Re-rest: Reflection-reinforced self-training for language agents

Zi-Yi Dou, Cheng-Fu Yang, Xueqing Wu, Kai-Wei Chang, and Nanyun Peng. Re-rest: Reflection-reinforced self-training for language agents. InEMNLP, pp. 15394–15411. As- sociation for Computational Linguistics, 2024. D.1

work page 2024

[16] [16]

Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, Anirudh Goyal, Anthony Hartshorn, Aobo Yang, Archi Mitra, Archie Sravankumar, Artem Korenev, Arthur Hinsvark, Arun Rao, Aston Zhang, Aurélien Rodriguez, Austen Gregerson, Ava Spataru, Baptiste Rozière, Bethany...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[17] [17]

arXiv preprint arXiv:2506.19767 , year=

Yuqian Fu, Tinghong Chen, Jiajun Chai, Xihuai Wang, Songjun Tu, Guojun Yin, Wei Lin, Qichao Zhang, Yuanheng Zhu, and Dongbin Zhao. SRFT: A single-stage method with super- vised and reinforcement fine-tuning for reasoning.CoRR, abs/2506.19767, 2025. 1, 1, 2, 5, D.1

work page arXiv 2025

[18] [18]

Flowreasoner: Reinforcing query-level meta-agents.arXiv preprint arXiv:2504.15257, 2025

Hongcheng Gao, Yue Liu, Yufei He, Longxu Dou, Chao Du, Zhijie Deng, Bryan Hooi, Min Lin, and Tianyu Pang. Flowreasoner: Reinforcing query-level meta-agents.CoRR, abs/2504.15257,

work page arXiv

[19] [19]

Etash Kumar Guha, Ryan Marten, Sedrick Keh, Negin Raoof, Georgios Smyrnis, Hritik Bansal, Marianna Nezhurina, Jean Mercat, Trung Vu, Zayne Sprague, Ashima Suvarna, Benjamin Feuer, Liangyu Chen, Zaid Khan, Eric Frankel, Sachin Grover, Caroline Choi, Niklas Muennighoff, Shiye Su, Wanjia Zhao, John Yang, Shreyas Pimpalgaonkar, Kartik Sharma, Charlie Cheng-Ji...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[20] [20]

Rewarding the unlikely: Lifting GRPO beyond distribution sharpening

Andre Wang He, Daniel Fried, and Sean Welleck. Rewarding the unlikely: Lifting GRPO beyond distribution sharpening. InEMNLP, pp. 25548–25560. Association for Computational Linguistics, 2025. 1

work page 2025

[21] [21]

DeepMath-103K: A Large-Scale, Challenging, Decontaminated, and Verifiable Mathematical Dataset for Advancing Reasoning

Zhiwei He, Tian Liang, Jiahao Xu, Qiuzhi Liu, Xingyu Chen, Yue Wang, Linfeng Song, Dian Yu, Zhenwen Liang, Wenxuan Wang, Zhuosheng Zhang, Rui Wang, Zhaopeng Tu, Haitao Mi, and Dong Yu. Deepmath-103k: A large-scale, challenging, decontaminated, and verifiable mathematical dataset for advancing reasoning.CoRR, abs/2504.11456, 2025. 4.4

work page internal anchor Pith review Pith/arXiv arXiv 2025

[22] [22]

Measuring mathematical problem solving with the MATH dataset

Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the MATH dataset. InNeurIPS Datasets and Benchmarks, 2021. 1, 4.1, B.6

work page 2021

[23] [23]

Scaling Laws for Autoregressive Generative Modeling

Tom Henighan, Jared Kaplan, Mor Katz, Mark Chen, Christopher Hesse, Jacob Jackson, Heewoo Jun, Tom B. Brown, Prafulla Dhariwal, Scott Gray, Chris Hallacy, Benjamin Mann, Alec Radford, Aditya Ramesh, Nick Ryder, Daniel M. Ziegler, John Schulman, Dario Amodei, and Sam McCandlish. Scaling laws for autoregressive generative modeling.CoRR, abs/2010.14701,

work page internal anchor Pith review Pith/arXiv arXiv 2010

[24] [24]

Deep Learning Scaling is Predictable, Empirically

Joel Hestness, Sharan Narang, Newsha Ardalani, Gregory F. Diamos, Heewoo Jun, Hassan Kianinejad, Md. Mostofa Ali Patwary, Yang Yang, and Yanqi Zhou. Deep learning scaling is predictable, empirically.CoRR, abs/1712.00409, 2017. E

work page internal anchor Pith review Pith/arXiv arXiv 2017

[25] [25]

REINFORCE++: Stabilizing Critic-Free Policy Optimization with Global Advantage Normalization

Jian Hu. REINFORCE++: A simple and efficient approach for aligning large language models. CoRR, abs/2501.03262, 2025. 2

work page internal anchor Pith review Pith/arXiv arXiv 2025

[26] [26]

Open-Reasoner-Zero: An Open Source Approach to Scaling Up Reinforcement Learning on the Base Model

Jingcheng Hu, Yinmin Zhang, Qi Han, Daxin Jiang, Xiangyu Zhang, and Heung-Yeung Shum. Open-reasoner-zero: An open source approach to scaling up reinforcement learning on the base model.CoRR, abs/2503.24290, 2025. 2

work page internal anchor Pith review Pith/arXiv arXiv 2025

[27] [27]

A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions.ACM Trans

Lei Huang, Weijiang Yu, Weitao Ma, Weihong Zhong, Zhangyin Feng, Haotian Wang, Qiang- long Chen, Weihua Peng, Xiaocheng Feng, Bing Qin, and Ting Liu. A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions.ACM Trans. Inf. Syst., 43(2):42:1–42:55, 2025. D.2

work page 2025

[28] [28]

Yichen Huang and Lin F. Yang. Gemini 2.5 pro capable of winning gold at IMO 2025.CoRR, abs/2507.15855, 2025. 1

work page arXiv 2025

[29] [29]

O1 replication journey - part 2: Surpassing o1- preview through simple distillation, big progress or bitter lesson?CoRR, abs/2411.16489, 2024

Zhen Huang, Haoyang Zou, Xuefeng Li, Yixiu Liu, Yuxiang Zheng, Ethan Chern, Shijie Xia, Yiwei Qin, Weizhe Yuan, and Pengfei Liu. O1 replication journey - part 2: Surpassing o1- preview through simple distillation, big progress or bitter lesson?CoRR, abs/2411.16489, 2024. 3.1

work page arXiv 2024

[30] [30]

Livecodebench: Holistic and contamination free evaluation of large language models for code

Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Ar- mando Solar-Lezama, Koushik Sen, and Ion Stoica. Livecodebench: Holistic and contamination free evaluation of large language models for code. InICLR. OpenReview.net, 2025. 1, 4.1, B.6

work page 2025

[31] [31]

Search-R1: Training LLMs to Reason and Leverage Search Engines with Reinforcement Learning

Bowen Jin, Hansi Zeng, Zhenrui Yue, Dong Wang, Hamed Zamani, and Jiawei Han. Search- r1: Training llms to reason and leverage search engines with reinforcement learning.CoRR, abs/2503.09516, 2025. 3.2

work page internal anchor Pith review Pith/arXiv arXiv 2025

[32] [32]

Kimi-Team, Angang Du, Bofei Gao, Bowei Xing, Changjiu Jiang, Cheng Chen, Cheng Li, Chenjun Xiao, Chenzhuang Du, Chonghua Liao, Chuning Tang, Congcong Wang, Dehao Zhang, Enming Yuan, Enzhe Lu, Fengxiang Tang, Flood Sung, Guangda Wei, Guokun Lai, Haiqing Guo, Han Zhu, Hao Ding, Hao Hu, Hao Yang, Hao Zhang, Haotian Yao, Haotian Zhao, Haoyu Lu, Haoze Li, Haoz...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[33] [33]

Co-Reyes, Avi Singh, Kate Baumli, Shariq Iqbal, Colton Bishop, Rebecca Roelofs, Lei M

Aviral Kumar, Vincent Zhuang, Rishabh Agarwal, Yi Su, John D. Co-Reyes, Avi Singh, Kate Baumli, Shariq Iqbal, Colton Bishop, Rebecca Roelofs, Lei M. Zhang, Kay McKinney, Disha Shrivastava, Cosmin Paduraru, George Tucker, Doina Precup, Feryal M. P. Behbahani, and Aleksandra Faust. Training language models to self-correct via reinforcement learning. InICLR....

work page 2025

[34] [34]

Efficient memory management for large language model serving with pagedattention

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. InSOSP, pp. 611–626. ACM, 2023. 4.1, B.1

work page 2023

[35] [35]

Remax: A simple, effective, and efficient reinforcement learning method for aligning large language models

Ziniu Li, Tian Xu, Yushun Zhang, Zhihang Lin, Yang Yu, Ruoyu Sun, and Zhi-Quan Luo. Remax: A simple, effective, and efficient reinforcement learning method for aligning large language models. InICML. OpenReview.net, 2024. 2

work page 2024

[36] [36]

Marft: Multi-agent reinforcement fine-tuning, 2025

Junwei Liao, Muning Wen, Jun Wang, and Weinan Zhang. MARFT: multi-agent reinforcement fine-tuning.CoRR, abs/2504.16129, 2025. 3.1

work page internal anchor Pith review arXiv 2025

[37] [37]

Enhancing efficiency and exploration in reinforcement learning for llms

Mengqi Liao, Xiangyu Xi, Ruinian Chen, Jia Leng, Yangen Hu, Ke Zeng, Shuai Liu, and Huaiyu Wan. Enhancing efficiency and exploration in reinforcement learning for llms. In EMNLP, pp. 1451–1463. Association for Computational Linguistics, 2025. 1

work page 2025

[38] [38]

Let’s verify step by step

Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harrison Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step. InICLR. OpenReview.net, 2024. B.6

work page 2024

[39] [39]

Interactive Learning for LLM Reasoning

Hehai Lin, Shilei Cao, Sudong Wang, Haotian Wu, Minzhi Li, Linyi Yang, Juepeng Zheng, and Chengwei Qin. Interactive learning for LLM reasoning.CoRR, abs/2509.26306, 2025. 1, 5, D.1

work page internal anchor Pith review Pith/arXiv arXiv 2025

[40] [40]

Are your llms capable of stable reasoning?arXiv preprint arXiv:2412.13147, 2024

Junnan Liu, Hongwei Liu, Linchen Xiao, Ziyi Wang, Kuikun Liu, Songyang Gao, Wenwei Zhang, Songyang Zhang, and Kai Chen. Are your llms capable of stable reasoning?CoRR, abs/2412.13147, 2024. 1, 4.1, B.6

work page arXiv 2024

[41] [41]

Situatedthinker: Grounding LLM reasoning with real-world through situated thinking.CoRR, abs/2505.19300, 2025

Junnan Liu, Linhao Luo, Thuy-Trang Vu, and Gholamreza Haffari. Situatedthinker: Grounding LLM reasoning with real-world through situated thinking.CoRR, abs/2505.19300, 2025. 3.2

work page arXiv 2025

[42] [42]

Learn to reason efficiently with adaptive length-based reward shaping.arXiv preprint arXiv:2505.15612, 2025

Wei Liu, Ruochen Zhou, Yiyun Deng, Yuzhen Huang, Junteng Liu, Yuntian Deng, Yizhe Zhang, and Junxian He. Learn to reason efficiently with adaptive length-based reward shaping.CoRR, abs/2505.15612, 2025. 4.1

work page arXiv 2025

[43] [43]

Exploratory memory- augmented llm agent via hybrid on- and off-policy optimization.CoRR, abs/2602.23008, 2026

Zeyuan Liu, Jeonghye Kim, Xufang Luo, Dongsheng Li, and Yuqing Yang. Exploratory memory- augmented llm agent via hybrid on- and off-policy optimization.CoRR, abs/2602.23008, 2026. 1

work page arXiv 2026

[44] [44]

Understanding R1-Zero-Like Training: A Critical Perspective

Zichen Liu, Changyu Chen, Wenjun Li, Penghui Qi, Tianyu Pang, Chao Du, Wee Sun Lee, and Min Lin. Understanding r1-zero-like training: A critical perspective.CoRR, abs/2503.20783,

work page internal anchor Pith review Pith/arXiv arXiv

[45] [45]

Towards a unified view of large language model post-training.arXiv preprint arXiv:2509.04419,

Xingtai Lv, Yuxin Zuo, Youbang Sun, Hongyi Liu, Yuntian Wei, Zhekai Chen, Lixuan He, Xuekai Zhu, Kaiyan Zhang, Bingning Wang, Ning Ding, and Bowen Zhou. Towards a unified view of large language model post-training.CoRR, abs/2509.04419, 2025. 1, 2, 5, D.1 13

work page arXiv 2025

[46] [46]

Learning what reinforcement learning can't: Interleaved online fine-tuning for hardest questions

Lu Ma, Hao Liang, Meiyi Qiang, Lexiang Tang, Xiaochen Ma, Zhen Hao Wong, Junbo Niu, Chengyu Shen, Runming He, Bin Cui, and Wentao Zhang. Learning what reinforcement learning can’t: Interleaved online fine-tuning for hardest questions.CoRR, abs/2506.07527,

work page arXiv

[47] [47]

s1: Simple test-time scaling

Niklas Muennighoff, Zitong Yang, Weijia Shi, Xiang Lisa Li, Li Fei-Fei, Hannaneh Hajishirzi, Luke Zettlemoyer, Percy Liang, Emmanuel J. Candès, and Tatsunori Hashimoto. s1: Simple test-time scaling.CoRR, abs/2501.19393, 2025. 3.1

work page internal anchor Pith review Pith/arXiv arXiv 2025

[48] [48]

Learning to reason with llms

OpenAI. Learning to reason with llms. https://openai.com/index/learning-to- reason-with-llms/, 2024. Accessed: 2024-09. 1, 2, 5

work page 2024

[49] [49]

Introducing openai o3 and o4-mini

OpenAI. Introducing openai o3 and o4-mini. https://openai.com/index/introducing- o3-and-o4-mini/, 2024. Accessed: 2024-12. 2

work page 2024

[50] [50]

Gpt-5 and the new era of work

OpenAI. Gpt-5 and the new era of work. https://openai.com/index/gpt-5-new-era- of-work/, 2025. Accessed: 2025-08. 1, 5

work page 2025

[51] [51]

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul F. Christiano, Jan Leike, and Ryan Lowe. Training language models to follow instructions with human fee...

work page 2022

[52] [52]

Ram´e, A.; Ferret, J.; Vieillard, N.; Dadashi, R.; Hussenot, L.; Cedoz, P.-L.; Sessa, P

Yiwei Qin, Xuefeng Li, Haoyang Zou, Yixiu Liu, Shijie Xia, Zhen Huang, Yixin Ye, Weizhe Yuan, Hector Liu, Yuanzhi Li, and Pengfei Liu. O1 replication journey: A strategic progress report - part 1.CoRR, abs/2410.18982, 2024. 3.1

work page arXiv 2024

[53] [53]

David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R. Bowman. GPQA: A graduate-level google-proof q&a benchmark.CoRR, abs/2311.12022, 2023. 1, 4.1, B.6

work page internal anchor Pith review Pith/arXiv arXiv 2023

[54] [54]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.CoRR, abs/1707.06347, 2017. 1, 2, 3.2

work page internal anchor Pith review Pith/arXiv arXiv 2017

[55] [55]

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Mingchuan Zhang, Y . K. Li, Y . Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.CoRR, abs/2402.03300, 2024. 1, 1, 2, 4.1

work page internal anchor Pith review Pith/arXiv arXiv 2024

[56] [56]

Hybridflow: A flexible and efficient RLHF framework

Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient RLHF framework. In EuroSys, pp. 1279–1297. ACM, 2025. 4.1, B.1

work page 2025

[57] [57]

Reflexion: language agents with verbal reinforcement learning

Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: language agents with verbal reinforcement learning. InNeurIPS, 2023. 5, D.1

work page 2023

[58] [58]

Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters

Charlie Snell, Jaehoon Lee, Kelvin Xu, and Aviral Kumar. Scaling LLM test-time compute optimally can be more effective than scaling model parameters.CoRR, abs/2408.03314, 2024. 1

work page internal anchor Pith review Pith/arXiv arXiv 2024

[59] [59]

R1-Searcher: Incentivizing the Search Capability in LLMs via Reinforcement Learning

Huatong Song, Jinhao Jiang, Yingqian Min, Jie Chen, Zhipeng Chen, Wayne Xin Zhao, Lei Fang, and Ji-Rong Wen. R1-searcher: Incentivizing the search capability in llms via reinforcement learning.CoRR, abs/2503.05592, 2025. 3.2

work page internal anchor Pith review Pith/arXiv arXiv 2025

[60] [60]

Reasoning gym: Reasoning environments for reinforcement learning with verifiable rewards.arXiv preprint arXiv:2505.24760,

Zafir Stojanovski, Oliver Stanley, Joe Sharratt, Richard Jones, Abdulhakeem Adefioye, Jean Kaddour, and Andreas Köpf. REASONING GYM: reasoning environments for reinforcement learning with verifiable rewards.CoRR, abs/2505.24760, 2025. 1, 4.1, B.6

work page arXiv 2025

[61] [61]

ZeroSearch: Incentivize the Search Capability of LLMs without Searching

Hao Sun, Zile Qiao, Jiayan Guo, Xuanbo Fan, Yingyan Hou, Yong Jiang, Pengjun Xie, Yan Zhang, Fei Huang, and Jingren Zhou. Zerosearch: Incentivize the search capability of llms without searching.CoRR, abs/2505.04588, 2025. 3.1

work page internal anchor Pith review Pith/arXiv arXiv 2025

[62] [62]

Improving data efficiency for llm reinforcement fine-tuning through difficulty-targeted online data selection and rollout replay.arXiv preprint arXiv:2506.05316,

Yifan Sun, Jingyan Shen, Yibin Wang, Tianyu Chen, Zhendong Wang, Mingyuan Zhou, and Huan Zhang. Improving data efficiency for LLM reinforcement fine-tuning through difficulty- targeted online data selection and rollout replay.CoRR, abs/2506.05316, 2025. 5, D.1 14

work page arXiv 2025

[63] [63]

Qwq-32b: Embracing the power of reinforcement learning, March 2025

Qwen Team. Qwq-32b: Embracing the power of reinforcement learning, March 2025. URL https://qwenlm.github.io/blog/qwq-32b/. 5

work page 2025

[64] [64]

Rema: Learning to meta-think for llms with multi-agent reinforcement learning, 2025

Ziyu Wan, Yunxiang Li, Yan Song, Hanjing Wang, Linyi Yang, Mark Schmidt, Jun Wang, Weinan Zhang, Shuyue Hu, and Ying Wen. Rema: Learning to meta-think for llms with multi-agent reinforcement learning.CoRR, abs/2503.09501, 2025. 3.1

work page arXiv 2025

[65] [65]

Beyond the 80/20 Rule: High-Entropy Minority Tokens Drive Effective Reinforcement Learning for LLM Reasoning

Shenzhi Wang, Le Yu, Chang Gao, Chujie Zheng, Shixuan Liu, Rui Lu, Kai Dang, Xionghui Chen, Jianxin Yang, Zhenru Zhang, Yuqiong Liu, An Yang, Andrew Zhao, Yang Yue, Shiji Song, Bowen Yu, Gao Huang, and Junyang Lin. Beyond the 80/20 rule: High-entropy minority tokens drive effective reinforcement learning for LLM reasoning.CoRR, abs/2506.01939, 2025. A

work page internal anchor Pith review Pith/arXiv arXiv 2025

[66] [66]

Chi, Quoc V

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed H. Chi, Quoc V . Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models. InNeurIPS, 2022. 3.1

work page 2022

[67] [67]

Truthrl: Incentivizing truthful llms via reinforcement learning.CoRR, abs/2509.25760, 2025

Zhepei Wei, Xiao Yang, Kai Sun, Jiaqi Wang, Rulin Shao, Sean Chen, Mohammad Kachuee, Teja Gollapudi, Tony Liao, Nicolas Scheffer, Rakesh Wanga, Anuj Kumar, Yu Meng, Wen-tau Yih, and Xin Luna Dong. Truthrl: Incentivizing truthful llms via reinforcement learning.CoRR, abs/2509.25760, 2025. D.2

work page arXiv 2025

[68] [68]

Grok 4.https://x.ai/news/grok-4/, 2025

xAI. Grok 4.https://x.ai/news/grok-4/, 2025. Accessed: 2025-07. 1

work page 2025

[69] [69]

Kdrl: Post-training reasoning llms via unified knowledge distillation and reinforcement learning.arXiv preprint arXiv:2506.02208,

Hongling Xu, Qi Zhu, Heyuan Deng, Jinpeng Li, Lu Hou, Yasheng Wang, Lifeng Shang, Ruifeng Xu, and Fei Mi. KDRL: post-training reasoning llms via unified knowledge distillation and reinforcement learning.CoRR, abs/2506.02208, 2025. 5

work page arXiv 2025

[70] [70]

Comas: Co-evolving multi-agent systems via interaction rewards.arXiv preprint arXiv:2510.08529, 2025

Xiangyuan Xue, Yifan Zhou, Guibin Zhang, Zaibin Zhang, Yijiang Li, Chen Zhang, Zhenfei Yin, Philip Torr, Wanli Ouyang, and Lei Bai. Comas: Co-evolving multi-agent systems via interaction rewards.CoRR, abs/2510.08529, 2025. 1

work page arXiv 2025

[71] [71]

Qwen2.5 Technical Report

An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianhao Li, Tingyu X...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[72] [72]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jian Yang, Jiaxi Yang, Jingren Zhou, Jingren Zhou, Junyang Lin, Kai Dang, Keqin Bao, Kexin Yang, ...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[73] [73]

DAPO: An Open-Source LLM Reinforcement Learning System at Scale

Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Tiantian Fan, Gaohong Liu, Lingjun Liu, Xin Liu, Haibin Lin, Zhiqi Lin, Bole Ma, Guangming Sheng, Yuxuan Tong, Chi Zhang, Mofan Zhang, Wang Zhang, Hang Zhu, Jinhua Zhu, Jiaze Chen, Jiangjie Chen, Chengyi Wang, Hongli Yu, Weinan Dai, Yuxuan Song, Xiangpeng Wei, Hao Zhou, Jingjing Liu, W...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[74] [74]

Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?

Yang Yue, Zhiqi Chen, Rui Lu, Andrew Zhao, Zhaokai Wang, Yang Yue, Shiji Song, and Gao Huang. Does reinforcement learning really incentivize reasoning capacity in llms beyond the base model?CoRR, abs/2504.13837, 2025. 1, 5 15

work page internal anchor Pith review Pith/arXiv arXiv 2025

[75] [75]

VAPO: Efficient and Reliable Reinforcement Learning for Advanced Reasoning Tasks

Yu Yue, Yufeng Yuan, Qiying Yu, Xiaochen Zuo, Ruofei Zhu, Wenyuan Xu, Jiaze Chen, Cheng- Xiang Wang, Tiantian Fan, Zhengyin Du, Xiangpeng Wei, Xiangyu Yu, Gaohong Liu, Juncai Liu, Lingjun Liu, Haibin Lin, Zhiqi Lin, Bole Ma, Chi Zhang, Mofan Zhang, Wang Zhang, Hang Zhu, Ru Zhang, Xin Liu, Mingxuan Wang, Yonghui Wu, and Lin Yan. V APO: efficient and reliab...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[76] [76]

Incentivizing llms to self-verify their answers.arXiv preprint arXiv:2506.01369, 2025a

Fuxiang Zhang, Jiacheng Xu, Chaojie Wang, Ce Cui, Yang Liu, and Bo An. Incentivizing llms to self-verify their answers.CoRR, abs/2506.01369, 2025. D.2

work page arXiv 2025

[77] [77]

arXiv preprint arXiv:2508.11408 , year=

Wenhao Zhang, Yuexiang Xie, Yuchang Sun, Yanxi Chen, Guoyin Wang, Yaliang Li, Bolin Ding, and Jingren Zhou. On-policy RL meets off-policy experts: Harmonizing supervised fine-tuning and reinforcement learning via dynamic weighting.CoRR, abs/2508.11408, 2025. 1, 2, 2, 5, D.1

work page arXiv 2025

[78] [78]

Learning to reason under off-policy guidance

Yue Zhang, Yafu Li, Ganqu Cui, Yu Cheng, Zhi Wang, Xiaoye Qu, Jianhao Yan, and Zican Hu. Learning to reason under off-policy guidance. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025. 1, 2, 2, 3.2, 5, D.1

work page 2025

[79] [79]

Echo chamber: Rl post-training amplifies behaviors learned in pretraining

Rosie Zhao, Alexandru Meterez, Sham M. Kakade, Cengiz Pehlevan, Samy Jelassi, and Eran Malach. Echo chamber: RL post-training amplifies behaviors learned in pretraining.CoRR, abs/2504.07912, 2025. 1, 5

work page arXiv 2025

[80] [80]

Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models

Siyan Zhao, Zhihui Xie, Mengchen Liu, Jing Huang, Guan Pang, Feiyu Chen, and Aditya Grover. Self-distilled reasoner: On-policy self-distillation for large language models.CoRR, abs/2601.18734, 2026. 1, 4.1, 5, D.1

work page internal anchor Pith review Pith/arXiv arXiv 2026