CRAFT: Counterfactual Credit Assignment from Free Sibling Rollouts for Self-Distilled Agentic Reinforcement Learning

Kani Chen; Zibin Meng

arxiv: 2606.29476 · v1 · pith:IIOXBUCHnew · submitted 2026-06-28 · 💻 cs.LG · cs.AI

CRAFT: Counterfactual Credit Assignment from Free Sibling Rollouts for Self-Distilled Agentic Reinforcement Learning

Zibin Meng , Kani Chen This is my paper

Pith reviewed 2026-06-30 07:45 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords counterfactual credit assignmentself-distilled reinforcement learningagentic RLtoken-level distillationGRPOsigned advantage estimateKL penalty polarization

0 comments

The pith

CRAFT assigns signed per-token credits in self-distilled agentic RL by importance-weighting already-sampled sibling rollouts to estimate counterfactual advantage changes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes a method called CRAFT that fixes two limitations in how token-level distillation losses are gated in agentic reinforcement learning. The standard approach relies on a single scalar measuring the teacher-student log-probability gap, which only scores the actual rollout that occurred and never indicates whether a teacher-preferred action would have improved or harmed the outcome. CRAFT reuses the G-1 sibling rollouts that GRPO already generates, applies importance weighting by the log-probability gap, and produces a self-normalised estimate of the group-level change in advantage that would result from up-weighting those actions. This signed credit is obtained at near-zero extra compute and then drives two additional pillars: an asymmetric controller that trades off distillation weight against reference KL weight, and a token-wise switch between mode-seeking and mode-covering KL updates. The paper proves consistency of the estimator, supplies a variance bound, and supplies independent switches that render the loss byte-identical to the baseline when any pillar is disabled.

Core claim

CRAFT is a three-pillar credit-assignment scheme. Pillar 1 (Counterfactual Token Importance) reuses the G-1 sibling rollouts already sampled by GRPO and importance-weights them by the log-probability gap to form a self-normalised estimate of the group-level counterfactual change in advantage from up-weighting teacher-preferred actions at each step. Pillar 2 is an asymmetric controller that raises the distillation weight while lowering the reference-KL weight along an exponential moving average of gate activity. Pillar 3 polarises the KL penalty token by token, switching between mode-seeking and mode-covering updates according to the sign of the credit. Each pillar has an independent switch t

What carries the argument

Counterfactual Token Importance, which reuses G-1 sibling rollouts and importance-weights them by the teacher-student log-probability gap to estimate the counterfactual group-level change in advantage.

If this is right

The estimator is consistent and admits a variance bound.
Disabling any pillar renders the loss and gradient byte-identical to the baseline in IEEE-754 arithmetic.
Performance gains can be isolated from Adaptive-CRINGE, which shares only Pillar 2.
The method was evaluated across three agentic environments, four model scales, and five end-to-end methods.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The reuse of GRPO siblings suggests the technique could be applied to other multi-rollout RL settings where additional sampling is costly.
Token-level sign information may allow more precise control of exploration versus exploitation in distillation objectives.
The bit-exact reproducibility switches provide a template for isolating algorithmic contributions in future RL ablations.

Load-bearing premise

The G-1 sibling rollouts already sampled by GRPO suffice to form a self-normalised estimate of the group-level counterfactual change in advantage from up-weighting teacher-preferred actions at each step.

What would settle it

A direct measurement, on held-out trajectories, showing that the importance-weighted sibling estimate fails to predict the actual change in trajectory advantage when teacher-preferred actions receive higher weight, or that the observed variance exceeds the paper's stated bound.

Figures

Figures reproduced from arXiv: 2606.29476 by Kani Chen, Zibin Meng.

**Figure 1.** Figure 1: The three pillars of CRAFT at a glance. Left: on the (sign A(i) ,sign ∆t) plane, the prior single-gate baseline collapses two of the four quadrants by always distilling toward the teacher ( [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗

read the original abstract

Self-distilled agentic reinforcement learning augments trajectory-level reward with a token-level distillation loss, using as its teacher the same policy conditioned on privileged context. The prevailing recipe gates this loss by a single scalar, the teacher-student log-probability gap. This signal is doubly limited: it is retrospective, scoring only the realised rollout and never the counterfactual ones, and it is sign-blind, never signalling when a teacher-preferred action would have harmed the trajectory. We introduce CRAFT, a three-pillar credit-assignment scheme that addresses both limitations. Pillar 1, Counterfactual Token Importance, reuses the G-1 sibling rollouts that GRPO already samples and importance-weights them by the log-probability gap to form a self-normalised estimate of the group-level counterfactual change in advantage from up-weighting teacher-preferred actions at each step; this yields a signed per-token credit at near-zero extra compute. Pillar 2 is an asymmetric controller that raises the distillation weight as it lowers the reference-KL weight along an exponential moving average of gate activity, and conversely. Pillar 3 polarises the KL penalty token by token, switching between a mode-seeking and a mode-covering update according to the sign of the credit. Each pillar has an independent switch that, when disabled, renders the loss and gradient byte-identical to the baseline in IEEE-754 arithmetic, so any measured gain is attributable to algorithmic change rather than implementation drift. We prove the estimator's consistency and a variance bound, give structural and bit-exact reproducibility guarantees, and evaluate CRAFT across three agentic environments, four model scales, and five end-to-end methods, plus two tabulated prior-work baselines. Among these is Adaptive-CRINGE, a comparator sharing Pillar 2 with CRAFT, isolating the counterfactual contribution.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CRAFT reuses GRPO siblings for signed per-token credits via self-normalized importance weighting, with reproducibility switches and broad evals, but the finite-sample variance of that estimator is the part that still needs checking.

read the letter

The new piece is Pillar 1: it takes the G-1 sibling rollouts that GRPO already produces, importance-weights them by the teacher-student log-prob gap, and forms a self-normalized estimate of how much the group advantage would change if teacher-preferred tokens were up-weighted. That gives a signed credit at essentially no extra sampling cost. Pillars 2 and 3 then adjust the distillation weight dynamically and flip the KL between mode-seeking and mode-covering per token. The paper also adds independent on/off switches so any gain is byte-identical to the baseline when the new pieces are disabled.

They claim a consistency proof and a variance bound for the estimator, plus structural and bit-exact reproducibility guarantees. The evaluation covers three agentic environments, four model scales, five end-to-end methods, and a direct comparator (Adaptive-CRINGE) that shares Pillar 2, which helps isolate the counterfactual contribution.

The main soft spot is exactly the one the stress-test flags. Self-normalized importance sampling on small G can produce unstable weights when log-prob gaps are large, and the bound is likely asymptotic. The abstract does not show realized variance, error bars, or how extreme weights are clipped, so it is not yet clear whether the signed credits stay informative rather than noise-dominated in the regimes they actually run. If the full derivations and tables address this, the claim strengthens; otherwise the empirical gains rest on unverified stability.

This is for people working on token-level credit assignment inside agentic RL loops for language models. Anyone already using GRPO-style group sampling will see the practical angle immediately. The formal claims plus the breadth of experiments are enough to justify sending it to a serious referee, even if the variance question will probably require extra work in revision.

Referee Report

2 major / 1 minor

Summary. The paper introduces CRAFT, a three-pillar credit-assignment method for self-distilled agentic RL. Pillar 1 reuses the G-1 GRPO sibling rollouts to form a self-normalised importance-weighted estimate of the group-level counterfactual advantage change when up-weighting teacher-preferred tokens, yielding signed per-token credit at near-zero cost. Pillar 2 is an asymmetric controller that trades off distillation weight against reference-KL weight via an EMA of gate activity. Pillar 3 polarises the KL penalty token-by-token according to the sign of the credit. The manuscript claims a consistency proof and variance bound for the estimator, structural and bit-exact reproducibility guarantees via independent switches, and empirical gains across three agentic environments, four model scales, five end-to-end methods, and two prior-work baselines.

Significance. If the consistency proof is valid and the finite-sample variance of the self-normalised estimator remains controlled, CRAFT would supply an efficient mechanism for signed token-level credit that is absent from the prevailing retrospective, sign-blind distillation gate. The explicit reproducibility guarantees (each pillar can be disabled to recover the baseline loss and gradient in IEEE-754 arithmetic) and the isolation of the counterfactual contribution via the Adaptive-CRINGE comparator are concrete strengths that would strengthen any positive result.

major comments (2)

[Abstract (Pillar 1)] Abstract (Pillar 1 description): the consistency and variance bound for the self-normalised importance-weighted estimator are asserted, yet the bound is described only as holding in the limit; no explicit dependence on group size G or on the magnitude of log-probability gaps is supplied. Because the central claim requires that the G-1 siblings already sampled by GRPO produce reliable signed credits rather than noise-dominated estimates, the absence of a non-asymptotic guarantee or a demonstration that extreme weights do not induce finite-sample bias is load-bearing.
[Abstract (evaluation)] Abstract (evaluation paragraph): the manuscript states empirical gains across three environments, four scales and five methods but supplies neither error bars on the reported improvements nor any quantification of the realised variance of the Pillar-1 estimator. Without these diagnostics it is impossible to verify that the counterfactual credits remain useful at the moderate G values actually employed.

minor comments (1)

[Abstract] The abstract refers to 'structural and bit-exact reproducibility guarantees' but does not enumerate the exact switch settings that recover the baseline; a short table or enumerated list would improve clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on the theoretical claims and empirical reporting. We address each major comment below and outline the revisions we will make.

read point-by-point responses

Referee: [Abstract (Pillar 1)] Abstract (Pillar 1 description): the consistency and variance bound for the self-normalised importance-weighted estimator are asserted, yet the bound is described only as holding in the limit; no explicit dependence on group size G or on the magnitude of log-probability gaps is supplied. Because the central claim requires that the G-1 siblings already sampled by GRPO produce reliable signed credits rather than noise-dominated estimates, the absence of a non-asymptotic guarantee or a demonstration that extreme weights do not induce finite-sample bias is load-bearing.

Authors: We acknowledge that the abstract presents the guarantees in asymptotic terms without explicit finite-sample dependence on G or log-probability gap magnitude. The appendix contains the consistency proof (as G o ∞ under standard importance-sampling assumptions) and a variance bound that holds when gaps are bounded; however, we agree these details are not foregrounded in the abstract. We will revise the abstract to state the asymptotic character explicitly and add a short clause referencing the appendix conditions. We do not currently possess a tight non-asymptotic bound that would cover arbitrary gap sizes without further assumptions on the policy class; deriving one would constitute new theoretical work beyond the present scope. In the revision we will instead include a brief empirical diagnostic (already computed during experiments) showing that the self-normalised weights remained moderate for the G values used (4–8) and that the resulting credits correlated with downstream performance gains. revision: partial
Referee: [Abstract (evaluation)] Abstract (evaluation paragraph): the manuscript states empirical gains across three environments, four scales and five methods but supplies neither error bars on the reported improvements nor any quantification of the realised variance of the Pillar-1 estimator. Without these diagnostics it is impossible to verify that the counterfactual credits remain useful at the moderate G values actually employed.

Authors: We agree that the absence of error bars and estimator-variance diagnostics weakens the empirical claim. The reported numbers are means over multiple random seeds, but standard deviations were computed and can be added. In the revised manuscript we will include error bars (or confidence intervals) on all tables and figures that report performance deltas, and we will add a supplementary table or figure that reports the empirical variance of the Pillar-1 importance-weighted estimator across the three environments and the G values actually used. These additions will directly address whether the signed credits remain informative rather than noise-dominated at moderate group sizes. revision: yes

Circularity Check

0 steps flagged

No circularity: estimator defined from external GRPO samples with independent consistency proof

full rationale

The paper's Pillar 1 estimator is constructed directly from the G-1 sibling rollouts already produced by GRPO, using importance weighting by the log-probability gap to produce a self-normalised counterfactual advantage change; the authors then state a separate consistency proof and variance bound for this construction. No equation reduces the output credit to a fitted parameter renamed as a prediction, nor does any load-bearing premise rest on a self-citation whose content is itself unverified or defined in terms of the target result. Reproducibility switches and cross-method evaluations supply external grounding. The Adaptive-CRINGE comparator is invoked only to isolate the contribution of the new pillar, not to justify the estimator itself.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that sibling rollouts already generated by GRPO suffice for unbiased counterfactual estimation; no free parameters or invented entities are introduced in the abstract.

axioms (1)

domain assumption The G-1 sibling rollouts already sampled by GRPO are sufficient to form a self-normalised estimate of the group-level counterfactual change in advantage from up-weighting teacher-preferred actions at each step.
Invoked directly in the description of Pillar 1.

pith-pipeline@v0.9.1-grok · 5870 in / 1324 out tokens · 29966 ms · 2026-06-30T07:45:45.554050+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

47 extracted references · 14 canonical work pages · 13 internal anchors

[1]

Proximal Policy Optimization Algorithms

Schulman, John and Wolski, Filip and Dhariwal, Prafulla and Radford, Alec and Klimov, Oleg , title =. arXiv preprint arXiv:1707.06347 , year =

work page internal anchor Pith review Pith/arXiv arXiv
[2]

Ouyang, Long and Wu, Jeffrey and Jiang, Xu and Almeida, Diogo and Wainwright, Carroll L. and Mishkin, Pamela and Zhang, Chong and Agarwal, Sandhini and Slama, Katarina and Ray, Alex and Schulman, John and Hilton, Jacob and Kelton, Fraser and Miller, Luke and Simens, Maddie and Askell, Amanda and Welinder, Peter and Christiano, Paul and Leike, Jan and Lowe...
[3]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Shao, Zhihong and Wang, Peiyi and Zhu, Qihao and Xu, Runxin and Song, Junxiao and Zhang, Mingchuan and Li, Yong-Kang and Wu, Yu and Guo, Daya , title =. arXiv preprint arXiv:2402.03300 , year =

work page internal anchor Pith review Pith/arXiv arXiv
[4]

International Conference on Learning Representations (ICLR) , year =

Welleck, Sean and Kulikov, Ilia and Roller, Stephen and Dinan, Emily and Cho, Kyunghyun and Weston, Jason , title =. International Conference on Learning Representations (ICLR) , year =
[5]

Annual Meeting of the Association for Computational Linguistics (ACL) , year =

Adolphs, Leonard and Gao, Tianyu and Xu, Jing and Shuster, Kurt and Sukhbaatar, Sainbayar and Weston, Jason , title =. Annual Meeting of the Association for Computational Linguistics (ACL) , year =
[6]

and Ermon, Stefano and Finn, Chelsea , title =

Rafailov, Rafael and Sharma, Archit and Mitchell, Eric and Manning, Christopher D. and Ermon, Stefano and Finn, Chelsea , title =. Advances in Neural Information Processing Systems (NeurIPS) , year =
[7]

NIPS Deep Learning and Representation Learning Workshop , year =

Hinton, Geoffrey and Vinyals, Oriol and Dean, Jeff , title =. NIPS Deep Learning and Representation Learning Workshop , year =
[8]

Microsoft Research Technical Report MSR-TR-2005-173 , year =

Minka, Thomas , title =. Microsoft Research Technical Report MSR-TR-2005-173 , year =

2005
[9]

Self-Distilled Agentic Reinforcement Learning

Lu, Zhengxi and Yao, Zhiyuan and Han, Zhuowen and Wang, Zi-Han and Wu, Jinyang and Gu, Qi and Cai, Xunliang and Lu, Weiming and Xiao, Jun and Zhuang, Yueting and Shen, Yongliang , title =. arXiv preprint arXiv:2605.15155 , year =

work page internal anchor Pith review Pith/arXiv arXiv
[10]

Self-Distilled RLVR

Yang, Chenxu and Qin, Chuanyu and Si, Qingyi and Chen, Minghui and Gu, Naibin and Yao, Dingyu and Lin, Zheng and Wang, Weiping and Wang, Jiaqi and Duan, Nan , title =. arXiv preprint arXiv:2604.03128 , year =

work page internal anchor Pith review Pith/arXiv arXiv
[11]

International Conference on Learning Representations (ICLR) , year =

Shridhar, Mohit and Yuan, Xingdi and C. International Conference on Learning Representations (ICLR) , year =
[12]

Advances in Neural Information Processing Systems (NeurIPS) , year =

Yao, Shunyu and Chen, Howard and Yang, John and Narasimhan, Karthik , title =. Advances in Neural Information Processing Systems (NeurIPS) , year =
[13]

Search-R1: Training LLMs to Reason and Leverage Search Engines with Reinforcement Learning

Jin, Bowen and Wang, Xinyu and Wang, Han and Pan, Lin and Ji, Yuxiao and Sui, Zhifang and Han, Jiawei , title =. arXiv preprint arXiv:2503.09516 , year =

work page internal anchor Pith review Pith/arXiv arXiv
[14]

arXiv preprint arXiv:2412.15115 , year =

work page internal anchor Pith review Pith/arXiv arXiv
[15]

arXiv preprint arXiv:2505.09388 , year =

work page internal anchor Pith review Pith/arXiv arXiv
[16]

and Thompson, Donovan J

Horvitz, Daniel G. and Thompson, Donovan J. , title =. Journal of the American Statistical Association , year =
[17]

and Singh, Satinder , title =

Precup, Doina and Sutton, Richard S. and Singh, Satinder , title =. International Conference on Machine Learning (ICML) , year =
[18]

Cambridge University Press, Second Edition , year =

Pearl, Judea , title =. Cambridge University Press, Second Edition , year =
[19]

International Conference on Learning Representations (ICLR) , year =

Buesing, Lars and Weber, Theophane and Zwols, Yori and Heess, Nicolas and Racaniere, Sebastien and Guez, Arthur and Lespiau, Jean-Baptiste , title =. International Conference on Learning Representations (ICLR) , year =
[20]

, title =

Williams, Ronald J. , title =. Machine Learning , year =
[21]

and McAllester, David and Singh, Satinder and Mansour, Yishay , title =

Sutton, Richard S. and McAllester, David and Singh, Satinder and Mansour, Yishay , title =. Advances in Neural Information Processing Systems (NeurIPS) , year =
[22]

and Abbeel, Pieter , title =

Schulman, John and Levine, Sergey and Moritz, Philipp and Jordan, Michael I. and Abbeel, Pieter , title =. International Conference on Machine Learning (ICML) , year =
[23]

Blog post, joschu.net , year =

Schulman, John , title =. Blog post, joschu.net , year =
[24]

arXiv preprint , year =

Xu, Hanyang and Lu, Shudong and Hao, Yan and Chen, Wenhu and Chen, Min , title =. arXiv preprint , year =
[25]

Skill-SD: Skill-Conditioned Self-Distillation for Multi-turn LLM Agents

Wang, Hao and Wang, Guozhi and Xiao, Han and Zhou, Yufeng and Pan, Yue and Wang, Jichao and Xu, Ke and Wen, Yafei and Ruan, Xiaohu and Chen, Xiaoxin and Qi, Honggang , title =. arXiv preprint arXiv:2604.10674 , year =

work page internal anchor Pith review Pith/arXiv arXiv
[26]

International Conference on Learning Representations (ICLR) , year =

Liu, Xiao and Yu, Hao and Zhang, Hanchen and Xu, Yifan and Lei, Xuanyu and Lai, Hanyu and Gu, Yu and Ding, Hangliang and Men, Kaiwen and Yang, Kejuan and Zhang, Shudan and Deng, Xiang and Zeng, Aohan and Du, Zhengxiao and Zhang, Chenhui and Shen, Sheng and Zhang, Tianjun and Su, Yu and Sun, Huan and Huang, Minlie and Dong, Yuxiao and Tang, Jie , title =. ...
[27]

, title =

Zelikman, Eric and Wu, Yuhuai and Mu, Jesse and Goodman, Noah D. , title =. Advances in Neural Information Processing Systems (NeurIPS) , year =
[28]

Training Verifiers to Solve Math Word Problems

Cobbe, Karl and Kosaraju, Vineet and Bavarian, Mohammad and Chen, Mark and Jun, Heewoo and Kaiser, Lukasz and Plappert, Matthias and Tworek, Jerry and Hilton, Jacob and Nakano, Reiichiro and Hesse, Christopher and Schulman, John , title =. arXiv preprint arXiv:2110.14168 , year =

work page internal anchor Pith review Pith/arXiv arXiv
[29]

NeurIPS Track on Datasets and Benchmarks , year =

Hendrycks, Dan and Burns, Collin and Kadavath, Saurav and Arora, Akul and Basart, Steven and Tang, Eric and Song, Dawn and Steinhardt, Jacob , title =. NeurIPS Track on Datasets and Benchmarks , year =
[30]

International Conference on Learning Representations (ICLR) , year =

Yao, Shunyu and Zhao, Jeffrey and Yu, Dian and Du, Nan and Shafran, Izhak and Narasimhan, Karthik and Cao, Yuan , title =. International Conference on Learning Representations (ICLR) , year =
[31]

Advances in Neural Information Processing Systems (NeurIPS) , year =

Shinn, Noah and Cassano, Federico and Berman, Edward and Gopinath, Ashwin and Narasimhan, Karthik and Yao, Shunyu , title =. Advances in Neural Information Processing Systems (NeurIPS) , year =
[32]

Advances in Neural Information Processing Systems (NeurIPS) , year =

Schick, Timo and Dwivedi-Yu, Jane and Dess. Advances in Neural Information Processing Systems (NeurIPS) , year =
[33]

Quiet-STaR: Language Models Can Teach Themselves to Think Before Speaking

Zelikman, Eric and Harik, Georges and Shao, Yijia and Jayasiri, Varuna and Haber, Nick and Goodman, Noah D. , title =. arXiv preprint arXiv:2403.09629 , year =

work page internal anchor Pith review Pith/arXiv arXiv
[34]

Reinforced Self-Training (ReST) for Language Modeling

Gulcehre, Caglar and Le Paine, Tom and Srinivasan, Srivatsan and Konyushkova, Ksenia and Weerts, Lotte and Sharma, Abhishek and Siddhant, Aditya and Ahern, Alex and Wang, Miaosen and Gu, Chenjie and Macherey, Wolfgang and Doucet, Arnaud and Firat, Orhan and de Freitas, Nando , title =. arXiv preprint arXiv:2308.08998 , year =

work page internal anchor Pith review Pith/arXiv arXiv
[35]

International Conference on Machine Learning (ICML) , year =

Chen, Zixiang and Deng, Yihe and Yuan, Huizhuo and Ji, Kaixuan and Gu, Quanquan , title =. International Conference on Machine Learning (ICML) , year =
[36]

International Conference on Machine Learning (ICML) , year =

Yuan, Weizhe and Pang, Richard Yuanzhe and Cho, Kyunghyun and Li, Xian and Sukhbaatar, Sainbayar and Xu, Jing and Weston, Jason , title =. International Conference on Machine Learning (ICML) , year =
[37]

International Conference on Learning Representations (ICLR) , year =

Agarwal, Rishabh and Vieillard, Nino and Zhou, Yongchao and Stanczyk, Piotr and Ramos, Sabela and Geist, Matthieu and Bachem, Olivier , title =. International Conference on Learning Representations (ICLR) , year =
[38]

Constitutional AI: Harmlessness from AI Feedback

Bai, Yuntao and Kadavath, Saurav and Kundu, Sandipan and Askell, Amanda and Kernion, Jackson and Jones, Andy and Chen, Anna and Goldie, Anna and Mirhoseini, Azalia and McKinnon, Cameron and others , title =. arXiv preprint arXiv:2212.08073 , year =

work page internal anchor Pith review Pith/arXiv arXiv
[39]

arXiv preprint arXiv:2412.19437 , year =

work page internal anchor Pith review Pith/arXiv arXiv
[40]

International Conference on Machine Learning (ICML) , year =

Swaminathan, Adith and Joachims, Thorsten , title =. International Conference on Machine Learning (ICML) , year =
[41]

Conference on Empirical Methods in Natural Language Processing (EMNLP) , year =

Wang, Ruoyao and Jansen, Peter and C. Conference on Empirical Methods in Natural Language Processing (EMNLP) , year =
[42]

International Conference on Learning Representations (ICLR) , year =

Chevalier-Boisvert, Maxime and Bahdanau, Dzmitry and Lahlou, Salem and Willems, Lucas and Saharia, Chitwan and Nguyen, Thien Huu and Bengio, Yoshua , title =. International Conference on Learning Representations (ICLR) , year =
[43]

Advances in Neural Information Processing Systems (NeurIPS) , year =

Pang, Richard Yuanzhe and Yuan, Weizhe and Cho, Kyunghyun and He, He and Sukhbaatar, Sainbayar and Weston, Jason , title =. Advances in Neural Information Processing Systems (NeurIPS) , year =
[44]

Back to Basics: Revisiting

Ahmadian, Arash and Cremer, Chris and Gall. Back to Basics: Revisiting. Annual Meeting of the Association for Computational Linguistics (ACL) , year =
[45]

Vineppo: Refining credit assignment in rl training of llms, 2025

Kazemnejad, Amirhossein and Aghajohari, Milad and Portelance, Eva and Sordoni, Alessandro and Reddy, Siva and Courville, Aaron and Le Roux, Nicolas , title =. arXiv preprint arXiv:2410.01679 , year =

work page arXiv
[46]

International Conference on Learning Representations (ICLR) , year =

Lightman, Hunter and Kosaraju, Vineet and Burda, Yura and Edwards, Harri and Baker, Bowen and Lee, Teddy and Leike, Jan and Schulman, John and Sutskever, Ilya and Cobbe, Karl , title =. International Conference on Learning Representations (ICLR) , year =
[47]

Annual Meeting of the Association for Computational Linguistics (ACL) , year =

Wang, Peiyi and Li, Lei and Shao, Zhihong and Xu, Runxin and Dai, Damai and Li, Yifei and Chen, Deli and Wu, Yu and Sui, Zhifang , title =. Annual Meeting of the Association for Computational Linguistics (ACL) , year =

[1] [1]

Proximal Policy Optimization Algorithms

Schulman, John and Wolski, Filip and Dhariwal, Prafulla and Radford, Alec and Klimov, Oleg , title =. arXiv preprint arXiv:1707.06347 , year =

work page internal anchor Pith review Pith/arXiv arXiv

[2] [2]

Ouyang, Long and Wu, Jeffrey and Jiang, Xu and Almeida, Diogo and Wainwright, Carroll L. and Mishkin, Pamela and Zhang, Chong and Agarwal, Sandhini and Slama, Katarina and Ray, Alex and Schulman, John and Hilton, Jacob and Kelton, Fraser and Miller, Luke and Simens, Maddie and Askell, Amanda and Welinder, Peter and Christiano, Paul and Leike, Jan and Lowe...

[3] [3]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Shao, Zhihong and Wang, Peiyi and Zhu, Qihao and Xu, Runxin and Song, Junxiao and Zhang, Mingchuan and Li, Yong-Kang and Wu, Yu and Guo, Daya , title =. arXiv preprint arXiv:2402.03300 , year =

work page internal anchor Pith review Pith/arXiv arXiv

[4] [4]

International Conference on Learning Representations (ICLR) , year =

Welleck, Sean and Kulikov, Ilia and Roller, Stephen and Dinan, Emily and Cho, Kyunghyun and Weston, Jason , title =. International Conference on Learning Representations (ICLR) , year =

[5] [5]

Annual Meeting of the Association for Computational Linguistics (ACL) , year =

Adolphs, Leonard and Gao, Tianyu and Xu, Jing and Shuster, Kurt and Sukhbaatar, Sainbayar and Weston, Jason , title =. Annual Meeting of the Association for Computational Linguistics (ACL) , year =

[6] [6]

and Ermon, Stefano and Finn, Chelsea , title =

Rafailov, Rafael and Sharma, Archit and Mitchell, Eric and Manning, Christopher D. and Ermon, Stefano and Finn, Chelsea , title =. Advances in Neural Information Processing Systems (NeurIPS) , year =

[7] [7]

NIPS Deep Learning and Representation Learning Workshop , year =

Hinton, Geoffrey and Vinyals, Oriol and Dean, Jeff , title =. NIPS Deep Learning and Representation Learning Workshop , year =

[8] [8]

Microsoft Research Technical Report MSR-TR-2005-173 , year =

Minka, Thomas , title =. Microsoft Research Technical Report MSR-TR-2005-173 , year =

2005

[9] [9]

Self-Distilled Agentic Reinforcement Learning

Lu, Zhengxi and Yao, Zhiyuan and Han, Zhuowen and Wang, Zi-Han and Wu, Jinyang and Gu, Qi and Cai, Xunliang and Lu, Weiming and Xiao, Jun and Zhuang, Yueting and Shen, Yongliang , title =. arXiv preprint arXiv:2605.15155 , year =

work page internal anchor Pith review Pith/arXiv arXiv

[10] [10]

Self-Distilled RLVR

Yang, Chenxu and Qin, Chuanyu and Si, Qingyi and Chen, Minghui and Gu, Naibin and Yao, Dingyu and Lin, Zheng and Wang, Weiping and Wang, Jiaqi and Duan, Nan , title =. arXiv preprint arXiv:2604.03128 , year =

work page internal anchor Pith review Pith/arXiv arXiv

[11] [11]

International Conference on Learning Representations (ICLR) , year =

Shridhar, Mohit and Yuan, Xingdi and C. International Conference on Learning Representations (ICLR) , year =

[12] [12]

Advances in Neural Information Processing Systems (NeurIPS) , year =

Yao, Shunyu and Chen, Howard and Yang, John and Narasimhan, Karthik , title =. Advances in Neural Information Processing Systems (NeurIPS) , year =

[13] [13]

Search-R1: Training LLMs to Reason and Leverage Search Engines with Reinforcement Learning

Jin, Bowen and Wang, Xinyu and Wang, Han and Pan, Lin and Ji, Yuxiao and Sui, Zhifang and Han, Jiawei , title =. arXiv preprint arXiv:2503.09516 , year =

work page internal anchor Pith review Pith/arXiv arXiv

[14] [14]

arXiv preprint arXiv:2412.15115 , year =

work page internal anchor Pith review Pith/arXiv arXiv

[15] [15]

arXiv preprint arXiv:2505.09388 , year =

work page internal anchor Pith review Pith/arXiv arXiv

[16] [16]

and Thompson, Donovan J

Horvitz, Daniel G. and Thompson, Donovan J. , title =. Journal of the American Statistical Association , year =

[17] [17]

and Singh, Satinder , title =

Precup, Doina and Sutton, Richard S. and Singh, Satinder , title =. International Conference on Machine Learning (ICML) , year =

[18] [18]

Cambridge University Press, Second Edition , year =

Pearl, Judea , title =. Cambridge University Press, Second Edition , year =

[19] [19]

International Conference on Learning Representations (ICLR) , year =

Buesing, Lars and Weber, Theophane and Zwols, Yori and Heess, Nicolas and Racaniere, Sebastien and Guez, Arthur and Lespiau, Jean-Baptiste , title =. International Conference on Learning Representations (ICLR) , year =

[20] [20]

, title =

Williams, Ronald J. , title =. Machine Learning , year =

[21] [21]

and McAllester, David and Singh, Satinder and Mansour, Yishay , title =

Sutton, Richard S. and McAllester, David and Singh, Satinder and Mansour, Yishay , title =. Advances in Neural Information Processing Systems (NeurIPS) , year =

[22] [22]

and Abbeel, Pieter , title =

Schulman, John and Levine, Sergey and Moritz, Philipp and Jordan, Michael I. and Abbeel, Pieter , title =. International Conference on Machine Learning (ICML) , year =

[23] [23]

Blog post, joschu.net , year =

Schulman, John , title =. Blog post, joschu.net , year =

[24] [24]

arXiv preprint , year =

Xu, Hanyang and Lu, Shudong and Hao, Yan and Chen, Wenhu and Chen, Min , title =. arXiv preprint , year =

[25] [25]

Skill-SD: Skill-Conditioned Self-Distillation for Multi-turn LLM Agents

Wang, Hao and Wang, Guozhi and Xiao, Han and Zhou, Yufeng and Pan, Yue and Wang, Jichao and Xu, Ke and Wen, Yafei and Ruan, Xiaohu and Chen, Xiaoxin and Qi, Honggang , title =. arXiv preprint arXiv:2604.10674 , year =

work page internal anchor Pith review Pith/arXiv arXiv

[26] [26]

International Conference on Learning Representations (ICLR) , year =

Liu, Xiao and Yu, Hao and Zhang, Hanchen and Xu, Yifan and Lei, Xuanyu and Lai, Hanyu and Gu, Yu and Ding, Hangliang and Men, Kaiwen and Yang, Kejuan and Zhang, Shudan and Deng, Xiang and Zeng, Aohan and Du, Zhengxiao and Zhang, Chenhui and Shen, Sheng and Zhang, Tianjun and Su, Yu and Sun, Huan and Huang, Minlie and Dong, Yuxiao and Tang, Jie , title =. ...

[27] [27]

, title =

Zelikman, Eric and Wu, Yuhuai and Mu, Jesse and Goodman, Noah D. , title =. Advances in Neural Information Processing Systems (NeurIPS) , year =

[28] [28]

Training Verifiers to Solve Math Word Problems

Cobbe, Karl and Kosaraju, Vineet and Bavarian, Mohammad and Chen, Mark and Jun, Heewoo and Kaiser, Lukasz and Plappert, Matthias and Tworek, Jerry and Hilton, Jacob and Nakano, Reiichiro and Hesse, Christopher and Schulman, John , title =. arXiv preprint arXiv:2110.14168 , year =

work page internal anchor Pith review Pith/arXiv arXiv

[29] [29]

NeurIPS Track on Datasets and Benchmarks , year =

Hendrycks, Dan and Burns, Collin and Kadavath, Saurav and Arora, Akul and Basart, Steven and Tang, Eric and Song, Dawn and Steinhardt, Jacob , title =. NeurIPS Track on Datasets and Benchmarks , year =

[30] [30]

International Conference on Learning Representations (ICLR) , year =

Yao, Shunyu and Zhao, Jeffrey and Yu, Dian and Du, Nan and Shafran, Izhak and Narasimhan, Karthik and Cao, Yuan , title =. International Conference on Learning Representations (ICLR) , year =

[31] [31]

Advances in Neural Information Processing Systems (NeurIPS) , year =

Shinn, Noah and Cassano, Federico and Berman, Edward and Gopinath, Ashwin and Narasimhan, Karthik and Yao, Shunyu , title =. Advances in Neural Information Processing Systems (NeurIPS) , year =

[32] [32]

Advances in Neural Information Processing Systems (NeurIPS) , year =

Schick, Timo and Dwivedi-Yu, Jane and Dess. Advances in Neural Information Processing Systems (NeurIPS) , year =

[33] [33]

Quiet-STaR: Language Models Can Teach Themselves to Think Before Speaking

Zelikman, Eric and Harik, Georges and Shao, Yijia and Jayasiri, Varuna and Haber, Nick and Goodman, Noah D. , title =. arXiv preprint arXiv:2403.09629 , year =

work page internal anchor Pith review Pith/arXiv arXiv

[34] [34]

Reinforced Self-Training (ReST) for Language Modeling

Gulcehre, Caglar and Le Paine, Tom and Srinivasan, Srivatsan and Konyushkova, Ksenia and Weerts, Lotte and Sharma, Abhishek and Siddhant, Aditya and Ahern, Alex and Wang, Miaosen and Gu, Chenjie and Macherey, Wolfgang and Doucet, Arnaud and Firat, Orhan and de Freitas, Nando , title =. arXiv preprint arXiv:2308.08998 , year =

work page internal anchor Pith review Pith/arXiv arXiv

[35] [35]

International Conference on Machine Learning (ICML) , year =

Chen, Zixiang and Deng, Yihe and Yuan, Huizhuo and Ji, Kaixuan and Gu, Quanquan , title =. International Conference on Machine Learning (ICML) , year =

[36] [36]

International Conference on Machine Learning (ICML) , year =

Yuan, Weizhe and Pang, Richard Yuanzhe and Cho, Kyunghyun and Li, Xian and Sukhbaatar, Sainbayar and Xu, Jing and Weston, Jason , title =. International Conference on Machine Learning (ICML) , year =

[37] [37]

International Conference on Learning Representations (ICLR) , year =

Agarwal, Rishabh and Vieillard, Nino and Zhou, Yongchao and Stanczyk, Piotr and Ramos, Sabela and Geist, Matthieu and Bachem, Olivier , title =. International Conference on Learning Representations (ICLR) , year =

[38] [38]

Constitutional AI: Harmlessness from AI Feedback

Bai, Yuntao and Kadavath, Saurav and Kundu, Sandipan and Askell, Amanda and Kernion, Jackson and Jones, Andy and Chen, Anna and Goldie, Anna and Mirhoseini, Azalia and McKinnon, Cameron and others , title =. arXiv preprint arXiv:2212.08073 , year =

work page internal anchor Pith review Pith/arXiv arXiv

[39] [39]

arXiv preprint arXiv:2412.19437 , year =

work page internal anchor Pith review Pith/arXiv arXiv

[40] [40]

International Conference on Machine Learning (ICML) , year =

Swaminathan, Adith and Joachims, Thorsten , title =. International Conference on Machine Learning (ICML) , year =

[41] [41]

Conference on Empirical Methods in Natural Language Processing (EMNLP) , year =

Wang, Ruoyao and Jansen, Peter and C. Conference on Empirical Methods in Natural Language Processing (EMNLP) , year =

[42] [42]

International Conference on Learning Representations (ICLR) , year =

Chevalier-Boisvert, Maxime and Bahdanau, Dzmitry and Lahlou, Salem and Willems, Lucas and Saharia, Chitwan and Nguyen, Thien Huu and Bengio, Yoshua , title =. International Conference on Learning Representations (ICLR) , year =

[43] [43]

Advances in Neural Information Processing Systems (NeurIPS) , year =

Pang, Richard Yuanzhe and Yuan, Weizhe and Cho, Kyunghyun and He, He and Sukhbaatar, Sainbayar and Weston, Jason , title =. Advances in Neural Information Processing Systems (NeurIPS) , year =

[44] [44]

Back to Basics: Revisiting

Ahmadian, Arash and Cremer, Chris and Gall. Back to Basics: Revisiting. Annual Meeting of the Association for Computational Linguistics (ACL) , year =

[45] [45]

Vineppo: Refining credit assignment in rl training of llms, 2025

Kazemnejad, Amirhossein and Aghajohari, Milad and Portelance, Eva and Sordoni, Alessandro and Reddy, Siva and Courville, Aaron and Le Roux, Nicolas , title =. arXiv preprint arXiv:2410.01679 , year =

work page arXiv

[46] [46]

International Conference on Learning Representations (ICLR) , year =

Lightman, Hunter and Kosaraju, Vineet and Burda, Yura and Edwards, Harri and Baker, Bowen and Lee, Teddy and Leike, Jan and Schulman, John and Sutskever, Ilya and Cobbe, Karl , title =. International Conference on Learning Representations (ICLR) , year =

[47] [47]

Annual Meeting of the Association for Computational Linguistics (ACL) , year =

Wang, Peiyi and Li, Lei and Shao, Zhihong and Xu, Runxin and Dai, Damai and Li, Yifei and Chen, Deli and Wu, Yu and Sui, Zhifang , title =. Annual Meeting of the Association for Computational Linguistics (ACL) , year =