CRAFT: Counterfactual Credit Assignment from Free Sibling Rollouts for Self-Distilled Agentic Reinforcement Learning
Pith reviewed 2026-06-30 07:45 UTC · model grok-4.3
The pith
CRAFT assigns signed per-token credits in self-distilled agentic RL by importance-weighting already-sampled sibling rollouts to estimate counterfactual advantage changes.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
CRAFT is a three-pillar credit-assignment scheme. Pillar 1 (Counterfactual Token Importance) reuses the G-1 sibling rollouts already sampled by GRPO and importance-weights them by the log-probability gap to form a self-normalised estimate of the group-level counterfactual change in advantage from up-weighting teacher-preferred actions at each step. Pillar 2 is an asymmetric controller that raises the distillation weight while lowering the reference-KL weight along an exponential moving average of gate activity. Pillar 3 polarises the KL penalty token by token, switching between mode-seeking and mode-covering updates according to the sign of the credit. Each pillar has an independent switch t
What carries the argument
Counterfactual Token Importance, which reuses G-1 sibling rollouts and importance-weights them by the teacher-student log-probability gap to estimate the counterfactual group-level change in advantage.
If this is right
- The estimator is consistent and admits a variance bound.
- Disabling any pillar renders the loss and gradient byte-identical to the baseline in IEEE-754 arithmetic.
- Performance gains can be isolated from Adaptive-CRINGE, which shares only Pillar 2.
- The method was evaluated across three agentic environments, four model scales, and five end-to-end methods.
Where Pith is reading between the lines
- The reuse of GRPO siblings suggests the technique could be applied to other multi-rollout RL settings where additional sampling is costly.
- Token-level sign information may allow more precise control of exploration versus exploitation in distillation objectives.
- The bit-exact reproducibility switches provide a template for isolating algorithmic contributions in future RL ablations.
Load-bearing premise
The G-1 sibling rollouts already sampled by GRPO suffice to form a self-normalised estimate of the group-level counterfactual change in advantage from up-weighting teacher-preferred actions at each step.
What would settle it
A direct measurement, on held-out trajectories, showing that the importance-weighted sibling estimate fails to predict the actual change in trajectory advantage when teacher-preferred actions receive higher weight, or that the observed variance exceeds the paper's stated bound.
Figures
read the original abstract
Self-distilled agentic reinforcement learning augments trajectory-level reward with a token-level distillation loss, using as its teacher the same policy conditioned on privileged context. The prevailing recipe gates this loss by a single scalar, the teacher-student log-probability gap. This signal is doubly limited: it is retrospective, scoring only the realised rollout and never the counterfactual ones, and it is sign-blind, never signalling when a teacher-preferred action would have harmed the trajectory. We introduce CRAFT, a three-pillar credit-assignment scheme that addresses both limitations. Pillar 1, Counterfactual Token Importance, reuses the G-1 sibling rollouts that GRPO already samples and importance-weights them by the log-probability gap to form a self-normalised estimate of the group-level counterfactual change in advantage from up-weighting teacher-preferred actions at each step; this yields a signed per-token credit at near-zero extra compute. Pillar 2 is an asymmetric controller that raises the distillation weight as it lowers the reference-KL weight along an exponential moving average of gate activity, and conversely. Pillar 3 polarises the KL penalty token by token, switching between a mode-seeking and a mode-covering update according to the sign of the credit. Each pillar has an independent switch that, when disabled, renders the loss and gradient byte-identical to the baseline in IEEE-754 arithmetic, so any measured gain is attributable to algorithmic change rather than implementation drift. We prove the estimator's consistency and a variance bound, give structural and bit-exact reproducibility guarantees, and evaluate CRAFT across three agentic environments, four model scales, and five end-to-end methods, plus two tabulated prior-work baselines. Among these is Adaptive-CRINGE, a comparator sharing Pillar 2 with CRAFT, isolating the counterfactual contribution.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces CRAFT, a three-pillar credit-assignment method for self-distilled agentic RL. Pillar 1 reuses the G-1 GRPO sibling rollouts to form a self-normalised importance-weighted estimate of the group-level counterfactual advantage change when up-weighting teacher-preferred tokens, yielding signed per-token credit at near-zero cost. Pillar 2 is an asymmetric controller that trades off distillation weight against reference-KL weight via an EMA of gate activity. Pillar 3 polarises the KL penalty token-by-token according to the sign of the credit. The manuscript claims a consistency proof and variance bound for the estimator, structural and bit-exact reproducibility guarantees via independent switches, and empirical gains across three agentic environments, four model scales, five end-to-end methods, and two prior-work baselines.
Significance. If the consistency proof is valid and the finite-sample variance of the self-normalised estimator remains controlled, CRAFT would supply an efficient mechanism for signed token-level credit that is absent from the prevailing retrospective, sign-blind distillation gate. The explicit reproducibility guarantees (each pillar can be disabled to recover the baseline loss and gradient in IEEE-754 arithmetic) and the isolation of the counterfactual contribution via the Adaptive-CRINGE comparator are concrete strengths that would strengthen any positive result.
major comments (2)
- [Abstract (Pillar 1)] Abstract (Pillar 1 description): the consistency and variance bound for the self-normalised importance-weighted estimator are asserted, yet the bound is described only as holding in the limit; no explicit dependence on group size G or on the magnitude of log-probability gaps is supplied. Because the central claim requires that the G-1 siblings already sampled by GRPO produce reliable signed credits rather than noise-dominated estimates, the absence of a non-asymptotic guarantee or a demonstration that extreme weights do not induce finite-sample bias is load-bearing.
- [Abstract (evaluation)] Abstract (evaluation paragraph): the manuscript states empirical gains across three environments, four scales and five methods but supplies neither error bars on the reported improvements nor any quantification of the realised variance of the Pillar-1 estimator. Without these diagnostics it is impossible to verify that the counterfactual credits remain useful at the moderate G values actually employed.
minor comments (1)
- [Abstract] The abstract refers to 'structural and bit-exact reproducibility guarantees' but does not enumerate the exact switch settings that recover the baseline; a short table or enumerated list would improve clarity.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on the theoretical claims and empirical reporting. We address each major comment below and outline the revisions we will make.
read point-by-point responses
-
Referee: [Abstract (Pillar 1)] Abstract (Pillar 1 description): the consistency and variance bound for the self-normalised importance-weighted estimator are asserted, yet the bound is described only as holding in the limit; no explicit dependence on group size G or on the magnitude of log-probability gaps is supplied. Because the central claim requires that the G-1 siblings already sampled by GRPO produce reliable signed credits rather than noise-dominated estimates, the absence of a non-asymptotic guarantee or a demonstration that extreme weights do not induce finite-sample bias is load-bearing.
Authors: We acknowledge that the abstract presents the guarantees in asymptotic terms without explicit finite-sample dependence on G or log-probability gap magnitude. The appendix contains the consistency proof (as G o ∞ under standard importance-sampling assumptions) and a variance bound that holds when gaps are bounded; however, we agree these details are not foregrounded in the abstract. We will revise the abstract to state the asymptotic character explicitly and add a short clause referencing the appendix conditions. We do not currently possess a tight non-asymptotic bound that would cover arbitrary gap sizes without further assumptions on the policy class; deriving one would constitute new theoretical work beyond the present scope. In the revision we will instead include a brief empirical diagnostic (already computed during experiments) showing that the self-normalised weights remained moderate for the G values used (4–8) and that the resulting credits correlated with downstream performance gains. revision: partial
-
Referee: [Abstract (evaluation)] Abstract (evaluation paragraph): the manuscript states empirical gains across three environments, four scales and five methods but supplies neither error bars on the reported improvements nor any quantification of the realised variance of the Pillar-1 estimator. Without these diagnostics it is impossible to verify that the counterfactual credits remain useful at the moderate G values actually employed.
Authors: We agree that the absence of error bars and estimator-variance diagnostics weakens the empirical claim. The reported numbers are means over multiple random seeds, but standard deviations were computed and can be added. In the revised manuscript we will include error bars (or confidence intervals) on all tables and figures that report performance deltas, and we will add a supplementary table or figure that reports the empirical variance of the Pillar-1 importance-weighted estimator across the three environments and the G values actually used. These additions will directly address whether the signed credits remain informative rather than noise-dominated at moderate group sizes. revision: yes
Circularity Check
No circularity: estimator defined from external GRPO samples with independent consistency proof
full rationale
The paper's Pillar 1 estimator is constructed directly from the G-1 sibling rollouts already produced by GRPO, using importance weighting by the log-probability gap to produce a self-normalised counterfactual advantage change; the authors then state a separate consistency proof and variance bound for this construction. No equation reduces the output credit to a fitted parameter renamed as a prediction, nor does any load-bearing premise rest on a self-citation whose content is itself unverified or defined in terms of the target result. Reproducibility switches and cross-method evaluations supply external grounding. The Adaptive-CRINGE comparator is invoked only to isolate the contribution of the new pillar, not to justify the estimator itself.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The G-1 sibling rollouts already sampled by GRPO are sufficient to form a self-normalised estimate of the group-level counterfactual change in advantage from up-weighting teacher-preferred actions at each step.
Reference graph
Works this paper leans on
-
[1]
Proximal Policy Optimization Algorithms
Schulman, John and Wolski, Filip and Dhariwal, Prafulla and Radford, Alec and Klimov, Oleg , title =. arXiv preprint arXiv:1707.06347 , year =
work page internal anchor Pith review Pith/arXiv arXiv
-
[2]
Ouyang, Long and Wu, Jeffrey and Jiang, Xu and Almeida, Diogo and Wainwright, Carroll L. and Mishkin, Pamela and Zhang, Chong and Agarwal, Sandhini and Slama, Katarina and Ray, Alex and Schulman, John and Hilton, Jacob and Kelton, Fraser and Miller, Luke and Simens, Maddie and Askell, Amanda and Welinder, Peter and Christiano, Paul and Leike, Jan and Lowe...
-
[3]
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
Shao, Zhihong and Wang, Peiyi and Zhu, Qihao and Xu, Runxin and Song, Junxiao and Zhang, Mingchuan and Li, Yong-Kang and Wu, Yu and Guo, Daya , title =. arXiv preprint arXiv:2402.03300 , year =
work page internal anchor Pith review Pith/arXiv arXiv
-
[4]
International Conference on Learning Representations (ICLR) , year =
Welleck, Sean and Kulikov, Ilia and Roller, Stephen and Dinan, Emily and Cho, Kyunghyun and Weston, Jason , title =. International Conference on Learning Representations (ICLR) , year =
-
[5]
Annual Meeting of the Association for Computational Linguistics (ACL) , year =
Adolphs, Leonard and Gao, Tianyu and Xu, Jing and Shuster, Kurt and Sukhbaatar, Sainbayar and Weston, Jason , title =. Annual Meeting of the Association for Computational Linguistics (ACL) , year =
-
[6]
and Ermon, Stefano and Finn, Chelsea , title =
Rafailov, Rafael and Sharma, Archit and Mitchell, Eric and Manning, Christopher D. and Ermon, Stefano and Finn, Chelsea , title =. Advances in Neural Information Processing Systems (NeurIPS) , year =
-
[7]
NIPS Deep Learning and Representation Learning Workshop , year =
Hinton, Geoffrey and Vinyals, Oriol and Dean, Jeff , title =. NIPS Deep Learning and Representation Learning Workshop , year =
-
[8]
Microsoft Research Technical Report MSR-TR-2005-173 , year =
Minka, Thomas , title =. Microsoft Research Technical Report MSR-TR-2005-173 , year =
2005
-
[9]
Self-Distilled Agentic Reinforcement Learning
Lu, Zhengxi and Yao, Zhiyuan and Han, Zhuowen and Wang, Zi-Han and Wu, Jinyang and Gu, Qi and Cai, Xunliang and Lu, Weiming and Xiao, Jun and Zhuang, Yueting and Shen, Yongliang , title =. arXiv preprint arXiv:2605.15155 , year =
work page internal anchor Pith review Pith/arXiv arXiv
-
[10]
Yang, Chenxu and Qin, Chuanyu and Si, Qingyi and Chen, Minghui and Gu, Naibin and Yao, Dingyu and Lin, Zheng and Wang, Weiping and Wang, Jiaqi and Duan, Nan , title =. arXiv preprint arXiv:2604.03128 , year =
work page internal anchor Pith review Pith/arXiv arXiv
-
[11]
International Conference on Learning Representations (ICLR) , year =
Shridhar, Mohit and Yuan, Xingdi and C. International Conference on Learning Representations (ICLR) , year =
-
[12]
Advances in Neural Information Processing Systems (NeurIPS) , year =
Yao, Shunyu and Chen, Howard and Yang, John and Narasimhan, Karthik , title =. Advances in Neural Information Processing Systems (NeurIPS) , year =
-
[13]
Search-R1: Training LLMs to Reason and Leverage Search Engines with Reinforcement Learning
Jin, Bowen and Wang, Xinyu and Wang, Han and Pan, Lin and Ji, Yuxiao and Sui, Zhifang and Han, Jiawei , title =. arXiv preprint arXiv:2503.09516 , year =
work page internal anchor Pith review Pith/arXiv arXiv
-
[14]
arXiv preprint arXiv:2412.15115 , year =
work page internal anchor Pith review Pith/arXiv arXiv
-
[15]
arXiv preprint arXiv:2505.09388 , year =
work page internal anchor Pith review Pith/arXiv arXiv
-
[16]
and Thompson, Donovan J
Horvitz, Daniel G. and Thompson, Donovan J. , title =. Journal of the American Statistical Association , year =
-
[17]
and Singh, Satinder , title =
Precup, Doina and Sutton, Richard S. and Singh, Satinder , title =. International Conference on Machine Learning (ICML) , year =
-
[18]
Cambridge University Press, Second Edition , year =
Pearl, Judea , title =. Cambridge University Press, Second Edition , year =
-
[19]
International Conference on Learning Representations (ICLR) , year =
Buesing, Lars and Weber, Theophane and Zwols, Yori and Heess, Nicolas and Racaniere, Sebastien and Guez, Arthur and Lespiau, Jean-Baptiste , title =. International Conference on Learning Representations (ICLR) , year =
-
[20]
, title =
Williams, Ronald J. , title =. Machine Learning , year =
-
[21]
and McAllester, David and Singh, Satinder and Mansour, Yishay , title =
Sutton, Richard S. and McAllester, David and Singh, Satinder and Mansour, Yishay , title =. Advances in Neural Information Processing Systems (NeurIPS) , year =
-
[22]
and Abbeel, Pieter , title =
Schulman, John and Levine, Sergey and Moritz, Philipp and Jordan, Michael I. and Abbeel, Pieter , title =. International Conference on Machine Learning (ICML) , year =
-
[23]
Blog post, joschu.net , year =
Schulman, John , title =. Blog post, joschu.net , year =
-
[24]
arXiv preprint , year =
Xu, Hanyang and Lu, Shudong and Hao, Yan and Chen, Wenhu and Chen, Min , title =. arXiv preprint , year =
-
[25]
Skill-SD: Skill-Conditioned Self-Distillation for Multi-turn LLM Agents
Wang, Hao and Wang, Guozhi and Xiao, Han and Zhou, Yufeng and Pan, Yue and Wang, Jichao and Xu, Ke and Wen, Yafei and Ruan, Xiaohu and Chen, Xiaoxin and Qi, Honggang , title =. arXiv preprint arXiv:2604.10674 , year =
work page internal anchor Pith review Pith/arXiv arXiv
-
[26]
International Conference on Learning Representations (ICLR) , year =
Liu, Xiao and Yu, Hao and Zhang, Hanchen and Xu, Yifan and Lei, Xuanyu and Lai, Hanyu and Gu, Yu and Ding, Hangliang and Men, Kaiwen and Yang, Kejuan and Zhang, Shudan and Deng, Xiang and Zeng, Aohan and Du, Zhengxiao and Zhang, Chenhui and Shen, Sheng and Zhang, Tianjun and Su, Yu and Sun, Huan and Huang, Minlie and Dong, Yuxiao and Tang, Jie , title =. ...
-
[27]
, title =
Zelikman, Eric and Wu, Yuhuai and Mu, Jesse and Goodman, Noah D. , title =. Advances in Neural Information Processing Systems (NeurIPS) , year =
-
[28]
Training Verifiers to Solve Math Word Problems
Cobbe, Karl and Kosaraju, Vineet and Bavarian, Mohammad and Chen, Mark and Jun, Heewoo and Kaiser, Lukasz and Plappert, Matthias and Tworek, Jerry and Hilton, Jacob and Nakano, Reiichiro and Hesse, Christopher and Schulman, John , title =. arXiv preprint arXiv:2110.14168 , year =
work page internal anchor Pith review Pith/arXiv arXiv
-
[29]
NeurIPS Track on Datasets and Benchmarks , year =
Hendrycks, Dan and Burns, Collin and Kadavath, Saurav and Arora, Akul and Basart, Steven and Tang, Eric and Song, Dawn and Steinhardt, Jacob , title =. NeurIPS Track on Datasets and Benchmarks , year =
-
[30]
International Conference on Learning Representations (ICLR) , year =
Yao, Shunyu and Zhao, Jeffrey and Yu, Dian and Du, Nan and Shafran, Izhak and Narasimhan, Karthik and Cao, Yuan , title =. International Conference on Learning Representations (ICLR) , year =
-
[31]
Advances in Neural Information Processing Systems (NeurIPS) , year =
Shinn, Noah and Cassano, Federico and Berman, Edward and Gopinath, Ashwin and Narasimhan, Karthik and Yao, Shunyu , title =. Advances in Neural Information Processing Systems (NeurIPS) , year =
-
[32]
Advances in Neural Information Processing Systems (NeurIPS) , year =
Schick, Timo and Dwivedi-Yu, Jane and Dess. Advances in Neural Information Processing Systems (NeurIPS) , year =
-
[33]
Quiet-STaR: Language Models Can Teach Themselves to Think Before Speaking
Zelikman, Eric and Harik, Georges and Shao, Yijia and Jayasiri, Varuna and Haber, Nick and Goodman, Noah D. , title =. arXiv preprint arXiv:2403.09629 , year =
work page internal anchor Pith review Pith/arXiv arXiv
-
[34]
Reinforced Self-Training (ReST) for Language Modeling
Gulcehre, Caglar and Le Paine, Tom and Srinivasan, Srivatsan and Konyushkova, Ksenia and Weerts, Lotte and Sharma, Abhishek and Siddhant, Aditya and Ahern, Alex and Wang, Miaosen and Gu, Chenjie and Macherey, Wolfgang and Doucet, Arnaud and Firat, Orhan and de Freitas, Nando , title =. arXiv preprint arXiv:2308.08998 , year =
work page internal anchor Pith review Pith/arXiv arXiv
-
[35]
International Conference on Machine Learning (ICML) , year =
Chen, Zixiang and Deng, Yihe and Yuan, Huizhuo and Ji, Kaixuan and Gu, Quanquan , title =. International Conference on Machine Learning (ICML) , year =
-
[36]
International Conference on Machine Learning (ICML) , year =
Yuan, Weizhe and Pang, Richard Yuanzhe and Cho, Kyunghyun and Li, Xian and Sukhbaatar, Sainbayar and Xu, Jing and Weston, Jason , title =. International Conference on Machine Learning (ICML) , year =
-
[37]
International Conference on Learning Representations (ICLR) , year =
Agarwal, Rishabh and Vieillard, Nino and Zhou, Yongchao and Stanczyk, Piotr and Ramos, Sabela and Geist, Matthieu and Bachem, Olivier , title =. International Conference on Learning Representations (ICLR) , year =
-
[38]
Constitutional AI: Harmlessness from AI Feedback
Bai, Yuntao and Kadavath, Saurav and Kundu, Sandipan and Askell, Amanda and Kernion, Jackson and Jones, Andy and Chen, Anna and Goldie, Anna and Mirhoseini, Azalia and McKinnon, Cameron and others , title =. arXiv preprint arXiv:2212.08073 , year =
work page internal anchor Pith review Pith/arXiv arXiv
-
[39]
arXiv preprint arXiv:2412.19437 , year =
work page internal anchor Pith review Pith/arXiv arXiv
-
[40]
International Conference on Machine Learning (ICML) , year =
Swaminathan, Adith and Joachims, Thorsten , title =. International Conference on Machine Learning (ICML) , year =
-
[41]
Conference on Empirical Methods in Natural Language Processing (EMNLP) , year =
Wang, Ruoyao and Jansen, Peter and C. Conference on Empirical Methods in Natural Language Processing (EMNLP) , year =
-
[42]
International Conference on Learning Representations (ICLR) , year =
Chevalier-Boisvert, Maxime and Bahdanau, Dzmitry and Lahlou, Salem and Willems, Lucas and Saharia, Chitwan and Nguyen, Thien Huu and Bengio, Yoshua , title =. International Conference on Learning Representations (ICLR) , year =
-
[43]
Advances in Neural Information Processing Systems (NeurIPS) , year =
Pang, Richard Yuanzhe and Yuan, Weizhe and Cho, Kyunghyun and He, He and Sukhbaatar, Sainbayar and Weston, Jason , title =. Advances in Neural Information Processing Systems (NeurIPS) , year =
-
[44]
Back to Basics: Revisiting
Ahmadian, Arash and Cremer, Chris and Gall. Back to Basics: Revisiting. Annual Meeting of the Association for Computational Linguistics (ACL) , year =
-
[45]
arXiv preprint arXiv:2410.01679 , year =
Kazemnejad, Amirhossein and Aghajohari, Milad and Portelance, Eva and Sordoni, Alessandro and Reddy, Siva and Courville, Aaron and Le Roux, Nicolas , title =. arXiv preprint arXiv:2410.01679 , year =
-
[46]
International Conference on Learning Representations (ICLR) , year =
Lightman, Hunter and Kosaraju, Vineet and Burda, Yura and Edwards, Harri and Baker, Bowen and Lee, Teddy and Leike, Jan and Schulman, John and Sutskever, Ilya and Cobbe, Karl , title =. International Conference on Learning Representations (ICLR) , year =
-
[47]
Annual Meeting of the Association for Computational Linguistics (ACL) , year =
Wang, Peiyi and Li, Lei and Shao, Zhihong and Xu, Runxin and Dai, Damai and Li, Yifei and Chen, Deli and Wu, Yu and Sui, Zhifang , title =. Annual Meeting of the Association for Computational Linguistics (ACL) , year =
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.