pith. sign in

arxiv: 2605.19447 · v1 · pith:LQW767VYnew · submitted 2026-05-19 · 💻 cs.AI

What and When to Distill: Selective Hindsight Distillation for Multi-Turn Agents

Pith reviewed 2026-05-20 05:38 UTC · model grok-4.3

classification 💻 cs.AI
keywords multi-turn agentshindsight distillationenvironment feedbackcredit assignmentLLM agentsALFWorldWebShopreinforcement learning
0
0 comments X

The pith

Selective use of environment feedback in hindsight distillation improves credit assignment for long-horizon LLM agents.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines how multi-turn agents trained with reinforcement learning struggle to assign credit across many actions when given only a final success or failure signal. It tests five different sources of per-step environmental feedback and two ways of inserting that feedback into the training process. The central proposal is SERL, a framework that lets the task reward decide the direction of each parameter update while environment signals control where and how strongly to apply the update. This selective approach focuses learning on critical actions rather than treating all steps equally. Experiments on ALFWorld and WebShop show higher success rates than both standard RL methods and non-selective distillation baselines.

Core claim

SERL determines update direction from the task reward and uses environment feedback to set the timing and magnitude of updates, thereby concentrating learning on the actions that matter most for solving multi-turn tasks.

What carries the argument

SERL selective environment-reweighted learning framework, which reweights the magnitude and placement of hindsight updates according to per-step environment signals while preserving the task reward as the sole source of update direction.

If this is right

  • Grounded, action-relevant feedback inserted at selected points outperforms both trajectory-level rewards and indiscriminate use of longer context.
  • The same framework raises success from strong baselines to 90.0% on ALFWorld and 80.1% on WebShop.
  • Different feedback sources (error messages, page changes, observations, reference trajectories) are not equally useful; only those tied to the current action improve results.
  • The method remains compatible with existing RL and distillation pipelines because it only changes how feedback modulates update strength and timing.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same selective-reweighting logic could be applied to other sparse-reward domains such as robotics or game playing where intermediate observations are available but rarely used for credit assignment.
  • If environment feedback can be generated cheaply at inference time, the approach may reduce the need for expensive human demonstrations in multi-turn agent training.

Load-bearing premise

Environment feedback can be parsed reliably enough to mark truly critical actions without adding noise or selection bias that would weaken the task reward signal.

What would settle it

On a held-out multi-turn task, applying the same total amount of feedback uniformly across all steps produces success rates equal to or higher than the selective placement used in SERL.

read the original abstract

Reinforcement learning can train LLM agents from sparse task rewards, but long-horizon credit assignment remains challenging: a single success-or-failure signal must be distributed across many actions. Existing methods rely on trajectory-level rewards or proxy signals, without fully leveraging per-step environmental feedback. Multi-turn agent settings are underexplored, where feedback can include error messages, page changes, observations, or reference trajectories. We systematically study five feedback sources and two insertion granularities and introduce SERL, a selective environment-reweighted learning framework. SERL uses the task reward to determine update direction, while environment feedback adjusts placement and magnitude, focusing on critical actions. On ALFWorld and WebShop, SERL achieves 90.0% and 80.1% success, outperforming strong RL and distillation baselines. Analysis shows that grounded, action-relevant feedback at meaningful points consistently outperforms indiscriminate use of longer or richer context.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces SERL, a selective environment-reweighted learning framework for training multi-turn LLM agents from sparse task rewards. SERL uses the task reward to set update direction while per-step environmental feedback (error messages, page changes, observations, reference trajectories) determines placement and magnitude, targeting critical actions. The authors systematically examine five feedback sources and two insertion granularities, reporting 90.0% success on ALFWorld and 80.1% on WebShop, outperforming strong RL and distillation baselines. Analysis indicates that grounded, action-relevant feedback at meaningful points outperforms indiscriminate use of richer context.

Significance. If the results hold under scrutiny, the work provides a concrete method for improving credit assignment in long-horizon agent RL by selectively leveraging available environmental signals rather than relying solely on trajectory-level or proxy rewards. The empirical comparison across feedback types offers actionable guidance for multi-turn settings and could influence practical agent training pipelines.

major comments (2)
  1. Abstract: the headline success rates (90.0% ALFWorld, 80.1% WebShop) are reported without error bars, variance estimates, or statistical significance tests against the RL and distillation baselines. This omission is load-bearing for the central empirical claim and prevents assessment of whether the gains are reliable or could be explained by run-to-run variability.
  2. Abstract: the systematic study of five feedback sources and two granularities is described, yet no quantitative check is provided that the selected action subsets remain unbiased with respect to the sparse task reward. Without such a check (e.g., reward distribution comparison between reweighted and non-reweighted trajectories), the reweighting step risks amplifying spurious correlations from deterministic environment signals rather than improving true credit assignment.
minor comments (2)
  1. The abstract mentions 'strong RL and distillation baselines' but does not name them; adding the specific method names and citations would improve readability.
  2. Clarify the precise meaning of 'insertion granularities' at first use, as the term is not standard and affects interpretation of the ablation.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which highlight important aspects of empirical rigor and potential biases in our reweighting approach. We address each major comment below and commit to revisions that strengthen the manuscript without altering its core claims.

read point-by-point responses
  1. Referee: Abstract: the headline success rates (90.0% ALFWorld, 80.1% WebShop) are reported without error bars, variance estimates, or statistical significance tests against the RL and distillation baselines. This omission is load-bearing for the central empirical claim and prevents assessment of whether the gains are reliable or could be explained by run-to-run variability.

    Authors: We agree that reporting error bars, variance estimates, and statistical significance tests is essential for assessing the reliability of the reported gains. The full manuscript includes results from multiple random seeds, but these details were omitted from the abstract for brevity. In the revised version, we will expand the abstract to include mean success rates with standard deviations and note the results of paired statistical tests against the baselines. revision: yes

  2. Referee: Abstract: the systematic study of five feedback sources and two granularities is described, yet no quantitative check is provided that the selected action subsets remain unbiased with respect to the sparse task reward. Without such a check (e.g., reward distribution comparison between reweighted and non-reweighted trajectories), the reweighting step risks amplifying spurious correlations from deterministic environment signals rather than improving true credit assignment.

    Authors: This is a valid concern regarding potential bias in the selective reweighting. While our analysis in Section 4.3 demonstrates that SERL improves credit assignment through targeted use of grounded feedback, we did not include an explicit comparison of reward distributions between reweighted and original trajectories. We will add this quantitative check (e.g., histograms or statistical comparison of task rewards) to the revised manuscript to confirm that the reweighting preserves the original reward signal while focusing on critical steps. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical framework with external benchmark validation

full rationale

The paper introduces SERL as an empirical selective environment-reweighted learning method that uses task rewards to set update direction and per-step environment feedback (error messages, observations, etc.) to adjust placement and magnitude. No equations, derivations, or first-principles claims appear in the provided abstract or described results. Success rates (90.0% ALFWorld, 80.1% WebShop) are reported against external RL and distillation baselines on standard benchmarks; these outcomes are not shown to reduce by construction to any fitted parameter or self-defined quantity within the method. Systematic study of five feedback sources is presented as experimental design rather than a mathematical reduction. The central claims therefore remain self-contained against external evaluations and do not trigger any of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no explicit free parameters, axioms, or invented entities; SERL is introduced as a framework relying on standard RL concepts and benchmark environments.

pith-pipeline@v0.9.0 · 5716 in / 1080 out tokens · 39680 ms · 2026-05-20T05:38:00.206272+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

43 extracted references · 43 canonical work pages · 22 internal anchors

  1. [1]

    Back to basics: Revisiting reinforce-style optimization for learning from human feedback in llms

    Arash Ahmadian, Chris Cremer, Matthias Gallé, Marzieh Fadaee, Julia Kreutzer, Olivier Pietquin, Ahmet Üstün, and Sara Hooker. Back to basics: Revisiting reinforce-style optimization for learning from human feedback in llms. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 12248–12267,...

  2. [2]

    MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering

    Jun Shern Chan, Neil Chowdhury, Oliver Jaffe, James Aung, Dane Sherburn, Evan Mays, Giulio Starace, Kevin Liu, Leon Maksin, Tejal Patwardhan, Lilian Weng, and Aleksander Mądry. Mle-bench: Evaluating machine learning agents on machine learning engineering, 2025. URLhttps://arxiv.org/abs/ 2410.07095. 1

  3. [3]

    Step-level value preference optimization for mathematical reasoning

    Guoxin Chen, Minpeng Liao, Chengxi Li, and Kai Fan. Step-level value preference optimization for mathematical reasoning. InFindings of the Association for Computational Linguistics: EMNLP 2024, pages 7889–7903, 2024. 2

  4. [4]

    Seed-grpo: Semantic entropy enhanced grpo for uncertainty-aware policy optimization.arXiv preprint arXiv:2505.12346, 2025

    Minghan Chen, Guikun Chen, Wenguan Wang, and Yi Yang. Seed-grpo: Semantic entropy enhanced grpo for uncertainty-aware policy optimization.arXiv preprint arXiv:2505.12346, 2025. 2

  5. [5]

    Beyondhigh-entropyexploration: Correctness- aware low-entropy segment-based advantage shaping for reasoning llms.arXiv preprint arXiv:2512.00908, 2025

    XinzhuChen, XueshengLi, ZhongxiangSun, andWeijieYu. Beyondhigh-entropyexploration: Correctness- aware low-entropy segment-based advantage shaping for reasoning llms.arXiv preprint arXiv:2512.00908, 2025

  6. [6]

    Reasoning with exploration: An entropy perspective, 2025

    Daixuan Cheng, Shaohan Huang, Xuekai Zhu, Bo Dai, Wayne Xin Zhao, Zhenliang Zhang, and Furu Wei. Reasoning with exploration: An entropy perspective, 2025. URLhttps://arxiv.org/abs/2506. 14758. 2

  7. [7]

    Process Reinforcement through Implicit Rewards

    Ganqu Cui, Lifan Yuan, Zefan Wang, Hanbin Wang, Yuchen Zhang, Jiacheng Chen, Wendi Li, Bingxiang He, Yuchen Fan, Tianyu Yu, et al. Process reinforcement through implicit rewards.arXiv preprint arXiv:2502.01456, 2025. 2

  8. [8]

    Group-in-Group Policy Optimization for LLM Agent Training

    Lang Feng, Zhenghai Xue, Tingcong Liu, and Bo An. Group-in-group policy optimization for llm agent training.arXiv preprint arXiv:2505.10978, 2025. 2, 4.1, 1, 2, B.1

  9. [9]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025. 2

  10. [10]

    Hierarchy-of-groups policy optimization for long-horizon agentic tasks.arXiv preprint arXiv:2602.22817, 2026

    Shuo He, Lang Feng, Qi Wei, Xin Cheng, Lei Feng, and Bo An. Hierarchy-of-groups policy optimization for long-horizon agentic tasks.arXiv preprint arXiv:2602.22817, 2026. 2, 4.1, 1, 2

  11. [11]

    Reinforcement Learning via Self-Distillation

    Jonas Hübotter, Frederike Lübeck, Lejs Behric, Anton Baumann, Marco Bagatella, Daniel Marta, Ido Hakimi, Idan Shenfeld, Thomas Kleine Buening, Carlos Guestrin, and Andreas Krause. Reinforcement learning via self-distillation, 2026. URLhttps://arxiv.org/abs/2601.20802. 1, 2, 4.1, 1, B.1

  12. [12]

    SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

    Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. Swe-bench: Can language models resolve real-world github issues?, 2024. URL https://arxiv.org/abs/2310.06770. 1, 2

  13. [13]

    Search-R1: Training LLMs to Reason and Leverage Search Engines with Reinforcement Learning

    Bowen Jin, Hansi Zeng, Zhenrui Yue, Dong Wang, Hamed Zamani, and Jiawei Han. Search-R1: Training LLMs to reason and leverage search engines with reinforcement learning.arXiv preprint arXiv:2503.09516,

  14. [14]

    Opt-bench: Evaluating llm agent on large-scale search spaces optimization problems.arXiv preprint arXiv:2506.10764, 2025

    Xiaozhe Li, Jixuan Chen, Xinyu Fang, Shengyuan Ding, Haodong Duan, Qingwen Liu, and Kai Chen. Opt-bench: Evaluating llm agent on large-scale search spaces optimization problems.arXiv preprint arXiv:2506.10764, 2025. 1

  15. [15]

    Np-engine: Empowering optimization reasoning in large language models with verifiable synthetic np problems, 2025

    Xiaozhe Li, Xinyu Fang, Shengyuan Ding, Linyang Li, Haodong Duan, Qingwen Liu, and Kai Chen. Np-engine: Empowering optimization reasoning in large language models with verifiable synthetic np problems, 2025. URLhttps://arxiv.org/abs/2510.16476. 2 11 What and When to Distill: Selective Hindsight Distillation for Multi-Turn Agents

  16. [16]

    Escaping the Context Bottleneck: Active Context Curation for LLM Agents via Reinforcement Learning

    Xiaozhe Li, Tianyi Lyu, Yizhao Yang, Liang Shan, Siyi Yang, Ligao Zhang, Zhuoyi Huang, Qingwen Liu, and Yang Li. Escaping the context bottleneck: Active context curation for llm agents via reinforcement learning, 2026. URLhttps://arxiv.org/abs/2604.11462. 2

  17. [17]

    Attention illuminates llm reasoning: The preplan-and-anchor rhythm enables fine-grained policy optimization.arXiv preprint arXiv:2510.13554, 2025

    Yang Li, Zhichen Dong, Yuhan Sun, Weixun Wang, Shaopan Xiong, Yijia Luo, Jiashun Liu, Han Lu, Jiamang Wang, Wenbo Su, et al. Attention illuminates llm reasoning: The preplan-and-anchor rhythm enables fine-grained policy optimization.arXiv preprint arXiv:2510.13554, 2025. 2

  18. [18]

    Outcome-grounded advantage reshaping for fine-grained credit assignment in mathematical reasoning.arXiv preprint arXiv:2601.07408, 2026

    Ziheng Li, Liu Kang, Feng Xiao, Luxi Xing, Qingyi Si, Zhuoran Li, Weikang Gong, Deqing Yang, Yanghua Xiao, and Hongcheng Guo. Outcome-grounded advantage reshaping for fine-grained credit assignment in mathematical reasoning.arXiv preprint arXiv:2601.07408, 2026. 2

  19. [19]

    Let’s verify step by step

    Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harrison Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step. InThe Twelfth International Con- ference on Learning Representations, 2024. URLhttps://openreview.net/forum?id=v8L0pN6EOi. 2

  20. [20]

    On-policy distillation.Thinking Machines Lab: Connectionism,

    Kevin Lu and Thinking Machines Lab. On-policy distillation.Thinking Machines Lab: Connectionism,

  21. [21]

    On-policy distillation.Thinking Machines Lab: Con- nectionism, 2025

    doi: 10.64434/tml.20251026. https://thinkingmachines.ai/blog/on-policy-distillation. 1, 2, 3.1

  22. [22]

    Improve Mathematical Reasoning in Language Models by Automated Process Supervision

    Liangchen Luo, Yinxiao Liu, Rosanne Liu, Samrat Phatale, Meiqi Guo, Harsh Lara, Yunxuan Li, Lei Shu, Yun Zhu, Lei Meng, et al. Improve mathematical reasoning in language models by automated process supervision.arXiv preprint arXiv:2406.06592, 2024. 2

  23. [23]

    ToolRL: Reward is All Tool Learning Needs

    Cheng Qian, Emre Can Acikgoz, Qi He, Hongru Wang, Xiusi Chen, Dilek Hakkani-Tür, Gokhan Tur, and Heng Ji. ToolRL: Reward is all tool learning needs.arXiv preprint arXiv:2504.13958, 2025. 1, 2

  24. [24]

    Proximal Policy Optimization Algorithms

    John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy opti- mization algorithms.arXiv preprint arXiv:1707.06347, 2017. 2, 4.1, 1

  25. [25]

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y. K. Li, Y. Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models, 2024. URLhttps://arxiv.org/abs/2402.03300. 1, 2, 3.1, 4.1, 1, 2

  26. [26]

    Self-Distillation Enables Continual Learning

    Idan Shenfeld, Mehul Damani, Jonas Hübotter, and Pulkit Agrawal. Self-distillation enables continual learning, 2026. URLhttps://arxiv.org/abs/2601.19897. 2

  27. [27]

    HybridFlow: A Flexible and Efficient RLHF Framework

    Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient rlhf framework.arXiv preprint arXiv: 2409.19256,

  28. [28]

    Reflexion: Language agents with verbal reinforcement learning.Advances in Neural Information Processing Systems, 36, 2024

    Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: Language agents with verbal reinforcement learning.Advances in Neural Information Processing Systems, 36, 2024. 4.1, 1

  29. [29]

    ALFWorld: Aligning text and embodied environments for interactive learning

    Mohit Shridhar, Xingdi Yuan, Marc-Alexandre Cote, Yonatan Bisk, Adam Trischler, and Matthew Hausknecht. ALFWorld: Aligning text and embodied environments for interactive learning. InIn- ternational Conference on Learning Representations, 2021. URLhttps://openreview.net/forum? id=0IOX0YcCdTn. 1, 2, 4.1

  30. [30]

    Ktae: A model- free algorithm to key-tokens advantage estimation in mathematical reasoning, 2025

    Wei Sun, Wen Yang, Pu Jian, Qianlong Du, Fuwei Cui, Shuo Ren, and Jiajun Zhang. Ktae: A model- free algorithm to key-tokens advantage estimation in mathematical reasoning, 2025. URLhttps: //arxiv.org/abs/2505.16826. 2

  31. [31]

    Kimi Team, Angang Du, Bofei Gao, Bowei Xing, Changjiu Jiang, Cheng Chen, Cheng Li, Chenjun Xiao, Chenzhuang Du, Chonghua Liao, et al. Kimi k1. 5: Scaling reinforcement learning with llms.arXiv preprint arXiv:2501.12599, 2025. 2

  32. [32]

    Qwen2.5: A party of foundation models, September 2024

    Qwen Team. Qwen2.5: A party of foundation models, September 2024. URLhttps://qwenlm.github. io/blog/qwen2.5/. 4.1, B.1 12 What and When to Distill: Selective Hindsight Distillation for Multi-Turn Agents

  33. [33]

    AppWorld: A controllable world of apps and people for benchmarking interactive coding agents

    Harsh Trivedi, Tushar Khot, Mareike Hartmann, Ruskin Manku, Vinty Dong, Edward Li, Shashank Gupta, Ashish Sabharwal, and Niranjan Balasubramanian. AppWorld: A controllable world of apps and people for benchmarking interactive coding agents. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), p...

  34. [34]

    RAGEN: Understanding Self-Evolution in LLM Agents via Multi-Turn Reinforcement Learning

    Zihan Wang, Kangrui Wang, Qineng Wang, Pingyue Zhang, Linjie Li, Zhengyuan Yang, Kefan Yu, Minh Nhat Nguyen, Licheng Liu, Eli Gottlieb, et al. RAGEN: Understanding self-evolution in LLM agents via multi-turn reinforcement learning.arXiv preprint arXiv:2504.20073, 2025. 2

  35. [35]

    Logic-RL: Unleashing LLM Reasoning with Rule-Based Reinforcement Learning

    Tian Xie, Zitian Gao, Qingnan Ren, Haoming Luo, Yuqian Hong, Bryan Dai, Joey Zhou, Kai Qiu, Zhirong Wu, and Chong Luo. Logic-rl: Unleashing llm reasoning with rule-based reinforcement learning.arXiv preprint arXiv:2502.14768, 2025. 2

  36. [36]

    Self-Distilled RLVR

    Chenxu Yang, Chuanyu Qin, Qingyi Si, Minghui Chen, Naibin Gu, Dingyu Yao, Zheng Lin, Weiping Wang, Jiaqi Wang, and Nan Duan. Self-distilled rlvr, 2026. URLhttps://arxiv.org/abs/2604.03128. 1, 2, 4.1, 1, 4.3, B.1

  37. [37]

    WebShop: Towards scalable real-world web interaction with grounded language agents.Advances in Neural Information Processing Systems, 35: 20744–20757, 2022

    Shunyu Yao, Howard Chen, John Yang, and Karthik Narasimhan. WebShop: Towards scalable real-world web interaction with grounded language agents.Advances in Neural Information Processing Systems, 35: 20744–20757, 2022. 1, 2, 4.1

  38. [38]

    ReAct: Synergizing reasoning and acting in language models

    Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R Narasimhan, and Yuan Cao. ReAct: Synergizing reasoning and acting in language models. InThe Eleventh International Conference on Learning Representations, 2023. URLhttps://openreview.net/forum?id=WE_vluYUL-X. 4.1, 1

  39. [39]

    DAPO: An Open-Source LLM Reinforcement Learning System at Scale

    Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Tiantian Fan, Gaohong Liu, Lingjun Liu, Xin Liu, et al. Dapo: An open-source llm reinforcement learning system at scale.arXiv preprint arXiv:2503.14476, 2025. 2

  40. [40]

    Generative verifiers: Reward modeling as next-token prediction.arXiv preprint arXiv:2408.15240, 2024

    Lunjun Zhang, Arian Hosseini, Hritik Bansal, Mehran Kazemi, Aviral Kumar, and Rishabh Agarwal. Generative verifiers: Reward modeling as next-token prediction.arXiv preprint arXiv:2408.15240, 2024. 2

  41. [41]

    Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models

    Siyan Zhao, Zhihui Xie, Mengchen Liu, Jing Huang, Guan Pang, Feiyu Chen, and Aditya Grover. Self- distilled reasoner: On-policy self-distillation for large language models, 2026. URLhttps://arxiv. org/abs/2601.18734. 2

  42. [42]

    Group Sequence Policy Optimization

    Chujie Zheng, Shixuan Liu, Mingze Li, Xiong-Hui Chen, Bowen Yu, Chang Gao, Kai Dang, Yuqiong Liu, Rui Men, An Yang, Jingren Zhou, and Junyang Lin. Group sequence policy optimization, 2025. URL https://arxiv.org/abs/2507.18071. 2

  43. [43]

    Fine-Tuning Language Models from Human Preferences

    Daniel M Ziegler, Nisan Stiennon, Jeffrey Wu, Tom B Brown, Alec Radford, Dario Amodei, Paul Chris- tiano, and Geoffrey Irving. Fine-tuning language models from human preferences.arXiv preprint arXiv:1909.08593, 2019. 2 13 What and When to Distill: Selective Hindsight Distillation for Multi-Turn Agents A. Environment Feedback Details We formalize a multi-t...