What and When to Distill: Selective Hindsight Distillation for Multi-Turn Agents
Pith reviewed 2026-05-20 05:38 UTC · model grok-4.3
The pith
Selective use of environment feedback in hindsight distillation improves credit assignment for long-horizon LLM agents.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
SERL determines update direction from the task reward and uses environment feedback to set the timing and magnitude of updates, thereby concentrating learning on the actions that matter most for solving multi-turn tasks.
What carries the argument
SERL selective environment-reweighted learning framework, which reweights the magnitude and placement of hindsight updates according to per-step environment signals while preserving the task reward as the sole source of update direction.
If this is right
- Grounded, action-relevant feedback inserted at selected points outperforms both trajectory-level rewards and indiscriminate use of longer context.
- The same framework raises success from strong baselines to 90.0% on ALFWorld and 80.1% on WebShop.
- Different feedback sources (error messages, page changes, observations, reference trajectories) are not equally useful; only those tied to the current action improve results.
- The method remains compatible with existing RL and distillation pipelines because it only changes how feedback modulates update strength and timing.
Where Pith is reading between the lines
- The same selective-reweighting logic could be applied to other sparse-reward domains such as robotics or game playing where intermediate observations are available but rarely used for credit assignment.
- If environment feedback can be generated cheaply at inference time, the approach may reduce the need for expensive human demonstrations in multi-turn agent training.
Load-bearing premise
Environment feedback can be parsed reliably enough to mark truly critical actions without adding noise or selection bias that would weaken the task reward signal.
What would settle it
On a held-out multi-turn task, applying the same total amount of feedback uniformly across all steps produces success rates equal to or higher than the selective placement used in SERL.
read the original abstract
Reinforcement learning can train LLM agents from sparse task rewards, but long-horizon credit assignment remains challenging: a single success-or-failure signal must be distributed across many actions. Existing methods rely on trajectory-level rewards or proxy signals, without fully leveraging per-step environmental feedback. Multi-turn agent settings are underexplored, where feedback can include error messages, page changes, observations, or reference trajectories. We systematically study five feedback sources and two insertion granularities and introduce SERL, a selective environment-reweighted learning framework. SERL uses the task reward to determine update direction, while environment feedback adjusts placement and magnitude, focusing on critical actions. On ALFWorld and WebShop, SERL achieves 90.0% and 80.1% success, outperforming strong RL and distillation baselines. Analysis shows that grounded, action-relevant feedback at meaningful points consistently outperforms indiscriminate use of longer or richer context.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces SERL, a selective environment-reweighted learning framework for training multi-turn LLM agents from sparse task rewards. SERL uses the task reward to set update direction while per-step environmental feedback (error messages, page changes, observations, reference trajectories) determines placement and magnitude, targeting critical actions. The authors systematically examine five feedback sources and two insertion granularities, reporting 90.0% success on ALFWorld and 80.1% on WebShop, outperforming strong RL and distillation baselines. Analysis indicates that grounded, action-relevant feedback at meaningful points outperforms indiscriminate use of richer context.
Significance. If the results hold under scrutiny, the work provides a concrete method for improving credit assignment in long-horizon agent RL by selectively leveraging available environmental signals rather than relying solely on trajectory-level or proxy rewards. The empirical comparison across feedback types offers actionable guidance for multi-turn settings and could influence practical agent training pipelines.
major comments (2)
- Abstract: the headline success rates (90.0% ALFWorld, 80.1% WebShop) are reported without error bars, variance estimates, or statistical significance tests against the RL and distillation baselines. This omission is load-bearing for the central empirical claim and prevents assessment of whether the gains are reliable or could be explained by run-to-run variability.
- Abstract: the systematic study of five feedback sources and two granularities is described, yet no quantitative check is provided that the selected action subsets remain unbiased with respect to the sparse task reward. Without such a check (e.g., reward distribution comparison between reweighted and non-reweighted trajectories), the reweighting step risks amplifying spurious correlations from deterministic environment signals rather than improving true credit assignment.
minor comments (2)
- The abstract mentions 'strong RL and distillation baselines' but does not name them; adding the specific method names and citations would improve readability.
- Clarify the precise meaning of 'insertion granularities' at first use, as the term is not standard and affects interpretation of the ablation.
Simulated Author's Rebuttal
We thank the referee for the constructive comments, which highlight important aspects of empirical rigor and potential biases in our reweighting approach. We address each major comment below and commit to revisions that strengthen the manuscript without altering its core claims.
read point-by-point responses
-
Referee: Abstract: the headline success rates (90.0% ALFWorld, 80.1% WebShop) are reported without error bars, variance estimates, or statistical significance tests against the RL and distillation baselines. This omission is load-bearing for the central empirical claim and prevents assessment of whether the gains are reliable or could be explained by run-to-run variability.
Authors: We agree that reporting error bars, variance estimates, and statistical significance tests is essential for assessing the reliability of the reported gains. The full manuscript includes results from multiple random seeds, but these details were omitted from the abstract for brevity. In the revised version, we will expand the abstract to include mean success rates with standard deviations and note the results of paired statistical tests against the baselines. revision: yes
-
Referee: Abstract: the systematic study of five feedback sources and two granularities is described, yet no quantitative check is provided that the selected action subsets remain unbiased with respect to the sparse task reward. Without such a check (e.g., reward distribution comparison between reweighted and non-reweighted trajectories), the reweighting step risks amplifying spurious correlations from deterministic environment signals rather than improving true credit assignment.
Authors: This is a valid concern regarding potential bias in the selective reweighting. While our analysis in Section 4.3 demonstrates that SERL improves credit assignment through targeted use of grounded feedback, we did not include an explicit comparison of reward distributions between reweighted and original trajectories. We will add this quantitative check (e.g., histograms or statistical comparison of task rewards) to the revised manuscript to confirm that the reweighting preserves the original reward signal while focusing on critical steps. revision: yes
Circularity Check
No circularity: empirical framework with external benchmark validation
full rationale
The paper introduces SERL as an empirical selective environment-reweighted learning method that uses task rewards to set update direction and per-step environment feedback (error messages, observations, etc.) to adjust placement and magnitude. No equations, derivations, or first-principles claims appear in the provided abstract or described results. Success rates (90.0% ALFWorld, 80.1% WebShop) are reported against external RL and distillation baselines on standard benchmarks; these outcomes are not shown to reduce by construction to any fitted parameter or self-defined quantity within the method. Systematic study of five feedback sources is presented as experimental design rather than a mathematical reduction. The central claims therefore remain self-contained against external evaluations and do not trigger any of the enumerated circularity patterns.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
Cost.FunctionalEquationwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
SERL uses an environment-conditioned teacher to selectively sharpen the RL objective on the actions that matter... the task reward determines the update direction, while environment feedback adjusts only the placement and magnitude of that update.
-
Foundation.AlphaCoordinateFixationJ_uniquely_calibrated_via_higher_derivative unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
The final token advantage is Ã_t,i = A_t ((1−α_k) + α_k w̄_t,i)
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Back to basics: Revisiting reinforce-style optimization for learning from human feedback in llms
Arash Ahmadian, Chris Cremer, Matthias Gallé, Marzieh Fadaee, Julia Kreutzer, Olivier Pietquin, Ahmet Üstün, and Sara Hooker. Back to basics: Revisiting reinforce-style optimization for learning from human feedback in llms. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 12248–12267,...
work page 2024
-
[2]
MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering
Jun Shern Chan, Neil Chowdhury, Oliver Jaffe, James Aung, Dane Sherburn, Evan Mays, Giulio Starace, Kevin Liu, Leon Maksin, Tejal Patwardhan, Lilian Weng, and Aleksander Mądry. Mle-bench: Evaluating machine learning agents on machine learning engineering, 2025. URLhttps://arxiv.org/abs/ 2410.07095. 1
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[3]
Step-level value preference optimization for mathematical reasoning
Guoxin Chen, Minpeng Liao, Chengxi Li, and Kai Fan. Step-level value preference optimization for mathematical reasoning. InFindings of the Association for Computational Linguistics: EMNLP 2024, pages 7889–7903, 2024. 2
work page 2024
-
[4]
Minghan Chen, Guikun Chen, Wenguan Wang, and Yi Yang. Seed-grpo: Semantic entropy enhanced grpo for uncertainty-aware policy optimization.arXiv preprint arXiv:2505.12346, 2025. 2
-
[5]
XinzhuChen, XueshengLi, ZhongxiangSun, andWeijieYu. Beyondhigh-entropyexploration: Correctness- aware low-entropy segment-based advantage shaping for reasoning llms.arXiv preprint arXiv:2512.00908, 2025
-
[6]
Reasoning with exploration: An entropy perspective, 2025
Daixuan Cheng, Shaohan Huang, Xuekai Zhu, Bo Dai, Wayne Xin Zhao, Zhenliang Zhang, and Furu Wei. Reasoning with exploration: An entropy perspective, 2025. URLhttps://arxiv.org/abs/2506. 14758. 2
work page 2025
-
[7]
Process Reinforcement through Implicit Rewards
Ganqu Cui, Lifan Yuan, Zefan Wang, Hanbin Wang, Yuchen Zhang, Jiacheng Chen, Wendi Li, Bingxiang He, Yuchen Fan, Tianyu Yu, et al. Process reinforcement through implicit rewards.arXiv preprint arXiv:2502.01456, 2025. 2
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[8]
Group-in-Group Policy Optimization for LLM Agent Training
Lang Feng, Zhenghai Xue, Tingcong Liu, and Bo An. Group-in-group policy optimization for llm agent training.arXiv preprint arXiv:2505.10978, 2025. 2, 4.1, 1, 2, B.1
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[9]
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025. 2
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[10]
Shuo He, Lang Feng, Qi Wei, Xin Cheng, Lei Feng, and Bo An. Hierarchy-of-groups policy optimization for long-horizon agentic tasks.arXiv preprint arXiv:2602.22817, 2026. 2, 4.1, 1, 2
-
[11]
Reinforcement Learning via Self-Distillation
Jonas Hübotter, Frederike Lübeck, Lejs Behric, Anton Baumann, Marco Bagatella, Daniel Marta, Ido Hakimi, Idan Shenfeld, Thomas Kleine Buening, Carlos Guestrin, and Andreas Krause. Reinforcement learning via self-distillation, 2026. URLhttps://arxiv.org/abs/2601.20802. 1, 2, 4.1, 1, B.1
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[12]
SWE-bench: Can Language Models Resolve Real-World GitHub Issues?
Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. Swe-bench: Can language models resolve real-world github issues?, 2024. URL https://arxiv.org/abs/2310.06770. 1, 2
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[13]
Search-R1: Training LLMs to Reason and Leverage Search Engines with Reinforcement Learning
Bowen Jin, Hansi Zeng, Zhenrui Yue, Dong Wang, Hamed Zamani, and Jiawei Han. Search-R1: Training LLMs to reason and leverage search engines with reinforcement learning.arXiv preprint arXiv:2503.09516,
work page internal anchor Pith review Pith/arXiv arXiv
-
[14]
Xiaozhe Li, Jixuan Chen, Xinyu Fang, Shengyuan Ding, Haodong Duan, Qingwen Liu, and Kai Chen. Opt-bench: Evaluating llm agent on large-scale search spaces optimization problems.arXiv preprint arXiv:2506.10764, 2025. 1
-
[15]
Xiaozhe Li, Xinyu Fang, Shengyuan Ding, Linyang Li, Haodong Duan, Qingwen Liu, and Kai Chen. Np-engine: Empowering optimization reasoning in large language models with verifiable synthetic np problems, 2025. URLhttps://arxiv.org/abs/2510.16476. 2 11 What and When to Distill: Selective Hindsight Distillation for Multi-Turn Agents
-
[16]
Escaping the Context Bottleneck: Active Context Curation for LLM Agents via Reinforcement Learning
Xiaozhe Li, Tianyi Lyu, Yizhao Yang, Liang Shan, Siyi Yang, Ligao Zhang, Zhuoyi Huang, Qingwen Liu, and Yang Li. Escaping the context bottleneck: Active context curation for llm agents via reinforcement learning, 2026. URLhttps://arxiv.org/abs/2604.11462. 2
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[17]
Yang Li, Zhichen Dong, Yuhan Sun, Weixun Wang, Shaopan Xiong, Yijia Luo, Jiashun Liu, Han Lu, Jiamang Wang, Wenbo Su, et al. Attention illuminates llm reasoning: The preplan-and-anchor rhythm enables fine-grained policy optimization.arXiv preprint arXiv:2510.13554, 2025. 2
-
[18]
Ziheng Li, Liu Kang, Feng Xiao, Luxi Xing, Qingyi Si, Zhuoran Li, Weikang Gong, Deqing Yang, Yanghua Xiao, and Hongcheng Guo. Outcome-grounded advantage reshaping for fine-grained credit assignment in mathematical reasoning.arXiv preprint arXiv:2601.07408, 2026. 2
-
[19]
Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harrison Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step. InThe Twelfth International Con- ference on Learning Representations, 2024. URLhttps://openreview.net/forum?id=v8L0pN6EOi. 2
work page 2024
-
[20]
On-policy distillation.Thinking Machines Lab: Connectionism,
Kevin Lu and Thinking Machines Lab. On-policy distillation.Thinking Machines Lab: Connectionism,
-
[21]
On-policy distillation.Thinking Machines Lab: Con- nectionism, 2025
doi: 10.64434/tml.20251026. https://thinkingmachines.ai/blog/on-policy-distillation. 1, 2, 3.1
-
[22]
Improve Mathematical Reasoning in Language Models by Automated Process Supervision
Liangchen Luo, Yinxiao Liu, Rosanne Liu, Samrat Phatale, Meiqi Guo, Harsh Lara, Yunxuan Li, Lei Shu, Yun Zhu, Lei Meng, et al. Improve mathematical reasoning in language models by automated process supervision.arXiv preprint arXiv:2406.06592, 2024. 2
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[23]
ToolRL: Reward is All Tool Learning Needs
Cheng Qian, Emre Can Acikgoz, Qi He, Hongru Wang, Xiusi Chen, Dilek Hakkani-Tür, Gokhan Tur, and Heng Ji. ToolRL: Reward is all tool learning needs.arXiv preprint arXiv:2504.13958, 2025. 1, 2
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[24]
Proximal Policy Optimization Algorithms
John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy opti- mization algorithms.arXiv preprint arXiv:1707.06347, 2017. 2, 4.1, 1
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[25]
Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y. K. Li, Y. Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models, 2024. URLhttps://arxiv.org/abs/2402.03300. 1, 2, 3.1, 4.1, 1, 2
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[26]
Self-Distillation Enables Continual Learning
Idan Shenfeld, Mehul Damani, Jonas Hübotter, and Pulkit Agrawal. Self-distillation enables continual learning, 2026. URLhttps://arxiv.org/abs/2601.19897. 2
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[27]
HybridFlow: A Flexible and Efficient RLHF Framework
Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient rlhf framework.arXiv preprint arXiv: 2409.19256,
work page internal anchor Pith review Pith/arXiv arXiv
-
[28]
Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: Language agents with verbal reinforcement learning.Advances in Neural Information Processing Systems, 36, 2024. 4.1, 1
work page 2024
-
[29]
ALFWorld: Aligning text and embodied environments for interactive learning
Mohit Shridhar, Xingdi Yuan, Marc-Alexandre Cote, Yonatan Bisk, Adam Trischler, and Matthew Hausknecht. ALFWorld: Aligning text and embodied environments for interactive learning. InIn- ternational Conference on Learning Representations, 2021. URLhttps://openreview.net/forum? id=0IOX0YcCdTn. 1, 2, 4.1
work page 2021
-
[30]
Ktae: A model- free algorithm to key-tokens advantage estimation in mathematical reasoning, 2025
Wei Sun, Wen Yang, Pu Jian, Qianlong Du, Fuwei Cui, Shuo Ren, and Jiajun Zhang. Ktae: A model- free algorithm to key-tokens advantage estimation in mathematical reasoning, 2025. URLhttps: //arxiv.org/abs/2505.16826. 2
-
[31]
Kimi Team, Angang Du, Bofei Gao, Bowei Xing, Changjiu Jiang, Cheng Chen, Cheng Li, Chenjun Xiao, Chenzhuang Du, Chonghua Liao, et al. Kimi k1. 5: Scaling reinforcement learning with llms.arXiv preprint arXiv:2501.12599, 2025. 2
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[32]
Qwen2.5: A party of foundation models, September 2024
Qwen Team. Qwen2.5: A party of foundation models, September 2024. URLhttps://qwenlm.github. io/blog/qwen2.5/. 4.1, B.1 12 What and When to Distill: Selective Hindsight Distillation for Multi-Turn Agents
work page 2024
-
[33]
AppWorld: A controllable world of apps and people for benchmarking interactive coding agents
Harsh Trivedi, Tushar Khot, Mareike Hartmann, Ruskin Manku, Vinty Dong, Edward Li, Shashank Gupta, Ashish Sabharwal, and Niranjan Balasubramanian. AppWorld: A controllable world of apps and people for benchmarking interactive coding agents. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), p...
work page 2024
-
[34]
RAGEN: Understanding Self-Evolution in LLM Agents via Multi-Turn Reinforcement Learning
Zihan Wang, Kangrui Wang, Qineng Wang, Pingyue Zhang, Linjie Li, Zhengyuan Yang, Kefan Yu, Minh Nhat Nguyen, Licheng Liu, Eli Gottlieb, et al. RAGEN: Understanding self-evolution in LLM agents via multi-turn reinforcement learning.arXiv preprint arXiv:2504.20073, 2025. 2
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[35]
Logic-RL: Unleashing LLM Reasoning with Rule-Based Reinforcement Learning
Tian Xie, Zitian Gao, Qingnan Ren, Haoming Luo, Yuqian Hong, Bryan Dai, Joey Zhou, Kai Qiu, Zhirong Wu, and Chong Luo. Logic-rl: Unleashing llm reasoning with rule-based reinforcement learning.arXiv preprint arXiv:2502.14768, 2025. 2
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[36]
Chenxu Yang, Chuanyu Qin, Qingyi Si, Minghui Chen, Naibin Gu, Dingyu Yao, Zheng Lin, Weiping Wang, Jiaqi Wang, and Nan Duan. Self-distilled rlvr, 2026. URLhttps://arxiv.org/abs/2604.03128. 1, 2, 4.1, 1, 4.3, B.1
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[37]
Shunyu Yao, Howard Chen, John Yang, and Karthik Narasimhan. WebShop: Towards scalable real-world web interaction with grounded language agents.Advances in Neural Information Processing Systems, 35: 20744–20757, 2022. 1, 2, 4.1
work page 2022
-
[38]
ReAct: Synergizing reasoning and acting in language models
Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R Narasimhan, and Yuan Cao. ReAct: Synergizing reasoning and acting in language models. InThe Eleventh International Conference on Learning Representations, 2023. URLhttps://openreview.net/forum?id=WE_vluYUL-X. 4.1, 1
work page 2023
-
[39]
DAPO: An Open-Source LLM Reinforcement Learning System at Scale
Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Tiantian Fan, Gaohong Liu, Lingjun Liu, Xin Liu, et al. Dapo: An open-source llm reinforcement learning system at scale.arXiv preprint arXiv:2503.14476, 2025. 2
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[40]
arXiv preprint arXiv:2408.15240 , year=
Lunjun Zhang, Arian Hosseini, Hritik Bansal, Mehran Kazemi, Aviral Kumar, and Rishabh Agarwal. Generative verifiers: Reward modeling as next-token prediction.arXiv preprint arXiv:2408.15240, 2024. 2
-
[41]
Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models
Siyan Zhao, Zhihui Xie, Mengchen Liu, Jing Huang, Guan Pang, Feiyu Chen, and Aditya Grover. Self- distilled reasoner: On-policy self-distillation for large language models, 2026. URLhttps://arxiv. org/abs/2601.18734. 2
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[42]
Group Sequence Policy Optimization
Chujie Zheng, Shixuan Liu, Mingze Li, Xiong-Hui Chen, Bowen Yu, Chang Gao, Kai Dang, Yuqiong Liu, Rui Men, An Yang, Jingren Zhou, and Junyang Lin. Group sequence policy optimization, 2025. URL https://arxiv.org/abs/2507.18071. 2
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[43]
Fine-Tuning Language Models from Human Preferences
Daniel M Ziegler, Nisan Stiennon, Jeffrey Wu, Tom B Brown, Alec Radford, Dario Amodei, Paul Chris- tiano, and Geoffrey Irving. Fine-tuning language models from human preferences.arXiv preprint arXiv:1909.08593, 2019. 2 13 What and When to Distill: Selective Hindsight Distillation for Multi-Turn Agents A. Environment Feedback Details We formalize a multi-t...
work page internal anchor Pith review Pith/arXiv arXiv 1909
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.