HINT-SD: Targeted Hindsight Self-Distillation for Long-Horizon Agents

Sung Ju Hwang; Taekyung Ki; Woongyeng Yeo; Yumin Choi

arxiv: 2605.17873 · v1 · pith:CGXGEVAInew · submitted 2026-05-18 · 💻 cs.LG · cs.AI· cs.CL

HINT-SD: Targeted Hindsight Self-Distillation for Long-Horizon Agents

Woongyeng Yeo , Yumin Choi , Taekyung Ki , Sung Ju Hwang This is my paper

Pith reviewed 2026-05-20 12:13 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CL

keywords long-horizon agentsself-distillationhindsightLLM agentsreinforcement learningsparse rewardsfeedback distillationagent training

0 comments

The pith

Full-trajectory hindsight selects failure actions for targeted self-distillation, raising long-horizon agent success by up to 18.8% at 2.26 times lower training time.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tries to show that long-horizon LLM agents trained with reinforcement learning benefit when feedback is distilled only onto the precise action spans identified as causing failure through analysis of the entire completed trajectory. This matters to a sympathetic reader because sparse outcome rewards give no guidance on which steps to fix, while dense per-turn feedback wastes effort on successful or neutral turns and often arrives at the wrong moment. If the claim holds, agents would reach higher task completion rates on complex benchmarks while requiring less computation per training step.

Core claim

HINT-SD is a targeted self-distillation framework that uses full-trajectory hindsight to select failure-relevant actions and applies feedback-conditioned distillation only on those targeted action spans rather than every turn. On BFCL v3 and AppWorld this yields up to 18.80 percent higher performance than the dense per-turn feedback baseline together with 2.26 times lower time per training step.

What carries the argument

The central mechanism is hindsight-based selection of failure-relevant action spans followed by feedback-conditioned self-distillation restricted to those spans.

If this is right

Long-horizon agents reach higher task success rates when supervision focuses only on the actions that actually contributed to failure.
Training steps become faster because distillation is skipped on turns that were already successful or neutral.
Precise alignment between feedback and causal actions matters more than the volume of feedback supplied at every turn.
The same selective approach scales better to longer sequences than uniform dense feedback methods.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same hindsight-targeting idea could be tested in other sparse-reward sequential decision tasks outside LLM agents.
Pairing this method with cheaper ways to generate the initial feedback might further reduce overall supervision cost.
Adding an independent check on whether the selected spans truly caused the failure could make the gains more robust.

Load-bearing premise

Hindsight review of the completed trajectory can reliably locate the exact action spans that caused failure without introducing new errors or overlooking important causal links.

What would settle it

An experiment that replaces the hindsight selection step with random choice of action spans and still obtains the same performance gains would show that targeted selection is not required for the reported improvements.

Figures

Figures reproduced from arXiv: 2605.17873 by Sung Ju Hwang, Taekyung Ki, Woongyeng Yeo, Yumin Choi.

**Figure 1.** Figure 1: (Left) Per-epoch Accuracy scores on the BFCL v3 eval split. (Middle) Time per training step. (Right) Peak GPU memory usage during the first epoch of training. 1 3 6 9 12 15 Epoch 0% 10% 20% 30% 40% 50% Frequency Target Turn Distribution Target Turns 1-3 4-8 9+ [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗

**Figure 3.** Figure 3: aggregates selected hindsight target turns, complementing the epoch-wise regions in [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: Prompt template for multi-step hindsight feedback generation in HINT-SD-Multi. Given a complete failed AppWorld trajectory, the analyzer selects up to {max_steps} failure-relevant steps and returns localized corrective feedback for each selected step. SYSTEM: You analyze failed AppWorld tool-use trajectories. Identify the FIRST step where the agent made a mistake. Write the feedback in less than three sent… view at source ↗

**Figure 5.** Figure 5: Prompt template for single-step hindsight feedback generation in HINT-SD-Single. Given a complete failed AppWorld trajectory, the analyzer identifies the earliest failure-relevant step and returns a concise correction for the step. 9 [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

**Figure 6.** Figure 6: Qualitative example of selected hindsight target turns from an AppWorld training rollout. The abbreviated trajectory context shows that the analyzer localizes feedback to the actions where the agent loses the authenticated Spotify state, rather than applying the same feedback globally at the beginning of the trajectory. 10 [PITH_FULL_IMAGE:figures/full_fig_p010_6.png] view at source ↗

**Figure 7.** Figure 7: Qualitative comparison on a BFCL task. The task-matched rollouts share the same failure pattern: the booking is never created, so later booking-dependent tool calls fail. Global hindsight gives one episode-level correction, while HINT-SD attaches the same root cause to the concrete turns where it first appears and then propagates. 11 [PITH_FULL_IMAGE:figures/full_fig_p011_7.png] view at source ↗

**Figure 8.** Figure 8: Qualitative comparison on an AppWorld task. The trajectory excerpt is from the selected-target run. The selected-turn feedback exposes early actionable errors in API use and authentication, instead of only summarizing a later episode-level failure. 12 [PITH_FULL_IMAGE:figures/full_fig_p012_8.png] view at source ↗

read the original abstract

Training long-horizon LLM agents with reinforcement learning is challenging because sparse outcome rewards reveal whether a task succeeds, but not which intermediate actions caused the outcome or how they should be corrected. Recent methods alleviate this issue by generating rewards or textual hints from turn-level action-output signals, or by using feedback-conditioned self-distillation. However, generating feedback at every turn is inefficient when many intermediate turns are already successful or neutral, and applying feedback at a fixed or misaligned turn often fails to supervise the actions that contributed to the failure. To bridge this gap, we propose HINT-SD, a targeted self-distillation framework that uses full-trajectory hindsight to select failure-relevant actions and applies feedback-conditioned distillation only on targeted action spans. Experiments on BFCL v3 and AppWorld show that our method improves over the dense per-turn feedback baseline by up to 18.80 percent while achieving 2.26$\times$ lower time per training step, suggesting that selecting where to distill is a key factor for both effective and efficient long-horizon agent training.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

HINT-SD shows practical efficiency gains from targeting self-distillation via hindsight span selection, but the accuracy of that selection step lacks direct checks.

read the letter

The headline point here is that HINT-SD uses full-trajectory hindsight to pick out the specific action spans that likely caused failures and then does feedback-conditioned distillation only on those. This gives up to 18.8% better results and 2.26 times lower time per step than dense per-turn feedback on BFCL v3 and AppWorld. The approach avoids wasting effort on neutral or successful turns while trying to align supervision with the actual failure points. What stands out as new is the shift from applying distillation everywhere or at fixed points to a targeted selection based on hindsight analysis of the complete run. The abstract positions this as addressing inefficiency in generating feedback for neutral or successful turns and misalignment in supervision. It builds on prior work in feedback-conditioned distillation but refines it by making the application selective rather than uniform. The paper does a solid job laying out the problem with sparse rewards in long-horizon agents and showing empirical gains on two relevant benchmarks. The efficiency claim is particularly useful for practical training pipelines where compute is a bottleneck and data collection is expensive. That said, the soft spot is the lack of direct validation for the hindsight selection step. The improvements rest on the assumption that the method accurately identifies the failure-relevant actions without missing key links or picking wrong ones. Without ablations comparing hindsight spans to random ones or metrics on selection accuracy like precision against oracle points, it's hard to rule out that the gains come simply from distilling less data rather than better targeted supervision. The abstract doesn't mention any such checks, and if the full paper has them, they would help a lot. This paper is aimed at researchers working on LLM-based agents and self-distillation techniques for reinforcement learning in complex tasks. Readers interested in efficient supervision for long sequences would get value from the targeted approach, especially if they are dealing with expensive training runs. I'd recommend sending it for peer review. The core idea is practical and the results are reported with specific numbers, so referees can dig into the methods and suggest improvements on the validation side. It seems like a reasonable incremental step that could influence how people apply supervision in agent training.

Referee Report

1 major / 1 minor

Summary. The paper proposes HINT-SD, a targeted self-distillation framework for long-horizon LLM agents. It uses full-trajectory hindsight to identify failure-relevant action spans and applies feedback-conditioned distillation only on those spans rather than dense per-turn feedback. Experiments on BFCL v3 and AppWorld report up to 18.80% improvement over the dense baseline and 2.26× lower time per training step.

Significance. If the hindsight selection step is reliable, the method offers a practical way to improve both effectiveness and efficiency in sparse-reward agent training by concentrating supervision on causally relevant actions. The reported speed-up is a notable strength for scaling to longer trajectories.

major comments (1)

[Method] Method section (as described in the abstract and method overview): The central claim that hindsight reliably isolates the precise action spans responsible for failure lacks any quantitative validation, such as precision/recall against oracle failure points, inter-annotator agreement, or ablation comparing hindsight spans to random spans. This is load-bearing because inaccurate selection would make targeted distillation no more informative than dense feedback, and the observed gains could arise from reduced distillation volume rather than better supervision.

minor comments (1)

[Abstract] Abstract: The concrete percentage improvements and speed-up factor are presented without details on the number of runs, seed variance, or statistical tests, which would help assess robustness.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the major comment on validation of the hindsight selection mechanism below.

read point-by-point responses

Referee: [Method] Method section (as described in the abstract and method overview): The central claim that hindsight reliably isolates the precise action spans responsible for failure lacks any quantitative validation, such as precision/recall against oracle failure points, inter-annotator agreement, or ablation comparing hindsight spans to random spans. This is load-bearing because inaccurate selection would make targeted distillation no more informative than dense feedback, and the observed gains could arise from reduced distillation volume rather than better supervision.

Authors: We agree that direct quantitative validation of the hindsight span selection would strengthen the central claim. The current manuscript reports end-to-end gains (up to 18.80% over dense feedback) and efficiency improvements (2.26× lower time per step) on BFCL v3 and AppWorld, which are consistent with effective targeting, but does not include precision/recall against oracles, inter-annotator agreement, or an explicit random-span ablation. In the revised version we will add an ablation that applies feedback-conditioned distillation to random spans of matched average length, together with statistics on selected span lengths and a more detailed description of the hindsight identification procedure. These additions will help isolate whether gains derive from targeted supervision rather than reduced distillation volume. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical gains reported independently of method internals

full rationale

The paper defines HINT-SD as a targeted hindsight self-distillation procedure that selects failure-relevant action spans from full trajectories and applies feedback-conditioned distillation only to those spans. All performance numbers (up to 18.80% improvement, 2.26× lower time per step) are presented strictly as measured experimental outcomes on BFCL v3 and AppWorld against a dense per-turn baseline. No equations, fitted parameters, or self-citations are invoked to derive these quantities algebraically from the method definition itself. The central claim therefore remains an empirical observation rather than a self-referential reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach rests on the domain assumption that full-trajectory hindsight can be used to accurately localize failure causes and that the resulting feedback is suitable for distillation. No free parameters or invented entities are explicitly introduced in the abstract.

axioms (1)

domain assumption Full-trajectory hindsight can reliably identify failure-relevant action spans without introducing selection bias or missing causal steps.
This premise is required for the targeted distillation to be more effective than uniform per-turn feedback.

pith-pipeline@v0.9.0 · 5729 in / 1425 out tokens · 37184 ms · 2026-05-20T12:13:06.515614+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

HINT-SD analyzes the full rollout to produce a sparse set of failure-relevant steps together with corrective feedback... applies a distillation loss only to the selected action spans
IndisputableMonolith/Foundation/ArithmeticFromLogic.lean embed_injective unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

relevance-sparsity problem: in a failed trajectory, only a small subset of actions may require correction

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

28 extracted references · 28 canonical work pages · 5 internal anchors

[1]

A pp W orld: A Controllable World of Apps and People for Benchmarking Interactive Coding Agents

Trivedi, Harsh and Khot, Tushar and Hartmann, Mareike and Manku, Ruskin and Dong, Vinty and Li, Edward and Gupta, Shashank and Sabharwal, Ashish and Balasubramanian, Niranjan. A pp W orld: A Controllable World of Apps and People for Benchmarking Interactive Coding Agents. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguist...

work page 2024
[2]

The Eleventh International Conference on Learning Representations , year=

ReAct: Synergizing Reasoning and Acting in Language Models , author=. The Eleventh International Conference on Learning Representations , year=

work page
[3]

The Twelfth International Conference on Learning Representations , year=

WebArena: A Realistic Web Environment for Building Autonomous Agents , author=. The Twelfth International Conference on Learning Representations , year=

work page
[4]

Thirty-seventh Conference on Neural Information Processing Systems , year=

Reflexion: language agents with verbal reinforcement learning , author=. Thirty-seventh Conference on Neural Information Processing Systems , year=

work page
[5]

Thirty-seventh Conference on Neural Information Processing Systems , year=

Self-Refine: Iterative Refinement with Self-Feedback , author=. Thirty-seventh Conference on Neural Information Processing Systems , year=

work page
[6]

Zhibin Gou and Zhihong Shao and Yeyun Gong and yelong shen and Yujiu Yang and Nan Duan and Weizhu Chen , booktitle=

work page
[7]

Training Verifiers to Solve Math Word Problems

Training verifiers to solve math word problems , author=. arXiv preprint arXiv:2110.14168 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[8]

The Twelfth International Conference on Learning Representations , year=

Let's Verify Step by Step , author=. The Twelfth International Conference on Learning Representations , year=

work page
[9]

The 1st Workshop on Scaling Post-training for LLMs , year=

Reinforcement Learning via Self-Distillation , author=. The 1st Workshop on Scaling Post-training for LLMs , year=

work page
[10]

The 1st Workshop on Scaling Post-training for LLMs , year=

Expanding the Capabilities of Reinforcement Learning via Text Feedback , author=. The 1st Workshop on Scaling Post-training for LLMs , year=

work page
[11]

OpenClaw-RL: Train Any Agent Simply by Talking

Openclaw-rl: Train any agent simply by talking , author=. arXiv preprint arXiv:2603.10165 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[12]

arXiv preprint arXiv:2603.21383 , year=

PivotRL: High Accuracy Agentic Post-Training at Low Compute Cost , author=. arXiv preprint arXiv:2603.21383 , year=

work page arXiv
[13]

Group-in-Group Policy Optimization for

Lang Feng and Zhenghai Xue and Tingcong Liu and Bo An , booktitle=. Group-in-Group Policy Optimization for

work page
[14]

arXiv preprint arXiv:2603.08754 , year=

Hindsight Credit Assignment for Long-Horizon LLM Agents , author=. arXiv preprint arXiv:2603.08754 , year=

work page arXiv
[15]

Qwen3 Technical Report

Qwen3 technical report , author=. arXiv preprint arXiv:2505.09388 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[16]

Skill-SD: Skill-Conditioned Self-Distillation for Multi-turn LLM Agents

Skill-sd: Skill-conditioned self-distillation for multi-turn llm agents , author=. arXiv preprint arXiv:2604.10674 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[17]

Proceedings of The 3rd Conference on Lifelong Learning Agents , pages =

Sub-goal Distillation: A Method to Improve Small Language Agents , author =. Proceedings of The 3rd Conference on Lifelong Learning Agents , pages =. 2025 , publisher =. 2405.02749 , archivePrefix =

work page arXiv 2025
[18]

2026 , eprint =

Revisiting On-Policy Distillation: Empirical Failure Modes and Simple Fixes , author =. 2026 , eprint =

work page 2026
[19]

Gonzalez , booktitle=

Shishir G Patil and Huanzhi Mao and Fanjia Yan and Charlie Cheng-Jie Ji and Vishnu Suresh and Ion Stoica and Joseph E. Gonzalez , booktitle=. The Berkeley Function Calling Leaderboard (

work page
[20]

Agentevolver: Towards efficient self-evolving agent system,

Agentevolver: Towards efficient self-evolving agent system , author=. arXiv preprint arXiv:2511.10395 , year=

work page arXiv
[21]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Deepseekmath: Pushing the limits of mathematical reasoning in open language models , author=. arXiv preprint arXiv:2402.03300 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[22]

Introducing GPT-5.4 mini and nano , year =

work page
[23]

International Conference on Learning Representations , year=

Decoupled Weight Decay Regularization , author=. International Conference on Learning Representations , year=

work page
[24]

Edward J Hu and yelong shen and Phillip Wallis and Zeyuan Allen-Zhu and Yuanzhi Li and Shean Wang and Lu Wang and Weizhu Chen , booktitle=. Lo

work page
[25]

The journal of machine learning research , year=

Dropout: a simple way to prevent neural networks from overfitting , author=. The journal of machine learning research , year=

work page
[26]

Proceedings of the 31st International Conference on Neural Information Processing Systems , pages =

Tarvainen, Antti and Valpola, Harri , title =. Proceedings of the 31st International Conference on Neural Information Processing Systems , pages =. 2017 , isbn =

work page 2017
[27]

von Werra, Leandro and Belkada, Younes and Tunstall, Lewis and Beeching, Edward and Thrush, Tristan and Lambert, Nathan and Huang, Shengyi and Rasul, Kashif and Gallouédec, Quentin , license =

work page
[28]

Proceedings of the 29th Symposium on Operating Systems Principles , pages =

Kwon, Woosuk and Li, Zhuohan and Zhuang, Siyuan and Sheng, Ying and Zheng, Lianmin and Yu, Cody Hao and Gonzalez, Joseph and Zhang, Hao and Stoica, Ion , title =. Proceedings of the 29th Symposium on Operating Systems Principles , pages =. 2023 , publisher =

work page 2023

[1] [1]

A pp W orld: A Controllable World of Apps and People for Benchmarking Interactive Coding Agents

Trivedi, Harsh and Khot, Tushar and Hartmann, Mareike and Manku, Ruskin and Dong, Vinty and Li, Edward and Gupta, Shashank and Sabharwal, Ashish and Balasubramanian, Niranjan. A pp W orld: A Controllable World of Apps and People for Benchmarking Interactive Coding Agents. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguist...

work page 2024

[2] [2]

The Eleventh International Conference on Learning Representations , year=

ReAct: Synergizing Reasoning and Acting in Language Models , author=. The Eleventh International Conference on Learning Representations , year=

work page

[3] [3]

The Twelfth International Conference on Learning Representations , year=

WebArena: A Realistic Web Environment for Building Autonomous Agents , author=. The Twelfth International Conference on Learning Representations , year=

work page

[4] [4]

Thirty-seventh Conference on Neural Information Processing Systems , year=

Reflexion: language agents with verbal reinforcement learning , author=. Thirty-seventh Conference on Neural Information Processing Systems , year=

work page

[5] [5]

Thirty-seventh Conference on Neural Information Processing Systems , year=

Self-Refine: Iterative Refinement with Self-Feedback , author=. Thirty-seventh Conference on Neural Information Processing Systems , year=

work page

[6] [6]

Zhibin Gou and Zhihong Shao and Yeyun Gong and yelong shen and Yujiu Yang and Nan Duan and Weizhu Chen , booktitle=

work page

[7] [7]

Training Verifiers to Solve Math Word Problems

Training verifiers to solve math word problems , author=. arXiv preprint arXiv:2110.14168 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[8] [8]

The Twelfth International Conference on Learning Representations , year=

Let's Verify Step by Step , author=. The Twelfth International Conference on Learning Representations , year=

work page

[9] [9]

The 1st Workshop on Scaling Post-training for LLMs , year=

Reinforcement Learning via Self-Distillation , author=. The 1st Workshop on Scaling Post-training for LLMs , year=

work page

[10] [10]

The 1st Workshop on Scaling Post-training for LLMs , year=

Expanding the Capabilities of Reinforcement Learning via Text Feedback , author=. The 1st Workshop on Scaling Post-training for LLMs , year=

work page

[11] [11]

OpenClaw-RL: Train Any Agent Simply by Talking

Openclaw-rl: Train any agent simply by talking , author=. arXiv preprint arXiv:2603.10165 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[12] [12]

arXiv preprint arXiv:2603.21383 , year=

PivotRL: High Accuracy Agentic Post-Training at Low Compute Cost , author=. arXiv preprint arXiv:2603.21383 , year=

work page arXiv

[13] [13]

Group-in-Group Policy Optimization for

Lang Feng and Zhenghai Xue and Tingcong Liu and Bo An , booktitle=. Group-in-Group Policy Optimization for

work page

[14] [14]

arXiv preprint arXiv:2603.08754 , year=

Hindsight Credit Assignment for Long-Horizon LLM Agents , author=. arXiv preprint arXiv:2603.08754 , year=

work page arXiv

[15] [15]

Qwen3 Technical Report

Qwen3 technical report , author=. arXiv preprint arXiv:2505.09388 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[16] [16]

Skill-SD: Skill-Conditioned Self-Distillation for Multi-turn LLM Agents

Skill-sd: Skill-conditioned self-distillation for multi-turn llm agents , author=. arXiv preprint arXiv:2604.10674 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[17] [17]

Proceedings of The 3rd Conference on Lifelong Learning Agents , pages =

Sub-goal Distillation: A Method to Improve Small Language Agents , author =. Proceedings of The 3rd Conference on Lifelong Learning Agents , pages =. 2025 , publisher =. 2405.02749 , archivePrefix =

work page arXiv 2025

[18] [18]

2026 , eprint =

Revisiting On-Policy Distillation: Empirical Failure Modes and Simple Fixes , author =. 2026 , eprint =

work page 2026

[19] [19]

Gonzalez , booktitle=

Shishir G Patil and Huanzhi Mao and Fanjia Yan and Charlie Cheng-Jie Ji and Vishnu Suresh and Ion Stoica and Joseph E. Gonzalez , booktitle=. The Berkeley Function Calling Leaderboard (

work page

[20] [20]

Agentevolver: Towards efficient self-evolving agent system,

Agentevolver: Towards efficient self-evolving agent system , author=. arXiv preprint arXiv:2511.10395 , year=

work page arXiv

[21] [21]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Deepseekmath: Pushing the limits of mathematical reasoning in open language models , author=. arXiv preprint arXiv:2402.03300 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[22] [22]

Introducing GPT-5.4 mini and nano , year =

work page

[23] [23]

International Conference on Learning Representations , year=

Decoupled Weight Decay Regularization , author=. International Conference on Learning Representations , year=

work page

[24] [24]

Edward J Hu and yelong shen and Phillip Wallis and Zeyuan Allen-Zhu and Yuanzhi Li and Shean Wang and Lu Wang and Weizhu Chen , booktitle=. Lo

work page

[25] [25]

The journal of machine learning research , year=

Dropout: a simple way to prevent neural networks from overfitting , author=. The journal of machine learning research , year=

work page

[26] [26]

Proceedings of the 31st International Conference on Neural Information Processing Systems , pages =

Tarvainen, Antti and Valpola, Harri , title =. Proceedings of the 31st International Conference on Neural Information Processing Systems , pages =. 2017 , isbn =

work page 2017

[27] [27]

von Werra, Leandro and Belkada, Younes and Tunstall, Lewis and Beeching, Edward and Thrush, Tristan and Lambert, Nathan and Huang, Shengyi and Rasul, Kashif and Gallouédec, Quentin , license =

work page

[28] [28]

Proceedings of the 29th Symposium on Operating Systems Principles , pages =

Kwon, Woosuk and Li, Zhuohan and Zhuang, Siyuan and Sheng, Ying and Zheng, Lianmin and Yu, Cody Hao and Gonzalez, Joseph and Zhang, Hao and Stoica, Ion , title =. Proceedings of the 29th Symposium on Operating Systems Principles , pages =. 2023 , publisher =

work page 2023