AgentHER: Hindsight Experience Replay for LLM Agent Trajectory Relabeling
Pith reviewed 2026-05-15 06:46 UTC · model grok-4.3
The pith
AgentHER adapts hindsight experience replay to turn failed LLM agent trajectories into useful training data by relabeling them with alternative achievable goals.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
AgentHER recovers this lost signal by adapting Hindsight Experience Replay to natural-language agent trajectories: a trajectory that fails goal A is often a correct demonstration for an achievable alternative goal B. The four-stage pipeline of failure classification, outcome extraction, LLM-guided relabeling with confidence gating, and data packaging converts discarded failures into SFT, DPO, and ShareGPT training data, delivering 7.6-11.4% gains over success-only SFT and 2x sample efficiency across model families while beating the strongest baseline by 3.0-6.2%.
What carries the argument
Four-stage pipeline that classifies failures, extracts outcomes, performs LLM-guided relabeling with confidence gating and cross-model verification, then packages the results as training data.
If this is right
- Agents trained on the expanded dataset reach higher success rates on held-out WebArena and ToolBench tasks than those trained only on successes.
- The method achieves roughly twice the sample efficiency of success-only training.
- Performance exceeds the strongest prior experience-centric baseline by 3-6 percentage points across four model families.
- Noise reduction techniques bring label error down to 2.9% and human-rated precision above 96%.
Where Pith is reading between the lines
- The same relabeling idea could be applied to other sequential tasks where partial progress toward one goal constitutes success at another.
- Integrating the generated data into online reinforcement learning loops might further reduce the need for human demonstrations.
- The low reported per-trajectory cost suggests the pipeline can scale to much larger trajectory collections without prohibitive overhead.
Load-bearing premise
LLM-guided relabeling with confidence gating and cross-model verification produces alternative goals that are both valid and useful for downstream training without systematic bias or noise that would degrade model performance.
What would settle it
Training agents on the relabeled dataset and observing no improvement or a drop in success rate on the strict task-disjoint held-out sets relative to success-only training.
read the original abstract
LLM-agent training pipelines routinely discard failed trajectories even though GPT-4o achieves only 14-20% on WebArena and below 55% pass@1 on ToolBench; even specialised systems at 50-65% leave the majority of trajectories unused. We introduce AgentHER, which recovers this lost signal by adapting Hindsight Experience Replay (HER) to natural-language agent trajectories: a trajectory that fails goal A is often a correct demonstration for an achievable alternative goal B. AgentHER realises this through a four-stage pipeline (failure classification, outcome extraction, LLM-guided relabeling with confidence gating, and data packaging) that converts discarded failures into SFT, DPO, and ShareGPT training data. On WebArena and ToolBench under a strict task-disjoint held-out protocol, AgentHER improves over success-only SFT by +7.6-11.4% across four model families (GPT-4o, Qwen2.5-72B/7B, LLaMA-3.1-8B), achieves 2x sample efficiency, and beats the strongest experience-centric baseline (Agent Workflow Memory) by +3.0-6.2%. Two robustness mechanisms, failure-severity weighting and cross-model multi-judge verification (gpt-4o-mini paired with Qwen2.5-72B-Instruct), reduce label noise from 5.9% to 2.9% and raise human-rated relabeling precision to 97.1% on WebArena and 96.0% on ToolBench. A full system-cost audit shows the entire relabeling pipeline costs 2.98 and 26 wall-clock minutes for 3,000 trajectories, i.e. 1.4 x 10^-3 per accepted pair. Code: https://github.com/alphadl/AgentHER
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces AgentHER, adapting Hindsight Experience Replay to LLM agent trajectories via a four-stage pipeline (failure classification, outcome extraction, LLM-guided relabeling with confidence gating, and data packaging) that converts failed trajectories into SFT/DPO/ShareGPT data. Under task-disjoint held-out splits on WebArena and ToolBench, it reports +7.6-11.4% gains over success-only SFT across four model families, 2x sample efficiency, and +3.0-6.2% over the strongest baseline, with robustness mechanisms (failure-severity weighting, cross-model verification) reducing noise from 5.9% to 2.9% and achieving 96-97% human precision at low cost.
Significance. If the relabeling produces valid and unbiased demonstrations, this could substantially improve data efficiency for LLM agent training by salvaging the majority of trajectories that are failures. The concrete gains, sample-efficiency results, cost audit, and robustness checks (noise reduction, human precision) are strengths that would make the work practically relevant if the central assumption holds.
major comments (2)
- [Relabeling pipeline and experimental results] The cross-model verification and human precision metrics (96-97%) reduce reported label noise to 2.9%, but do not rule out systematic distributional bias in the LLM-relabeled goals (e.g., preference for simpler or more frequent alternatives). This is load-bearing for the central claim, as such bias could produce the observed gains on held-out tasks without reflecting true hindsight benefits; an explicit check (e.g., goal-distribution comparison or ablation on biased vs. unbiased relabeling) is needed.
- [Experimental evaluation] Statistical significance tests for the reported improvements (+7.6-11.4% and +3.0-6.2%) are absent, and full experimental details plus exact relabeling prompts are not provided. This weakens confidence in the robustness of the gains under the task-disjoint protocol.
minor comments (2)
- [Methods] Clarify the exact implementation of failure-severity weighting within the data-packaging stage to make the robustness mechanisms fully reproducible.
- [Appendix or main text] The code link is provided, but including a small example of a relabeled trajectory pair in the main text or appendix would improve clarity.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback and positive assessment of the work's practical relevance. We address each major comment below and will incorporate revisions to strengthen the manuscript.
read point-by-point responses
-
Referee: The cross-model verification and human precision metrics (96-97%) reduce reported label noise to 2.9%, but do not rule out systematic distributional bias in the LLM-relabeled goals (e.g., preference for simpler or more frequent alternatives). This is load-bearing for the central claim, as such bias could produce the observed gains on held-out tasks without reflecting true hindsight benefits; an explicit check (e.g., goal-distribution comparison or ablation on biased vs. unbiased relabeling) is needed.
Authors: We agree that the existing noise metrics do not fully address potential systematic bias in goal distributions. In the revised manuscript, we will add (1) a comparison of goal distributions (e.g., length, frequency, complexity metrics) between original and relabeled goals on both datasets, and (2) an ablation comparing performance with and without filtering for potentially biased relabelings (e.g., via length or rarity thresholds). These will be included in Section 4.3 with supporting figures. revision: yes
-
Referee: Statistical significance tests for the reported improvements (+7.6-11.4% and +3.0-6.2%) are absent, and full experimental details plus exact relabeling prompts are not provided. This weakens confidence in the robustness of the gains under the task-disjoint protocol.
Authors: We acknowledge the omission. In the revision, we will add statistical significance tests (paired t-tests and bootstrap 95% confidence intervals) for all reported gains. Full experimental details (hyperparameters, training protocols, task-disjoint split construction) and the exact relabeling prompts (including the four-stage pipeline templates) will be provided in a new Appendix C. The task-disjoint protocol description will also be expanded for clarity. revision: yes
Circularity Check
No circularity: empirical pipeline evaluated on held-out data
full rationale
The paper describes a four-stage empirical pipeline (failure classification, outcome extraction, LLM-guided relabeling with gating, data packaging) for converting failed trajectories into training data. All reported gains (+7.6-11.4% over success-only SFT, 2x efficiency, +3.0-6.2% over baselines) are measured on strict task-disjoint held-out splits of WebArena and ToolBench across multiple model families. No equations, fitted parameters, or derivations are presented that reduce the claimed improvements to quantities constructed from the same data used for evaluation. External baselines and human precision metrics (96-97%) provide independent verification. The method adapts standard HER ideas without self-referential definitions or load-bearing self-citations that collapse the result to its inputs.
Axiom & Free-Parameter Ledger
free parameters (1)
- confidence threshold for gating
axioms (1)
- domain assumption A trajectory that fails goal A is often a correct demonstration for an achievable alternative goal B
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
AgentHER realises this through a four-stage pipeline (failure classification, outcome extraction, LLM-guided relabeling with confidence gating, and data packaging)
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.