pith. sign in

arxiv: 2603.21357 · v4 · submitted 2026-03-22 · 💻 cs.AI · cs.CL

AgentHER: Hindsight Experience Replay for LLM Agent Trajectory Relabeling

Pith reviewed 2026-05-15 06:46 UTC · model grok-4.3

classification 💻 cs.AI cs.CL
keywords LLM agentshindsight experience replaytrajectory relabelingWebArenaToolBenchsupervised fine-tuningpreference optimization
0
0 comments X

The pith

AgentHER adapts hindsight experience replay to turn failed LLM agent trajectories into useful training data by relabeling them with alternative achievable goals.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that most failed agent trajectories contain recoverable signal because they often succeed at some other feasible goal instead of the intended one. By converting those failures into demonstrations for the alternative goals, the method expands the usable training set far beyond success-only data. This produces measurable gains in final agent performance and training efficiency on held-out tasks from WebArena and ToolBench. The approach relies on a staged pipeline that classifies failures, extracts outcomes, and applies gated LLM relabeling to keep the added examples clean.

Core claim

AgentHER recovers this lost signal by adapting Hindsight Experience Replay to natural-language agent trajectories: a trajectory that fails goal A is often a correct demonstration for an achievable alternative goal B. The four-stage pipeline of failure classification, outcome extraction, LLM-guided relabeling with confidence gating, and data packaging converts discarded failures into SFT, DPO, and ShareGPT training data, delivering 7.6-11.4% gains over success-only SFT and 2x sample efficiency across model families while beating the strongest baseline by 3.0-6.2%.

What carries the argument

Four-stage pipeline that classifies failures, extracts outcomes, performs LLM-guided relabeling with confidence gating and cross-model verification, then packages the results as training data.

If this is right

  • Agents trained on the expanded dataset reach higher success rates on held-out WebArena and ToolBench tasks than those trained only on successes.
  • The method achieves roughly twice the sample efficiency of success-only training.
  • Performance exceeds the strongest prior experience-centric baseline by 3-6 percentage points across four model families.
  • Noise reduction techniques bring label error down to 2.9% and human-rated precision above 96%.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same relabeling idea could be applied to other sequential tasks where partial progress toward one goal constitutes success at another.
  • Integrating the generated data into online reinforcement learning loops might further reduce the need for human demonstrations.
  • The low reported per-trajectory cost suggests the pipeline can scale to much larger trajectory collections without prohibitive overhead.

Load-bearing premise

LLM-guided relabeling with confidence gating and cross-model verification produces alternative goals that are both valid and useful for downstream training without systematic bias or noise that would degrade model performance.

What would settle it

Training agents on the relabeled dataset and observing no improvement or a drop in success rate on the strict task-disjoint held-out sets relative to success-only training.

read the original abstract

LLM-agent training pipelines routinely discard failed trajectories even though GPT-4o achieves only 14-20% on WebArena and below 55% pass@1 on ToolBench; even specialised systems at 50-65% leave the majority of trajectories unused. We introduce AgentHER, which recovers this lost signal by adapting Hindsight Experience Replay (HER) to natural-language agent trajectories: a trajectory that fails goal A is often a correct demonstration for an achievable alternative goal B. AgentHER realises this through a four-stage pipeline (failure classification, outcome extraction, LLM-guided relabeling with confidence gating, and data packaging) that converts discarded failures into SFT, DPO, and ShareGPT training data. On WebArena and ToolBench under a strict task-disjoint held-out protocol, AgentHER improves over success-only SFT by +7.6-11.4% across four model families (GPT-4o, Qwen2.5-72B/7B, LLaMA-3.1-8B), achieves 2x sample efficiency, and beats the strongest experience-centric baseline (Agent Workflow Memory) by +3.0-6.2%. Two robustness mechanisms, failure-severity weighting and cross-model multi-judge verification (gpt-4o-mini paired with Qwen2.5-72B-Instruct), reduce label noise from 5.9% to 2.9% and raise human-rated relabeling precision to 97.1% on WebArena and 96.0% on ToolBench. A full system-cost audit shows the entire relabeling pipeline costs 2.98 and 26 wall-clock minutes for 3,000 trajectories, i.e. 1.4 x 10^-3 per accepted pair. Code: https://github.com/alphadl/AgentHER

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces AgentHER, adapting Hindsight Experience Replay to LLM agent trajectories via a four-stage pipeline (failure classification, outcome extraction, LLM-guided relabeling with confidence gating, and data packaging) that converts failed trajectories into SFT/DPO/ShareGPT data. Under task-disjoint held-out splits on WebArena and ToolBench, it reports +7.6-11.4% gains over success-only SFT across four model families, 2x sample efficiency, and +3.0-6.2% over the strongest baseline, with robustness mechanisms (failure-severity weighting, cross-model verification) reducing noise from 5.9% to 2.9% and achieving 96-97% human precision at low cost.

Significance. If the relabeling produces valid and unbiased demonstrations, this could substantially improve data efficiency for LLM agent training by salvaging the majority of trajectories that are failures. The concrete gains, sample-efficiency results, cost audit, and robustness checks (noise reduction, human precision) are strengths that would make the work practically relevant if the central assumption holds.

major comments (2)
  1. [Relabeling pipeline and experimental results] The cross-model verification and human precision metrics (96-97%) reduce reported label noise to 2.9%, but do not rule out systematic distributional bias in the LLM-relabeled goals (e.g., preference for simpler or more frequent alternatives). This is load-bearing for the central claim, as such bias could produce the observed gains on held-out tasks without reflecting true hindsight benefits; an explicit check (e.g., goal-distribution comparison or ablation on biased vs. unbiased relabeling) is needed.
  2. [Experimental evaluation] Statistical significance tests for the reported improvements (+7.6-11.4% and +3.0-6.2%) are absent, and full experimental details plus exact relabeling prompts are not provided. This weakens confidence in the robustness of the gains under the task-disjoint protocol.
minor comments (2)
  1. [Methods] Clarify the exact implementation of failure-severity weighting within the data-packaging stage to make the robustness mechanisms fully reproducible.
  2. [Appendix or main text] The code link is provided, but including a small example of a relabeled trajectory pair in the main text or appendix would improve clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and positive assessment of the work's practical relevance. We address each major comment below and will incorporate revisions to strengthen the manuscript.

read point-by-point responses
  1. Referee: The cross-model verification and human precision metrics (96-97%) reduce reported label noise to 2.9%, but do not rule out systematic distributional bias in the LLM-relabeled goals (e.g., preference for simpler or more frequent alternatives). This is load-bearing for the central claim, as such bias could produce the observed gains on held-out tasks without reflecting true hindsight benefits; an explicit check (e.g., goal-distribution comparison or ablation on biased vs. unbiased relabeling) is needed.

    Authors: We agree that the existing noise metrics do not fully address potential systematic bias in goal distributions. In the revised manuscript, we will add (1) a comparison of goal distributions (e.g., length, frequency, complexity metrics) between original and relabeled goals on both datasets, and (2) an ablation comparing performance with and without filtering for potentially biased relabelings (e.g., via length or rarity thresholds). These will be included in Section 4.3 with supporting figures. revision: yes

  2. Referee: Statistical significance tests for the reported improvements (+7.6-11.4% and +3.0-6.2%) are absent, and full experimental details plus exact relabeling prompts are not provided. This weakens confidence in the robustness of the gains under the task-disjoint protocol.

    Authors: We acknowledge the omission. In the revision, we will add statistical significance tests (paired t-tests and bootstrap 95% confidence intervals) for all reported gains. Full experimental details (hyperparameters, training protocols, task-disjoint split construction) and the exact relabeling prompts (including the four-stage pipeline templates) will be provided in a new Appendix C. The task-disjoint protocol description will also be expanded for clarity. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical pipeline evaluated on held-out data

full rationale

The paper describes a four-stage empirical pipeline (failure classification, outcome extraction, LLM-guided relabeling with gating, data packaging) for converting failed trajectories into training data. All reported gains (+7.6-11.4% over success-only SFT, 2x efficiency, +3.0-6.2% over baselines) are measured on strict task-disjoint held-out splits of WebArena and ToolBench across multiple model families. No equations, fitted parameters, or derivations are presented that reduce the claimed improvements to quantities constructed from the same data used for evaluation. External baselines and human precision metrics (96-97%) provide independent verification. The method adapts standard HER ideas without self-referential definitions or load-bearing self-citations that collapse the result to its inputs.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that failed trajectories contain recoverable signal for alternative goals and that LLM relabeling can extract it reliably; no free parameters or invented entities are named in the abstract.

free parameters (1)
  • confidence threshold for gating
    Used to accept or reject relabeled goals; exact value and fitting procedure not stated in abstract.
axioms (1)
  • domain assumption A trajectory that fails goal A is often a correct demonstration for an achievable alternative goal B
    Core premise of the HER adaptation for natural-language trajectories.

pith-pipeline@v0.9.0 · 5638 in / 1260 out tokens · 25749 ms · 2026-05-15T06:46:38.534506+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.