Archer: Training language model agents via hierarchical multi-turn rl

Yifei Zhou, Andrea Zanette, Jiayi Pan, Sergey Levine, Aviral Kumar · 2024 · arXiv 2402.19446

13 Pith papers cite this work. Polarity classification is still indexing.

13 Pith papers citing it

read on arXiv browse 13 citing papers

citation-role summary

background 2

citation-polarity summary

background 2

representative citing papers

Unlocking Proactivity in Task-Oriented Dialogue

cs.AI · 2026-05-21 · unverdicted · novelty 7.0 · 2 refs

Introduces a Cognitive User Simulator modeling stratified personas with hidden concerns and Simulator-Induced Asymmetric-View Policy Optimization to unlock proactive behavior in task-oriented dialogue agents.

When Denser Credit Is Not Enough: Evidence-Calibrated Policy Optimization for Long-Horizon LLM Agent Training

cs.LG · 2026-06-04 · unverdicted · novelty 6.0

ECPO improves GiGPO by shrinking low-count action advantages and suppressing noisy anchor states, yielding +5.2/+7.3 success gains on ALFWorld/WebShop with Qwen2.5-1.5B models at negligible extra cost.

ECHO: Terminal Agents Learn World Models for Free

cs.LG · 2026-05-23 · unverdicted · novelty 6.0

ECHO is a hybrid RL objective that trains agents to predict environment observation tokens from their actions, doubling GRPO pass@1 on TerminalBench-2.0 while improving dynamics prediction on held-out trajectories.

Rewarding Beliefs, Not Actions: Consistency-Guided Credit Assignment for Long-Horizon Agents

cs.CL · 2026-05-19 · unverdicted · novelty 6.0

ReBel uses belief-consistency supervision and belief-aware grouping to improve credit assignment in long-horizon RL for LLM agents, achieving up to 20.4 percentage points higher success and 2.1x better sample efficiency than GRPO on ALFWorld and WebShop.

SOD: Step-wise On-policy Distillation for Small Language Model Agents

cs.CL · 2026-05-08 · unverdicted · novelty 6.0

SOD reweights on-policy distillation strength step-by-step using divergence to stabilize tool use in small language model agents, yielding up to 20.86% gains and 26.13% on AIME 2025 for a 0.6B model.

From History to State: Constant-Context Skill Learning for LLM Agents

cs.AI · 2026-05-06 · unverdicted · novelty 6.0

Constant-context skill learning trains reusable task-family modules for LLM agents using a deterministic state block for progress tracking and subgoal rewards, achieving 89.6% unseen success on ALFWorld, 76.8% on WebShop, and 66.4% on SciWorld with Qwen3-8B while reducing prompt tokens 2-7x.

From Refusal to Recovery: A Control-Theoretic Approach to Generative AI Guardrails

cs.AI · 2025-10-15 · unverdicted · novelty 6.0

Control-theoretic guardrails enable proactive correction of risky LLM agent actions in latent space, preventing catastrophes like collisions or bankruptcy while preserving task performance in simulated environments.

Training Language Models to Self-Correct via Reinforcement Learning

cs.LG · 2024-09-19 · unverdicted · novelty 6.0

SCoRe uses multi-turn online RL with regularization on self-generated traces to improve LLM self-correction, achieving 15.6% and 9.1% gains on MATH and HumanEval for Gemini models.

Modularized Reinforcement Learning on LLMs: From MDP Creation to Exploration and Learning

cs.LG · 2026-06-20 · unverdicted · novelty 5.0

Survey mapping RL techniques onto LLM training and highlighting gaps in value-based, off-policy, and bootstrapping methods.

SIRI: Self-Internalizing Reinforcement Learning with Intrinsic Skills for LLM Agent Training

cs.AI · 2026-06-01 · unverdicted · novelty 5.0

SIRI trains LLM agents to discover, validate, and internalize reusable skills from their own rollouts without external generators or inference-time skill banks, yielding gains on ALFWorld and WebShop.

Trust Region On-Policy Distillation

cs.LG · 2026-05-31 · unverdicted · novelty 5.0

TrOPD stabilizes on-policy distillation for LLMs with trust-region learning, outlier estimation, and off-policy guidance, outperforming prior OPD methods on reasoning and code benchmarks.

StraTA: Incentivizing Agentic Reinforcement Learning with Strategic Trajectory Abstraction

cs.CL · 2026-05-07 · unverdicted · novelty 5.0

StraTA improves LLM agent success rates to 93.1% on ALFWorld and 84.2% on WebShop by sampling a compact initial strategy and training it jointly with action execution via hierarchical GRPO-style rollouts.

Towards Large Reasoning Models: A Survey of Reinforced Reasoning with Large Language Models

cs.AI · 2025-01-16 · unverdicted · novelty 3.0

The paper surveys reinforced reasoning techniques for LLMs, covering automated data construction, learning-to-reason methods, and test-time scaling as steps toward Large Reasoning Models.

citing papers explorer

Showing 13 of 13 citing papers.

Unlocking Proactivity in Task-Oriented Dialogue cs.AI · 2026-05-21 · unverdicted · none · ref 21 · 2 links
Introduces a Cognitive User Simulator modeling stratified personas with hidden concerns and Simulator-Induced Asymmetric-View Policy Optimization to unlock proactive behavior in task-oriented dialogue agents.
When Denser Credit Is Not Enough: Evidence-Calibrated Policy Optimization for Long-Horizon LLM Agent Training cs.LG · 2026-06-04 · unverdicted · none · ref 23
ECPO improves GiGPO by shrinking low-count action advantages and suppressing noisy anchor states, yielding +5.2/+7.3 success gains on ALFWorld/WebShop with Qwen2.5-1.5B models at negligible extra cost.
ECHO: Terminal Agents Learn World Models for Free cs.LG · 2026-05-23 · unverdicted · none · ref 23
ECHO is a hybrid RL objective that trains agents to predict environment observation tokens from their actions, doubling GRPO pass@1 on TerminalBench-2.0 while improving dynamics prediction on held-out trajectories.
Rewarding Beliefs, Not Actions: Consistency-Guided Credit Assignment for Long-Horizon Agents cs.CL · 2026-05-19 · unverdicted · none · ref 45
ReBel uses belief-consistency supervision and belief-aware grouping to improve credit assignment in long-horizon RL for LLM agents, achieving up to 20.4 percentage points higher success and 2.1x better sample efficiency than GRPO on ALFWorld and WebShop.
SOD: Step-wise On-policy Distillation for Small Language Model Agents cs.CL · 2026-05-08 · unverdicted · none · ref 50
SOD reweights on-policy distillation strength step-by-step using divergence to stabilize tool use in small language model agents, yielding up to 20.86% gains and 26.13% on AIME 2025 for a 0.6B model.
From History to State: Constant-Context Skill Learning for LLM Agents cs.AI · 2026-05-06 · unverdicted · none · ref 45
Constant-context skill learning trains reusable task-family modules for LLM agents using a deterministic state block for progress tracking and subgoal rewards, achieving 89.6% unseen success on ALFWorld, 76.8% on WebShop, and 66.4% on SciWorld with Qwen3-8B while reducing prompt tokens 2-7x.
From Refusal to Recovery: A Control-Theoretic Approach to Generative AI Guardrails cs.AI · 2025-10-15 · unverdicted · none · ref 61
Control-theoretic guardrails enable proactive correction of risky LLM agent actions in latent space, preventing catastrophes like collisions or bankruptcy while preserving task performance in simulated environments.
Training Language Models to Self-Correct via Reinforcement Learning cs.LG · 2024-09-19 · unverdicted · none · ref 45
SCoRe uses multi-turn online RL with regularization on self-generated traces to improve LLM self-correction, achieving 15.6% and 9.1% gains on MATH and HumanEval for Gemini models.
Modularized Reinforcement Learning on LLMs: From MDP Creation to Exploration and Learning cs.LG · 2026-06-20 · unverdicted · none · ref 279
Survey mapping RL techniques onto LLM training and highlighting gaps in value-based, off-policy, and bootstrapping methods.
SIRI: Self-Internalizing Reinforcement Learning with Intrinsic Skills for LLM Agent Training cs.AI · 2026-06-01 · unverdicted · none · ref 38
SIRI trains LLM agents to discover, validate, and internalize reusable skills from their own rollouts without external generators or inference-time skill banks, yielding gains on ALFWorld and WebShop.
Trust Region On-Policy Distillation cs.LG · 2026-05-31 · unverdicted · none · ref 146
TrOPD stabilizes on-policy distillation for LLMs with trust-region learning, outlier estimation, and off-policy guidance, outperforming prior OPD methods on reasoning and code benchmarks.
StraTA: Incentivizing Agentic Reinforcement Learning with Strategic Trajectory Abstraction cs.CL · 2026-05-07 · unverdicted · none · ref 53
StraTA improves LLM agent success rates to 93.1% on ALFWorld and 84.2% on WebShop by sampling a compact initial strategy and training it jointly with action execution via hierarchical GRPO-style rollouts.
Towards Large Reasoning Models: A Survey of Reinforced Reasoning with Large Language Models cs.AI · 2025-01-16 · unverdicted · none · ref 200
The paper surveys reinforced reasoning techniques for LLMs, covering automated data construction, learning-to-reason methods, and test-time scaling as steps toward Large Reasoning Models.

Archer: Training language model agents via hierarchical multi-turn rl

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer