hub Mixed citations

Group-in-Group Policy Optimization for LLM Agent Training

Lang Feng, Zhenghai Xue, Tingcong Liu, Bo An · 2025 · cs.LG · arXiv 2505.10978

Mixed citation behavior. Most common role is background (68%).

76 Pith papers citing it

Background 68% of classified citations

open full Pith review browse 76 citing papers arXiv PDF

abstract

Recent advances in group-based reinforcement learning (RL) have driven frontier large language models (LLMs) in single-turn tasks like mathematical reasoning. However, their scalability to multi-turn LLM agent training remains limited. Unlike static tasks, agent-environment interactions unfold over many steps and often yield sparse or delayed rewards, making credit assignment across individual steps significantly more challenging. In this work, we propose Group-in-Group Policy Optimization (GiGPO), a novel RL algorithm that achieves fine-grained credit assignment for LLM agents while preserving the appealing properties of group-based RL: critic-free, low memory, and stable convergence. GiGPO introduces a two-level structure for estimating relative advantage: (i) At the episode-level, GiGPO computes macro relative advantages based on groups of complete trajectories; (ii) At the step-level, GiGPO introduces an anchor state grouping mechanism that retroactively constructs step-level groups by identifying repeated environment states across trajectories. Actions stemming from the same state are grouped together, enabling micro relative advantage estimation. This hierarchical structure effectively captures both global trajectory quality and local step effectiveness without relying on auxiliary models or additional rollouts. We evaluate GiGPO on challenging agent benchmarks, including ALFWorld and WebShop, as well as tool-integrated reasoning on search-augmented QA tasks, using Qwen2.5-1.5B/3B/7B-Instruct. Crucially, GiGPO delivers fine-grained per-step credit signals, achieves performance gains of > 12% on ALFWorld and > 9% on WebShop over GRPO, and obtains superior performance on QA tasks (42.1% on 3B and 47.2% on 7B): all while maintaining the same GPU memory overhead, identical LLM rollout, and incurring little to no additional time cost.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 18 baseline 3 method 1

citation-polarity summary

background 15 baseline 3 unclear 3 use method 1

claims ledger

abstract Recent advances in group-based reinforcement learning (RL) have driven frontier large language models (LLMs) in single-turn tasks like mathematical reasoning. However, their scalability to multi-turn LLM agent training remains limited. Unlike static tasks, agent-environment interactions unfold over many steps and often yield sparse or delayed rewards, making credit assignment across individual steps significantly more challenging. In this work, we propose Group-in-Group Policy Optimization (GiGPO), a novel RL algorithm that achieves fine-grained credit assignment for LLM agents while preservin

co-cited works

representative citing papers

ReSum: Synergizing LLM Reasoning and Summarization with Reinforcement Learning

cs.AI · 2026-06-11 · unverdicted · novelty 7.0

ReSum trains LLMs via RLVR to self-summarize reasoning trajectories, yielding 4% average performance gains and 18.6% shorter rollouts through contrastive rollout branches.

Co-Evolving Skill Generation and Policy Optimization

cs.CL · 2026-06-07 · unverdicted · novelty 7.0

Framework estimates context-dependent marginal utility of candidate skills via reward gaps in matched base vs. skill-augmented rollouts to filter skills and co-train policy as generator.

Thinking as Compression: Your Reasoning Model is Secretly a Context Compressor

cs.AI · 2026-05-27 · unverdicted · novelty 7.0

Reasoning models naturally compress context via thinking traces, with reward-constrained optimization yielding 17-23% gains over baselines on long-context QA at high compression ratios.

ClawForge: Generating Executable Interactive Benchmarks for Command-Line Agents

cs.AI · 2026-05-13 · unverdicted · novelty 7.0 · 2 refs

ClawForge is a generator framework that creates reproducible executable benchmarks for command-line agents under state conflict, with ClawForge-Bench showing frontier models reach at most 45.3% strict accuracy and that state inspection drives most performance gaps.

Learning Agentic Policy from Action Guidance

cs.CL · 2026-05-12 · unverdicted · novelty 7.0

ActGuide-RL uses human action data as plan-style guidance in mixed-policy RL to overcome exploration barriers in LLM agents, matching SFT+RL performance on search benchmarks without cold-start training.

AgentForesight: Online Auditing for Early Failure Prediction in Multi-Agent Systems

cs.CL · 2026-05-09 · unverdicted · novelty 7.0 · 2 refs

AgentForesight introduces an online auditor model that predicts decisive errors in multi-agent trajectories at the earliest step using a coarse-to-fine reinforcement learning recipe on a new curated dataset AFTraj-2K.

The Moltbook Files: A Harmless Slopocalypse or Humanity's Last Experiment

cs.CL · 2026-05-08 · unverdicted · novelty 7.0

An AI-agent social platform generated mostly neutral content whose use in fine-tuning reduced model truthfulness comparably to human Reddit data, suggesting limited unique harm but flagging tail risks like secret leaks.

ResRL: Boosting LLM Reasoning via Negative Sample Projection Residual Reinforcement Learning

cs.LG · 2026-05-01 · unverdicted · novelty 7.0 · 2 refs

ResRL decouples shared semantics between positive and negative responses in LLM reinforcement learning via SVD-based projection residuals, outperforming baselines including NSR by up to 9.4% on math reasoning benchmarks.

TCOD: Exploring Temporal Curriculum in On-Policy Distillation for Multi-turn Autonomous Agents

cs.LG · 2026-04-27 · unverdicted · novelty 7.0

TCOD stabilizes on-policy distillation for multi-turn agents via temporal curriculum on trajectory depth, improving performance up to 18 points over vanilla OPD and sometimes surpassing the teacher.

Co-Evolving LLM Decision and Skill Bank Agents for Long-Horizon Tasks

cs.AI · 2026-04-22 · unverdicted · novelty 7.0

COSPLAY co-evolves an LLM decision agent with a skill bank agent to improve long-horizon game performance, reporting over 25.1% average reward gains versus frontier LLM baselines on single-player benchmarks.

RiskWebWorld: A Realistic Interactive Benchmark for GUI Agents in E-commerce Risk Management

cs.AI · 2026-04-15 · unverdicted · novelty 7.0

RiskWebWorld is the first realistic interactive benchmark for GUI agents in e-commerce risk management, revealing a large gap between generalist and specialized models plus RL gains.

PRIME: Training Free Proactive Reasoning via Iterative Memory Evolution for User-Centric Agent

cs.AI · 2026-04-08 · unverdicted · novelty 7.0

PRIME enables agents to proactively reason in user-centric tasks by iteratively evolving structured memories from interaction trajectories without gradient-based training.

Generate, Filter, Control, Replay: A Comprehensive Survey of Rollout Strategies for LLM Reinforcement Learning

cs.LG · 2026-04-08 · unverdicted · novelty 7.0

This survey introduces the Generate-Filter-Control-Replay (GFCR) taxonomy to structure rollout pipelines for RL-based post-training of reasoning LLMs.

Gen-Searcher: Reinforcing Agentic Search for Image Generation

cs.CV · 2026-03-30 · unverdicted · novelty 7.0 · 2 refs

Gen-Searcher is the first trained search-augmented image generation agent using SFT followed by GRPO reinforcement learning with dual text-image rewards, delivering 15-16 point gains on knowledge-intensive benchmarks.

Is One Layer Enough? Training A Single Transformer Layer Can Match Full-Parameter RL Training

cs.LG · 2026-07-01 · unverdicted · novelty 6.0 · 2 refs

A single middle transformer layer trained in isolation recovers most RL post-training gains in LLMs, with gains concentrated in middle layers across models, algorithms, and tasks.

ATOD: Annealed Turn-aware On-policy Distillation for Multi-turn Autonomous Agents

cs.AI · 2026-06-26 · unverdicted · novelty 6.0

ATOD anneals from on-policy distillation to RL with turn-level reweighting to improve multi-turn agent success rates on ALFWorld, WebShop, and Search-QA.

APPO: Agentic Procedural Policy Optimization

cs.LG · 2026-06-10 · unverdicted · novelty 6.0

APPO refines branching and credit assignment in agentic RL via a Branching Score and procedure-level scaling, improving baselines by nearly 4 points on 13 benchmarks.

SALT: When More Rollouts Don't Help in Group-Based Policy Optimization and How to Make Them Matter

cs.LG · 2026-06-04 · unverdicted · novelty 6.0

SALT is a subspace-adaptive plug-in for GRPO that decomposes group-relative coefficients into shared and residual channels using mini-batch Gram geometry and amplifies residuals to mitigate signed cancellation in RLVR.

Think-Before-Speak: From Internal Evaluation to Public Expression in Multi-Agent Social Simulation

cs.AI · 2026-06-02 · unverdicted · novelty 6.0 · 2 refs

TBS is an interval-based multi-agent LLM simulation framework that separates structured internal evaluative states from public utterance generation and shows these states vary systematically with turn-allocation, silence, and memory conditions.

COMAP: Co-Evolving World Models and Agent Policies for LLM Agents

cs.AI · 2026-06-01 · unverdicted · novelty 6.0

COMAP co-evolves textual world models and agent policies for LLMs through on-policy self-distillation, yielding up to 16.75% relative gains on embodied planning, web navigation, and tool-use tasks.

AgentOdyssey: Open-Ended Long-Horizon Text Game Generation for Test-Time Continual Learning Agents

cs.CL · 2026-05-29 · unverdicted · novelty 6.0

Introduces AgentOdyssey, a procedural generator of open-ended long-horizon text games, to evaluate test-time continual learning agents and diagnose limits in exploration, memory, and planning.

Efficient Agentic Reinforcement Learning with On-Policy Intrinsic Knowledge Boundary Enhancement

cs.CL · 2026-05-26 · unverdicted · novelty 6.0

AKBE uses dual-path (with-tool and no-tool) rollouts during agentic RL training to categorize trajectories and supply targeted signals that raise average QA accuracy by 1.85 while cutting tool calls 18% and raising tool productivity 25%.

RICE-PO: Turning Retrieval Interactions into Credit Signals for Reasoning Agents

cs.CL · 2026-05-25 · unverdicted · novelty 6.0

RICE-PO is a policy optimization framework that converts retrieval interactions into credit signals for latent reasoning steps in agents by selecting high-uncertainty actions as anchors and propagating credit based on influence strength and residual stability, outperforming baselines on BRIGHT and B

Rewarding Beliefs, Not Actions: Consistency-Guided Credit Assignment for Long-Horizon Agents

cs.CL · 2026-05-19 · unverdicted · novelty 6.0

ReBel uses belief-consistency supervision and belief-aware grouping to improve credit assignment in long-horizon RL for LLM agents, achieving up to 20.4 percentage points higher success and 2.1x better sample efficiency than GRPO on ALFWorld and WebShop.

citing papers explorer

Showing 19 of 69 citing papers after filters.

From Verdict to Process: Agentic Reinforcement Learning for Multi-Stage Fact Verification cs.AI · 2026-06-11 · unverdicted · none · ref 3 · internal anchor
ProFact applies agentic RL with process-aware rewards to jointly optimize claim decomposition, evidence retrieval, and verdict prediction in multi-stage fact verification.
Role-Agent: Bootstrapping LLM Agents via Dual-Role Evolution cs.AI · 2026-06-09 · unverdicted · none · ref 1 · internal anchor
Role-Agent bootstraps LLM agents via dual roles (World-In-Agent for process rewards from state prediction and Agent-In-World for failure-driven task retrieval), reporting over 4% average gains on benchmarks.
SIRI: Self-Internalizing Reinforcement Learning with Intrinsic Skills for LLM Agent Training cs.AI · 2026-06-01 · unverdicted · none · ref 21 · internal anchor
SIRI trains LLM agents to discover, validate, and internalize reusable skills from their own rollouts without external generators or inference-time skill banks, yielding gains on ALFWorld and WebShop.
Trust Region On-Policy Distillation cs.LG · 2026-05-31 · unverdicted · none · ref 117 · internal anchor
TrOPD stabilizes on-policy distillation for LLMs with trust-region learning, outlier estimation, and off-policy guidance, outperforming prior OPD methods on reasoning and code benchmarks.
AMATA: Adaptive Multi-Agent Trajectory Alignment for Knowledge-Intensive Question Answering cs.CL · 2026-05-17 · unverdicted · none · ref 2 · internal anchor
AMATA is an adaptive multi-agent trajectory alignment system that improves factual consistency in knowledge-intensive QA via intra-trajectory preference learning and inter-agent dependency optimization.
Look Before You Leap: Autonomous Exploration for LLM Agents cs.AI · 2026-05-15 · unverdicted · none · ref 8 · internal anchor
LLM agents improve adaptability by first using an interaction budget for systematic exploration measured via Exploration Checkpoint Coverage before executing tasks.
StraTA: Incentivizing Agentic Reinforcement Learning with Strategic Trajectory Abstraction cs.CL · 2026-05-07 · unverdicted · none · ref 56 · internal anchor
StraTA improves LLM agent success rates to 93.1% on ALFWorld and 84.2% on WebShop by sampling a compact initial strategy and training it jointly with action execution via hierarchical GRPO-style rollouts.
On Training Large Language Models for Long-Horizon Tasks: An Empirical Study of Horizon Length cs.AI · 2026-05-04 · unverdicted · none · ref 52 · internal anchor
Longer action horizons bottleneck LLM agent training through instability, but training with reduced horizons stabilizes learning and enables better generalization to longer horizons.
GUI Agents with Reinforcement Learning: Toward Digital Inhabitants cs.AI · 2026-04-30 · unverdicted · none · ref 25 · internal anchor
The paper delivers the first comprehensive overview of RL for GUI agents, organizing methods into offline, online, and hybrid strategies while analyzing trends in rewards, efficiency, and deliberation to outline a future roadmap.
From Coarse to Fine: Self-Adaptive Hierarchical Planning for LLM Agents cs.AI · 2026-04-25 · unverdicted · none · ref 7 · internal anchor
AdaPlan-H enables LLM agents to generate self-adaptive hierarchical plans that adjust detail level to task difficulty, improving success rates in multi-step tasks.
Environmental Understanding Vision-Language Model for Embodied Agent cs.CV · 2026-04-21 · unverdicted · none · ref 13 · internal anchor
EUEA fine-tunes VLMs on object perception, task planning, action understanding and goal recognition, with recovery and GRPO, to raise ALFRED success rates by 11.89% over behavior cloning.
Seeing Isn't Believing: Mitigating Belief Inertia via Active Intervention in Embodied Agents cs.CL · 2026-04-19 · unverdicted · none · ref 1 · internal anchor
The Estimate-Verify-Update (EVU) mechanism reduces belief inertia in embodied agents and raises task success rates on three benchmarks.
From Reasoning to Agentic: Credit Assignment in Reinforcement Learning for Large Language Models cs.CL · 2026-04-10 · unverdicted · none · ref 6 · internal anchor
A survey of credit assignment techniques in LLM reinforcement learning that distinguishes maturing methods for reasoning from new approaches needed for agentic settings and provides supporting resources.
StaRPO: Stability-Augmented Reinforcement Policy Optimization cs.AI · 2026-04-10 · unverdicted · none · ref 9 · internal anchor
StaRPO improves LLM reasoning by adding autocorrelation function and path efficiency stability metrics to RL policy optimization, yielding higher accuracy and fewer logic errors on reasoning benchmarks.
SEARL: Joint Optimization of Policy and Tool Graph Memory for Self-Evolving Agents cs.AI · 2026-04-09 · unverdicted · none · ref 2 · internal anchor
SEARL uses a tool graph memory that integrates planning and execution to densify rewards and improve generalization in self-evolving agents on knowledge and math tasks.
RoboAgent: Chaining Basic Capabilities for Embodied Task Planning cs.RO · 2026-04-09 · unverdicted · none · ref 25 · internal anchor
RoboAgent chains basic vision-language capabilities inside a single VLM via a scheduler and trains it in three stages (behavior cloning, DAgger, RL) to improve embodied task planning.
Claw-R1: A Step-Level Data Middleware System for Agentic Reinforcement Learning cs.LG · 2026-06-08 · unverdicted · none · ref 5 · internal anchor
Claw-R1 provides a Gateway Server and Data Pool to manage step-level agent interaction traces as structured data assets for agentic RL training.
Darwin Mobile Agent: A Roadmap for Self-Evolution cs.AI · 2026-05-26 · unverdicted · none · ref 4 · internal anchor
Introduces an open-source mobile GUI agent training framework and a roadmap for autonomous self-evolution via removal of human priors in three pillars.
StepPO: Step-Aligned Policy Optimization for Agentic Reinforcement Learning cs.CL · 2026-04-20 · unreviewed · ref 9 · internal anchor

Group-in-Group Policy Optimization for LLM Agent Training

hub tools

citation-role summary

citation-polarity summary

claims ledger

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer