{"work":{"id":"bc65d492-e6ba-4522-874c-43d2f4fc5191","openalex_id":null,"doi":null,"arxiv_id":"2505.10978","raw_key":null,"title":"Group-in-Group Policy Optimization for LLM Agent Training","authors":null,"authors_text":"Lang Feng, Zhenghai Xue, Tingcong Liu, Bo An","year":2025,"venue":"cs.LG","abstract":"Recent advances in group-based reinforcement learning (RL) have driven frontier large language models (LLMs) in single-turn tasks like mathematical reasoning. However, their scalability to multi-turn LLM agent training remains limited. Unlike static tasks, agent-environment interactions unfold over many steps and often yield sparse or delayed rewards, making credit assignment across individual steps significantly more challenging. In this work, we propose Group-in-Group Policy Optimization (GiGPO), a novel RL algorithm that achieves fine-grained credit assignment for LLM agents while preserving the appealing properties of group-based RL: critic-free, low memory, and stable convergence. GiGPO introduces a two-level structure for estimating relative advantage: (i) At the episode-level, GiGPO computes macro relative advantages based on groups of complete trajectories; (ii) At the step-level, GiGPO introduces an anchor state grouping mechanism that retroactively constructs step-level groups by identifying repeated environment states across trajectories. Actions stemming from the same state are grouped together, enabling micro relative advantage estimation. This hierarchical structure effectively captures both global trajectory quality and local step effectiveness without relying on auxiliary models or additional rollouts. We evaluate GiGPO on challenging agent benchmarks, including ALFWorld and WebShop, as well as tool-integrated reasoning on search-augmented QA tasks, using Qwen2.5-1.5B/3B/7B-Instruct. Crucially, GiGPO delivers fine-grained per-step credit signals, achieves performance gains of > 12% on ALFWorld and > 9% on WebShop over GRPO, and obtains superior performance on QA tasks (42.1% on 3B and 47.2% on 7B): all while maintaining the same GPU memory overhead, identical LLM rollout, and incurring little to no additional time cost.","external_url":"https://arxiv.org/abs/2505.10978","cited_by_count":null,"metadata_source":"pith","metadata_fetched_at":"2026-05-22T21:42:10.657233+00:00","pith_arxiv_id":"2505.10978","created_at":"2026-05-09T05:45:22.204702+00:00","updated_at":"2026-05-22T21:42:10.657233+00:00","title_quality_ok":true,"display_title":"Group-in-Group Policy Optimization for LLM Agent Training","render_title":"Group-in-Group Policy Optimization for LLM Agent Training"},"hub":{"state":{"work_id":"bc65d492-e6ba-4522-874c-43d2f4fc5191","tier":"hub","tier_reason":"10+ Pith inbound or 1,000+ external citations","pith_inbound_count":54,"external_cited_by_count":null,"distinct_field_count":6,"first_pith_cited_at":"2025-03-31T18:00:29+00:00","last_pith_cited_at":"2026-05-19T16:19:29+00:00","author_build_status":"not_needed","summary_status":"needed","contexts_status":"needed","graph_status":"needed","ask_index_status":"not_needed","reader_status":"not_needed","recognition_status":"not_needed","updated_at":"2026-05-23T06:34:06.552621+00:00","tier_text":"hub"},"tier":"hub","role_counts":[{"context_role":"background","n":18},{"context_role":"baseline","n":3},{"context_role":"method","n":1}],"polarity_counts":[{"context_polarity":"background","n":15},{"context_polarity":"baseline","n":3},{"context_polarity":"unclear","n":3},{"context_polarity":"use_method","n":1}],"runs":{"context_extract":{"job_type":"context_extract","status":"succeeded","result":{"enqueued_papers":25},"error":null,"updated_at":"2026-05-14T17:48:54.718519+00:00"},"graph_features":{"job_type":"graph_features","status":"succeeded","result":{"co_cited":[{"title":"DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models","work_id":"c5006563-f3ec-438a-9e35-b7b484f34828","shared_citers":25},{"title":"DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning","work_id":"e6b75ad5-2877-4168-97c8-710407094d20","shared_citers":17},{"title":"Proximal Policy Optimization Algorithms","work_id":"240c67fe-d14d-4520-91c1-38a4e272ca19","shared_citers":17},{"title":"DAPO: An Open-Source LLM Reinforcement Learning System at Scale","work_id":"64019d00-0b11-4bbd-b173-b46c8fad0157","shared_citers":13},{"title":"Qwen3 Technical Report","work_id":"25a4e30c-1232-48e7-9925-02fa12ba7c9e","shared_citers":13},{"title":"ALFWorld: Aligning Text and Embodied Environments for Interactive Learning","work_id":"fa436f46-ec0a-4d2e-a0ff-e697def4a7be","shared_citers":12},{"title":"Search-R1: Training LLMs to Reason and Leverage Search Engines with Reinforcement Learning","work_id":"0e0b7549-2bc4-4574-aa7f-588ffa16eaae","shared_citers":9},{"title":"GPT-4 Technical Report","work_id":"b928e041-6991-4c08-8c81-0359e4097c7b","shared_citers":8},{"title":"Group Sequence Policy Optimization","work_id":"3a98b53b-9f52-4d95-adf7-89353c0a9a65","shared_citers":8},{"title":"RAGEN: Understanding Self-Evolution in LLM Agents via Multi-Turn Reinforcement Learning","work_id":"b96383ee-f8dc-471f-aba4-bc5ce9b0b632","shared_citers":8},{"title":"Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities","work_id":"008df105-2fdd-45d8-857a-8e35868aecb6","shared_citers":7},{"title":"Measuring Mathematical Problem Solving With the MATH Dataset","work_id":"50652ac6-fb7c-4675-a2c2-159c241feb17","shared_citers":7},{"title":"The Landscape of Agentic Reinforcement Learning for LLMs: A Survey","work_id":"87909127-da20-4ccc-8ae3-4a4a20ef81b7","shared_citers":7},{"title":"SWE-bench: Can Language Models Resolve Real-World GitHub Issues?","work_id":"d0effe15-a689-441a-8e3f-ea35f1c4e4b1","shared_citers":6},{"title":"Understanding R1-Zero-Like Training: A Critical Perspective","work_id":"ec354f3b-9484-4a0c-94c8-92d4d0260835","shared_citers":6},{"title":"Voyager: An Open-Ended Embodied Agent with Large Language Models","work_id":"ffe0d207-86cf-4742-a100-e988ac8b9676","shared_citers":6},{"title":"Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism","work_id":"c888e6d1-0b1d-43d6-9ef5-f0912a0efa1b","shared_citers":5},{"title":"OpenAI o1 System Card","work_id":"68d3c334-0fc9-49e3-b7b0-a69afae933e2","shared_citers":5},{"title":"Training Verifiers to Solve Math Word Problems","work_id":"acab1aa8-b4d6-40e0-a3ee-25341701dca2","shared_citers":5},{"title":"AgentGym-RL: Training","work_id":"acf16fe3-7cbe-4d5f-a8f8-24f6c14fb557","shared_citers":4},{"title":"Agentic reinforced policy optimization","work_id":"6cb0d241-5e4d-44ed-9476-659819ec0681","shared_citers":4},{"title":"DeepSeek-V3 Technical Report","work_id":"57d2791d-2219-4c31-a077-afc04b12a75c","shared_citers":4},{"title":"Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?","work_id":"d854765a-e664-41c0-8655-21c4bf2e0cc4","shared_citers":4},{"title":"HybridFlow: A Flexible and Efficient RLHF Framework","work_id":"7eb9c9f4-b322-4bba-8011-09ff8d6ad801","shared_citers":4}],"time_series":[{"n":38,"year":2026}],"dependency_candidates":[]},"error":null,"updated_at":"2026-05-14T17:49:14.258974+00:00"},"identity_refresh":{"job_type":"identity_refresh","status":"succeeded","result":{"items":[{"title":"Qwen3 Technical Report","outcome":"unchanged","work_id":"25a4e30c-1232-48e7-9925-02fa12ba7c9e","resolver":"local_arxiv","confidence":0.98,"old_work_id":"25a4e30c-1232-48e7-9925-02fa12ba7c9e"}],"counts":{"fixed":0,"merged":0,"unchanged":1,"quarantined":0,"needs_external_resolution":0},"errors":[],"attempted":1},"error":null,"updated_at":"2026-05-14T17:48:43.145932+00:00"},"summary_claims":{"job_type":"summary_claims","status":"succeeded","result":{"title":"Group-in-Group Policy Optimization for LLM Agent Training","claims":[{"claim_text":"Recent advances in group-based reinforcement learning (RL) have driven frontier large language models (LLMs) in single-turn tasks like mathematical reasoning. However, their scalability to multi-turn LLM agent training remains limited. Unlike static tasks, agent-environment interactions unfold over many steps and often yield sparse or delayed rewards, making credit assignment across individual steps significantly more challenging. In this work, we propose Group-in-Group Policy Optimization (GiGPO), a novel RL algorithm that achieves fine-grained credit assignment for LLM agents while preservin","claim_type":"abstract","evidence_strength":"source_metadata"}],"why_cited":"Pith tracks Group-in-Group Policy Optimization for LLM Agent Training because it crossed a citation-hub threshold.","role_counts":[]},"error":null,"updated_at":"2026-05-14T17:49:17.508229+00:00"}},"summary":{"title":"Group-in-Group Policy Optimization for LLM Agent Training","claims":[{"claim_text":"Recent advances in group-based reinforcement learning (RL) have driven frontier large language models (LLMs) in single-turn tasks like mathematical reasoning. However, their scalability to multi-turn LLM agent training remains limited. Unlike static tasks, agent-environment interactions unfold over many steps and often yield sparse or delayed rewards, making credit assignment across individual steps significantly more challenging. In this work, we propose Group-in-Group Policy Optimization (GiGPO), a novel RL algorithm that achieves fine-grained credit assignment for LLM agents while preservin","claim_type":"abstract","evidence_strength":"source_metadata"}],"why_cited":"Pith tracks Group-in-Group Policy Optimization for LLM Agent Training because it crossed a citation-hub threshold.","role_counts":[]},"graph":{"co_cited":[{"title":"DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models","work_id":"c5006563-f3ec-438a-9e35-b7b484f34828","shared_citers":25},{"title":"DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning","work_id":"e6b75ad5-2877-4168-97c8-710407094d20","shared_citers":17},{"title":"Proximal Policy Optimization Algorithms","work_id":"240c67fe-d14d-4520-91c1-38a4e272ca19","shared_citers":17},{"title":"DAPO: An Open-Source LLM Reinforcement Learning System at Scale","work_id":"64019d00-0b11-4bbd-b173-b46c8fad0157","shared_citers":13},{"title":"Qwen3 Technical Report","work_id":"25a4e30c-1232-48e7-9925-02fa12ba7c9e","shared_citers":13},{"title":"ALFWorld: Aligning Text and Embodied Environments for Interactive Learning","work_id":"fa436f46-ec0a-4d2e-a0ff-e697def4a7be","shared_citers":12},{"title":"Search-R1: Training LLMs to Reason and Leverage Search Engines with Reinforcement Learning","work_id":"0e0b7549-2bc4-4574-aa7f-588ffa16eaae","shared_citers":9},{"title":"GPT-4 Technical Report","work_id":"b928e041-6991-4c08-8c81-0359e4097c7b","shared_citers":8},{"title":"Group Sequence Policy Optimization","work_id":"3a98b53b-9f52-4d95-adf7-89353c0a9a65","shared_citers":8},{"title":"RAGEN: Understanding Self-Evolution in LLM Agents via Multi-Turn Reinforcement Learning","work_id":"b96383ee-f8dc-471f-aba4-bc5ce9b0b632","shared_citers":8},{"title":"Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities","work_id":"008df105-2fdd-45d8-857a-8e35868aecb6","shared_citers":7},{"title":"Measuring Mathematical Problem Solving With the MATH Dataset","work_id":"50652ac6-fb7c-4675-a2c2-159c241feb17","shared_citers":7},{"title":"The Landscape of Agentic Reinforcement Learning for LLMs: A Survey","work_id":"87909127-da20-4ccc-8ae3-4a4a20ef81b7","shared_citers":7},{"title":"SWE-bench: Can Language Models Resolve Real-World GitHub Issues?","work_id":"d0effe15-a689-441a-8e3f-ea35f1c4e4b1","shared_citers":6},{"title":"Understanding R1-Zero-Like Training: A Critical Perspective","work_id":"ec354f3b-9484-4a0c-94c8-92d4d0260835","shared_citers":6},{"title":"Voyager: An Open-Ended Embodied Agent with Large Language Models","work_id":"ffe0d207-86cf-4742-a100-e988ac8b9676","shared_citers":6},{"title":"Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism","work_id":"c888e6d1-0b1d-43d6-9ef5-f0912a0efa1b","shared_citers":5},{"title":"OpenAI o1 System Card","work_id":"68d3c334-0fc9-49e3-b7b0-a69afae933e2","shared_citers":5},{"title":"Training Verifiers to Solve Math Word Problems","work_id":"acab1aa8-b4d6-40e0-a3ee-25341701dca2","shared_citers":5},{"title":"AgentGym-RL: Training","work_id":"acf16fe3-7cbe-4d5f-a8f8-24f6c14fb557","shared_citers":4},{"title":"Agentic reinforced policy optimization","work_id":"6cb0d241-5e4d-44ed-9476-659819ec0681","shared_citers":4},{"title":"DeepSeek-V3 Technical Report","work_id":"57d2791d-2219-4c31-a077-afc04b12a75c","shared_citers":4},{"title":"Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?","work_id":"d854765a-e664-41c0-8655-21c4bf2e0cc4","shared_citers":4},{"title":"HybridFlow: A Flexible and Efficient RLHF Framework","work_id":"7eb9c9f4-b322-4bba-8011-09ff8d6ad801","shared_citers":4}],"time_series":[{"n":38,"year":2026}],"dependency_candidates":[]},"authors":[]}}