Plan Then Action:High-Level Planning Guidance Reinforcement Learning for LLM Reasoning

Benteng Chen; Chaoda Song; Dinggen Zhang; Qingtao Pan; Qinjian Zhao; Shufei Zhang; Sumon Biswas; Towsif Raiyan; Weida Wang; Yang Ouyang

arxiv: 2510.01833 · v2 · pith:JDC3WCA4new · submitted 2025-10-02 · 💻 cs.AI · cs.CL

Plan Then Action:High-Level Planning Guidance Reinforcement Learning for LLM Reasoning

Zhihao Dou , Qinjian Zhao , Zhongwei Wan , Dinggen Zhang , Weida Wang , Towsif Raiyan , Benteng Chen , Qingtao Pan

show 5 more authors

Yang Ouyang Chaoda Song Zhiqiang Gao Shufei Zhang Sumon Biswas

This is my paper

classification 💻 cs.AI cs.CL

keywords reasoningguidancehigh-levellearningmodelsplanningpta-grporeinforcement

0 comments

read the original abstract

Large language models (LLMs) demonstrate strong reasoning abilities via Chain-of-Thought (CoT), but their token-level generation encourages local decisions and lacks global planning, often leading to redundant or inaccurate reasoning. Existing methods, such as tree-based search and reinforcement learning (RL), attempt to address this issue but incur high computational costs and still struggle to produce reliable reasoning trajectories. To address these challenges, we propose Plan-Then-Action Enhanced Reasoning with Group Relative Policy Optimization (PTA-GRPO), a two-stage framework designed to jointly improve high-level planning and fine-grained CoT reasoning. Specifically, in the first stage, a given LLM is responsible for summarizing CoT reasoning into compact high-level guidance, which is then leveraged for supervised fine-tuning. Then, we introduce a guidance-aware reinforcement learning method that jointly optimizes the final output and the quality of guidance, enhancing reasoning effectiveness. We evaluate PTA-GRPO on ten reasoning benchmarks across mathematics and natural sciences, using five diverse base models spanning multiple data modalities. The results show that PTA-GRPO consistently delivers significant improvements across models and tasks, demonstrating strong effectiveness and generalization.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Generate, Filter, Control, Replay: A Comprehensive Survey of Rollout Strategies for LLM Reinforcement Learning
cs.LG 2026-04 unverdicted novelty 7.0

This survey introduces the Generate-Filter-Control-Replay (GFCR) taxonomy to structure rollout pipelines for RL-based post-training of reasoning LLMs.
PALM: Progress-Aware Policy Learning via Affordance Reasoning for Long-Horizon Robotic Manipulation
cs.RO 2026-01 unverdicted novelty 6.0

PALM improves long-horizon robotic manipulation success by distilling affordance representations for object interaction and predicting within-subtask progress in a VLA model.