Chain-of-Goals Hierarchical Policy for Long-Horizon Offline Goal-Conditioned RL

Jinwoo Choi; Sang-Hyun Lee; Seung-Woo Seo

arxiv: 2602.03389 · v2 · pith:WLPR73BWnew · submitted 2026-02-03 · 💻 cs.LG · cs.AI

Chain-of-Goals Hierarchical Policy for Long-Horizon Offline Goal-Conditioned RL

Jinwoo Choi , Sang-Hyun Lee , Seung-Woo Seo This is my paper

classification 💻 cs.LG cs.AI

keywords coghphierarchicallong-horizonlatentofflinetasksactionchain-of-goals

0 comments

read the original abstract

Offline goal-conditioned reinforcement learning remains challenging for long-horizon tasks. While hierarchical approaches mitigate this issue by decomposing tasks, most existing methods rely on separate high- and low-level networks and generate only a single intermediate subgoal, leaving several structural limitations in long-horizon decision-making. To address this limitation, we draw inspiration from chain-of-thought reasoning and propose the Chain-of-Goals Hierarchical Policy (CoGHP), a novel framework that reformulates hierarchical decision-making as autoregressive sequence modeling within a unified architecture. Given a state and a final goal, CoGHP autoregressively generates a sequence of latent subgoals followed by the primitive action, where each latent subgoal acts as a reasoning step that conditions subsequent predictions. To implement this efficiently, we introduce an MLP-Mixer backbone, which supports cross-token communication and captures structural relationships among state, goal, latent subgoals, and action. Across challenging navigation and manipulation benchmarks, CoGHP consistently outperforms strong offline baselines, demonstrating improved performance on long-horizon tasks. Project page: https://wlsdn9350.github.io/projects/coghp/

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Switching Successor Measures for Hierarchical Zero-shot Reinforcement Learning
cs.LG 2026-05 unverdicted novelty 7.0

Switching successor measures extend classical successor measures to enable hierarchical zero-shot RL via the FB π-Switch algorithm that extracts subgoal-selection and control policies from forward-backward representations.
Unifying Object-Centric World Models and Diffusion Policy: A Hierarchical Framework for Multi-Stage Robotic Tasks
cs.RO 2026-06 unverdicted novelty 5.0

WorldDP combines a high-level object-centric world model for subgoal planning with a low-level diffusion policy for execution, claiming better performance than baselines on multi-stage robotic manipulation benchmarks.