HiPO improves LLM reasoning performance by optimizing preferences separately on response segments rather than entire outputs.
Direct Preference Optimization for Primitive-Enabled Hierarchical RL: A Bilevel Approach
1 Pith paper cite this work. Polarity classification is still indexing.
abstract
Hierarchical reinforcement learning (HRL) enables agents to solve complex, long-horizon tasks by decomposing them into manageable sub-tasks. However, HRL methods face two fundamental challenges: (i) non-stationarity caused by the evolving lower-level policy during training, which destabilizes higher-level learning, and (ii) the generation of infeasible subgoals that lower-level policies cannot achieve. To address these challenges, we introduce DIPPER, a novel HRL framework that formulates goal-conditioned HRL as a bi-level optimization problem and leverages direct preference optimization (DPO) to train the higher-level policy. By learning from stationary preference comparisons over subgoal sequences rather than rewards that depend on the evolving lower-level policy, DIPPER mitigates the impact of non-stationarity on hierarchical learning. To address infeasible subgoals, DIPPER incorporates lower-level value function regularization that encourages the higher-level policy to propose achievable subgoals. We also introduce two novel metrics to quantitatively verify that DIPPER mitigates non-stationarity and infeasible subgoal generation issues in HRL. We perform empirical evaluations on challenging robotic navigation and manipulation benchmarks and show that DIPPER achieves upto 40% improvements over state-of-the-art baselines, demonstrating that preference-based methods can effectively alleviate persistent challenges in hierarchical
fields
cs.AI 1years
2026 1verdicts
UNVERDICTED 1representative citing papers
citing papers explorer
-
HiPO: Hierarchical Preference Optimization for Adaptive Reasoning in LLMs
HiPO improves LLM reasoning performance by optimizing preferences separately on response segments rather than entire outputs.