Direct Preference Optimization for Primitive-Enabled Hierarchical RL: A Bilevel Approach

· 2024 · cs.LG · arXiv 2411.00361

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it

open full Pith review browse 1 citing papers arXiv PDF

abstract

Hierarchical reinforcement learning (HRL) enables agents to solve complex, long-horizon tasks by decomposing them into manageable sub-tasks. However, HRL methods face two fundamental challenges: (i) non-stationarity caused by the evolving lower-level policy during training, which destabilizes higher-level learning, and (ii) the generation of infeasible subgoals that lower-level policies cannot achieve. To address these challenges, we introduce DIPPER, a novel HRL framework that formulates goal-conditioned HRL as a bi-level optimization problem and leverages direct preference optimization (DPO) to train the higher-level policy. By learning from stationary preference comparisons over subgoal sequences rather than rewards that depend on the evolving lower-level policy, DIPPER mitigates the impact of non-stationarity on hierarchical learning. To address infeasible subgoals, DIPPER incorporates lower-level value function regularization that encourages the higher-level policy to propose achievable subgoals. We also introduce two novel metrics to quantitatively verify that DIPPER mitigates non-stationarity and infeasible subgoal generation issues in HRL. We perform empirical evaluations on challenging robotic navigation and manipulation benchmarks and show that DIPPER achieves upto 40% improvements over state-of-the-art baselines, demonstrating that preference-based methods can effectively alleviate persistent challenges in hierarchical

representative citing papers

HiPO: Hierarchical Preference Optimization for Adaptive Reasoning in LLMs

cs.AI · 2026-04-22 · unverdicted · novelty 7.0

HiPO improves LLM reasoning performance by optimizing preferences separately on response segments rather than entire outputs.

citing papers explorer

Showing 1 of 1 citing paper.

HiPO: Hierarchical Preference Optimization for Adaptive Reasoning in LLMs cs.AI · 2026-04-22 · unverdicted · none · ref 13 · internal anchor
HiPO improves LLM reasoning performance by optimizing preferences separately on response segments rather than entire outputs.

Direct Preference Optimization for Primitive-Enabled Hierarchical RL: A Bilevel Approach

fields

years

verdicts

representative citing papers

citing papers explorer