DIPPER uses bi-level optimization and DPO to train the higher-level policy from stationary preference comparisons and value regularization, claiming up to 40% gains on robotic navigation and manipulation tasks while introducing metrics for non-stationarity and infeasible subgoals.
Human preference scaling with demonstrations for deep reinforcement learning
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
verdicts
UNVERDICTED 2representative citing papers
Themis is an XAI-enabled framework for RL from human feedback that supports 200+ environments and includes a scalable cloud platform for collecting human preferences.
citing papers explorer
-
Direct Preference Optimization for Primitive-Enabled Hierarchical RL: A Bilevel Approach
DIPPER uses bi-level optimization and DPO to train the higher-level policy from stationary preference comparisons and value regularization, claiming up to 40% gains on robotic navigation and manipulation tasks while introducing metrics for non-stationarity and infeasible subgoals.
-
Themis: An explainable AI-enabled framework for Reinforcement Learning with Human Feedback
Themis is an XAI-enabled framework for RL from human feedback that supports 200+ environments and includes a scalable cloud platform for collecting human preferences.