DIBS decouples task policy learning via RL from evolution function learning via behavioral cloning to achieve more stable training and better generalization than prior RL and meta-RL methods for inductive generalization from specifications.
Title resolution pending
3 Pith papers cite this work. Polarity classification is still indexing.
3
Pith papers citing it
years
2026 3verdicts
UNVERDICTED 3representative citing papers
Framework using parameterized Signal Temporal Logic specifications to shape rewards for PPO-based RL, yielding tighter velocity tracking and more stable training than hand-crafted rewards on Barkour quadruped in MuJoCo simulation.
Iterative refinement of unknown MDP parameters allows repeated satisfaction of PAC conditions, yielding asymptotic optimality for reachability specifications in RL.
citing papers explorer
-
Learning Gait-Aware Quadruped Locomotion with Temporal Logic Specifications
Framework using parameterized Signal Temporal Logic specifications to shape rewards for PPO-based RL, yielding tighter velocity tracking and more stable training than hand-crafted rewards on Barkour quadruped in MuJoCo simulation.