Bandit algorithms can be adapted to Tree MDPs by treating policies as arms with shared-data confidence bounds, achieving polynomial memory and instance-dependent bounds on sample complexity and regret that depend on terminal-state gaps rather than all policies.
Title resolution pending
3 Pith papers cite this work. Polarity classification is still indexing.
verdicts
UNVERDICTED 3representative citing papers
SafeAdapt certifies a Rashomon set of safe policies from demonstration data and projects updates from arbitrary RL algorithms onto it to guarantee preservation of safety on source tasks.
An improved Q-learning algorithm with a modified action-value function and reward-penalty scheme generates time-optimal robot trajectories that respect velocity-dependent piecewise-linear torque constraints.
citing papers explorer
-
On-line Learning in Tree MDPs by Treating Policies as Bandit Arms
Bandit algorithms can be adapted to Tree MDPs by treating policies as arms with shared-data confidence bounds, achieving polynomial memory and instance-dependent bounds on sample complexity and regret that depend on terminal-state gaps rather than all policies.
-
SafeAdapt: Provably Safe Policy Updates in Deep Reinforcement Learning
SafeAdapt certifies a Rashomon set of safe policies from demonstration data and projects updates from arbitrary RL algorithms onto it to guarantee preservation of safety on source tasks.
-
Reinforcement Learning for Robotic Time-optimal Path Tracking Using Prior Knowledge
An improved Q-learning algorithm with a modified action-value function and reward-penalty scheme generates time-optimal robot trajectories that respect velocity-dependent piecewise-linear torque constraints.