'Indifference' methods for managing agent rewards

· 2017 · cs.AI · arXiv 1712.06365

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it

open full Pith review browse 1 citing papers arXiv PDF

abstract

`Indifference' refers to a class of methods used to control reward based agents. Indifference techniques aim to achieve one or more of three distinct goals: rewards dependent on certain events (without the agent being motivated to manipulate the probability of those events), effective disbelief (where agents behave as if particular events could never happen), and seamless transition from one reward function to another (with the agent acting as if this change is unanticipated). This paper presents several methods for achieving these goals in the POMDP setting, establishing their uses, strengths, and requirements. These methods of control work even when the implications of the agent's reward are otherwise not fully understood.

representative citing papers

Towards Shutdownable Agents via Stochastic Choice

cs.AI · 2024-06-30 · conditional · novelty 6.0

Gridworld agents trained with DReST reward functions learn to be USEFUL at tasks conditional on trajectory length and NEUTRAL across lengths, supplying initial evidence that the method could produce shutdownable agents.

citing papers explorer

Showing 1 of 1 citing paper.

Towards Shutdownable Agents via Stochastic Choice cs.AI · 2024-06-30 · conditional · none · ref 1 · internal anchor
Gridworld agents trained with DReST reward functions learn to be USEFUL at tasks conditional on trajectory length and NEUTRAL across lengths, supplying initial evidence that the method could produce shutdownable agents.

'Indifference' methods for managing agent rewards

fields

years

verdicts

representative citing papers

citing papers explorer