A Game-Theoretic Analysis of the Off-Switch Game
read the original abstract
The off-switch game is a game theoretic model of a highly intelligent robot interacting with a human. In the original paper by Hadfield-Menell et al. (2016), the analysis is not fully game-theoretic as the human is modelled as an irrational player, and the robot's best action is only calculated under unrealistic normality and soft-max assumptions. In this paper, we make the analysis fully game theoretic, by modelling the human as a rational player with a random utility function. As a consequence, we are able to easily calculate the robot's best action for arbitrary belief and irrationality assumptions.
This paper has not been read by Pith yet.
Forward citations
Cited by 4 Pith papers
-
Towards Shutdownable Agents: Generalizing Stochastic Choice in RL Agents and LLMs
DReST training makes RL agents and LLMs neutral to trajectory lengths and useful at goals, generalizing to halve shutdown influence probability in out-of-distribution tests.
-
Towards Shutdownable Agents: Generalizing Stochastic Choice in RL Agents and LLMs
DReST-trained deep RL agents and fine-tuned LLMs generalize to higher usefulness and neutrality on unseen test contexts, with reported gains of 11-18% over baselines and near-maximum scores for the LLM.
-
Towards Shutdownable Agents: Generalizing Stochastic Choice in RL Agents and LLMs
DReST-trained RL agents and LLMs achieve higher usefulness and neutrality to trajectory lengths, halving the probability of delaying shutdown in out-of-distribution tests.
-
Towards Shutdownable Agents via Stochastic Choice
Gridworld agents trained with DReST reward functions learn to be USEFUL at tasks conditional on trajectory length and NEUTRAL across lengths, supplying initial evidence that the method could produce shutdownable agents.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.