Towards Shutdownable Agents via Stochastic Choice

· 2024 · cs.AI · arXiv 2407.00805

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it

open full Pith review browse 1 citing papers arXiv PDF

abstract

The POST-Agents Proposal (PAP) is an idea for ensuring that advanced artificial agents never resist shutdown. A key part of the PAP is using a novel `Discounted Reward for Same-Length Trajectories (DReST)' reward function to train agents to (1) pursue goals effectively conditional on each trajectory-length (be `USEFUL'), and (2) choose stochastically between different trajectory-lengths (be `NEUTRAL' about trajectory-lengths). In this paper, we propose evaluation metrics for USEFULNESS and NEUTRALITY. We use a DReST reward function to train simple agents to navigate gridworlds, and we find that these agents learn to be USEFUL and NEUTRAL. Our results thus provide some initial evidence that DReST reward functions could train advanced agents to be USEFUL and NEUTRAL. Our theoretical work suggests that these agents would be useful and shutdownable.

representative citing papers

Towards Shutdownable Agents: Generalizing Stochastic Choice in RL Agents and LLMs

cs.AI · 2026-04-19 · conditional · novelty 7.0

DReST training makes RL agents and LLMs neutral to trajectory lengths and useful at goals, generalizing to halve shutdown influence probability in out-of-distribution tests.

citing papers explorer

Showing 1 of 1 citing paper.

Towards Shutdownable Agents: Generalizing Stochastic Choice in RL Agents and LLMs cs.AI · 2026-04-19 · conditional · none · ref 2 · internal anchor
DReST training makes RL agents and LLMs neutral to trajectory lengths and useful at goals, generalizing to halve shutdown influence probability in out-of-distribution tests.

Towards Shutdownable Agents via Stochastic Choice

fields

years

verdicts

representative citing papers

citing papers explorer