pith:EEJNUWLL
Deep reinforcement learning from human preferences
Reinforcement learning agents can learn complex behaviors such as Atari games and robot locomotion from human preferences over pairs of trajectory segments instead of engineered rewards.
arxiv:1706.03741 v4 · 2017-06-12 · stat.ML · cs.AI · cs.HC · cs.LG
Add to your LaTeX paper
\usepackage{pith}
\pithnumber{EEJNUWLLGF4XP6EY6FGCTLR4XV}
Prints a linked badge after your title and injects PDF metadata. Compiles on arXiv. Learn more · Embed verified badge
Record completeness
Claims
We show that this approach can effectively solve complex RL tasks without access to the reward function, including Atari games and simulated robot locomotion, while providing feedback on less than one percent of our agent's interactions with the environment.
That human preferences over short trajectory segments can be consistently modeled by a reward function that generalizes well enough to guide policy optimization without reward hacking or inconsistency on the full task.
Reinforcement learning agents solve complex tasks without access to the reward function by training a reward predictor from human comparisons of trajectory segments, requiring feedback on less than 1% of interactions.
References
Formal links
Cited by
Receipt and verification
| First computed | 2026-05-17T23:38:48.474584Z |
|---|---|
| Builder | pith-number-builder-2026-05-17-v1 |
| Signature | Pith Ed25519
(pith-v1-2026-05) · public key |
| Schema | pith-number/v1.0 |
Canonical hash
2112da596b317977f898f14c29ae3cbd56e9403932bbfda3094ee5b2169aad7f
Aliases
· · · · ·Agent API
Verify this Pith Number yourself
curl -sH 'Accept: application/ld+json' https://pith.science/pith/EEJNUWLLGF4XP6EY6FGCTLR4XV \
| jq -c '.canonical_record' \
| python3 -c "import sys,json,hashlib; b=json.dumps(json.loads(sys.stdin.read()), sort_keys=True, separators=(',',':'), ensure_ascii=False).encode(); print(hashlib.sha256(b).hexdigest())"
# expect: 2112da596b317977f898f14c29ae3cbd56e9403932bbfda3094ee5b2169aad7f
Canonical record JSON
{
"metadata": {
"abstract_canon_sha256": "ff8e60ebfff031fd0eb18e5acedfde3176fe496e13eae854152798fc2da3d728",
"cross_cats_sorted": [
"cs.AI",
"cs.HC",
"cs.LG"
],
"license": "http://arxiv.org/licenses/nonexclusive-distrib/1.0/",
"primary_cat": "stat.ML",
"submitted_at": "2017-06-12T17:23:59Z",
"title_canon_sha256": "c26d0dd48abbea12aea6ba91308ea0ab806eda720b3e557933897e49bf30ecc2"
},
"schema_version": "1.0",
"source": {
"id": "1706.03741",
"kind": "arxiv",
"version": 4
}
}