pith. sign in
Pith Number

pith:EEJNUWLL

pith:2017:EEJNUWLLGF4XP6EY6FGCTLR4XV
not attested not anchored not stored refs resolved

Deep reinforcement learning from human preferences

Dario Amodei, Jan Leike, Miljan Martic, Paul Christiano, Shane Legg, Tom B. Brown

Reinforcement learning agents can learn complex behaviors such as Atari games and robot locomotion from human preferences over pairs of trajectory segments instead of engineered rewards.

arxiv:1706.03741 v4 · 2017-06-12 · stat.ML · cs.AI · cs.HC · cs.LG

Add to your LaTeX paper
\usepackage{pith}
\pithnumber{EEJNUWLLGF4XP6EY6FGCTLR4XV}

Prints a linked badge after your title and injects PDF metadata. Compiles on arXiv. Learn more · Embed verified badge

Record completeness

1 Bitcoin timestamp
2 Internet Archive
3 Author claim open · sign in to claim
4 Citations open
5 Replications open
Portable graph bundle live · download bundle · merged state
The bundle contains the canonical record plus signed events. A mirror can host it anywhere and recompute the same current state with the deterministic merge algorithm.

Claims

C1strongest claim

We show that this approach can effectively solve complex RL tasks without access to the reward function, including Atari games and simulated robot locomotion, while providing feedback on less than one percent of our agent's interactions with the environment.

C2weakest assumption

That human preferences over short trajectory segments can be consistently modeled by a reward function that generalizes well enough to guide policy optimization without reward hacking or inconsistency on the full task.

C3one line summary

Reinforcement learning agents solve complex tasks without access to the reward function by training a reward predictor from human comparisons of trajectory segments, requiring feedback on less than 1% of interactions.

References

14 extracted · 14 resolved · 6 Pith anchors

[1] TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems · arXiv:1603.04467
[2] Concrete Problems in AI Safety · arXiv:1606.06565
[3] A bayesian interactive optimization approach to procedural animation design 2010
[4] Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. OpenAI Gym. arXiv preprint arXiv:1606.01540, · arXiv:1606.01540
[5] Deep Q-learning from Demonstrations · arXiv:1704.03732

Formal links

2 machine-checked theorem links

Cited by

26 papers in Pith

Receipt and verification
First computed 2026-05-17T23:38:48.474584Z
Builder pith-number-builder-2026-05-17-v1
Signature Pith Ed25519 (pith-v1-2026-05) · public key
Schema pith-number/v1.0

Canonical hash

2112da596b317977f898f14c29ae3cbd56e9403932bbfda3094ee5b2169aad7f

Aliases

arxiv: 1706.03741 · arxiv_version: 1706.03741v4 · doi: 10.48550/arxiv.1706.03741 · pith_short_12: EEJNUWLLGF4X · pith_short_16: EEJNUWLLGF4XP6EY · pith_short_8: EEJNUWLL
Agent API
Verify this Pith Number yourself
curl -sH 'Accept: application/ld+json' https://pith.science/pith/EEJNUWLLGF4XP6EY6FGCTLR4XV \
  | jq -c '.canonical_record' \
  | python3 -c "import sys,json,hashlib; b=json.dumps(json.loads(sys.stdin.read()), sort_keys=True, separators=(',',':'), ensure_ascii=False).encode(); print(hashlib.sha256(b).hexdigest())"
# expect: 2112da596b317977f898f14c29ae3cbd56e9403932bbfda3094ee5b2169aad7f
Canonical record JSON
{
  "metadata": {
    "abstract_canon_sha256": "ff8e60ebfff031fd0eb18e5acedfde3176fe496e13eae854152798fc2da3d728",
    "cross_cats_sorted": [
      "cs.AI",
      "cs.HC",
      "cs.LG"
    ],
    "license": "http://arxiv.org/licenses/nonexclusive-distrib/1.0/",
    "primary_cat": "stat.ML",
    "submitted_at": "2017-06-12T17:23:59Z",
    "title_canon_sha256": "c26d0dd48abbea12aea6ba91308ea0ab806eda720b3e557933897e49bf30ecc2"
  },
  "schema_version": "1.0",
  "source": {
    "id": "1706.03741",
    "kind": "arxiv",
    "version": 4
  }
}