Pith Number

pith:EEJNUWLL

pith:2017:EEJNUWLLGF4XP6EY6FGCTLR4XV

not attested not anchored not stored refs resolved

Deep reinforcement learning from human preferences

Dario Amodei, Jan Leike, Miljan Martic, Paul Christiano, Shane Legg, Tom B. Brown

Reinforcement learning agents can learn complex behaviors such as Atari games and robot locomotion from human preferences over pairs of trajectory segments instead of engineered rewards.

arxiv:1706.03741 v4 · 2017-06-12 · stat.ML · cs.AI · cs.HC · cs.LG

Open paper page JSON Open Graph Bundle Merged state Verified badge What is a Pith Number?

Add to your LaTeX paper

\usepackage{pith}
\pithnumber{EEJNUWLLGF4XP6EY6FGCTLR4XV}

Prints a linked badge after your title and injects PDF metadata. Compiles on arXiv. Learn more · Embed verified badge

Record completeness

1 Bitcoin timestamp

2 Internet Archive

3 Author claim open · sign in to claim

4 Citations open

5 Replications open

✓ Portable graph bundle live · download bundle · merged state

The bundle contains the canonical record plus signed events. A mirror can host it anywhere and recompute the same current state with the deterministic merge algorithm.

Claims

C1strongest claim

We show that this approach can effectively solve complex RL tasks without access to the reward function, including Atari games and simulated robot locomotion, while providing feedback on less than one percent of our agent's interactions with the environment.

C2weakest assumption

That human preferences over short trajectory segments can be consistently modeled by a reward function that generalizes well enough to guide policy optimization without reward hacking or inconsistency on the full task.

C3one line summary

Reinforcement learning agents solve complex tasks without access to the reward function by training a reward predictor from human comparisons of trajectory segments, requiring feedback on less than 1% of interactions.

References

14 extracted · 14 resolved · 6 Pith anchors

[1] TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems · arXiv:1603.04467

[2] Concrete Problems in AI Safety · arXiv:1606.06565

[3] A bayesian interactive optimization approach to procedural animation design 2010

[4] Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. OpenAI Gym. arXiv preprint arXiv:1606.01540, · arXiv:1606.01540

[5] Deep Q-learning from Demonstrations · arXiv:1704.03732

Formal links

2 machine-checked theorem links

Cited by

26 papers in Pith

An Overview of Catastrophic AI Risks

Not All Tokens Matter: Towards Efficient LLM Reasoning via Token Significance in Reinforcement Learning

EyeMulator: Improving Code Language Models by Mimicking Human Visual Attention

Probabilistic Modeling of Latent Agentic Substructures in Deep Neural Networks

Failure Modes of Maximum Entropy RLHF

Receipt and verification

First computed	2026-05-17T23:38:48.474584Z
Builder	pith-number-builder-2026-05-17-v1
Signature	Pith Ed25519 (`pith-v1-2026-05`) · public key
Schema	pith-number/v1.0

Canonical hash

2112da596b317977f898f14c29ae3cbd56e9403932bbfda3094ee5b2169aad7f

Aliases

arxiv: 1706.03741 · arxiv_version: 1706.03741v4 · doi: 10.48550/arxiv.1706.03741 · pith_short_12: EEJNUWLLGF4X · pith_short_16: EEJNUWLLGF4XP6EY · pith_short_8: EEJNUWLL

Agent API

Resolver JSON Graph JSON Events JSON Schema Signing key

Verify this Pith Number yourself

curl -sH 'Accept: application/ld+json' https://pith.science/pith/EEJNUWLLGF4XP6EY6FGCTLR4XV \
  | jq -c '.canonical_record' \
  | python3 -c "import sys,json,hashlib; b=json.dumps(json.loads(sys.stdin.read()), sort_keys=True, separators=(',',':'), ensure_ascii=False).encode(); print(hashlib.sha256(b).hexdigest())"
# expect: 2112da596b317977f898f14c29ae3cbd56e9403932bbfda3094ee5b2169aad7f

Canonical record JSON

{
  "metadata": {
    "abstract_canon_sha256": "ff8e60ebfff031fd0eb18e5acedfde3176fe496e13eae854152798fc2da3d728",
    "cross_cats_sorted": [
      "cs.AI",
      "cs.HC",
      "cs.LG"
    ],
    "license": "http://arxiv.org/licenses/nonexclusive-distrib/1.0/",
    "primary_cat": "stat.ML",
    "submitted_at": "2017-06-12T17:23:59Z",
    "title_canon_sha256": "c26d0dd48abbea12aea6ba91308ea0ab806eda720b3e557933897e49bf30ecc2"
  },
  "schema_version": "1.0",
  "source": {
    "id": "1706.03741",
    "kind": "arxiv",
    "version": 4
  }
}