{"record_type":"pith_number_record","schema_url":"https://pith.science/schemas/pith-number/v1.json","pith_number":"pith:2017:EEJNUWLLGF4XP6EY6FGCTLR4XV","short_pith_number":"pith:EEJNUWLL","schema_version":"1.0","canonical_sha256":"2112da596b317977f898f14c29ae3cbd56e9403932bbfda3094ee5b2169aad7f","source":{"kind":"arxiv","id":"1706.03741","version":4},"attestation_state":"computed","paper":{"title":"Deep reinforcement learning from human preferences","license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","headline":"Reinforcement learning agents can learn complex behaviors such as Atari games and robot locomotion from human preferences over pairs of trajectory segments instead of engineered rewards.","cross_cats":["cs.AI","cs.HC","cs.LG"],"primary_cat":"stat.ML","authors_text":"Dario Amodei, Jan Leike, Miljan Martic, Paul Christiano, Shane Legg, Tom B. Brown","submitted_at":"2017-06-12T17:23:59Z","abstract_excerpt":"For sophisticated reinforcement learning (RL) systems to interact usefully with real-world environments, we need to communicate complex goals to these systems. In this work, we explore goals defined in terms of (non-expert) human preferences between pairs of trajectory segments. We show that this approach can effectively solve complex RL tasks without access to the reward function, including Atari games and simulated robot locomotion, while providing feedback on less than one percent of our agent's interactions with the environment. This reduces the cost of human oversight far enough that it c"},"verification_status":{"content_addressed":true,"pith_receipt":true,"author_attested":false,"weak_author_claims":0,"strong_author_claims":0,"externally_anchored":false,"storage_verified":false,"citation_signatures":0,"replication_records":0,"graph_snapshot":true,"references_resolved":true,"formal_links_present":true},"canonical_record":{"source":{"id":"1706.03741","kind":"arxiv","version":4},"metadata":{"license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","primary_cat":"stat.ML","submitted_at":"2017-06-12T17:23:59Z","cross_cats_sorted":["cs.AI","cs.HC","cs.LG"],"title_canon_sha256":"c26d0dd48abbea12aea6ba91308ea0ab806eda720b3e557933897e49bf30ecc2","abstract_canon_sha256":"ff8e60ebfff031fd0eb18e5acedfde3176fe496e13eae854152798fc2da3d728"},"schema_version":"1.0"},"receipt":{"kind":"pith_receipt","key_id":"pith-v1-2026-05","algorithm":"ed25519","signed_at":"2026-05-17T23:38:48.475047Z","signature_b64":"x9nmbELb+EO1o3acyxgfkQYhfsGWhfJHSBfILD7r0Wm9VIE5nx3Vho2nKQcr0ztNQIcA9opfccbEBrB+1eLOBQ==","signed_message":"canonical_sha256_bytes","builder_version":"pith-number-builder-2026-05-17-v1","receipt_version":"0.3","canonical_sha256":"2112da596b317977f898f14c29ae3cbd56e9403932bbfda3094ee5b2169aad7f","last_reissued_at":"2026-05-17T23:38:48.474584Z","signature_status":"signed_v1","first_computed_at":"2026-05-17T23:38:48.474584Z","public_key_fingerprint":"8d4b5ee74e4693bcd1df2446408b0d54"},"graph_snapshot":{"paper":{"title":"Deep reinforcement learning from human preferences","license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","headline":"Reinforcement learning agents can learn complex behaviors such as Atari games and robot locomotion from human preferences over pairs of trajectory segments instead of engineered rewards.","cross_cats":["cs.AI","cs.HC","cs.LG"],"primary_cat":"stat.ML","authors_text":"Dario Amodei, Jan Leike, Miljan Martic, Paul Christiano, Shane Legg, Tom B. Brown","submitted_at":"2017-06-12T17:23:59Z","abstract_excerpt":"For sophisticated reinforcement learning (RL) systems to interact usefully with real-world environments, we need to communicate complex goals to these systems. In this work, we explore goals defined in terms of (non-expert) human preferences between pairs of trajectory segments. We show that this approach can effectively solve complex RL tasks without access to the reward function, including Atari games and simulated robot locomotion, while providing feedback on less than one percent of our agent's interactions with the environment. This reduces the cost of human oversight far enough that it c"},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"We show that this approach can effectively solve complex RL tasks without access to the reward function, including Atari games and simulated robot locomotion, while providing feedback on less than one percent of our agent's interactions with the environment.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"That human preferences over short trajectory segments can be consistently modeled by a reward function that generalizes well enough to guide policy optimization without reward hacking or inconsistency on the full task.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"Reinforcement learning agents solve complex tasks without access to the reward function by training a reward predictor from human comparisons of trajectory segments, requiring feedback on less than 1% of interactions.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"Reinforcement learning agents can learn complex behaviors such as Atari games and robot locomotion from human preferences over pairs of trajectory segments instead of engineered rewards.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"1d1477574e4b56d3f73dce0501982e09c2d303b9151a8b9b668403a8f2105359"},"source":{"id":"1706.03741","kind":"arxiv","version":4},"verdict":{"id":"1ecccbcb-5465-406c-a120-2b4e75d88e59","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-16T08:35:19.884292Z","strongest_claim":"We show that this approach can effectively solve complex RL tasks without access to the reward function, including Atari games and simulated robot locomotion, while providing feedback on less than one percent of our agent's interactions with the environment.","one_line_summary":"Reinforcement learning agents solve complex tasks without access to the reward function by training a reward predictor from human comparisons of trajectory segments, requiring feedback on less than 1% of interactions.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"That human preferences over short trajectory segments can be consistently modeled by a reward function that generalizes well enough to guide policy optimization without reward hacking or inconsistency on the full task.","pith_extraction_headline":"Reinforcement learning agents can learn complex behaviors such as Atari games and robot locomotion from human preferences over pairs of trajectory segments instead of engineered rewards."},"references":{"count":14,"sample":[{"doi":"","year":null,"title":"TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems","work_id":"91f3c09e-dae6-48ca-80c0-463dd1b1f6e1","ref_index":1,"cited_arxiv_id":"1603.04467","is_internal_anchor":true},{"doi":"","year":null,"title":"Concrete Problems in AI Safety","work_id":"c8d14fbe-6eab-464a-95b3-778aabd82fa3","ref_index":2,"cited_arxiv_id":"1606.06565","is_internal_anchor":true},{"doi":"","year":2010,"title":"A bayesian interactive optimization approach to procedural animation design","work_id":"a2ea06cf-ee83-47a5-846a-a82273829a4d","ref_index":3,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":null,"title":"Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. OpenAI Gym. arXiv preprint arXiv:1606.01540,","work_id":"6af98f3f-f074-41ae-a689-7dd7b4b8efde","ref_index":4,"cited_arxiv_id":"1606.01540","is_internal_anchor":true},{"doi":"","year":null,"title":"Deep Q-learning from Demonstrations","work_id":"3d67a954-e5a3-409f-a53c-3100d6063c5f","ref_index":5,"cited_arxiv_id":"1704.03732","is_internal_anchor":true}],"resolved_work":14,"snapshot_sha256":"fecc59eb2cfc6f5c54d2665625f562214f2620dcf5d38746a9a0add16ff206e5","internal_anchors":6},"formal_canon":{"evidence_count":2,"snapshot_sha256":"856c0889dbc7c0af35cd064d7e456d5bc26700aae668d11f3d0b896725ccf03e"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"},"aliases":[{"alias_kind":"arxiv","alias_value":"1706.03741","created_at":"2026-05-17T23:38:48.474670+00:00"},{"alias_kind":"arxiv_version","alias_value":"1706.03741v4","created_at":"2026-05-17T23:38:48.474670+00:00"},{"alias_kind":"doi","alias_value":"10.48550/arxiv.1706.03741","created_at":"2026-05-17T23:38:48.474670+00:00"},{"alias_kind":"pith_short_12","alias_value":"EEJNUWLLGF4X","created_at":"2026-05-18T12:31:12.930513+00:00"},{"alias_kind":"pith_short_16","alias_value":"EEJNUWLLGF4XP6EY","created_at":"2026-05-18T12:31:12.930513+00:00"},{"alias_kind":"pith_short_8","alias_value":"EEJNUWLL","created_at":"2026-05-18T12:31:12.930513+00:00"}],"events":[],"event_summary":{},"paper_claims":[],"inbound_citations":{"count":26,"internal_anchor_count":26,"sample":[{"citing_arxiv_id":"2306.12001","citing_title":"An Overview of Catastrophic AI Risks","ref_index":129,"is_internal_anchor":true},{"citing_arxiv_id":"2506.08125","citing_title":"Not All Tokens Matter: Towards Efficient LLM Reasoning via Token Significance in Reinforcement Learning","ref_index":30,"is_internal_anchor":true},{"citing_arxiv_id":"2508.16771","citing_title":"EyeMulator: Improving Code Language Models by Mimicking Human Visual Attention","ref_index":7,"is_internal_anchor":true},{"citing_arxiv_id":"2509.06701","citing_title":"Probabilistic Modeling of Latent Agentic Substructures in Deep Neural Networks","ref_index":35,"is_internal_anchor":true},{"citing_arxiv_id":"2509.20265","citing_title":"Failure Modes of Maximum Entropy RLHF","ref_index":12,"is_internal_anchor":true},{"citing_arxiv_id":"2510.04265","citing_title":"Don't Pass@k: A Bayesian Framework for Large Language Model Evaluation","ref_index":21,"is_internal_anchor":true},{"citing_arxiv_id":"2308.03958","citing_title":"Simple synthetic data reduces sycophancy in large language models","ref_index":10,"is_internal_anchor":true},{"citing_arxiv_id":"2505.18719","citing_title":"VLA-RL: Towards Masterful and General Robotic Manipulation with Scalable Reinforcement Learning","ref_index":13,"is_internal_anchor":true},{"citing_arxiv_id":"2309.16797","citing_title":"Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution","ref_index":81,"is_internal_anchor":true},{"citing_arxiv_id":"2212.03827","citing_title":"Discovering Latent Knowledge in Language Models Without Supervision","ref_index":6,"is_internal_anchor":true},{"citing_arxiv_id":"2410.07283","citing_title":"Prompt Infection: LLM-to-LLM Prompt Injection within Multi-Agent Systems","ref_index":54,"is_internal_anchor":true},{"citing_arxiv_id":"2605.12809","citing_title":"Correcting Influence: Unboxing LLM Outputs with Orthogonal Latent Spaces","ref_index":216,"is_internal_anchor":true},{"citing_arxiv_id":"2406.06592","citing_title":"Improve Mathematical Reasoning in Language Models by Automated Process Supervision","ref_index":1,"is_internal_anchor":true},{"citing_arxiv_id":"2403.19647","citing_title":"Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models","ref_index":11,"is_internal_anchor":true},{"citing_arxiv_id":"2605.10528","citing_title":"Collective Alignment in LLM Multi-Agent Systems: Disentangling Bias from Cooperation via Statistical Physics","ref_index":16,"is_internal_anchor":true},{"citing_arxiv_id":"2407.21787","citing_title":"Large Language Monkeys: Scaling Inference Compute with Repeated Sampling","ref_index":17,"is_internal_anchor":true},{"citing_arxiv_id":"2604.24176","citing_title":"Explanation Quality Assessment as Ranking with Listwise Rewards","ref_index":1,"is_internal_anchor":true},{"citing_arxiv_id":"2604.23338","citing_title":"A Systematic Survey of Security Threats and Defenses in LLM-Based AI Agents: A Layered Attack Surface Framework","ref_index":26,"is_internal_anchor":true},{"citing_arxiv_id":"2605.04542","citing_title":"Power Distribution Bridges Sampling, Self-Reward RL, and Self-Distillation","ref_index":72,"is_internal_anchor":true},{"citing_arxiv_id":"2605.01311","citing_title":"The Partial Testimony of Logs: Evaluation of Language Model Generation under Confounded Model Choice","ref_index":5,"is_internal_anchor":true},{"citing_arxiv_id":"2112.00861","citing_title":"A General Language Assistant as a Laboratory for Alignment","ref_index":215,"is_internal_anchor":true},{"citing_arxiv_id":"2604.11554","citing_title":"Relax: An Asynchronous Reinforcement Learning Engine for Omni-Modal Post-Training at Scale","ref_index":14,"is_internal_anchor":true},{"citing_arxiv_id":"2605.07172","citing_title":"Topology-Enhanced Alignment for Large Language Models: Trajectory Topology Loss and Topological Preference Optimization","ref_index":10,"is_internal_anchor":true},{"citing_arxiv_id":"2604.06788","citing_title":"From Perception to Autonomous Computational Modeling: A Multi-Agent Approach","ref_index":64,"is_internal_anchor":true},{"citing_arxiv_id":"2604.13803","citing_title":"Gaslight, Gatekeep, V1-V3: Early Visual Cortex Alignment Shields Vision-Language Models from Sycophantic Manipulation","ref_index":33,"is_internal_anchor":true}]},"formal_canon":{"evidence_count":2,"sample":[],"anchors":[]},"links":{"html":"https://pith.science/pith/EEJNUWLLGF4XP6EY6FGCTLR4XV","json":"https://pith.science/pith/EEJNUWLLGF4XP6EY6FGCTLR4XV.json","graph_json":"https://pith.science/api/pith-number/EEJNUWLLGF4XP6EY6FGCTLR4XV/graph.json","events_json":"https://pith.science/api/pith-number/EEJNUWLLGF4XP6EY6FGCTLR4XV/events.json","paper":"https://pith.science/paper/EEJNUWLL"},"agent_actions":{"view_html":"https://pith.science/pith/EEJNUWLLGF4XP6EY6FGCTLR4XV","download_json":"https://pith.science/pith/EEJNUWLLGF4XP6EY6FGCTLR4XV.json","view_paper":"https://pith.science/paper/EEJNUWLL","resolve_alias":"https://pith.science/api/pith-number/resolve?arxiv=1706.03741&json=true","fetch_graph":"https://pith.science/api/pith-number/EEJNUWLLGF4XP6EY6FGCTLR4XV/graph.json","fetch_events":"https://pith.science/api/pith-number/EEJNUWLLGF4XP6EY6FGCTLR4XV/events.json","actions":{"anchor_timestamp":"https://pith.science/pith/EEJNUWLLGF4XP6EY6FGCTLR4XV/action/timestamp_anchor","attest_storage":"https://pith.science/pith/EEJNUWLLGF4XP6EY6FGCTLR4XV/action/storage_attestation","attest_author":"https://pith.science/pith/EEJNUWLLGF4XP6EY6FGCTLR4XV/action/author_attestation","sign_citation":"https://pith.science/pith/EEJNUWLLGF4XP6EY6FGCTLR4XV/action/citation_signature","submit_replication":"https://pith.science/pith/EEJNUWLLGF4XP6EY6FGCTLR4XV/action/replication_record"}},"created_at":"2026-05-17T23:38:48.474670+00:00","updated_at":"2026-05-17T23:38:48.474670+00:00"}