{"record_type":"pith_number_record","schema_url":"https://pith.science/schemas/pith-number/v1.json","pith_number":"pith:2019:QZPNLERNGAKNBUDI22PIJ77TFS","short_pith_number":"pith:QZPNLERN","schema_version":"1.0","canonical_sha256":"865ed5922d3014d0d068d69e84fff32caf82ec05d3984e27e9a7f6d3678b1b63","source":{"kind":"arxiv","id":"1911.08265","version":2},"attestation_state":"computed","paper":{"title":"Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model","license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","headline":"MuZero achieves superhuman performance in Atari, Go, chess and shogi by learning a model that predicts only the reward, policy and value needed for planning.","cross_cats":["stat.ML"],"primary_cat":"cs.LG","authors_text":"Arthur Guez, David Silver, Demis Hassabis, Edward Lockhart, Ioannis Antonoglou, Julian Schrittwieser, Karen Simonyan, Laurent Sifre, Simon Schmitt, Thomas Hubert, Thore Graepel, Timothy Lillicrap","submitted_at":"2019-11-19T13:58:52Z","abstract_excerpt":"Constructing agents with planning capabilities has long been one of the main challenges in the pursuit of artificial intelligence. Tree-based planning methods have enjoyed huge success in challenging domains, such as chess and Go, where a perfect simulator is available. However, in real-world problems the dynamics governing the environment are often complex and unknown. In this work we present the MuZero algorithm which, by combining a tree-based search with a learned model, achieves superhuman performance in a range of challenging and visually complex domains, without any knowledge of their u"},"verification_status":{"content_addressed":true,"pith_receipt":true,"author_attested":false,"weak_author_claims":0,"strong_author_claims":0,"externally_anchored":false,"storage_verified":false,"citation_signatures":0,"replication_records":0,"graph_snapshot":true,"references_resolved":true,"formal_links_present":true},"canonical_record":{"source":{"id":"1911.08265","kind":"arxiv","version":2},"metadata":{"license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","primary_cat":"cs.LG","submitted_at":"2019-11-19T13:58:52Z","cross_cats_sorted":["stat.ML"],"title_canon_sha256":"e1b6e9a101ccbe0c56a2ef0ef7c6625e447d26efc29afa6fac4b14d44e852264","abstract_canon_sha256":"c17ad5bd13adb1c24b6da393fcd423d7782deb1152195b6ec0860f51eaf91b91"},"schema_version":"1.0"},"receipt":{"kind":"pith_receipt","key_id":"pith-v1-2026-05","algorithm":"ed25519","signed_at":"2026-05-17T23:38:46.178211Z","signature_b64":"+EMevCl0ypFTcszIMVW0TtD9iSewfXpt/Q6rCryVm17+Q7ZxdTrG9hmcEIZjbYb4Z7Yg6eNzVvpMCw6q6feTCw==","signed_message":"canonical_sha256_bytes","builder_version":"pith-number-builder-2026-05-17-v1","receipt_version":"0.3","canonical_sha256":"865ed5922d3014d0d068d69e84fff32caf82ec05d3984e27e9a7f6d3678b1b63","last_reissued_at":"2026-05-17T23:38:46.177763Z","signature_status":"signed_v1","first_computed_at":"2026-05-17T23:38:46.177763Z","public_key_fingerprint":"8d4b5ee74e4693bcd1df2446408b0d54"},"graph_snapshot":{"paper":{"title":"Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model","license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","headline":"MuZero achieves superhuman performance in Atari, Go, chess and shogi by learning a model that predicts only the reward, policy and value needed for planning.","cross_cats":["stat.ML"],"primary_cat":"cs.LG","authors_text":"Arthur Guez, David Silver, Demis Hassabis, Edward Lockhart, Ioannis Antonoglou, Julian Schrittwieser, Karen Simonyan, Laurent Sifre, Simon Schmitt, Thomas Hubert, Thore Graepel, Timothy Lillicrap","submitted_at":"2019-11-19T13:58:52Z","abstract_excerpt":"Constructing agents with planning capabilities has long been one of the main challenges in the pursuit of artificial intelligence. Tree-based planning methods have enjoyed huge success in challenging domains, such as chess and Go, where a perfect simulator is available. However, in real-world problems the dynamics governing the environment are often complex and unknown. In this work we present the MuZero algorithm which, by combining a tree-based search with a learned model, achieves superhuman performance in a range of challenging and visually complex domains, without any knowledge of their u"},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"MuZero achieves superhuman performance in a range of challenging and visually complex domains, without any knowledge of their underlying dynamics.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"That the learned model, when applied iteratively inside tree search, produces sufficiently accurate long-horizon predictions of reward, policy, and value to support effective planning even when the true dynamics are unknown and high-dimensional.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"MuZero matches or exceeds AlphaZero-level performance in Go, Chess, Shogi and sets a new state of the art on 57 Atari games by learning a model that directly supports planning rather than reconstructing full environment dynamics.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"MuZero achieves superhuman performance in Atari, Go, chess and shogi by learning a model that predicts only the reward, policy and value needed for planning.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"fc5d4d70e58143e2eb08613abb5d8fb5234c20e7280a7423dcc820c749673d35"},"source":{"id":"1911.08265","kind":"arxiv","version":2},"verdict":{"id":"d05ceb39-5633-41b4-9a1c-ba680c1e5d23","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-16T23:52:41.366359Z","strongest_claim":"MuZero achieves superhuman performance in a range of challenging and visually complex domains, without any knowledge of their underlying dynamics.","one_line_summary":"MuZero matches or exceeds AlphaZero-level performance in Go, Chess, Shogi and sets a new state of the art on 57 Atari games by learning a model that directly supports planning rather than reconstructing full environment dynamics.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"That the learned model, when applied iteratively inside tree search, produces sufficiently accurate long-horizon predictions of reward, policy, and value to support effective planning even when the true dynamics are unknown and high-dimensional.","pith_extraction_headline":"MuZero achieves superhuman performance in Atari, Go, chess and shogi by learning a model that predicts only the reward, policy and value needed for planning."},"references":{"count":53,"sample":[{"doi":"","year":2018,"title":"Lipton, and Animashree Anandkumar","work_id":"89d8a872-e25f-4e79-971e-9aad2c2d136a","ref_index":1,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2013,"title":"The arcade learning environment: An evaluation platform for general agents","work_id":"dd383516-d2cf-40d5-b95c-99ff0ca6f83d","ref_index":2,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2018,"title":"Superhuman ai for heads-up no-limit poker: Libratus beats top profes- sionals","work_id":"2356edfc-3c56-477c-848e-709319c2218b","ref_index":3,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2018,"title":"Learning and Querying Fast Generative Models for Reinforcement Learning","work_id":"45700551-6f99-4914-b123-083e4ac20e0a","ref_index":4,"cited_arxiv_id":"1802.03006","is_internal_anchor":true},{"doi":"","year":2002,"title":"Joseph Hoane, Jr., and Feng-hsiung Hsu","work_id":"313124b1-b65d-4318-85e2-cb28a84a6476","ref_index":5,"cited_arxiv_id":"","is_internal_anchor":false}],"resolved_work":53,"snapshot_sha256":"fd65d6b50c28d5f2436bc2689d847b03db653ada1b97a74fc8a57871844fecde","internal_anchors":6},"formal_canon":{"evidence_count":2,"snapshot_sha256":"01cc8924ba4b674bba76f2d749503c125b742dbd28c604fa0de38be27d3fb9d8"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"},"aliases":[{"alias_kind":"arxiv","alias_value":"1911.08265","created_at":"2026-05-17T23:38:46.177838+00:00"},{"alias_kind":"arxiv_version","alias_value":"1911.08265v2","created_at":"2026-05-17T23:38:46.177838+00:00"},{"alias_kind":"doi","alias_value":"10.48550/arxiv.1911.08265","created_at":"2026-05-17T23:38:46.177838+00:00"},{"alias_kind":"pith_short_12","alias_value":"QZPNLERNGAKN","created_at":"2026-05-18T12:33:27.125529+00:00"},{"alias_kind":"pith_short_16","alias_value":"QZPNLERNGAKNBUDI","created_at":"2026-05-18T12:33:27.125529+00:00"},{"alias_kind":"pith_short_8","alias_value":"QZPNLERN","created_at":"2026-05-18T12:33:27.125529+00:00"}],"events":[],"event_summary":{},"paper_claims":[],"inbound_citations":{"count":21,"internal_anchor_count":21,"sample":[{"citing_arxiv_id":"2605.19341","citing_title":"HalluWorld: A Controlled Benchmark for Hallucination via Reference World Models","ref_index":55,"is_internal_anchor":true},{"citing_arxiv_id":"2605.19503","citing_title":"ARC-RL: A Reinforcement Learning Playground Inspired by ARC Raiders","ref_index":26,"is_internal_anchor":true},{"citing_arxiv_id":"2605.15053","citing_title":"TFGN: Task-Free, Replay-Free Continual Pre-Training Without Catastrophic Forgetting at LLM Scale","ref_index":58,"is_internal_anchor":true},{"citing_arxiv_id":"2507.12549","citing_title":"The Serial Scaling Hypothesis","ref_index":97,"is_internal_anchor":true},{"citing_arxiv_id":"2512.10226","citing_title":"Latent Chain-of-Thought World Modeling for End-to-End Driving","ref_index":30,"is_internal_anchor":true},{"citing_arxiv_id":"2512.21648","citing_title":"Variance-Aware Prior-Based Tree Policies for Monte Carlo Tree Search","ref_index":3,"is_internal_anchor":true},{"citing_arxiv_id":"2602.01505","citing_title":"Optimal Sample Complexity for Single Time-Scale Actor-Critic with Momentum","ref_index":47,"is_internal_anchor":true},{"citing_arxiv_id":"2211.15657","citing_title":"Is Conditional Generative Modeling all you need for Decision-Making?","ref_index":130,"is_internal_anchor":true},{"citing_arxiv_id":"2509.24527","citing_title":"Training Agents Inside of Scalable World Models","ref_index":5,"is_internal_anchor":true},{"citing_arxiv_id":"2010.02193","citing_title":"Mastering Atari with Discrete World Models","ref_index":41,"is_internal_anchor":true},{"citing_arxiv_id":"2605.12653","citing_title":"Plan Before You Trade: Inference-Time Optimization for RL Trading Agents","ref_index":9,"is_internal_anchor":true},{"citing_arxiv_id":"2605.09364","citing_title":"Multi-scale Predictive Representations for Goal-conditioned Reinforcement Learning","ref_index":35,"is_internal_anchor":true},{"citing_arxiv_id":"2605.08982","citing_title":"PMCTS: Particle Monte Carlo Tree Search for Principled Parallelized Inference Time Scaling","ref_index":4,"is_internal_anchor":true},{"citing_arxiv_id":"1912.01603","citing_title":"Dream to Control: Learning Behaviors by Latent Imagination","ref_index":42,"is_internal_anchor":true},{"citing_arxiv_id":"2605.01694","citing_title":"Latent State Design for World Models under Sufficiency Constraints","ref_index":56,"is_internal_anchor":true},{"citing_arxiv_id":"2301.04104","citing_title":"Mastering Diverse Domains through World Models","ref_index":8,"is_internal_anchor":true},{"citing_arxiv_id":"2604.04518","citing_title":"Reproducibility study on how to find Spurious Correlations, Shortcut Learning, Clever Hans or Group-Distributional non-robustness and how to fix them","ref_index":3,"is_internal_anchor":true},{"citing_arxiv_id":"2605.00940","citing_title":"Interpretable experiential learning based on state history and global feedback","ref_index":26,"is_internal_anchor":true},{"citing_arxiv_id":"2605.03434","citing_title":"Quantum Hierarchical Reinforcement Learning via Variational Quantum Circuits","ref_index":3,"is_internal_anchor":true},{"citing_arxiv_id":"2604.25859","citing_title":"Privileged Foresight Distillation: Zero-Cost Future Correction for World Action Models","ref_index":12,"is_internal_anchor":true},{"citing_arxiv_id":"2605.06373","citing_title":"Beyond the Independence Assumption: Finite-Sample Guarantees for Deep Q-Learning under $\\tau$-Mixing","ref_index":60,"is_internal_anchor":true}]},"formal_canon":{"evidence_count":2,"sample":[],"anchors":[]},"links":{"html":"https://pith.science/pith/QZPNLERNGAKNBUDI22PIJ77TFS","json":"https://pith.science/pith/QZPNLERNGAKNBUDI22PIJ77TFS.json","graph_json":"https://pith.science/api/pith-number/QZPNLERNGAKNBUDI22PIJ77TFS/graph.json","events_json":"https://pith.science/api/pith-number/QZPNLERNGAKNBUDI22PIJ77TFS/events.json","paper":"https://pith.science/paper/QZPNLERN"},"agent_actions":{"view_html":"https://pith.science/pith/QZPNLERNGAKNBUDI22PIJ77TFS","download_json":"https://pith.science/pith/QZPNLERNGAKNBUDI22PIJ77TFS.json","view_paper":"https://pith.science/paper/QZPNLERN","resolve_alias":"https://pith.science/api/pith-number/resolve?arxiv=1911.08265&json=true","fetch_graph":"https://pith.science/api/pith-number/QZPNLERNGAKNBUDI22PIJ77TFS/graph.json","fetch_events":"https://pith.science/api/pith-number/QZPNLERNGAKNBUDI22PIJ77TFS/events.json","actions":{"anchor_timestamp":"https://pith.science/pith/QZPNLERNGAKNBUDI22PIJ77TFS/action/timestamp_anchor","attest_storage":"https://pith.science/pith/QZPNLERNGAKNBUDI22PIJ77TFS/action/storage_attestation","attest_author":"https://pith.science/pith/QZPNLERNGAKNBUDI22PIJ77TFS/action/author_attestation","sign_citation":"https://pith.science/pith/QZPNLERNGAKNBUDI22PIJ77TFS/action/citation_signature","submit_replication":"https://pith.science/pith/QZPNLERNGAKNBUDI22PIJ77TFS/action/replication_record"}},"created_at":"2026-05-17T23:38:46.177838+00:00","updated_at":"2026-05-17T23:38:46.177838+00:00"}