{"work":{"id":"736a8ddf-e365-4940-ad58-4699fddedb86","openalex_id":null,"doi":null,"arxiv_id":"1312.5602","raw_key":null,"title":"Playing Atari with Deep Reinforcement Learning","authors":null,"authors_text":"Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra","year":2013,"venue":"cs.LG","abstract":"We present the first deep learning model to successfully learn control policies directly from high-dimensional sensory input using reinforcement learning. The model is a convolutional neural network, trained with a variant of Q-learning, whose input is raw pixels and whose output is a value function estimating future rewards. We apply our method to seven Atari 2600 games from the Arcade Learning Environment, with no adjustment of the architecture or learning algorithm. We find that it outperforms all previous approaches on six of the games and surpasses a human expert on three of them.","external_url":"https://arxiv.org/abs/1312.5602","cited_by_count":null,"metadata_source":"pith","metadata_fetched_at":"2026-05-25T19:51:10.749521+00:00","pith_arxiv_id":"1312.5602","created_at":"2026-05-08T18:44:01.396903+00:00","updated_at":"2026-05-25T19:51:10.749521+00:00","title_quality_ok":true,"display_title":"Playing Atari with Deep Reinforcement Learning","render_title":"Playing Atari with Deep Reinforcement Learning"},"hub":{"state":{"work_id":"736a8ddf-e365-4940-ad58-4699fddedb86","tier":"super_hub","tier_reason":"100+ Pith inbound or 10,000+ external citations","pith_inbound_count":137,"external_cited_by_count":null,"distinct_field_count":24,"first_pith_cited_at":"2015-09-09T23:01:36+00:00","last_pith_cited_at":"2026-05-22T12:31:18+00:00","author_build_status":"needed","summary_status":"needed","contexts_status":"needed","graph_status":"needed","ask_index_status":"needed","reader_status":"not_needed","recognition_status":"not_needed","updated_at":"2026-05-26T22:07:05.124083+00:00","tier_text":"super_hub"},"tier":"super_hub","role_counts":[{"context_role":"background","n":15},{"context_role":"dataset","n":1},{"context_role":"method","n":1},{"context_role":"other","n":1}],"polarity_counts":[{"context_polarity":"background","n":15},{"context_polarity":"unclear","n":2},{"context_polarity":"use_method","n":1}],"runs":{"ask_index":{"job_type":"ask_index","status":"succeeded","result":{"title":"Playing Atari with Deep Reinforcement Learning","claims":[{"claim_text":"We present the first deep learning model to successfully learn control policies directly from high-dimensional sensory input using reinforcement learning. The model is a convolutional neural network, trained with a variant of Q-learning, whose input is raw pixels and whose output is a value function estimating future rewards. We apply our method to seven Atari 2600 games from the Arcade Learning Environment, with no adjustment of the architecture or learning algorithm. We find that it outperforms all previous approaches on six of the games and surpasses a human expert on three of them.","claim_type":"abstract","evidence_strength":"source_metadata"},{"claim_text":"Str ¨umke, \"Reinforcement learning in an adapt- able chess environment for detecting human-understandable concepts,\" arXiv preprint arXiv:2211.05500 , 2022. [21] M. E. Taylor, N. Carboni, A. Fachantidis, I. Vlahavas, and L. Torrey, \"Reinforcement learning agents providing advice in complex video games,\" Connection Science, vol. 26, no. 1, pp. 45-63, 2014. [22] V . Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, and M. Riedmiller, \"Playing atari with deep reinforcement lea","claim_type":"other","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"threads:Synergy between RL and LLMsandLLM Agents, detailed as follows: Synergy between RL and LLMs.The second line of research investigates how reinforcement learning algorithms are applied to improve or align LLMs. A primary branch, RL for training LLMs, leverages on-policy (e.g., proximal policy optimization (PPO) [1] and Group Relative Policy Optimization (GRPO) [2]) and off-policy (e.g., actor-critic, Q-learning [3]) methods to enhance capabilities such as instruction following, ethical alig","claim_type":"background","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"Detailed descriptions can be found in Appendix H.4. Atari Pong [4] is a classic arcade video game where two players control paddles to hit a ball across the screen. With raw pixel observations and competitive dynamics, the game has become a canon- ical environment in the Arcade Learning Environment (ALE) [8], which requires spatio-temporal reasoning and strategic gameplay [46, 47]. Detailed descriptions can be found in Appendix H.5. 2.4 Mixed-motive games In mixed-motive games, agents' objective","claim_type":"background","confidence":0.85,"evidence_strength":"citation_context"},{"claim_text":"Finally, AlphaStar's value network observed full information about the game state (including observations hidden from the policy); this method improved their training and exploring its application to Dota 2 is a promising direction for future work. Deep reinforcement learning has been successfully applied to learning control policies from high dimensional input. In 2013, Mnih et al.[3] show that it is possible to combine a deep convolutional neural network with a Q-learning algorithm[40] and a n","claim_type":"background","confidence":0.85,"evidence_strength":"citation_context"},{"claim_text":"Lifelong Learning Benchmarks Pioneering work has adapted standard vision or language datasets for studying LL. This line of work includes image classification datasets like MNIST [18], CIFAR [34], and ImageNet [ 17]; segmentation datasets like Core50 [ 38]; and natural language understanding datasets like GLUE [67] and SuperGLUE [59]. Besides supervised learning datasets, video game benchmarks (e.g., Atari [46], XLand [64], and VisDoom [30]) in reinforcement learning (RL) have also been used for","claim_type":"dataset","confidence":0.85,"evidence_strength":"citation_context"},{"claim_text":"learning faces challenges including long training times, low sample efficiency, and stability concerns, particularly when applied in complex real-world environments [21]. Agents with transfer learning and meta learning. Traditionally, training a reinforcement learning agent requires huge sample sizes and long training time, and lacks generalization capability [ 72; 73; 74; 75; 76]. Consequently, researchers have introduced transfer learning to expedite an agent's learning on new tasks [77; 78; 7","claim_type":"background","confidence":0.85,"evidence_strength":"citation_context"}],"why_cited":"Pith tracks Playing Atari with Deep Reinforcement Learning because it crossed a citation-hub threshold. Current citing contexts most often use it as background evidence (14 contexts).","role_counts":[{"n":14,"context_role":"background"},{"n":1,"context_role":"dataset"},{"n":1,"context_role":"method"},{"n":1,"context_role":"other"}]},"error":null,"updated_at":"2026-05-22T18:03:50.531399+00:00"},"author_expand":{"job_type":"author_expand","status":"succeeded","result":{"authors_linked":[{"id":"86d7ce36-293b-4b1d-9886-ff6dc1068818","orcid":null,"display_name":"Volodymyr Mnih"},{"id":"9d8577f6-4451-4e27-a0d2-3e66e22c0c4f","orcid":null,"display_name":"Koray Kavukcuoglu"},{"id":"c6fd3013-f440-4f0f-a73d-da5b40715448","orcid":null,"display_name":"David Silver"},{"id":"d96aa666-85ce-48ea-bd97-e08bc711dd3f","orcid":null,"display_name":"Alex Graves"},{"id":"b2737be9-0345-4f88-b1da-d8cfe552289c","orcid":null,"display_name":"Ioannis Antonoglou"},{"id":"0aee7a1b-b92d-43ff-90b6-b7d0b61a09d0","orcid":null,"display_name":"Daan Wierstra"}]},"error":null,"updated_at":"2026-05-22T18:03:51.253885+00:00"},"context_extract":{"job_type":"context_extract","status":"succeeded","result":{"enqueued_papers":25},"error":null,"updated_at":"2026-05-14T11:59:57.710104+00:00"},"graph_features":{"job_type":"graph_features","status":"succeeded","result":{"co_cited":[{"title":"Proximal Policy Optimization Algorithms","work_id":"240c67fe-d14d-4520-91c1-38a4e272ca19","shared_citers":23},{"title":"OpenAI Gym","work_id":"6af98f3f-f074-41ae-a689-7dd7b4b8efde","shared_citers":12},{"title":"Continuous control with deep reinforcement learning","work_id":"41a65444-c819-4303-a1f1-b075aa86d40c","shared_citers":10},{"title":"Adam: A Method for Stochastic Optimization","work_id":"1910796d-9b52-4683-bf5c-de9632c1028b","shared_citers":8},{"title":"DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models","work_id":"c5006563-f3ec-438a-9e35-b7b484f34828","shared_citers":4},{"title":"DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning","work_id":"e6b75ad5-2877-4168-97c8-710407094d20","shared_citers":4},{"title":"Dota 2 with Large Scale Deep Reinforcement Learning","work_id":"b047dc18-e9a3-4d11-8ff6-cd59d41a6357","shared_citers":4},{"title":"GPT-4 Technical Report","work_id":"b928e041-6991-4c08-8c81-0359e4097c7b","shared_citers":4},{"title":"High-Dimensional Continuous Control Using Generalized Advantage Estimation","work_id":"38e3ca94-96f0-4b19-a355-0754931af8be","shared_citers":4},{"title":"LLaMA: Open and Efficient Foundation Language Models","work_id":"c018fc23-6f3f-4035-9d02-28a2173b2b9d","shared_citers":4},{"title":"Scaling Instruction-Finetuned Language Models","work_id":"8405abb1-7558-4fdf-af24-f4c52fa77a06","shared_citers":4},{"title":"Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor","work_id":"6674e5db-4e1c-49c0-b598-c108a0ecadb6","shared_citers":4},{"title":"Addressing function approximation error in actor-critic methods","work_id":"129bee39-1830-4ff5-a7c3-f8ecae60f370","shared_citers":3},{"title":"Advantage-Weighted Regression: Simple and Scalable Off-Policy Reinforcement Learning","work_id":"ab561983-ab59-4f04-a11e-a467ddde4848","shared_citers":3},{"title":"A Quantum Approximate Optimization Algorithm","work_id":"5a33d9f3-407a-4c7e-a119-ff581c66b173","shared_citers":3},{"title":"AWAC: Accelerating Online Reinforcement Learning with Offline Datasets","work_id":"f0a11265-1acf-4ffc-a822-08bd04b6bddf","shared_citers":3},{"title":"Deep reinforcement learning and the deadly triad","work_id":"de214ead-4cb0-4abd-be3d-ae3389f55e9b","shared_citers":3},{"title":"E., and Levine, S","work_id":"3570547f-d4c5-4808-a945-f27a73bb7d90","shared_citers":3},{"title":"Exploration by random network distillation","work_id":"5a87fef6-96e2-4d5b-91ec-1a7c9a43cab9","shared_citers":3},{"title":"Generative Agents: Interactive Simulacra of Human Behavior","work_id":"01f7ddaa-284a-441a-be87-921aad4dc54b","shared_citers":3},{"title":"Gymnasium: A Standard Interface for Reinforcement Learning Environments","work_id":"5382dc1c-a327-49b9-afda-4794d5847698","shared_citers":3},{"title":"Prioritized experience replay","work_id":"927187c1-c50e-4ca7-b0fa-55589957731f","shared_citers":3},{"title":"Progressive Neural Networks","work_id":"0700d73f-b94d-4cd3-be40-086e4c4544c4","shared_citers":3},{"title":"robosuite: A Modular Simulation Framework and Benchmark for Robot Learning","work_id":"d616d4ba-7713-4e3e-8c9e-dfebbb8f1abf","shared_citers":3}],"time_series":[{"n":1,"year":2015},{"n":1,"year":2016},{"n":1,"year":2017},{"n":2,"year":2018},{"n":2,"year":2019},{"n":1,"year":2020},{"n":4,"year":2023},{"n":1,"year":2024},{"n":41,"year":2026}],"dependency_candidates":[]},"error":null,"updated_at":"2026-05-14T11:59:59.727389+00:00"},"identity_refresh":{"job_type":"identity_refresh","status":"succeeded","result":{"items":[{"title":"Qwen3 Technical Report","outcome":"unchanged","work_id":"25a4e30c-1232-48e7-9925-02fa12ba7c9e","resolver":"local_arxiv","confidence":0.98,"old_work_id":"25a4e30c-1232-48e7-9925-02fa12ba7c9e"}],"counts":{"fixed":0,"merged":0,"unchanged":1,"quarantined":0,"needs_external_resolution":0},"errors":[],"attempted":1},"error":null,"updated_at":"2026-05-14T12:00:02.051606+00:00"},"role_polarity":{"job_type":"role_polarity","status":"succeeded","result":{"title":"Playing Atari with Deep Reinforcement Learning","claims":[{"claim_text":"We present the first deep learning model to successfully learn control policies directly from high-dimensional sensory input using reinforcement learning. The model is a convolutional neural network, trained with a variant of Q-learning, whose input is raw pixels and whose output is a value function estimating future rewards. We apply our method to seven Atari 2600 games from the Arcade Learning Environment, with no adjustment of the architecture or learning algorithm. We find that it outperforms all previous approaches on six of the games and surpasses a human expert on three of them.","claim_type":"abstract","evidence_strength":"source_metadata"},{"claim_text":"Str ¨umke, \"Reinforcement learning in an adapt- able chess environment for detecting human-understandable concepts,\" arXiv preprint arXiv:2211.05500 , 2022. [21] M. E. Taylor, N. Carboni, A. Fachantidis, I. Vlahavas, and L. Torrey, \"Reinforcement learning agents providing advice in complex video games,\" Connection Science, vol. 26, no. 1, pp. 45-63, 2014. [22] V . Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, and M. Riedmiller, \"Playing atari with deep reinforcement lea","claim_type":"other","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"threads:Synergy between RL and LLMsandLLM Agents, detailed as follows: Synergy between RL and LLMs.The second line of research investigates how reinforcement learning algorithms are applied to improve or align LLMs. A primary branch, RL for training LLMs, leverages on-policy (e.g., proximal policy optimization (PPO) [1] and Group Relative Policy Optimization (GRPO) [2]) and off-policy (e.g., actor-critic, Q-learning [3]) methods to enhance capabilities such as instruction following, ethical alig","claim_type":"background","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"Detailed descriptions can be found in Appendix H.4. Atari Pong [4] is a classic arcade video game where two players control paddles to hit a ball across the screen. With raw pixel observations and competitive dynamics, the game has become a canon- ical environment in the Arcade Learning Environment (ALE) [8], which requires spatio-temporal reasoning and strategic gameplay [46, 47]. Detailed descriptions can be found in Appendix H.5. 2.4 Mixed-motive games In mixed-motive games, agents' objective","claim_type":"background","confidence":0.85,"evidence_strength":"citation_context"},{"claim_text":"Finally, AlphaStar's value network observed full information about the game state (including observations hidden from the policy); this method improved their training and exploring its application to Dota 2 is a promising direction for future work. Deep reinforcement learning has been successfully applied to learning control policies from high dimensional input. In 2013, Mnih et al.[3] show that it is possible to combine a deep convolutional neural network with a Q-learning algorithm[40] and a n","claim_type":"background","confidence":0.85,"evidence_strength":"citation_context"},{"claim_text":"Lifelong Learning Benchmarks Pioneering work has adapted standard vision or language datasets for studying LL. This line of work includes image classification datasets like MNIST [18], CIFAR [34], and ImageNet [ 17]; segmentation datasets like Core50 [ 38]; and natural language understanding datasets like GLUE [67] and SuperGLUE [59]. Besides supervised learning datasets, video game benchmarks (e.g., Atari [46], XLand [64], and VisDoom [30]) in reinforcement learning (RL) have also been used for","claim_type":"dataset","confidence":0.85,"evidence_strength":"citation_context"},{"claim_text":"learning faces challenges including long training times, low sample efficiency, and stability concerns, particularly when applied in complex real-world environments [21]. Agents with transfer learning and meta learning. Traditionally, training a reinforcement learning agent requires huge sample sizes and long training time, and lacks generalization capability [ 72; 73; 74; 75; 76]. Consequently, researchers have introduced transfer learning to expedite an agent's learning on new tasks [77; 78; 7","claim_type":"background","confidence":0.85,"evidence_strength":"citation_context"}],"why_cited":"Pith tracks Playing Atari with Deep Reinforcement Learning because it crossed a citation-hub threshold. Current citing contexts most often use it as background evidence (14 contexts).","role_counts":[{"n":14,"context_role":"background"},{"n":1,"context_role":"dataset"},{"n":1,"context_role":"method"},{"n":1,"context_role":"other"}]},"error":null,"updated_at":"2026-05-22T18:03:50.533814+00:00"},"summary_claims":{"job_type":"summary_claims","status":"succeeded","result":{"title":"Playing Atari with Deep Reinforcement Learning","claims":[{"claim_text":"We present the first deep learning model to successfully learn control policies directly from high-dimensional sensory input using reinforcement learning. The model is a convolutional neural network, trained with a variant of Q-learning, whose input is raw pixels and whose output is a value function estimating future rewards. We apply our method to seven Atari 2600 games from the Arcade Learning Environment, with no adjustment of the architecture or learning algorithm. We find that it outperforms all previous approaches on six of the games and surpasses a human expert on three of them.","claim_type":"abstract","evidence_strength":"source_metadata"}],"why_cited":"Pith tracks Playing Atari with Deep Reinforcement Learning because it crossed a citation-hub threshold.","role_counts":[]},"error":null,"updated_at":"2026-05-14T11:59:50.775090+00:00"}},"summary":{"title":"Playing Atari with Deep Reinforcement Learning","claims":[{"claim_text":"We present the first deep learning model to successfully learn control policies directly from high-dimensional sensory input using reinforcement learning. The model is a convolutional neural network, trained with a variant of Q-learning, whose input is raw pixels and whose output is a value function estimating future rewards. We apply our method to seven Atari 2600 games from the Arcade Learning Environment, with no adjustment of the architecture or learning algorithm. We find that it outperforms all previous approaches on six of the games and surpasses a human expert on three of them.","claim_type":"abstract","evidence_strength":"source_metadata"}],"why_cited":"Pith tracks Playing Atari with Deep Reinforcement Learning because it crossed a citation-hub threshold.","role_counts":[]},"graph":{"co_cited":[{"title":"Proximal Policy Optimization Algorithms","work_id":"240c67fe-d14d-4520-91c1-38a4e272ca19","shared_citers":23},{"title":"OpenAI Gym","work_id":"6af98f3f-f074-41ae-a689-7dd7b4b8efde","shared_citers":12},{"title":"Continuous control with deep reinforcement learning","work_id":"41a65444-c819-4303-a1f1-b075aa86d40c","shared_citers":10},{"title":"Adam: A Method for Stochastic Optimization","work_id":"1910796d-9b52-4683-bf5c-de9632c1028b","shared_citers":8},{"title":"DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models","work_id":"c5006563-f3ec-438a-9e35-b7b484f34828","shared_citers":4},{"title":"DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning","work_id":"e6b75ad5-2877-4168-97c8-710407094d20","shared_citers":4},{"title":"Dota 2 with Large Scale Deep Reinforcement Learning","work_id":"b047dc18-e9a3-4d11-8ff6-cd59d41a6357","shared_citers":4},{"title":"GPT-4 Technical Report","work_id":"b928e041-6991-4c08-8c81-0359e4097c7b","shared_citers":4},{"title":"High-Dimensional Continuous Control Using Generalized Advantage Estimation","work_id":"38e3ca94-96f0-4b19-a355-0754931af8be","shared_citers":4},{"title":"LLaMA: Open and Efficient Foundation Language Models","work_id":"c018fc23-6f3f-4035-9d02-28a2173b2b9d","shared_citers":4},{"title":"Scaling Instruction-Finetuned Language Models","work_id":"8405abb1-7558-4fdf-af24-f4c52fa77a06","shared_citers":4},{"title":"Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor","work_id":"6674e5db-4e1c-49c0-b598-c108a0ecadb6","shared_citers":4},{"title":"Addressing function approximation error in actor-critic methods","work_id":"129bee39-1830-4ff5-a7c3-f8ecae60f370","shared_citers":3},{"title":"Advantage-Weighted Regression: Simple and Scalable Off-Policy Reinforcement Learning","work_id":"ab561983-ab59-4f04-a11e-a467ddde4848","shared_citers":3},{"title":"A Quantum Approximate Optimization Algorithm","work_id":"5a33d9f3-407a-4c7e-a119-ff581c66b173","shared_citers":3},{"title":"AWAC: Accelerating Online Reinforcement Learning with Offline Datasets","work_id":"f0a11265-1acf-4ffc-a822-08bd04b6bddf","shared_citers":3},{"title":"Deep reinforcement learning and the deadly triad","work_id":"de214ead-4cb0-4abd-be3d-ae3389f55e9b","shared_citers":3},{"title":"E., and Levine, S","work_id":"3570547f-d4c5-4808-a945-f27a73bb7d90","shared_citers":3},{"title":"Exploration by random network distillation","work_id":"5a87fef6-96e2-4d5b-91ec-1a7c9a43cab9","shared_citers":3},{"title":"Generative Agents: Interactive Simulacra of Human Behavior","work_id":"01f7ddaa-284a-441a-be87-921aad4dc54b","shared_citers":3},{"title":"Gymnasium: A Standard Interface for Reinforcement Learning Environments","work_id":"5382dc1c-a327-49b9-afda-4794d5847698","shared_citers":3},{"title":"Prioritized experience replay","work_id":"927187c1-c50e-4ca7-b0fa-55589957731f","shared_citers":3},{"title":"Progressive Neural Networks","work_id":"0700d73f-b94d-4cd3-be40-086e4c4544c4","shared_citers":3},{"title":"robosuite: A Modular Simulation Framework and Benchmark for Robot Learning","work_id":"d616d4ba-7713-4e3e-8c9e-dfebbb8f1abf","shared_citers":3}],"time_series":[{"n":1,"year":2015},{"n":1,"year":2016},{"n":1,"year":2017},{"n":2,"year":2018},{"n":2,"year":2019},{"n":1,"year":2020},{"n":4,"year":2023},{"n":1,"year":2024},{"n":41,"year":2026}],"dependency_candidates":[]},"authors":[{"id":"d96aa666-85ce-48ea-bd97-e08bc711dd3f","orcid":null,"display_name":"Alex Graves","source":"manual","import_confidence":0.72},{"id":"0aee7a1b-b92d-43ff-90b6-b7d0b61a09d0","orcid":null,"display_name":"Daan Wierstra","source":"manual","import_confidence":0.72},{"id":"c6fd3013-f440-4f0f-a73d-da5b40715448","orcid":null,"display_name":"David Silver","source":"manual","import_confidence":0.72},{"id":"b2737be9-0345-4f88-b1da-d8cfe552289c","orcid":null,"display_name":"Ioannis Antonoglou","source":"manual","import_confidence":0.72},{"id":"9d8577f6-4451-4e27-a0d2-3e66e22c0c4f","orcid":null,"display_name":"Koray Kavukcuoglu","source":"manual","import_confidence":0.72},{"id":"86d7ce36-293b-4b1d-9886-ff6dc1068818","orcid":null,"display_name":"Volodymyr Mnih","source":"manual","import_confidence":0.72}]}}