pith. sign in

super hub Canonical reference

Playing Atari with Deep Reinforcement Learning

Canonical reference. 83% of citing Pith papers cite this work as background.

178 Pith papers citing it
Background 83% of classified citations
abstract

We present the first deep learning model to successfully learn control policies directly from high-dimensional sensory input using reinforcement learning. The model is a convolutional neural network, trained with a variant of Q-learning, whose input is raw pixels and whose output is a value function estimating future rewards. We apply our method to seven Atari 2600 games from the Arcade Learning Environment, with no adjustment of the architecture or learning algorithm. We find that it outperforms all previous approaches on six of the games and surpasses a human expert on three of them.

hub tools

citation-role summary

background 15 dataset 1 method 1 other 1

citation-polarity summary

claims ledger

  • abstract We present the first deep learning model to successfully learn control policies directly from high-dimensional sensory input using reinforcement learning. The model is a convolutional neural network, trained with a variant of Q-learning, whose input is raw pixels and whose output is a value function estimating future rewards. We apply our method to seven Atari 2600 games from the Arcade Learning Environment, with no adjustment of the architecture or learning algorithm. We find that it outperforms all previous approaches on six of the games and surpasses a human expert on three of them.

authors

co-cited works

clear filters

representative citing papers

OP-GRPO: Efficient Off-Policy GRPO for Flow-Matching Models

cs.CV · 2026-04-05 · unverdicted · novelty 8.0

OP-GRPO is the first off-policy GRPO method for flow-matching models that reuses trajectories via replay buffer and importance sampling corrections, matching on-policy performance with 34.2% of the training steps.

Consistency Models

cs.LG · 2023-03-02 · conditional · novelty 8.0

Consistency models achieve fast one-step generation with SOTA FID of 3.55 on CIFAR-10 and 6.20 on ImageNet 64x64 by directly mapping noise to data, outperforming prior distillation techniques.

What Type of Inference is Active Inference?

cs.AI · 2026-06-03 · unverdicted · novelty 7.0

EFE-based active inference planning is characterized as VFE on an augmented model plus entropy and planning corrections, with a derived message-passing implementation and grid-world validation.

Staying Alive: Uncensored Survival Analysis with Tabular Foundation Models

cs.LG · 2026-06-02 · unverdicted · novelty 7.0

A training-free survival regression approach uses tabular foundation models to build an accelerated failure time model and iteratively impute right-censored data with a non-parametric in-context estimator, matching the performance of trained Cox and parametric AFT models on benchmarks.

TabQL: In-Context Q-Learning with Tabular Foundation Models

cs.LG · 2026-05-18 · unverdicted · novelty 7.0

TabQL is a reinforcement learning framework that substitutes a tabular foundation model with in-context capabilities for the parametric Q-network in DQN, with a warm-up phase and theoretical analysis claiming improved sample efficiency.

On-line Learning in Tree MDPs by Treating Policies as Bandit Arms

cs.AI · 2026-05-06 · unverdicted · novelty 7.0

Bandit algorithms can be adapted to Tree MDPs by treating policies as arms with shared-data confidence bounds, achieving polynomial memory and instance-dependent bounds on sample complexity and regret that depend on terminal-state gaps rather than all policies.

Replay-buffer engineering for noise-robust quantum circuit optimization

quant-ph · 2026-04-23 · unverdicted · novelty 7.0

Treating the replay buffer as a central lever in RL for quantum circuit optimization yields 4-32x sample efficiency gains, up to 67.5% faster episodes, and 85-90% fewer steps to accuracy on noisy molecular and compilation tasks.

Bounded Ratio Reinforcement Learning

cs.LG · 2026-04-20 · conditional · novelty 7.0

BRRL derives an analytic optimal policy for regularized constrained RL that guarantees monotonic improvement and yields the BPO algorithm that matches or exceeds PPO.

citing papers explorer

Showing 50 of 178 citing papers.

  • A Differentiable Atari VCS:A Complex, Fully Known Ground Truth for Explainable AI cs.AI · 2026-06-21 · conditional · none · ref 12 · internal anchor

    Differentiable reimplementations of the Atari VCS provide a complex, fully known ground-truth system for testing gradient-based explainable AI methods.

  • OP-GRPO: Efficient Off-Policy GRPO for Flow-Matching Models cs.CV · 2026-04-05 · unverdicted · none · ref 25 · internal anchor

    OP-GRPO is the first off-policy GRPO method for flow-matching models that reuses trajectories via replay buffer and importance sampling corrections, matching on-policy performance with 34.2% of the training steps.

  • LIBERO: Benchmarking Knowledge Transfer for Lifelong Robot Learning cs.AI · 2023-06-05 · conditional · none · ref 46 · internal anchor

    LIBERO is a new benchmark for lifelong robot learning that evaluates transfer of declarative, procedural, and mixed knowledge across 130 manipulation tasks with provided demonstration data.

  • Consistency Models cs.LG · 2023-03-02 · conditional · none · ref 42 · internal anchor

    Consistency models achieve fast one-step generation with SOTA FID of 3.55 on CIFAR-10 and 6.20 on ImageNet 64x64 by directly mapping noise to data, outperforming prior distillation techniques.

  • Decision Transformer: Reinforcement Learning via Sequence Modeling cs.LG · 2021-06-02 · accept · none · ref 42 · internal anchor

    Decision Transformer casts RL as autoregressive sequence modeling conditioned on desired returns, past states and actions, matching or exceeding offline RL baselines on Atari, Gym and Key-to-Door tasks.

  • DecompRL: Solving Harder Problems by Learning Modular Code Generation cs.LG · 2026-07-02 · unverdicted · none · ref 41 · internal anchor

    DecompRL is an RL method that learns modular code decomposition for LLMs, enabling exponential candidate generation via recombination to solve harder coding problems with lower GPU cost.

  • PolicyGuard: Towards Test-time and Step-level Adversary (Backdoor) Defense for Reinforcement Learning Agent cs.LG · 2026-06-11 · unverdicted · none · ref 13 · internal anchor

    PolicyGuard provides a test-time step-level defense against backdoor attacks in RL using GP posterior variance, showing high detection AUROC on seven games.

  • Expected Free Energy-based Planning as Variational Inference cs.AI · 2026-06-09 · unverdicted · none · ref 206 · internal anchor

    EFE-based planning is formulated as variational free energy minimization with epistemic priors, decomposing into expected plan costs plus a complexity term.

  • What Type of Inference is Active Inference? cs.AI · 2026-06-03 · unverdicted · none · ref 224 · internal anchor

    EFE-based active inference planning is characterized as VFE on an augmented model plus entropy and planning corrections, with a derived message-passing implementation and grid-world validation.

  • When Offline Selectors Cannot Beat the Best Single Model: A Diagnostic Study on edX Dropout Prediction cs.LG · 2026-06-02 · conditional · none · ref 9 · internal anchor

    A three-stage diagnostic on edX data shows offline selectors (BC, DQN, CQL) fail to reach oracle performance due to local representational ambiguity rather than learner mismatch or label shift.

  • Staying Alive: Uncensored Survival Analysis with Tabular Foundation Models cs.LG · 2026-06-02 · unverdicted · none · ref 11 · internal anchor

    A training-free survival regression approach uses tabular foundation models to build an accelerated failure time model and iteratively impute right-censored data with a non-parametric in-context estimator, matching the performance of trained Cox and parametric AFT models on benchmarks.

  • Task-Induced Representational Invariances Depend on Learning Objective in Deep RL cs.LG · 2026-06-01 · unverdicted · none · ref 38 · internal anchor

    In navigation tasks, DQN learns MDP-homomorphism-invariant representations while PPO learns action-symmetric ones despite comparable performance, with effects on transfer and in LLMs.

  • OPD+: Rethinking the Advantage Design for On-Policy Distillation cs.LG · 2026-05-31 · unverdicted · none · ref 13 · internal anchor

    OPD+ removes the bias from stop-gradient in on-policy distillation by deriving correct gradients for f-divergences, outperforming standard KL-based methods on math reasoning and tool-use tasks.

  • Word Class Representations Spontaneously Emerge from Successor Representations Trained on Natural Language cs.CL · 2026-05-23 · unverdicted · none · ref 2 · internal anchor

    Successor representation training on natural language causes part-of-speech categories to emerge spontaneously in the learned embeddings, with structure varying by predictive horizon.

  • Concentration of General Stochastic Approximation Under Heavy-Tailed Markovian Noise math.PR · 2026-05-20 · unverdicted · none · ref 106 · internal anchor

    Establishes maximal concentration bounds for stochastic approximation under heavy-tailed Markovian noise, with tails ranging from sub-Gaussian to heavier than Weibull depending on step sizes and contractivity properties, plus a truncation argument for unbounded noise.

  • Compositional Transduction with Latent Analogies for Offline Goal-Conditioned Reinforcement Learning cs.LG · 2026-05-20 · unverdicted · none · ref 5 · internal anchor

    Proposes latent analogies and analogy transduction to enable compositional generalization to unseen goal-context pairs in offline GCRL, outperforming trajectory-stitching baselines on manipulation tasks.

  • TabQL: In-Context Q-Learning with Tabular Foundation Models cs.LG · 2026-05-18 · unverdicted · none · ref 5 · internal anchor

    TabQL is a reinforcement learning framework that substitutes a tabular foundation model with in-context capabilities for the parametric Q-network in DQN, with a warm-up phase and theoretical analysis claiming improved sample efficiency.

  • Access Timing as Scaffolding: A Reinforcement Learning Approach to GenAI in Education cs.CY · 2026-05-15 · unverdicted · none · ref 87 · 2 links · internal anchor

    A reinforcement learning agent for timing GenAI access improved post-test performance and metacognitive accuracy over unrestricted or fully restricted conditions in a lab study with 105 students.

  • Self-Supervised On-Policy Reinforcement Learning via Contrastive Proximal Policy Optimisation cs.LG · 2026-05-13 · unverdicted · none · ref 14 · internal anchor

    CPPO is an on-policy contrastive RL method that derives advantages from contrastive Q-values for PPO optimization, outperforming prior CRL baselines in 14/18 tasks and matching or exceeding reward-based PPO in 12/18 tasks.

  • TuniQ: Autotuning Compilation Passes for Quantum Workloads at Scale for Effectiveness and Efficiency quant-ph · 2026-05-12 · unverdicted · none · ref 49 · internal anchor

    TuniQ uses RL with a dual-encoder, shaped rewards, and action masking to autotune quantum compilation passes, improving fidelity and speed over Qiskit while generalizing across backends and scaling to large circuits.

  • On-line Learning in Tree MDPs by Treating Policies as Bandit Arms cs.AI · 2026-05-06 · unverdicted · none · ref 37 · internal anchor

    Bandit algorithms can be adapted to Tree MDPs by treating policies as arms with shared-data confidence bounds, achieving polynomial memory and instance-dependent bounds on sample complexity and regret that depend on terminal-state gaps rather than all policies.

  • Replay-buffer engineering for noise-robust quantum circuit optimization quant-ph · 2026-04-23 · unverdicted · none · ref 31 · internal anchor

    Treating the replay buffer as a central lever in RL for quantum circuit optimization yields 4-32x sample efficiency gains, up to 67.5% faster episodes, and 85-90% fewer steps to accuracy on noisy molecular and compilation tasks.

  • Co-Evolving LLM Decision and Skill Bank Agents for Long-Horizon Tasks cs.AI · 2026-04-22 · unverdicted · none · ref 18 · internal anchor

    COSPLAY co-evolves an LLM decision agent with a skill bank agent to improve long-horizon game performance, reporting over 25.1% average reward gains versus frontier LLM baselines on single-player benchmarks.

  • Bounded Ratio Reinforcement Learning cs.LG · 2026-04-20 · conditional · none · ref 16 · internal anchor

    BRRL derives an analytic optimal policy for regularized constrained RL that guarantees monotonic improvement and yields the BPO algorithm that matches or exceeds PPO.

  • Reinforcement Learning via Value Gradient Flow cs.LG · 2026-04-15 · unverdicted · none · ref 44 · internal anchor

    VGF solves behavior-regularized RL by transporting particles from a reference distribution to the value-induced optimal policy via discrete value-guided gradient flow.

  • Autonomous Diffractometry Enabled by Visual Reinforcement Learning cs.LG · 2026-04-13 · unverdicted · none · ref 45 · internal anchor

    A model-free reinforcement learning agent learns to align crystals from diffraction images without human supervision or theoretical knowledge.

  • SafeAdapt: Provably Safe Policy Updates in Deep Reinforcement Learning cs.LG · 2026-04-10 · unverdicted · none · ref 28 · internal anchor

    SafeAdapt certifies a Rashomon set of safe policies from demonstration data and projects updates from arbitrary RL algorithms onto it to guarantee preservation of safety on source tasks.

  • Long-Horizon Model-Based Offline Reinforcement Learning Without Explicit Conservatism cs.LG · 2025-12-04 · conditional · none · ref 9 · internal anchor

    NEUBAY uses Bayesian posteriors over world models with long-horizon planning to match or exceed conservative offline RL methods without explicit conservatism.

  • Inverse Reinforcement Learning with Just Classification and a Few Regressions cs.LG · 2025-09-25 · unverdicted · none · ref 22 · internal anchor

    GenPQR recovers normalized rewards in maximum-entropy IRL by estimating the policy with classification and the soft Q-function with regression, providing modular finite-sample guarantees under general function approximation.

  • Adaptive Ensemble Aggregation for Actor-Critics cs.LG · 2025-07-31 · unverdicted · none · ref 23 · internal anchor

    AEA dynamically aggregates ensembles in off-policy actor-critics from training dynamics, with proofs of convergence to an error-minimizing equilibrium, bias shrinkage with ensemble size, and monotonic policy improvement.

  • Deep Computerized Adaptive Testing stat.ME · 2025-02-26 · unverdicted · none · ref 41 · internal anchor

    A multivariate Bayesian IRT CAT framework accelerated by direct sampling and optimized with double deep Q-learning for non-myopic item selection.

  • Acoustics-based Active Control of Unsteady Flow Dynamics using Reinforcement Learning Driven Synthetic Jets physics.flu-dyn · 2023-12-27 · unverdicted · none · ref 45 · internal anchor

    A DRL agent uses far-field acoustic measurements from a hydrophone array as its sole feedback to drive synthetic jets on a cylinder, achieving up to 9.5% noise reduction and 23.8% drag reduction at Re=100.

  • Learning Interactive Real-World Simulators cs.AI · 2023-10-09 · conditional · none · ref 164 · internal anchor

    UniSim learns a universal real-world simulator from orchestrated diverse datasets, enabling zero-shot deployment of policies trained purely in simulation.

  • Voyager: An Open-Ended Embodied Agent with Large Language Models cs.AI · 2023-05-25 · unverdicted · none · ref 33 · internal anchor

    Voyager achieves superior lifelong learning in Minecraft by combining an automatic exploration curriculum, a library of executable skills, and iterative LLM prompting with environment feedback, yielding 3.3x more unique items and 15.3x faster milestone unlocks than prior methods while generalizing技能

  • Dota 2 with Large Scale Deep Reinforcement Learning cs.LG · 2019-12-13 · accept · none · ref 3 · internal anchor

    OpenAI Five achieved superhuman performance in Dota 2 by defeating the world champions using scaled self-play reinforcement learning.

  • Language Models as Knowledge Bases? cs.CL · 2019-09-03 · accept · none · ref 300 · internal anchor

    BERT stores relational knowledge extractable via cloze queries without fine-tuning and matches supervised baselines on open-domain QA tasks.

  • Benchmarking Model-Based Reinforcement Learning cs.LG · 2019-07-03 · accept · none · ref 34 · internal anchor

    Introduces a benchmark suite of over 18 MBRL environments, evaluates multiple algorithms under consistent settings, and identifies three core challenges: dynamics bottleneck, planning horizon dilemma, and early-termination dilemma.

  • Finding Needles in a Moving Haystack: Prioritizing Alerts with Adversarial Reinforcement Learning cs.CR · 2019-06-20 · unverdicted · none · ref 28 · internal anchor

    Adversarial RL approximates a game-theoretic equilibrium to yield a stochastic policy for prioritizing alerts against adaptive attackers in fraud and intrusion detection.

  • Exploring Model-based Planning with Policy Networks cs.LG · 2019-06-20 · unverdicted · none · ref 28 · internal anchor

    POPLIN combines policy networks with model-predictive planning by optimizing either action sequences or policy parameters, yielding 3x better sample efficiency than PETS, TD3 and SAC on MuJoCo locomotion tasks.

  • Soft Actor-Critic Algorithms and Applications cs.LG · 2018-12-13 · unverdicted · none · ref 9 · internal anchor

    SAC extends maximum-entropy RL into a stable off-policy actor-critic method with constrained temperature tuning, outperforming prior algorithms in sample efficiency and consistency on locomotion and manipulation tasks.

  • Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor cs.LG · 2018-01-04 · accept · none · ref 17 · internal anchor

    Soft Actor-Critic is an off-policy maximum-entropy actor-critic algorithm that achieves state-of-the-art performance and high stability on continuous control benchmarks.

  • Deep reinforcement learning from human preferences stat.ML · 2017-06-12 · accept · none · ref 9 · internal anchor

    Reinforcement learning agents solve complex tasks without access to the reward function by training a reward predictor from human comparisons of trajectory segments, requiring feedback on less than 1% of interactions.

  • Continuous control with deep reinforcement learning cs.LG · 2015-09-09 · accept · none · ref 7 · internal anchor

    DDPG is a model-free actor-critic algorithm that learns continuous control policies end-to-end from states or pixels using deterministic policy gradients and deep networks, solving more than 20 physics tasks competitively with full-information planning methods.

  • One Demonstration Is Enough for Real-World Robotic Reinforcement Learning cs.RO · 2026-07-02 · unverdicted · none · ref 17 · internal anchor

    AutoSERL achieves strong performance on six real-world robot manipulation tasks using RL guided by a single demonstration via sliding-window intervention, safety recovery, and automatic termination.

  • Coachable agents for interactive gameplay cs.AI · 2026-07-01 · unverdicted · none · ref 36 · internal anchor

    A framework combining universal value function approximators with targeted training scenarios and data augmentation produces RL agents that adapt to user-specified styles in real time across video games and humanoid domains while preserving core task performance.

  • Failure-Based Testing for Deep Reinforcement Learning Agents cs.SE · 2026-06-30 · unverdicted · none · ref 27 · internal anchor

    Proposes Prior Random Testing (PRT) that leverages task difficulty to prioritize failure-prone test cases for DRL agents, achieving over 50% lower testing cost than random testing while preserving diversity on four benchmarks.

  • ReLaTS: a Reinforcement Learning-based method for dynamically determining the coupling Time Step in multi-scale simulations of self-gravitating systems astro-ph.IM · 2026-06-18 · unverdicted · none · ref 14 · internal anchor

    ReLaTS uses reinforcement learning to dynamically choose coupling timesteps in multi-scale self-gravitating simulations, achieving lower energy errors than fixed-timestep methods with comparable cost.

  • Reversal Q-Learning cs.LG · 2026-06-16 · unverdicted · none · ref 15 · internal anchor

    Reversal Q-Learning (RQL) proposes reversing flows for virtual trajectories and bias-variance reduction in an expanded MDP to train flow policies, reporting best average performance on 50 simulated robotic tasks versus prior flow-based offline RL methods.

  • Dmsh: A Multi-Agent Reinforcement Learning Framework for All-Quad Mesh Generation math.NA · 2026-06-09 · unverdicted · none · ref 34 · internal anchor

    Dmsh is a new multi-agent RL framework that formulates mesh generation as an MDP and uses three coordinated agents plus curriculum learning to produce globally conforming all-quad meshes without post-processing.

  • MedGym:A Unified Continuous-Time Benchmark for Dynamic Medical Treatment Reinforcement Learning cs.LG · 2026-05-31 · unverdicted · none · ref 14 · internal anchor

    MedGym introduces a continuous-time RL benchmark for medical treatment derived from clinical data via PINNs, supporting offline/online evaluation on personalization, safety, and discrete vs continuous methods.