pith. sign in

arxiv: 1511.05952 · v4 · pith:MZHKTPI4new · submitted 2015-11-18 · 💻 cs.LG

Prioritized Experience Replay

classification 💻 cs.LG
keywords replayexperienceprioritizedtransitionsgameslearningreinforcementwere
0
0 comments X
read the original abstract

Experience replay lets online reinforcement learning agents remember and reuse experiences from the past. In prior work, experience transitions were uniformly sampled from a replay memory. However, this approach simply replays transitions at the same frequency that they were originally experienced, regardless of their significance. In this paper we develop a framework for prioritizing experience, so as to replay important transitions more frequently, and therefore learn more efficiently. We use prioritized experience replay in Deep Q-Networks (DQN), a reinforcement learning algorithm that achieved human-level performance across many Atari games. DQN with prioritized experience replay achieves a new state-of-the-art, outperforming DQN with uniform replay on 41 out of 49 games.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 41 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution

    cs.CL 2023-09 unverdicted novelty 8.0

    Promptbreeder evolves both task prompts and the mutation prompts that improve them using LLMs, outperforming Chain-of-Thought and Plan-and-Solve on arithmetic and commonsense reasoning benchmarks.

  2. Adaptive TD-Lambda for Cooperative Multi-agent Reinforcement Learning

    cs.LG 2026-05 unverdicted novelty 7.0

    ATD(λ) adapts TD(λ) in MARL via a density ratio estimator on past/current replay buffers to assign λ per state-action pair, yielding competitive or better results than fixed-λ QMIX and MAPPO on SMAC and Gfootball.

  3. Offline Policy Evaluation for Manipulation Policies via Discounted Liveness Formulation

    cs.RO 2026-05 conditional novelty 7.0

    A liveness-based Bellman operator enables conservative offline policy evaluation for manipulation tasks by encoding task progression and reducing truncation bias from finite horizons.

  4. Disagreement-Regularized Importance Sampling for Adversarial Label Corruption

    cs.LG 2026-05 unverdicted novelty 7.0

    DR-IS selects low-contamination subsets via bounded rank-disagreement in proxy ensembles under an ε-contamination model, with O(√(log(N/δ)/K)) concentration rates that certify separation when the expectation gap Δ' is...

  5. Replay-buffer engineering for noise-robust quantum circuit optimization

    quant-ph 2026-04 unverdicted novelty 7.0

    Treating the replay buffer as a central lever in RL for quantum circuit optimization yields 4-32x sample efficiency gains, up to 67.5% faster episodes, and 85-90% fewer steps to accuracy on noisy molecular and compila...

  6. Self-Organizing Dual-Buffer Adaptive Clustering Experience Replay (SODACER) for Safe Reinforcement Learning in Optimal Control

    eess.SY 2026-01 unverdicted novelty 7.0

    SODACER uses fast and slow buffers with adaptive clustering for experience replay in safe RL, integrated with CBFs and Sophia optimizer to achieve faster convergence and safety on nonlinear systems like HPV transmission.

  7. RL-AWB: Deep Reinforcement Learning for Auto White Balance Correction in Low-Light Night-time Scenes

    cs.CV 2026-01 unverdicted novelty 7.0

    RL-AWB uses reinforcement learning to optimize parameters of a statistical white-balance estimator for nighttime scenes and reports better generalization on a new multi-sensor dataset.

  8. Preemptive Solving of Future Problems: Multitask Preplay in Humans and Machines

    cs.LG 2025-07 unverdicted novelty 7.0

    Multitask Preplay replays experience from pursued tasks as starting points for counterfactual simulation of unpursued tasks to learn predictive representations that support fast generalization in humans and machines.

  9. Act in Collusion: Distributed Multi-Target Backdoor Attacks in Federated Learning

    cs.CV 2024-11 unverdicted novelty 7.0

    DMBA maintains attack success rates above 80% for all backdoors in a distributed multi-target FL setting where baselines drop below 50%.

  10. Mastering Diverse Domains through World Models

    cs.AI 2023-01 unverdicted novelty 7.0

    DreamerV3 uses world models and robustness techniques to solve over 150 tasks across domains with a single configuration, including Minecraft diamond collection from scratch.

  11. Mastering Atari with Discrete World Models

    cs.LG 2020-10 accept novelty 7.0

    DreamerV2 reaches human-level performance on 55 Atari games by learning behaviors inside a separately trained discrete-latent world model.

  12. Finding Needles in a Moving Haystack: Prioritizing Alerts with Adversarial Reinforcement Learning

    cs.CR 2019-06 unverdicted novelty 7.0

    Adversarial RL approximates a game-theoretic equilibrium to yield a stochastic policy for prioritizing alerts against adaptive attackers in fraud and intrusion detection.

  13. Beyond Partner Diversity: An Influence-Based Team Steering Framework for Zero-Shot Human-Machine Teaming

    cs.AI 2026-05 unverdicted novelty 6.0

    IBTS framework uses influence shaping to improve zero-shot human-machine teaming beyond partner diversity alone, with gains shown in Overcooked-AI simulations and a 30-subject human study.

  14. Error whitening: Why Gauss-Newton outperforms Newton

    cs.LG 2026-05 conditional novelty 6.0

    Gauss-Newton descent whitens errors by projecting Newton directions or gradients onto the tangent space, replacing JJ^T with the identity and removing parameterization distortions that affect Newton descent.

  15. When Does Non-Uniform Replay Matter in Reinforcement Learning?

    cs.LG 2026-05 unverdicted novelty 6.0

    Non-uniform replay helps off-policy RL mainly at low replay volumes, high-entropy sampling matters even at similar recency, and Truncated Geometric replay offers a low-overhead practical solution.

  16. Experience Constrained Hierarchical Federated Reinforcement Learning for Large-scale UAV Teams in Hazardous Environments

    cs.LG 2026-05 unverdicted novelty 6.0

    In experience-constrained federated RL for UAVs, learning performance depends primarily on experience reuse and minibatch size rather than the number of participating learners.

  17. AutoREC: A software platform for developing reinforcement learning agents for equivalent circuit model generation from electrochemical impedance spectroscopy data

    cs.LG 2026-04 unverdicted novelty 6.0

    AutoREC uses a Double Deep Q-Network agent to generate equivalent circuit models from EIS data, reporting over 99.6% success on synthetic sets and generalization to experimental battery, corrosion, and catalysis data.

  18. Preventing Latent Rehearsal Decay in Online Continual SSL with SOLAR

    cs.LG 2026-04 unverdicted novelty 6.0

    SOLAR prevents latent rehearsal decay in online continual SSL by adaptively managing replay buffers with deviation proxies and an explicit overlap loss, delivering both fast convergence and state-of-the-art final accu...

  19. Data Warmup: Complexity-Aware Curricula for Efficient Diffusion Training

    cs.LG 2026-04 conditional novelty 6.0

    Data Warmup accelerates diffusion training on ImageNet by scheduling images from low to high complexity via a foreground-based metric and temperature-controlled sampler, improving FID and IS scores faster than uniform...

  20. Continual Reinforcement Learning with Diversity Exploration and Adversarial Self-Correction

    cs.LG 2019-06 unverdicted novelty 6.0

    CDAN framework uses diversity exploration and adversarial self-correction for continual RL in continuous control, evaluated on new CAM environment with NSD metric showing 18.35% NSD improvement over baseline.

  21. DeepMind Control Suite

    cs.AI 2018-01 accept novelty 6.0

    The DeepMind Control Suite supplies a standardized collection of continuous control tasks with interpretable rewards for benchmarking reinforcement learning agents.

  22. Implicit Action Chunking for Smooth Continuous Control

    cs.RO 2026-05 unverdicted novelty 5.0

    Dual-Window Smoothing uses an execution window for deterministic smoothness and a value window to correct critic bias, plus a first-order temporal regularizer, to achieve smoother RL control than explicit chunking or ...

  23. When Does Non-Uniform Replay Matter in Reinforcement Learning?

    cs.LG 2026-05 unverdicted novelty 5.0

    Non-uniform replay helps most when replay volume is low; high-entropy sampling remains important, and a truncated geometric distribution delivers better sample efficiency with negligible overhead.

  24. When Does Non-Uniform Replay Matter in Reinforcement Learning?

    cs.LG 2026-05 unverdicted novelty 5.0

    Non-uniform replay improves RL sample efficiency mainly in low replay-volume regimes, with high-entropy sampling being key even at comparable recency.

  25. Distributional Value Estimation Without Target Networks for Robust Quality-Diversity

    cs.LG 2026-04 unverdicted novelty 5.0

    QDHUAC is a distributional, target-free QD-RL method that enables stable high-UTD training and competitive performance on Brax locomotion tasks using far fewer environment steps than prior approaches.

  26. Safe reinforcement learning with online filtering for fatigue-predictive human-robot task planning and allocation in production

    cs.AI 2026-04 unverdicted novelty 5.0

    PF-CD3Q uses online particle filtering to estimate fatigue parameters and constrains a deep Q-learning agent to solve fatigue-aware human-robot task planning as a CMDP.

  27. Unveiling the Black Box: A Multi-Layer Framework for Explaining Reinforcement Learning-Based Cyber Agents

    cs.CR 2025-05 unverdicted novelty 5.0

    A multi-layer framework combining POMDP-level strategic analysis and policy-level Q-value/PER tracking to explain RL-based cyber attacker behavior in simulated environments.

  28. Not All Rollouts are Useful: Down-Sampling Rollouts in LLM Reinforcement Learning

    cs.LG 2025-04 unverdicted novelty 5.0

    PODS applies max-variance down-sampling to GRPO rollouts in LLM RLVR, delivering at least 1.7x faster training to peak test accuracy on reasoning benchmarks.

  29. Reinforcement Learning for Testing Interdependent Requirements in Autonomous Vehicles: An Empirical Study

    cs.SE 2025-02 unverdicted novelty 5.0

    MORL generates more diverse requirement-violation scenarios while SORL produces higher-severity violations when testing interdependent requirements in an end-to-end AV controller.

  30. Intrinsic Motivation Driven Intuitive Physics Learning using Deep Reinforcement Learning with Intrinsic Reward Normalization

    cs.LG 2019-07 unverdicted novelty 5.0

    Graphical physics network integrated with DRL and intrinsic reward normalization lets an agent improve its intuitive physics model via intrinsic motivation in stationary and non-stationary 3D environments.

  31. Multi-Agent Deep Reinforcement Learning for Liquidation Strategy Analysis

    q-fin.TR 2019-06 unverdicted novelty 5.0

    The authors extend the Almgren-Chriss model to a multi-agent setting and apply deep reinforcement learning to simulate and optimize liquidation strategies under practical constraints.

  32. Rainbow Deep Q-Learning with Kinematics-Aware Design for Cooperative Delta and 3-RRS Parallel Robot Insertion

    cs.RO 2026-05 unverdicted novelty 4.0

    Rainbow DQN with kinematics-aware design optimization enables reliable cooperative insertion by Delta and 3-RRS robots in a high-fidelity simulator.

  33. Learning to Reason at the Frontier of Learnability

    cs.LG 2025-02 unverdicted novelty 4.0

    A curriculum sampling questions with high variance in success rate improves reinforcement learning performance for LLM reasoning tasks.

  34. Step-Video-T2V Technical Report: The Practice, Challenges, and Future of Video Foundation Model

    cs.CV 2025-02 unverdicted novelty 4.0

    Step-Video-T2V describes a 30B-parameter text-to-video model with custom Video-VAE, 3D DiT, flow matching, and Video-DPO that claims state-of-the-art results on a new internal benchmark.

  35. Convolutional Reservoir Computing for World Models

    cs.LG 2019-07 unverdicted novelty 4.0

    RCRC uses untrained random CNNs and reservoir computing plus evolution strategies to reach claimed state-of-the-art scores in reinforcement learning tasks while avoiding data storage and heavy training.

  36. A Dual Memory Structure for Efficient Use of Replay Memory in Deep Reinforcement Learning

    cs.LG 2019-07 unverdicted novelty 4.0

    Dual memory (main plus cache) for replay memory in DRL yields higher scores than single memory across three Gym environments.

  37. In Hindsight: A Smooth Reward for Steady Exploration

    cs.LG 2019-06 unverdicted novelty 4.0

    Adding a hindsight factor that integrates historic temporal differences into the Q-learning loss reduces overestimation and yields higher average scores than DQN, DDQN and dueling networks on ATARI games after 10 mill...

  38. A Deep Reinforcement Learning Approach for Global Routing

    cs.LG 2019-06 unverdicted novelty 4.0

    Deep RL agent trained on generated global routing instances outperforms sequential A* search.

  39. XekRung Technical Report

    cs.CR 2026-04 unverdicted novelty 3.0

    XekRung achieves state-of-the-art performance on cybersecurity benchmarks among same-scale models via tailored data synthesis and multi-stage training while retaining strong general capabilities.

  40. On Multi-Agent Learning in Team Sports Games

    cs.MA 2019-06 unverdicted novelty 3.0

    Describes a hierarchical RL method for multi-agent learning in team sports games aiming for human-like agents, reporting preliminary results that show promise.

  41. Optimal Use of Experience in First Person Shooter Environments

    cs.LG 2019-06 unverdicted novelty 2.0

    Empirical tests in VizDoom show multiple DQN updates per step do not improve performance after learning rate adjustment, with a 4:1 update-to-step ratio optimal before significant degradation.