OP-GRPO is the first off-policy GRPO method for flow-matching models that reuses trajectories via replay buffer and importance sampling corrections, matching on-policy performance with 34.2% of the training steps.
super hub Canonical reference
Playing Atari with Deep Reinforcement Learning
Canonical reference. 83% of citing Pith papers cite this work as background.
abstract
We present the first deep learning model to successfully learn control policies directly from high-dimensional sensory input using reinforcement learning. The model is a convolutional neural network, trained with a variant of Q-learning, whose input is raw pixels and whose output is a value function estimating future rewards. We apply our method to seven Atari 2600 games from the Arcade Learning Environment, with no adjustment of the architecture or learning algorithm. We find that it outperforms all previous approaches on six of the games and surpasses a human expert on three of them.
hub tools
citation-role summary
citation-polarity summary
claims ledger
- abstract We present the first deep learning model to successfully learn control policies directly from high-dimensional sensory input using reinforcement learning. The model is a convolutional neural network, trained with a variant of Q-learning, whose input is raw pixels and whose output is a value function estimating future rewards. We apply our method to seven Atari 2600 games from the Arcade Learning Environment, with no adjustment of the architecture or learning algorithm. We find that it outperforms all previous approaches on six of the games and surpasses a human expert on three of them.
authors
co-cited works
representative citing papers
LIBERO is a new benchmark for lifelong robot learning that evaluates transfer of declarative, procedural, and mixed knowledge across 130 manipulation tasks with provided demonstration data.
Consistency models achieve fast one-step generation with SOTA FID of 3.55 on CIFAR-10 and 6.20 on ImageNet 64x64 by directly mapping noise to data, outperforming prior distillation techniques.
Decision Transformer casts RL as autoregressive sequence modeling conditioned on desired returns, past states and actions, matching or exceeding offline RL baselines on Atari, Gym and Key-to-Door tasks.
Establishes maximal concentration bounds for stochastic approximation under heavy-tailed Markovian noise, with tails ranging from sub-Gaussian to heavier than Weibull depending on step sizes and contractivity properties, plus a truncation argument for unbounded noise.
Proposes latent analogies and analogy transduction to enable compositional generalization to unseen goal-context pairs in offline GCRL, outperforming trajectory-stitching baselines on manipulation tasks.
TabQL is a reinforcement learning framework that substitutes a tabular foundation model with in-context capabilities for the parametric Q-network in DQN, with a warm-up phase and theoretical analysis claiming improved sample efficiency.
CPPO is an on-policy contrastive RL method that derives advantages from contrastive Q-values for PPO optimization, outperforming prior CRL baselines in 14/18 tasks and matching or exceeding reward-based PPO in 12/18 tasks.
TuniQ uses RL with a dual-encoder, shaped rewards, and action masking to autotune quantum compilation passes, improving fidelity and speed over Qiskit while generalizing across backends and scaling to large circuits.
Bandit algorithms can be adapted to Tree MDPs by treating policies as arms with shared-data confidence bounds, achieving polynomial memory and instance-dependent bounds on sample complexity and regret that depend on terminal-state gaps rather than all policies.
Treating the replay buffer as a central lever in RL for quantum circuit optimization yields 4-32x sample efficiency gains, up to 67.5% faster episodes, and 85-90% fewer steps to accuracy on noisy molecular and compilation tasks.
COSPLAY co-evolves an LLM decision agent with a skill bank agent to improve long-horizon game performance, reporting over 25.1% average reward gains versus frontier LLM baselines on single-player benchmarks.
BRRL derives an analytic optimal policy for regularized constrained RL that guarantees monotonic improvement and yields the BPO algorithm that matches or exceeds PPO.
VGF solves behavior-regularized RL by transporting particles from a reference distribution to the value-induced optimal policy via discrete value-guided gradient flow.
A model-free reinforcement learning agent learns to align crystals from diffraction images without human supervision or theoretical knowledge.
SafeAdapt certifies a Rashomon set of safe policies from demonstration data and projects updates from arbitrary RL algorithms onto it to guarantee preservation of safety on source tasks.
NEUBAY uses Bayesian posteriors over world models with long-horizon planning to match or exceed conservative offline RL methods without explicit conservatism.
GenPQR recovers normalized rewards in maximum-entropy IRL by estimating the policy with classification and the soft Q-function with regression, providing modular finite-sample guarantees under general function approximation.
AEA dynamically aggregates ensembles in off-policy actor-critics from training dynamics, with proofs of convergence to an error-minimizing equilibrium, bias shrinkage with ensemble size, and monotonic policy improvement.
A multivariate Bayesian IRT CAT framework accelerated by direct sampling and optimized with double deep Q-learning for non-myopic item selection.
A DRL agent uses far-field acoustic measurements from a hydrophone array as its sole feedback to drive synthetic jets on a cylinder, achieving up to 9.5% noise reduction and 23.8% drag reduction at Re=100.
UniSim learns a universal real-world simulator from orchestrated diverse datasets, enabling zero-shot deployment of policies trained purely in simulation.
Voyager achieves superior lifelong learning in Minecraft by combining an automatic exploration curriculum, a library of executable skills, and iterative LLM prompting with environment feedback, yielding 3.3x more unique items and 15.3x faster milestone unlocks than prior methods while generalizing技能
OpenAI Five achieved superhuman performance in Dota 2 by defeating the world champions using scaled self-play reinforcement learning.
citing papers explorer
-
Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems
Offline RL promises to extract high-utility policies from static datasets but faces fundamental challenges that current methods only partially address.