super hub Canonical reference

Playing Atari with Deep Reinforcement Learning

Alex Graves, Daan Wierstra, David Silver, Ioannis Antonoglou, Koray Kavukcuoglu, Volodymyr Mnih · 2013 · cs.LG · arXiv 1312.5602

Canonical reference. 83% of citing Pith papers cite this work as background.

137 Pith papers citing it

Background 83% of classified citations

open full Pith review browse 137 citing papers more from Alex Graves arXiv PDF

abstract

We present the first deep learning model to successfully learn control policies directly from high-dimensional sensory input using reinforcement learning. The model is a convolutional neural network, trained with a variant of Q-learning, whose input is raw pixels and whose output is a value function estimating future rewards. We apply our method to seven Atari 2600 games from the Arcade Learning Environment, with no adjustment of the architecture or learning algorithm. We find that it outperforms all previous approaches on six of the games and surpasses a human expert on three of them.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 15 dataset 1 method 1 other 1

citation-polarity summary

background 15 unclear 2 use method 1

claims ledger

abstract We present the first deep learning model to successfully learn control policies directly from high-dimensional sensory input using reinforcement learning. The model is a convolutional neural network, trained with a variant of Q-learning, whose input is raw pixels and whose output is a value function estimating future rewards. We apply our method to seven Atari 2600 games from the Arcade Learning Environment, with no adjustment of the architecture or learning algorithm. We find that it outperforms all previous approaches on six of the games and surpasses a human expert on three of them.

authors

Alex Graves Daan Wierstra David Silver Ioannis Antonoglou Koray Kavukcuoglu Volodymyr Mnih

co-cited works

representative citing papers

OP-GRPO: Efficient Off-Policy GRPO for Flow-Matching Models

cs.CV · 2026-04-05 · unverdicted · novelty 8.0

OP-GRPO is the first off-policy GRPO method for flow-matching models that reuses trajectories via replay buffer and importance sampling corrections, matching on-policy performance with 34.2% of the training steps.

LIBERO: Benchmarking Knowledge Transfer for Lifelong Robot Learning

cs.AI · 2023-06-05 · conditional · novelty 8.0

LIBERO is a new benchmark for lifelong robot learning that evaluates transfer of declarative, procedural, and mixed knowledge across 130 manipulation tasks with provided demonstration data.

Consistency Models

cs.LG · 2023-03-02 · conditional · novelty 8.0

Consistency models achieve fast one-step generation with SOTA FID of 3.55 on CIFAR-10 and 6.20 on ImageNet 64x64 by directly mapping noise to data, outperforming prior distillation techniques.

Decision Transformer: Reinforcement Learning via Sequence Modeling

cs.LG · 2021-06-02 · accept · novelty 8.0

Decision Transformer casts RL as autoregressive sequence modeling conditioned on desired returns, past states and actions, matching or exceeding offline RL baselines on Atari, Gym and Key-to-Door tasks.

Concentration of General Stochastic Approximation Under Heavy-Tailed Markovian Noise

math.PR · 2026-05-20 · unverdicted · novelty 7.0

Establishes maximal concentration bounds for stochastic approximation under heavy-tailed Markovian noise, with tails ranging from sub-Gaussian to heavier than Weibull depending on step sizes and contractivity properties, plus a truncation argument for unbounded noise.

Compositional Transduction with Latent Analogies for Offline Goal-Conditioned Reinforcement Learning

cs.LG · 2026-05-20 · unverdicted · novelty 7.0

Proposes latent analogies and analogy transduction to enable compositional generalization to unseen goal-context pairs in offline GCRL, outperforming trajectory-stitching baselines on manipulation tasks.

TabQL: In-Context Q-Learning with Tabular Foundation Models

cs.LG · 2026-05-18 · unverdicted · novelty 7.0

TabQL is a reinforcement learning framework that substitutes a tabular foundation model with in-context capabilities for the parametric Q-network in DQN, with a warm-up phase and theoretical analysis claiming improved sample efficiency.

ASH: Agents that Self-Hone via Embodied Learning

cs.AI · 2026-05-14 · unverdicted · novelty 7.0

ASH reaches 11.2/12 milestones in Pokemon Emerald and 9.9/12 in Zelda by self-improving via an IDM trained on its own trajectories to label internet video, while baselines plateau at roughly 6/12.

Self-Supervised On-Policy Reinforcement Learning via Contrastive Proximal Policy Optimisation

cs.LG · 2026-05-13 · unverdicted · novelty 7.0

CPPO is an on-policy contrastive RL method that derives advantages from contrastive Q-values for PPO optimization, outperforming prior CRL baselines in 14/18 tasks and matching or exceeding reward-based PPO in 12/18 tasks.

TuniQ: Autotuning Compilation Passes for Quantum Workloads at Scale for Effectiveness and Efficiency

quant-ph · 2026-05-12 · unverdicted · novelty 7.0

TuniQ uses RL with a dual-encoder, shaped rewards, and action masking to autotune quantum compilation passes, improving fidelity and speed over Qiskit while generalizing across backends and scaling to large circuits.

On-line Learning in Tree MDPs by Treating Policies as Bandit Arms

cs.AI · 2026-05-06 · unverdicted · novelty 7.0

Bandit algorithms can be adapted to Tree MDPs by treating policies as arms with shared-data confidence bounds, achieving polynomial memory and instance-dependent bounds on sample complexity and regret that depend on terminal-state gaps rather than all policies.

Replay-buffer engineering for noise-robust quantum circuit optimization

quant-ph · 2026-04-23 · unverdicted · novelty 7.0

Treating the replay buffer as a central lever in RL for quantum circuit optimization yields 4-32x sample efficiency gains, up to 67.5% faster episodes, and 85-90% fewer steps to accuracy on noisy molecular and compilation tasks.

Co-Evolving LLM Decision and Skill Bank Agents for Long-Horizon Tasks

cs.AI · 2026-04-22 · unverdicted · novelty 7.0

COSPLAY co-evolves an LLM decision agent with a skill bank agent to improve long-horizon game performance, reporting over 25.1% average reward gains versus frontier LLM baselines on single-player benchmarks.

Bounded Ratio Reinforcement Learning

cs.LG · 2026-04-20 · conditional · novelty 7.0

BRRL derives an analytic optimal policy for regularized constrained RL that guarantees monotonic improvement and yields the BPO algorithm that matches or exceeds PPO.

Reinforcement Learning via Value Gradient Flow

cs.LG · 2026-04-15 · unverdicted · novelty 7.0

VGF solves behavior-regularized RL by transporting particles from a reference distribution to the value-induced optimal policy via discrete value-guided gradient flow.

Autonomous Diffractometry Enabled by Visual Reinforcement Learning

cs.LG · 2026-04-13 · unverdicted · novelty 7.0

A model-free reinforcement learning agent learns to align crystals from diffraction images without human supervision or theoretical knowledge.

SafeAdapt: Provably Safe Policy Updates in Deep Reinforcement Learning

cs.LG · 2026-04-10 · unverdicted · novelty 7.0

SafeAdapt certifies a Rashomon set of safe policies from demonstration data and projects updates from arbitrary RL algorithms onto it to guarantee preservation of safety on source tasks.

Long-Horizon Model-Based Offline Reinforcement Learning Without Explicit Conservatism

cs.LG · 2025-12-04 · conditional · novelty 7.0

NEUBAY uses Bayesian posteriors over world models with long-horizon planning to match or exceed conservative offline RL methods without explicit conservatism.

Inverse Reinforcement Learning with Just Classification and a Few Regressions

cs.LG · 2025-09-25 · unverdicted · novelty 7.0

GenPQR recovers normalized rewards in maximum-entropy IRL by estimating the policy with classification and the soft Q-function with regression, providing modular finite-sample guarantees under general function approximation.

Adaptive Ensemble Aggregation for Actor-Critics

cs.LG · 2025-07-31 · unverdicted · novelty 7.0

AEA dynamically aggregates ensembles in off-policy actor-critics from training dynamics, with proofs of convergence to an error-minimizing equilibrium, bias shrinkage with ensemble size, and monotonic policy improvement.

Deep Computerized Adaptive Testing

stat.ME · 2025-02-26 · unverdicted · novelty 7.0

A multivariate Bayesian IRT CAT framework accelerated by direct sampling and optimized with double deep Q-learning for non-myopic item selection.

Acoustics-based Active Control of Unsteady Flow Dynamics using Reinforcement Learning Driven Synthetic Jets

physics.flu-dyn · 2023-12-27 · unverdicted · novelty 7.0

A DRL agent uses far-field acoustic measurements from a hydrophone array as its sole feedback to drive synthetic jets on a cylinder, achieving up to 9.5% noise reduction and 23.8% drag reduction at Re=100.

Learning Interactive Real-World Simulators

cs.AI · 2023-10-09 · conditional · novelty 7.0

UniSim learns a universal real-world simulator from orchestrated diverse datasets, enabling zero-shot deployment of policies trained purely in simulation.

Voyager: An Open-Ended Embodied Agent with Large Language Models

cs.AI · 2023-05-25 · unverdicted · novelty 7.0

Voyager achieves superior lifelong learning in Minecraft by combining an automatic exploration curriculum, a library of executable skills, and iterative LLM prompting with environment feedback, yielding 3.3x more unique items and 15.3x faster milestone unlocks than prior methods while generalizing技能

citing papers explorer

Showing 50 of 137 citing papers.

OP-GRPO: Efficient Off-Policy GRPO for Flow-Matching Models cs.CV · 2026-04-05 · unverdicted · none · ref 25 · internal anchor
OP-GRPO is the first off-policy GRPO method for flow-matching models that reuses trajectories via replay buffer and importance sampling corrections, matching on-policy performance with 34.2% of the training steps.
LIBERO: Benchmarking Knowledge Transfer for Lifelong Robot Learning cs.AI · 2023-06-05 · conditional · none · ref 46 · internal anchor
LIBERO is a new benchmark for lifelong robot learning that evaluates transfer of declarative, procedural, and mixed knowledge across 130 manipulation tasks with provided demonstration data.
Consistency Models cs.LG · 2023-03-02 · conditional · none · ref 42 · internal anchor
Consistency models achieve fast one-step generation with SOTA FID of 3.55 on CIFAR-10 and 6.20 on ImageNet 64x64 by directly mapping noise to data, outperforming prior distillation techniques.
Decision Transformer: Reinforcement Learning via Sequence Modeling cs.LG · 2021-06-02 · accept · none · ref 42 · internal anchor
Decision Transformer casts RL as autoregressive sequence modeling conditioned on desired returns, past states and actions, matching or exceeding offline RL baselines on Atari, Gym and Key-to-Door tasks.
Concentration of General Stochastic Approximation Under Heavy-Tailed Markovian Noise math.PR · 2026-05-20 · unverdicted · none · ref 106 · internal anchor
Establishes maximal concentration bounds for stochastic approximation under heavy-tailed Markovian noise, with tails ranging from sub-Gaussian to heavier than Weibull depending on step sizes and contractivity properties, plus a truncation argument for unbounded noise.
Compositional Transduction with Latent Analogies for Offline Goal-Conditioned Reinforcement Learning cs.LG · 2026-05-20 · unverdicted · none · ref 5 · internal anchor
Proposes latent analogies and analogy transduction to enable compositional generalization to unseen goal-context pairs in offline GCRL, outperforming trajectory-stitching baselines on manipulation tasks.
TabQL: In-Context Q-Learning with Tabular Foundation Models cs.LG · 2026-05-18 · unverdicted · none · ref 5 · internal anchor
TabQL is a reinforcement learning framework that substitutes a tabular foundation model with in-context capabilities for the parametric Q-network in DQN, with a warm-up phase and theoretical analysis claiming improved sample efficiency.
ASH: Agents that Self-Hone via Embodied Learning cs.AI · 2026-05-14 · unverdicted · none · ref 35 · internal anchor
ASH reaches 11.2/12 milestones in Pokemon Emerald and 9.9/12 in Zelda by self-improving via an IDM trained on its own trajectories to label internet video, while baselines plateau at roughly 6/12.
Self-Supervised On-Policy Reinforcement Learning via Contrastive Proximal Policy Optimisation cs.LG · 2026-05-13 · unverdicted · none · ref 14 · internal anchor
CPPO is an on-policy contrastive RL method that derives advantages from contrastive Q-values for PPO optimization, outperforming prior CRL baselines in 14/18 tasks and matching or exceeding reward-based PPO in 12/18 tasks.
TuniQ: Autotuning Compilation Passes for Quantum Workloads at Scale for Effectiveness and Efficiency quant-ph · 2026-05-12 · unverdicted · none · ref 49 · internal anchor
TuniQ uses RL with a dual-encoder, shaped rewards, and action masking to autotune quantum compilation passes, improving fidelity and speed over Qiskit while generalizing across backends and scaling to large circuits.
On-line Learning in Tree MDPs by Treating Policies as Bandit Arms cs.AI · 2026-05-06 · unverdicted · none · ref 37 · internal anchor
Bandit algorithms can be adapted to Tree MDPs by treating policies as arms with shared-data confidence bounds, achieving polynomial memory and instance-dependent bounds on sample complexity and regret that depend on terminal-state gaps rather than all policies.
Replay-buffer engineering for noise-robust quantum circuit optimization quant-ph · 2026-04-23 · unverdicted · none · ref 31 · internal anchor
Treating the replay buffer as a central lever in RL for quantum circuit optimization yields 4-32x sample efficiency gains, up to 67.5% faster episodes, and 85-90% fewer steps to accuracy on noisy molecular and compilation tasks.
Co-Evolving LLM Decision and Skill Bank Agents for Long-Horizon Tasks cs.AI · 2026-04-22 · unverdicted · none · ref 18 · internal anchor
COSPLAY co-evolves an LLM decision agent with a skill bank agent to improve long-horizon game performance, reporting over 25.1% average reward gains versus frontier LLM baselines on single-player benchmarks.
Bounded Ratio Reinforcement Learning cs.LG · 2026-04-20 · conditional · none · ref 16 · internal anchor
BRRL derives an analytic optimal policy for regularized constrained RL that guarantees monotonic improvement and yields the BPO algorithm that matches or exceeds PPO.
Reinforcement Learning via Value Gradient Flow cs.LG · 2026-04-15 · unverdicted · none · ref 44 · internal anchor
VGF solves behavior-regularized RL by transporting particles from a reference distribution to the value-induced optimal policy via discrete value-guided gradient flow.
Autonomous Diffractometry Enabled by Visual Reinforcement Learning cs.LG · 2026-04-13 · unverdicted · none · ref 45 · internal anchor
A model-free reinforcement learning agent learns to align crystals from diffraction images without human supervision or theoretical knowledge.
SafeAdapt: Provably Safe Policy Updates in Deep Reinforcement Learning cs.LG · 2026-04-10 · unverdicted · none · ref 28 · internal anchor
SafeAdapt certifies a Rashomon set of safe policies from demonstration data and projects updates from arbitrary RL algorithms onto it to guarantee preservation of safety on source tasks.
Long-Horizon Model-Based Offline Reinforcement Learning Without Explicit Conservatism cs.LG · 2025-12-04 · conditional · none · ref 9 · internal anchor
NEUBAY uses Bayesian posteriors over world models with long-horizon planning to match or exceed conservative offline RL methods without explicit conservatism.
Inverse Reinforcement Learning with Just Classification and a Few Regressions cs.LG · 2025-09-25 · unverdicted · none · ref 22 · internal anchor
GenPQR recovers normalized rewards in maximum-entropy IRL by estimating the policy with classification and the soft Q-function with regression, providing modular finite-sample guarantees under general function approximation.
Adaptive Ensemble Aggregation for Actor-Critics cs.LG · 2025-07-31 · unverdicted · none · ref 23 · internal anchor
AEA dynamically aggregates ensembles in off-policy actor-critics from training dynamics, with proofs of convergence to an error-minimizing equilibrium, bias shrinkage with ensemble size, and monotonic policy improvement.
Deep Computerized Adaptive Testing stat.ME · 2025-02-26 · unverdicted · none · ref 41 · internal anchor
A multivariate Bayesian IRT CAT framework accelerated by direct sampling and optimized with double deep Q-learning for non-myopic item selection.
Acoustics-based Active Control of Unsteady Flow Dynamics using Reinforcement Learning Driven Synthetic Jets physics.flu-dyn · 2023-12-27 · unverdicted · none · ref 45 · internal anchor
A DRL agent uses far-field acoustic measurements from a hydrophone array as its sole feedback to drive synthetic jets on a cylinder, achieving up to 9.5% noise reduction and 23.8% drag reduction at Re=100.
Learning Interactive Real-World Simulators cs.AI · 2023-10-09 · conditional · none · ref 164 · internal anchor
UniSim learns a universal real-world simulator from orchestrated diverse datasets, enabling zero-shot deployment of policies trained purely in simulation.
Voyager: An Open-Ended Embodied Agent with Large Language Models cs.AI · 2023-05-25 · unverdicted · none · ref 33 · internal anchor
Voyager achieves superior lifelong learning in Minecraft by combining an automatic exploration curriculum, a library of executable skills, and iterative LLM prompting with environment feedback, yielding 3.3x more unique items and 15.3x faster milestone unlocks than prior methods while generalizing技能
Dota 2 with Large Scale Deep Reinforcement Learning cs.LG · 2019-12-13 · accept · none · ref 3 · internal anchor
OpenAI Five achieved superhuman performance in Dota 2 by defeating the world champions using scaled self-play reinforcement learning.
Language Models as Knowledge Bases? cs.CL · 2019-09-03 · accept · none · ref 300 · internal anchor
BERT stores relational knowledge extractable via cloze queries without fine-tuning and matches supervised baselines on open-domain QA tasks.
Benchmarking Model-Based Reinforcement Learning cs.LG · 2019-07-03 · accept · none · ref 34 · internal anchor
Introduces a benchmark suite of over 18 MBRL environments, evaluates multiple algorithms under consistent settings, and identifies three core challenges: dynamics bottleneck, planning horizon dilemma, and early-termination dilemma.
Finding Needles in a Moving Haystack: Prioritizing Alerts with Adversarial Reinforcement Learning cs.CR · 2019-06-20 · unverdicted · none · ref 28 · internal anchor
Adversarial RL approximates a game-theoretic equilibrium to yield a stochastic policy for prioritizing alerts against adaptive attackers in fraud and intrusion detection.
Exploring Model-based Planning with Policy Networks cs.LG · 2019-06-20 · unverdicted · none · ref 28 · internal anchor
POPLIN combines policy networks with model-predictive planning by optimizing either action sequences or policy parameters, yielding 3x better sample efficiency than PETS, TD3 and SAC on MuJoCo locomotion tasks.
Soft Actor-Critic Algorithms and Applications cs.LG · 2018-12-13 · unverdicted · none · ref 9 · internal anchor
SAC extends maximum-entropy RL into a stable off-policy actor-critic method with constrained temperature tuning, outperforming prior algorithms in sample efficiency and consistency on locomotion and manipulation tasks.
Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor cs.LG · 2018-01-04 · accept · none · ref 17 · internal anchor
Soft Actor-Critic is an off-policy maximum-entropy actor-critic algorithm that achieves state-of-the-art performance and high stability on continuous control benchmarks.
Deep reinforcement learning from human preferences stat.ML · 2017-06-12 · accept · none · ref 9 · internal anchor
Reinforcement learning agents solve complex tasks without access to the reward function by training a reward predictor from human comparisons of trajectory segments, requiring feedback on less than 1% of interactions.
Continuous control with deep reinforcement learning cs.LG · 2015-09-09 · accept · none · ref 7 · internal anchor
DDPG is a model-free actor-critic algorithm that learns continuous control policies end-to-end from states or pixels using deterministic policy gradients and deep networks, solving more than 20 physics tasks competitively with full-information planning methods.
Understanding Goal Generalisation in Sequential Reinforcement Learning cs.LG · 2026-05-22 · unverdicted · none · ref 42 · internal anchor
Empirical analysis of over 100 sequential RL training pipelines across 250+ OOD environments finds salient features drive generalization and early goals persist, with latent policy gradients simulating latent variable evolution to predict OOD behavior from training history.
Goal-Conditioned Agents that Learn Everything All at Once cs.LG · 2026-05-22 · unverdicted · none · ref 55 · internal anchor
LEO enables efficient all-goals learning in goal-conditioned RL by jointly predicting for all goals in one network pass, yielding >250x speedup over relabelling and better performance on Craftax.
Enhanced Reinforcement Learning-based Process Synthesis via Quantum Computing quant-ph · 2026-05-20 · unverdicted · none · ref 40 · internal anchor
Quantum RL variants with state encoding solve moderate-scale flowsheet synthesis problems competitively with classical RL on per-episode performance and more efficiently per parameter.
Reinforcement Learning Assisted Quantum Simulation of Many-Body Excited States and Real-Time Dynamics quant-ph · 2026-05-18 · unverdicted · none · ref 52 · internal anchor
The work generalizes RL-CQE to excited states and time evolution via adaptive operator selection and a constant-scaling ansatz, reporting chemical accuracy on chemical systems with compact representations.
CA2: Code-Aware Agent for Automated Game Testing cs.SE · 2026-05-13 · unverdicted · none · ref 3 · internal anchor
CA2 integrates call stack information into RL agents for game testing and shows consistent gains over baselines that ignore code signals.
Q-Flow: Stable and Expressive Reinforcement Learning with Flow-Based Policy cs.LG · 2026-05-13 · unverdicted · none · ref 8 · internal anchor
Q-Flow enables stable optimization of expressive flow-based policies in RL by propagating terminal values along deterministic flow dynamics to intermediate states for gradient updates without solver unrolling.
Discrete Flow Matching for Offline-to-Online Reinforcement Learning cs.LG · 2026-05-12 · unverdicted · none · ref 15 · internal anchor
DRIFT enables stable offline-to-online fine-tuning of CTMC policies in discrete RL via advantage-weighted discrete flow matching, path-space regularization, and candidate-set approximation.
DelAC: A Multi-agent Reinforcement Learning of Team-Symmetric Stochastic Games cs.MA · 2026-05-11 · unverdicted · none · ref 21 · internal anchor
Team-symmetric games always have team-symmetric Nash equilibria solvable via linear complementarity problems, and the DelAC actor-critic MARL algorithm outperforms existing methods in simulations.
Plan2Cleanse: Test-Time Backdoor Defense via Monte-Carlo Planning in Deep Reinforcement Learning cs.LG · 2026-05-10 · unverdicted · none · ref 10 · internal anchor
Plan2Cleanse frames RL backdoor detection as a Monte Carlo planning problem to achieve over 61 percentage point gains in trigger detection and improved win rates in competitive environments.
Learning the Preferences of a Learning Agent cs.AI · 2026-05-09 · unverdicted · none · ref 10 · internal anchor
Formalizes preference learning from a no-regret or Boltzmann-converging learner with theoretical guarantees or impossibility results for IRL algorithms.
Counter-Dyna: Data-Efficient RL-Based HVAC Control using Counterfactual Building Models cs.LG · 2026-05-06 · unverdicted · none · ref 20 · internal anchor
Counter-Dyna reduces RL training data for HVAC control to five weeks by using counterfactual surrogate models that ignore uncontrollable variables like weather and prices.
Quantile Geometry Regularization for Distributional Reinforcement Learning cs.LG · 2026-05-05 · unverdicted · none · ref 4 · internal anchor
RQIQN introduces a Wasserstein DRO-based correction to Bellman quantile targets that enlarges distributional spread without altering risk-neutral averages.
A Meta Reinforcement Learning Approach to Goals-Based Wealth Management cs.LG · 2026-05-04 · unverdicted · none · ref 76 · internal anchor
MetaRL pre-trained on GBWM problems delivers near-optimal dynamic strategies in 0.01s achieving 97.8% of DP optimal utility and handles larger problems where DP fails.
QHyer: Q-conditioned Hybrid Attention-mamba Transformer for Offline Goal-conditioned RL cs.LG · 2026-05-03 · unverdicted · none · ref 114 · internal anchor
QHyer replaces return-to-go with a state-conditioned Q-estimator and adds a gated hybrid attention-mamba backbone to achieve state-of-the-art performance in offline goal-conditioned RL on both Markovian and non-Markovian datasets.
Towards Real-time Control of a CartPole System on a Quantum Computer quant-ph · 2026-05-03 · unverdicted · none · ref 35 · internal anchor
A single-qubit quantum reinforcement learning agent solves CartPole faster than classical networks and quantifies shot-count versus control-frequency requirements for real-time closed-loop control on NISQ hardware, including direct electronics programming to reduce latency.
AutoREC: A software platform for developing reinforcement learning agents for equivalent circuit model generation from electrochemical impedance spectroscopy data cs.LG · 2026-04-29 · unverdicted · none · ref 2 · internal anchor
AutoREC uses a Double Deep Q-Network agent to generate equivalent circuit models from EIS data, reporting over 99.6% success on synthetic sets and generalization to experimental battery, corrosion, and catalysis data.
Improving Zero-Shot Offline RL via Behavioral Task Sampling cs.AI · 2026-04-28 · unverdicted · none · ref 15 · internal anchor
Extracting task vectors from the offline dataset for policy training improves zero-shot offline RL performance by an average of 20% over random sampling baselines.

Playing Atari with Deep Reinforcement Learning

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer