super hub Mixed citations

Human-level control through deep reinforcement learning

Alex Graves, Andreas K. Fidjeland, Andrei A. Rusu, Charles Beattie, David Silver, Georg Ostrovski + 2 more · 2015 · Nature · DOI 10.1038/nature14236

Mixed citation behavior. Most common role is background (43%).

58 Pith papers citing it

22.6k external citations · Crossref

Background 43% of classified citations

open at publisher browse 58 citing papers more from Alex Graves

hub tools

JSON dossier citing papers JSON publisher DOI

citation-role summary

background 4 method 2 baseline 1

citation-polarity summary

background 3 use method 2 baseline 1 unclear 1

authors

Alex Graves Andreas K. Fidjeland Andrei A. Rusu Charles Beattie David Silver Georg Ostrovski Joel Veness Koray Kavukcuoglu Marc G. Bellemare Martin Riedmiller Stig Petersen Volodymyr Mnih

co-cited works

representative citing papers

LiveBench: A Challenging, Contamination-Limited LLM Benchmark

cs.CL · 2024-06-27 · unverdicted · novelty 8.0

LiveBench is a contamination-limited LLM benchmark with auto-scored challenging tasks from recent sources across math, coding, reasoning and more, where top models score below 70%.

Heavy-Ball Q-Learning with Residual Weighting Correction

cs.LG · 2026-06-25 · unverdicted · novelty 7.0

Corrected heavy-ball Q-learning with convergence and acceleration guarantees is derived via switched linear system and joint spectral radius analysis, extended to linear function approximation.

CHORUS: Decentralized Multi-Embodiment Collaboration with One VLA Policy

cs.RO · 2026-06-10 · unverdicted · novelty 7.0

CHORUS adapts a single VLA backbone for decentralized control of diverse robot teams, achieving 64-point gains over from-scratch decentralized baselines and outperforming centralized methods in real-world tasks using only local observations.

Expected Free Energy-based Planning as Variational Inference

cs.AI · 2026-06-09 · unverdicted · novelty 7.0

EFE-based planning is formulated as variational free energy minimization with epistemic priors, decomposing into expected plan costs plus a complexity term.

What Type of Inference is Active Inference?

cs.AI · 2026-06-03 · unverdicted · novelty 7.0

EFE-based active inference planning is characterized as VFE on an augmented model plus entropy and planning corrections, with a derived message-passing implementation and grid-world validation.

From Ticks to Flows: Dynamics of Neural Reinforcement Learning in Continuous Environments

cs.LG · 2026-06-02 · unverdicted · novelty 7.0

Derives an SDE describing the infinitesimal change in state distribution at each gradient step for neural actor-critic RL in continuous environments under vanishing learning rate in the infinite width limit.

Coordination Graphs for Constrained Multi-Agent Reinforcement Learning

cs.AI · 2026-06-01 · conditional · novelty 7.0

CG-CMARL decomposes constrained multi-agent RL into pairwise coordination graphs with shared Q-functions, using Max-Sum message passing and a Lagrangian multiplier to coordinate actions and trace Pareto fronts scalably.

Randomized Advantage Transformation (RAT): Computing Natural Policy Gradients via Direct Backpropagation

cs.LG · 2026-05-18 · unverdicted · novelty 7.0

RAT reformulates regularized natural policy gradients as vanilla gradients with a transformed advantage, computed efficiently via randomized block Kaczmarz iterations on on-policy data.

Inline Critic Steers Image Editing

cs.CV · 2026-05-12 · conditional · novelty 7.0

Inline Critic uses a learnable token to critique and steer a frozen image-editing model's intermediate layers during generation, delivering state-of-the-art results on GEdit-Bench, RISEBench, and KRIS-Bench.

Optimal Sample Complexity for Single Time-Scale Actor-Critic with Momentum

cs.LG · 2026-02-02 · unverdicted · novelty 7.0

Single-timescale actor-critic with STORM momentum and a recent-sample buffer achieves optimal O(ε^{-2}) sample complexity for ε-optimal policies in finite discounted MDPs.

Variational Sequential Optimal Experimental Design using Reinforcement Learning

stat.ML · 2023-06-17 · unverdicted · novelty 7.0

vsOED uses a variational one-point reward and RL policy optimization to provide a lower bound on expected information gain for sequential experimental design, supporting nuisance parameters, implicit likelihoods, and multiple design goals.

Generalization in offline RL: The structure is more important than the amount of pessimism

cs.LG · 2026-07-02 · unverdicted · novelty 6.0

In offline RL, the structure of pessimism (set by dataset coverage) matters more for generalization than its amount; a symmetric overly pessimistic value function can outperform a non-symmetric mildly pessimistic one.

Episodic-to-Semantic Consolidation Without Identity Drift

cs.AI · 2026-07-02 · unverdicted · novelty 6.0

A deterministic episodic-to-semantic consolidation function with a structural lemma proving identity invariance, demonstrated in synthetic experiments on an embodied service agent.

Parametric Open Source Games

cs.GT · 2026-06-25 · unverdicted · novelty 6.0

Introduces parametric open-source games as continuous analogues of program equilibria, proves equilibrium existence, and derives an exact coupling threshold for cooperation in symmetric 2x2 games under gradient ascent.

SMR: Scheduler with Multi-Channel Map-Encoded Reinforcement Learning for Radio Telescopes

astro-ph.IM · 2026-06-25 · unverdicted · novelty 6.0

SMR uses multi-channel map-encoded reinforcement learning to achieve roughly 10% better time utilization than greedy baselines for single-dish radio telescope scheduling.

Identifying structural design principles shaping the computational abilities of recurrent neural networks

q-bio.NC · 2026-06-22 · unverdicted · novelty 6.0

Local 2- and 3-cycles enhance RNN computational capacity for Boolean functions, predicted by structural statistics, while adding interneurons boosts large networks.

NASDAQ: Normalized Observation Space Dynamics-Augmented Q-Learning

cs.LG · 2026-06-19 · unverdicted · novelty 6.0

NASDAQ normalizes observations in an online RL setting so that dynamics prediction losses are balanced across dimensions, yielding competitive performance with lower wall-time than prior model-based and self-predictive methods.

Formalizing Task-Space Complexity for Zero-Shot Generalization

cs.LG · 2026-06-18 · unverdicted · novelty 6.0

Introduces signed divergence to bound generalization gaps and defines task-space complexity as the minimum source contexts needed for ε-coverage under local smoothness, with set-cover reduction and empirical validation on LQR and DRL systems.

Reinforcement Learning Disrupts Gradient-Based Adversarial Optimization

cs.LG · 2026-06-10 · unverdicted · novelty 6.0

RL training disrupts gradient-based adversarial attacks by inducing unstable low-magnitude gradients that limit the effectiveness of methods like PGD within practical budgets.

Dmsh: A Multi-Agent Reinforcement Learning Framework for All-Quad Mesh Generation

math.NA · 2026-06-09 · unverdicted · novelty 6.0

Dmsh is a new multi-agent RL framework that formulates mesh generation as an MDP and uses three coordinated agents plus curriculum learning to produce globally conforming all-quad meshes without post-processing.

Rollout-Level Advantage-Prioritized Experience Replay for GRPO

cs.LG · 2026-06-03 · conditional · novelty 6.0

Rollout-level advantage-prioritized experience replay for GRPO recycles high-advantage individual rollouts with age eviction and fresh-anchored batches to outperform standard GRPO on math benchmarks, with gains increasing with model size.

ReviewGuard: Aligning LLM-Assisted Peer Review with Long-Term Scientific Impact

cs.DL · 2026-05-29 · unverdicted · novelty 6.0

ReviewGuard aligns LLM peer reviews with future citations via impact-aligned RL, achieving Spearman ρ=0.776 on rejected-then-published AI/ML papers versus 0.492 for human reviewers and flagging 5.6× more high-impact cases.

When Does Deep RL Beat Calibrated Baselines? A Benchmark Study on Adaptive Resource Control

cs.LG · 2026-05-26 · unverdicted · novelty 6.0

Benchmark study finds calibrated rule-based controller outperforms six DRL algorithms on cost for adaptive resource control across workloads, with action-space mismatch explaining large differences in constraint violations.

Learning in Low-Dimensional Subspaces: Orthogonal Bottlenecks for Reinforcement Learning

cs.LG · 2026-05-25 · unverdicted · novelty 6.0

Orthogonal bottlenecks constrain RL encoder features to low-dimensional subspaces while preserving expressivity and gradient dynamics under linear realizability when dimension exceeds the value function's intrinsic rank.

citing papers explorer

Showing 50 of 58 citing papers.

LiveBench: A Challenging, Contamination-Limited LLM Benchmark cs.CL · 2024-06-27 · unverdicted · none · ref 28
LiveBench is a contamination-limited LLM benchmark with auto-scored challenging tasks from recent sources across math, coding, reasoning and more, where top models score below 70%.
Heavy-Ball Q-Learning with Residual Weighting Correction cs.LG · 2026-06-25 · unverdicted · none · ref 18
Corrected heavy-ball Q-learning with convergence and acceleration guarantees is derived via switched linear system and joint spectral radius analysis, extended to linear function approximation.
CHORUS: Decentralized Multi-Embodiment Collaboration with One VLA Policy cs.RO · 2026-06-10 · unverdicted · none · ref 40
CHORUS adapts a single VLA backbone for decentralized control of diverse robot teams, achieving 64-point gains over from-scratch decentralized baselines and outperforming centralized methods in real-world tasks using only local observations.
Expected Free Energy-based Planning as Variational Inference cs.AI · 2026-06-09 · unverdicted · none · ref 205
EFE-based planning is formulated as variational free energy minimization with epistemic priors, decomposing into expected plan costs plus a complexity term.
What Type of Inference is Active Inference? cs.AI · 2026-06-03 · unverdicted · none · ref 223
EFE-based active inference planning is characterized as VFE on an augmented model plus entropy and planning corrections, with a derived message-passing implementation and grid-world validation.
From Ticks to Flows: Dynamics of Neural Reinforcement Learning in Continuous Environments cs.LG · 2026-06-02 · unverdicted · none · ref 168
Derives an SDE describing the infinitesimal change in state distribution at each gradient step for neural actor-critic RL in continuous environments under vanishing learning rate in the infinite width limit.
Coordination Graphs for Constrained Multi-Agent Reinforcement Learning cs.AI · 2026-06-01 · conditional · none · ref 7
CG-CMARL decomposes constrained multi-agent RL into pairwise coordination graphs with shared Q-functions, using Max-Sum message passing and a Lagrangian multiplier to coordinate actions and trace Pareto fronts scalably.
Randomized Advantage Transformation (RAT): Computing Natural Policy Gradients via Direct Backpropagation cs.LG · 2026-05-18 · unverdicted · none · ref 74
RAT reformulates regularized natural policy gradients as vanilla gradients with a transformed advantage, computed efficiently via randomized block Kaczmarz iterations on on-policy data.
Inline Critic Steers Image Editing cs.CV · 2026-05-12 · conditional · none · ref 33
Inline Critic uses a learnable token to critique and steer a frozen image-editing model's intermediate layers during generation, delivering state-of-the-art results on GEdit-Bench, RISEBench, and KRIS-Bench.
Optimal Sample Complexity for Single Time-Scale Actor-Critic with Momentum cs.LG · 2026-02-02 · unverdicted · none · ref 40
Single-timescale actor-critic with STORM momentum and a recent-sample buffer achieves optimal O(ε^{-2}) sample complexity for ε-optimal policies in finite discounted MDPs.
Variational Sequential Optimal Experimental Design using Reinforcement Learning stat.ML · 2023-06-17 · unverdicted · none · ref 61
vsOED uses a variational one-point reward and RL policy optimization to provide a lower bound on expected information gain for sequential experimental design, supporting nuisance parameters, implicit likelihoods, and multiple design goals.
Generalization in offline RL: The structure is more important than the amount of pessimism cs.LG · 2026-07-02 · unverdicted · none · ref 32
In offline RL, the structure of pessimism (set by dataset coverage) matters more for generalization than its amount; a symmetric overly pessimistic value function can outperform a non-symmetric mildly pessimistic one.
Episodic-to-Semantic Consolidation Without Identity Drift cs.AI · 2026-07-02 · unverdicted · none · ref 24
A deterministic episodic-to-semantic consolidation function with a structural lemma proving identity invariance, demonstrated in synthetic experiments on an embodied service agent.
Parametric Open Source Games cs.GT · 2026-06-25 · unverdicted · none · ref 39
Introduces parametric open-source games as continuous analogues of program equilibria, proves equilibrium existence, and derives an exact coupling threshold for cooperation in symmetric 2x2 games under gradient ascent.
SMR: Scheduler with Multi-Channel Map-Encoded Reinforcement Learning for Radio Telescopes astro-ph.IM · 2026-06-25 · unverdicted · none · ref 14
SMR uses multi-channel map-encoded reinforcement learning to achieve roughly 10% better time utilization than greedy baselines for single-dish radio telescope scheduling.
Identifying structural design principles shaping the computational abilities of recurrent neural networks q-bio.NC · 2026-06-22 · unverdicted · none · ref 14
Local 2- and 3-cycles enhance RNN computational capacity for Boolean functions, predicted by structural statistics, while adding interneurons boosts large networks.
NASDAQ: Normalized Observation Space Dynamics-Augmented Q-Learning cs.LG · 2026-06-19 · unverdicted · none · ref 2
NASDAQ normalizes observations in an online RL setting so that dynamics prediction losses are balanced across dimensions, yielding competitive performance with lower wall-time than prior model-based and self-predictive methods.
Formalizing Task-Space Complexity for Zero-Shot Generalization cs.LG · 2026-06-18 · unverdicted · none · ref 4
Introduces signed divergence to bound generalization gaps and defines task-space complexity as the minimum source contexts needed for ε-coverage under local smoothness, with set-cover reduction and empirical validation on LQR and DRL systems.
Reinforcement Learning Disrupts Gradient-Based Adversarial Optimization cs.LG · 2026-06-10 · unverdicted · none · ref 44
RL training disrupts gradient-based adversarial attacks by inducing unstable low-magnitude gradients that limit the effectiveness of methods like PGD within practical budgets.
Dmsh: A Multi-Agent Reinforcement Learning Framework for All-Quad Mesh Generation math.NA · 2026-06-09 · unverdicted · none · ref 25
Dmsh is a new multi-agent RL framework that formulates mesh generation as an MDP and uses three coordinated agents plus curriculum learning to produce globally conforming all-quad meshes without post-processing.
Rollout-Level Advantage-Prioritized Experience Replay for GRPO cs.LG · 2026-06-03 · conditional · none · ref 48
Rollout-level advantage-prioritized experience replay for GRPO recycles high-advantage individual rollouts with age eviction and fresh-anchored batches to outperform standard GRPO on math benchmarks, with gains increasing with model size.
ReviewGuard: Aligning LLM-Assisted Peer Review with Long-Term Scientific Impact cs.DL · 2026-05-29 · unverdicted · none · ref 36
ReviewGuard aligns LLM peer reviews with future citations via impact-aligned RL, achieving Spearman ρ=0.776 on rejected-then-published AI/ML papers versus 0.492 for human reviewers and flagging 5.6× more high-impact cases.
When Does Deep RL Beat Calibrated Baselines? A Benchmark Study on Adaptive Resource Control cs.LG · 2026-05-26 · unverdicted · none · ref 6
Benchmark study finds calibrated rule-based controller outperforms six DRL algorithms on cost for adaptive resource control across workloads, with action-space mismatch explaining large differences in constraint violations.
Learning in Low-Dimensional Subspaces: Orthogonal Bottlenecks for Reinforcement Learning cs.LG · 2026-05-25 · unverdicted · none · ref 3
Orthogonal bottlenecks constrain RL encoder features to low-dimensional subspaces while preserving expressivity and gradient dynamics under linear realizability when dimension exceeds the value function's intrinsic rank.
DemoEvolve: Overcoming Sparse Feedback in Agentic Harness Evolution with Demonstrations cs.AI · 2026-05-23 · unverdicted · none · ref 24
DemoEvolve bootstraps harness evolution with demonstrations to achieve more stable and effective edits than self-rollout search in sparse-feedback environments like Balatro.
Understanding Goal Generalisation in Sequential Reinforcement Learning cs.LG · 2026-05-22 · unverdicted · none · ref 43
Empirical analysis of over 100 sequential RL training pipelines across 250+ OOD environments finds salient features drive generalization and early goals persist, with latent policy gradients simulating latent variable evolution to predict OOD behavior from training history.
Curriculum reinforcement learning with measurable task representation learning cs.LG · 2026-05-22 · unverdicted · none · ref 35
A VAE-based latent task representation enables automatic curriculum generation in CRL for non-Euclidean navigation tasks, outperforming interpolation and GAN-based methods in experiments.
ARC-RL: A Reinforcement Learning Playground Inspired by ARC Raiders cs.RO · 2026-05-19 · accept · none · ref 18 · 2 links
ARC-RL is a new suite of four MuJoCo continuous-control environments featuring game-inspired hexapod and quadruped morphologies, a single closed-form multi-component reward function, CPG demonstrators, and empirical comparisons of online and offline-to-online RL algorithms.
Critic-Driven Voronoi-Quantization for Distilling Deep RL Policies to Explainable Models cs.LG · 2026-05-14 · unverdicted · none · ref 1
Critic-Driven Voronoi State Partitioning distills deep RL policies into piecewise-linear models by iteratively adding linear subpolicies in high-value-error regions identified by the critic.
R2R2: Robust Representation for Intensive Experience Reuse via Redundancy Reduction in Self-Predictive Learning cs.LG · 2026-05-13 · unverdicted · none · ref 16
R2R2 introduces a non-centered regularization objective for SPL that addresses conflicts with spectral properties, leading to better performance on continuous control tasks at high UTD ratios.
Robust Instruction Compliance in Cooperative Multi-Agent Reinforcement Learning cs.AI · 2026-05-12 · unverdicted · none · ref 37
MAVIC corrects Bellman backups at instruction boundaries by adjusting the incoming objective and restoring continuation value, enabling consistent estimation under stochastic instruction switching in cooperative MARL.
Market-Alignment Risk in Pricing Agents: Trace Diagnostics and Trace-Prior RL under Hidden Competitor State cs.AI · 2026-05-07 · unverdicted · none · ref 10
In a hotel revenue-management simulator, standard RL agents game scalar RevPAR rewards under hidden competitor states, but Trace-Prior RL matches both revenue metrics and price distributions by training a stochastic policy with a KL penalty to a learned market prior.
Long-Horizon Q-Learning: Accurate Value Learning via n-Step Inequalities cs.AI · 2026-05-07 · unverdicted · none · ref 27 · 2 links
LQL turns n-step action-sequence lower bounds into a practical hinge-loss stabilizer for off-policy Q-learning without extra networks or forward passes.
Vanishing L2 regularization for the softmax Multi Armed Bandit cs.LG · 2026-05-05 · unverdicted · none · ref 19
Vanishing L2 regularization yields provable convergence for softmax MAB policies and improves empirical performance.
A Meta Reinforcement Learning Approach to Goals-Based Wealth Management cs.LG · 2026-05-04 · unverdicted · none · ref 77
MetaRL pre-trained on GBWM problems delivers near-optimal dynamic strategies in 0.01s achieving 97.8% of DP optimal utility and handles larger problems where DP fails.
SAVGO: Learning State-Action Value Geometry with Cosine Similarity for Continuous Control cs.LG · 2026-05-01 · unverdicted · none · ref 1
SAVGO unifies representation learning, value estimation, and policy optimization by embedding state-action pairs such that cosine similarity reflects action-value similarity, enabling similarity-kernel-guided policy improvement.
A Systematic Review and Taxonomy of Reinforcement Learning-Model Predictive Control Integration for Linear Systems eess.SY · 2026-04-22 · unverdicted · none · ref 9
This review synthesizes existing RL-MPC integration methods for linear systems into a taxonomy across RL roles, algorithms, MPC formulations, costs, and domains while identifying recurring patterns and practical challenges.
Advantage-Weighted Regression: Simple and Scalable Off-Policy Reinforcement Learning cs.LG · 2019-10-01 · conditional · none · ref 10
AWR learns policies via advantage-weighted supervised regression on actions, achieving competitive off-policy performance on Gym tasks and strong results from static data alone.
Attentive Multi-Task Deep Reinforcement Learning cs.LG · 2019-07-05 · unverdicted · none · ref 21
Attention mechanism dynamically groups task knowledge at state granularity in multi-task DRL to enable positive transfer and avoid negative transfer, matching or exceeding prior methods with fewer parameters.
State Representation Matters in Deep Reinforcement Learning: Application to Energy Trading cs.LG · 2026-06-25 · unverdicted · none · ref 10
Combining absolute, relative, and forecast price features in the state for Double DQN agents improves arbitrage performance and cross-zone transfer in pumped-storage hydro trading compared to single feature families.
Deep Reinforcement Learning for Minimum Zero-Forcing Sets cs.LG · 2026-06-16 · unverdicted · none · ref 14
SD-ZFS adapts the S2V-DQN architecture to the minimum zero-forcing set problem and shows improved performance over greedy heuristics on varied graph datasets.
Learning Empirically Admissible Neural Heuristics for Combinatorial Search cs.LG · 2026-06-03 · unverdicted · none · ref 9
Presents a framework for training empirically admissible neural heuristics via underestimating Bellman operator, asymmetric loss, and validation calibration offset, reporting reduced node expansions with no observed admissibility violations on small puzzles.
Trace-Mediated Peak Bias: Bridging Temporal Credit Assignment and Cognitive Heuristics in Deep Reinforcement Learning cs.LG · 2026-06-03 · unverdicted · none · ref 32
Eligibility traces in deep RL create a peak bias by amplifying distal TD errors into gradient shocks that fixed-step SGD cannot normalize, leading to overestimation of peak-reward trajectories and a mechanistic account of the peak-end rule.
MileStone: A Multi-Objective Compiler Phase Ordering Framework for Graph-based IR-Level Optimization cs.PL · 2026-05-22 · unverdicted · none · ref 33
MileStone models compiler phase ordering as a multi-objective optimization problem using graph representations, GNN predictions, and RL agents to find Pareto-optimal pass sequences under user constraints.
RankQ: Offline-to-Online Reinforcement Learning via Self-Supervised Action Ranking cs.AI · 2026-05-11 · unverdicted · none · ref 29 · 2 links
RankQ augments temporal-difference Q-learning with a multi-term self-supervised ranking loss to enforce structured action ordering, yielding competitive or better results than prior methods on D4RL and large gains in vision-based robot fine-tuning.
When Does Non-Uniform Replay Matter in Reinforcement Learning? cs.LG · 2026-05-11 · unverdicted · none · ref 21 · 3 links
Non-uniform replay helps most when replay volume is low; high-entropy sampling remains important, and a truncated geometric distribution delivers better sample efficiency with negligible overhead.
GIFT: Global stabilisation via Intrinsic Fine Tuning cs.LG · 2026-04-25 · unverdicted · none · ref 9
GIFT fine-tunes deep RL policies with a stability-focused reward to improve global stability while preserving task performance.
Artifacts as Memory Beyond the Agent Boundary cs.AI · 2026-04-09 · unverdicted · none · ref 45
Artifacts in the environment can reduce the memory an RL agent needs to represent its history, as shown by a mathematical proof and experiments with spatial paths.
AGMARL-DKS: An Adaptive Graph-Enhanced Multi-Agent Reinforcement Learning for Dynamic Kubernetes Scheduling cs.DC · 2026-03-12 · unverdicted · none · ref 59
AGMARL-DKS uses per-node multi-agent RL with GNN state representations and stress-aware lexicographical ordering to outperform the default Kubernetes scheduler on fault tolerance, utilization, and cost for batch and mission-critical workloads.
Morphology-Aware Graph Reinforcement Learning for Tensegrity Robot Locomotion cs.RO · 2025-10-30 · unverdicted · none · ref 5
A GNN-augmented SAC policy that encodes tensegrity topology as a graph improves sample efficiency and enables zero-shot sim-to-real locomotion on a 3-bar tensegrity robot.

Human-level control through deep reinforcement learning

hub tools

citation-role summary

citation-polarity summary

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer