super hub Mixed citations

Rusu, Joel Veness, Marc G

Alex Graves, Andreas K. Fidjeland, Andrei A. Rusu, Charles Beattie, David Silver, Georg Ostrovski + 2 more · 2015 · Nature · DOI 10.1038/nature14236

Mixed citation behavior. Most common role is background (43%).

29 Pith papers citing it

22.6k external citations · Crossref

Background 43% of classified citations

open at publisher browse 29 citing papers more from Alex Graves

hub tools

JSON dossier citing papers JSON publisher DOI

citation-role summary

background 4 method 2 baseline 1

citation-polarity summary

background 3 use method 2 baseline 1 unclear 1

authors

Alex Graves Andreas K. Fidjeland Andrei A. Rusu Charles Beattie David Silver Georg Ostrovski Joel Veness Koray Kavukcuoglu Marc G. Bellemare Martin Riedmiller Stig Petersen Volodymyr Mnih

co-cited works

representative citing papers

LiveBench: A Challenging, Contamination-Limited LLM Benchmark

cs.CL · 2024-06-27 · unverdicted · novelty 8.0

LiveBench is a contamination-limited LLM benchmark with auto-scored challenging tasks from recent sources across math, coding, reasoning and more, where top models score below 70%.

Randomized Advantage Transformation (RAT): Computing Natural Policy Gradients via Direct Backpropagation

cs.LG · 2026-05-18 · unverdicted · novelty 7.0

RAT reformulates regularized natural policy gradients as vanilla gradients with a transformed advantage, computed efficiently via randomized block Kaczmarz iterations on on-policy data.

Inline Critic Steers Image Editing

cs.CV · 2026-05-12 · conditional · novelty 7.0

Inline Critic uses a learnable token to critique and steer a frozen image-editing model's intermediate layers during generation, delivering state-of-the-art results on GEdit-Bench, RISEBench, and KRIS-Bench.

Optimal Sample Complexity for Single Time-Scale Actor-Critic with Momentum

cs.LG · 2026-02-02 · unverdicted · novelty 7.0

Single-timescale actor-critic with STORM momentum and a recent-sample buffer achieves optimal O(ε^{-2}) sample complexity for ε-optimal policies in finite discounted MDPs.

Variational Sequential Optimal Experimental Design using Reinforcement Learning

stat.ML · 2023-06-17 · unverdicted · novelty 7.0

vsOED uses a variational one-point reward and RL policy optimization to provide a lower bound on expected information gain for sequential experimental design, supporting nuisance parameters, implicit likelihoods, and multiple design goals.

Understanding Goal Generalisation in Sequential Reinforcement Learning

cs.LG · 2026-05-22 · unverdicted · novelty 6.0

Empirical analysis of over 100 sequential RL training pipelines across 250+ OOD environments finds salient features drive generalization and early goals persist, with latent policy gradients simulating latent variable evolution to predict OOD behavior from training history.

Curriculum reinforcement learning with measurable task representation learning

cs.LG · 2026-05-22 · unverdicted · novelty 6.0

A VAE-based latent task representation enables automatic curriculum generation in CRL for non-Euclidean navigation tasks, outperforming interpolation and GAN-based methods in experiments.

ARC-RL: A Reinforcement Learning Playground Inspired by ARC Raiders

cs.RO · 2026-05-19 · accept · novelty 6.0 · 2 refs

ARC-RL is a new suite of four MuJoCo continuous-control environments featuring game-inspired hexapod and quadruped morphologies, a single closed-form multi-component reward function, CPG demonstrators, and empirical comparisons of online and offline-to-online RL algorithms.

R2R2: Robust Representation for Intensive Experience Reuse via Redundancy Reduction in Self-Predictive Learning

cs.LG · 2026-05-13 · unverdicted · novelty 6.0

R2R2 introduces a non-centered regularization objective for SPL that addresses conflicts with spectral properties, leading to better performance on continuous control tasks at high UTD ratios.

Market-Alignment Risk in Pricing Agents: Trace Diagnostics and Trace-Prior RL under Hidden Competitor State

cs.AI · 2026-05-07 · unverdicted · novelty 6.0

In a hotel revenue-management simulator, standard RL agents game scalar RevPAR rewards under hidden competitor states, but Trace-Prior RL matches both revenue metrics and price distributions by training a stochastic policy with a KL penalty to a learned market prior.

Long-Horizon Q-Learning: Accurate Value Learning via n-Step Inequalities

cs.AI · 2026-05-07 · unverdicted · novelty 6.0 · 2 refs

LQL turns n-step action-sequence lower bounds into a practical hinge-loss stabilizer for off-policy Q-learning without extra networks or forward passes.

Vanishing L2 regularization for the softmax Multi Armed Bandit

cs.LG · 2026-05-05 · unverdicted · novelty 6.0

Vanishing L2 regularization yields provable convergence for softmax MAB policies and improves empirical performance.

A Meta Reinforcement Learning Approach to Goals-Based Wealth Management

cs.LG · 2026-05-04 · unverdicted · novelty 6.0

MetaRL pre-trained on GBWM problems delivers near-optimal dynamic strategies in 0.01s achieving 97.8% of DP optimal utility and handles larger problems where DP fails.

SAVGO: Learning State-Action Value Geometry with Cosine Similarity for Continuous Control

cs.LG · 2026-05-01 · unverdicted · novelty 6.0

SAVGO unifies representation learning, value estimation, and policy optimization by embedding state-action pairs such that cosine similarity reflects action-value similarity, enabling similarity-kernel-guided policy improvement.

A Systematic Review and Taxonomy of Reinforcement Learning-Model Predictive Control Integration for Linear Systems

eess.SY · 2026-04-22 · unverdicted · novelty 6.0

This review synthesizes existing RL-MPC integration methods for linear systems into a taxonomy across RL roles, algorithms, MPC formulations, costs, and domains while identifying recurring patterns and practical challenges.

Advantage-Weighted Regression: Simple and Scalable Off-Policy Reinforcement Learning

cs.LG · 2019-10-01 · conditional · novelty 6.0

AWR learns policies via advantage-weighted supervised regression on actions, achieving competitive off-policy performance on Gym tasks and strong results from static data alone.

Attentive Multi-Task Deep Reinforcement Learning

cs.LG · 2019-07-05 · unverdicted · novelty 6.0

Attention mechanism dynamically groups task knowledge at state granularity in multi-task DRL to enable positive transfer and avoid negative transfer, matching or exceeding prior methods with fewer parameters.

MileStone: A Multi-Objective Compiler Phase Ordering Framework for Graph-based IR-Level Optimization

cs.PL · 2026-05-22 · unverdicted · novelty 5.0

MileStone models compiler phase ordering as a multi-objective optimization problem using graph representations, GNN predictions, and RL agents to find Pareto-optimal pass sequences under user constraints.

RankQ: Offline-to-Online Reinforcement Learning via Self-Supervised Action Ranking

cs.AI · 2026-05-11 · unverdicted · novelty 5.0 · 2 refs

RankQ augments temporal-difference Q-learning with a multi-term self-supervised ranking loss to enforce structured action ordering, yielding competitive or better results than prior methods on D4RL and large gains in vision-based robot fine-tuning.

When Does Non-Uniform Replay Matter in Reinforcement Learning?

cs.LG · 2026-05-11 · unverdicted · novelty 5.0 · 3 refs

Non-uniform replay helps most when replay volume is low; high-entropy sampling remains important, and a truncated geometric distribution delivers better sample efficiency with negligible overhead.

GIFT: Global stabilisation via Intrinsic Fine Tuning

cs.LG · 2026-04-25 · unverdicted · novelty 5.0

GIFT fine-tunes deep RL policies with a stability-focused reward to improve global stability while preserving task performance.

Artifacts as Memory Beyond the Agent Boundary

cs.AI · 2026-04-09 · unverdicted · novelty 5.0

Artifacts in the environment can reduce the memory an RL agent needs to represent its history, as shown by a mathematical proof and experiments with spatial paths.

AGMARL-DKS: An Adaptive Graph-Enhanced Multi-Agent Reinforcement Learning for Dynamic Kubernetes Scheduling

cs.DC · 2026-03-12 · unverdicted · novelty 5.0

AGMARL-DKS uses per-node multi-agent RL with GNN state representations and stress-aware lexicographical ordering to outperform the default Kubernetes scheduler on fault tolerance, utilization, and cost for batch and mission-critical workloads.

Morphology-Aware Graph Reinforcement Learning for Tensegrity Robot Locomotion

cs.RO · 2025-10-30 · unverdicted · novelty 5.0

A GNN-augmented SAC policy that encodes tensegrity topology as a graph improves sample efficiency and enables zero-shot sim-to-real locomotion on a 3-bar tensegrity robot.

citing papers explorer

Showing 29 of 29 citing papers.

LiveBench: A Challenging, Contamination-Limited LLM Benchmark cs.CL · 2024-06-27 · unverdicted · none · ref 28
LiveBench is a contamination-limited LLM benchmark with auto-scored challenging tasks from recent sources across math, coding, reasoning and more, where top models score below 70%.
Randomized Advantage Transformation (RAT): Computing Natural Policy Gradients via Direct Backpropagation cs.LG · 2026-05-18 · unverdicted · none · ref 74
RAT reformulates regularized natural policy gradients as vanilla gradients with a transformed advantage, computed efficiently via randomized block Kaczmarz iterations on on-policy data.
Inline Critic Steers Image Editing cs.CV · 2026-05-12 · conditional · none · ref 33
Inline Critic uses a learnable token to critique and steer a frozen image-editing model's intermediate layers during generation, delivering state-of-the-art results on GEdit-Bench, RISEBench, and KRIS-Bench.
Optimal Sample Complexity for Single Time-Scale Actor-Critic with Momentum cs.LG · 2026-02-02 · unverdicted · none · ref 40
Single-timescale actor-critic with STORM momentum and a recent-sample buffer achieves optimal O(ε^{-2}) sample complexity for ε-optimal policies in finite discounted MDPs.
Variational Sequential Optimal Experimental Design using Reinforcement Learning stat.ML · 2023-06-17 · unverdicted · none · ref 61
vsOED uses a variational one-point reward and RL policy optimization to provide a lower bound on expected information gain for sequential experimental design, supporting nuisance parameters, implicit likelihoods, and multiple design goals.
Understanding Goal Generalisation in Sequential Reinforcement Learning cs.LG · 2026-05-22 · unverdicted · none · ref 43
Empirical analysis of over 100 sequential RL training pipelines across 250+ OOD environments finds salient features drive generalization and early goals persist, with latent policy gradients simulating latent variable evolution to predict OOD behavior from training history.
Curriculum reinforcement learning with measurable task representation learning cs.LG · 2026-05-22 · unverdicted · none · ref 35
A VAE-based latent task representation enables automatic curriculum generation in CRL for non-Euclidean navigation tasks, outperforming interpolation and GAN-based methods in experiments.
ARC-RL: A Reinforcement Learning Playground Inspired by ARC Raiders cs.RO · 2026-05-19 · accept · none · ref 18 · 2 links
ARC-RL is a new suite of four MuJoCo continuous-control environments featuring game-inspired hexapod and quadruped morphologies, a single closed-form multi-component reward function, CPG demonstrators, and empirical comparisons of online and offline-to-online RL algorithms.
R2R2: Robust Representation for Intensive Experience Reuse via Redundancy Reduction in Self-Predictive Learning cs.LG · 2026-05-13 · unverdicted · none · ref 16
R2R2 introduces a non-centered regularization objective for SPL that addresses conflicts with spectral properties, leading to better performance on continuous control tasks at high UTD ratios.
Market-Alignment Risk in Pricing Agents: Trace Diagnostics and Trace-Prior RL under Hidden Competitor State cs.AI · 2026-05-07 · unverdicted · none · ref 10
In a hotel revenue-management simulator, standard RL agents game scalar RevPAR rewards under hidden competitor states, but Trace-Prior RL matches both revenue metrics and price distributions by training a stochastic policy with a KL penalty to a learned market prior.
Long-Horizon Q-Learning: Accurate Value Learning via n-Step Inequalities cs.AI · 2026-05-07 · unverdicted · none · ref 27 · 2 links
LQL turns n-step action-sequence lower bounds into a practical hinge-loss stabilizer for off-policy Q-learning without extra networks or forward passes.
Vanishing L2 regularization for the softmax Multi Armed Bandit cs.LG · 2026-05-05 · unverdicted · none · ref 19
Vanishing L2 regularization yields provable convergence for softmax MAB policies and improves empirical performance.
A Meta Reinforcement Learning Approach to Goals-Based Wealth Management cs.LG · 2026-05-04 · unverdicted · none · ref 77
MetaRL pre-trained on GBWM problems delivers near-optimal dynamic strategies in 0.01s achieving 97.8% of DP optimal utility and handles larger problems where DP fails.
SAVGO: Learning State-Action Value Geometry with Cosine Similarity for Continuous Control cs.LG · 2026-05-01 · unverdicted · none · ref 1
SAVGO unifies representation learning, value estimation, and policy optimization by embedding state-action pairs such that cosine similarity reflects action-value similarity, enabling similarity-kernel-guided policy improvement.
A Systematic Review and Taxonomy of Reinforcement Learning-Model Predictive Control Integration for Linear Systems eess.SY · 2026-04-22 · unverdicted · none · ref 9
This review synthesizes existing RL-MPC integration methods for linear systems into a taxonomy across RL roles, algorithms, MPC formulations, costs, and domains while identifying recurring patterns and practical challenges.
Advantage-Weighted Regression: Simple and Scalable Off-Policy Reinforcement Learning cs.LG · 2019-10-01 · conditional · none · ref 10
AWR learns policies via advantage-weighted supervised regression on actions, achieving competitive off-policy performance on Gym tasks and strong results from static data alone.
Attentive Multi-Task Deep Reinforcement Learning cs.LG · 2019-07-05 · unverdicted · none · ref 21
Attention mechanism dynamically groups task knowledge at state granularity in multi-task DRL to enable positive transfer and avoid negative transfer, matching or exceeding prior methods with fewer parameters.
MileStone: A Multi-Objective Compiler Phase Ordering Framework for Graph-based IR-Level Optimization cs.PL · 2026-05-22 · unverdicted · none · ref 33
MileStone models compiler phase ordering as a multi-objective optimization problem using graph representations, GNN predictions, and RL agents to find Pareto-optimal pass sequences under user constraints.
RankQ: Offline-to-Online Reinforcement Learning via Self-Supervised Action Ranking cs.AI · 2026-05-11 · unverdicted · none · ref 29 · 2 links
RankQ augments temporal-difference Q-learning with a multi-term self-supervised ranking loss to enforce structured action ordering, yielding competitive or better results than prior methods on D4RL and large gains in vision-based robot fine-tuning.
When Does Non-Uniform Replay Matter in Reinforcement Learning? cs.LG · 2026-05-11 · unverdicted · none · ref 21 · 3 links
Non-uniform replay helps most when replay volume is low; high-entropy sampling remains important, and a truncated geometric distribution delivers better sample efficiency with negligible overhead.
GIFT: Global stabilisation via Intrinsic Fine Tuning cs.LG · 2026-04-25 · unverdicted · none · ref 9
GIFT fine-tunes deep RL policies with a stability-focused reward to improve global stability while preserving task performance.
Artifacts as Memory Beyond the Agent Boundary cs.AI · 2026-04-09 · unverdicted · none · ref 45
Artifacts in the environment can reduce the memory an RL agent needs to represent its history, as shown by a mathematical proof and experiments with spatial paths.
AGMARL-DKS: An Adaptive Graph-Enhanced Multi-Agent Reinforcement Learning for Dynamic Kubernetes Scheduling cs.DC · 2026-03-12 · unverdicted · none · ref 59
AGMARL-DKS uses per-node multi-agent RL with GNN state representations and stress-aware lexicographical ordering to outperform the default Kubernetes scheduler on fault tolerance, utilization, and cost for batch and mission-critical workloads.
Morphology-Aware Graph Reinforcement Learning for Tensegrity Robot Locomotion cs.RO · 2025-10-30 · unverdicted · none · ref 5
A GNN-augmented SAC policy that encodes tensegrity topology as a graph improves sample efficiency and enables zero-shot sim-to-real locomotion on a 3-bar tensegrity robot.
Gymnasium: A Standard Interface for Reinforcement Learning Environments cs.LG · 2024-07-24 · accept · none · ref 24
Gymnasium establishes a standardized API for RL environments to improve interoperability, reproducibility, and ease of development in reinforcement learning.
Spectral Alignment in Forward-Backward Representations via Temporal Abstraction cs.LG · 2026-03-20 · unverdicted · none · ref 9
Temporal abstraction functions as a low-pass filter on transition dynamics to lower the effective rank of successor representations while bounding value function error in forward-backward learning.
AI-Powered Surrogate Modelling for Multiscale Combustion: A Critical Review and Opportunities physics.chem-ph · 2026-04-28 · unverdicted · none · ref 41
A critical review of AI surrogate models for multiscale combustion that compares supervised, unsupervised, and physics-guided methods, identifies transferability and consistency challenges, and outlines future opportunities.
Deep Learning for Sequential Decision Making under Uncertainty: Foundations, Frameworks, and Frontiers math.OC · 2026-04-13 · unverdicted · none · ref 94
A tutorial framing deep learning as a complement to optimization for sequential decision-making under uncertainty, with applications in supply chains, healthcare, and energy.
Optimal Use of Experience in First Person Shooter Environments cs.LG · 2019-06-24 · unverdicted · none · ref 1
Empirical tests in VizDoom show multiple DQN updates per step do not improve performance after learning rate adjustment, with a 4:1 update-to-step ratio optimal before significant degradation.

Rusu, Joel Veness, Marc G

hub tools

citation-role summary

citation-polarity summary

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer