LiveBench is a contamination-limited LLM benchmark with auto-scored challenging tasks from recent sources across math, coding, reasoning and more, where top models score below 70%.
super hub Mixed citations
Human-level control through deep reinforcement learning
Mixed citation behavior. Most common role is background (43%).
hub tools
citation-role summary
citation-polarity summary
authors
co-cited works
representative citing papers
Corrected heavy-ball Q-learning with convergence and acceleration guarantees is derived via switched linear system and joint spectral radius analysis, extended to linear function approximation.
CHORUS adapts a single VLA backbone for decentralized control of diverse robot teams, achieving 64-point gains over from-scratch decentralized baselines and outperforming centralized methods in real-world tasks using only local observations.
EFE-based planning is formulated as variational free energy minimization with epistemic priors, decomposing into expected plan costs plus a complexity term.
EFE-based active inference planning is characterized as VFE on an augmented model plus entropy and planning corrections, with a derived message-passing implementation and grid-world validation.
Derives an SDE describing the infinitesimal change in state distribution at each gradient step for neural actor-critic RL in continuous environments under vanishing learning rate in the infinite width limit.
CG-CMARL decomposes constrained multi-agent RL into pairwise coordination graphs with shared Q-functions, using Max-Sum message passing and a Lagrangian multiplier to coordinate actions and trace Pareto fronts scalably.
RAT reformulates regularized natural policy gradients as vanilla gradients with a transformed advantage, computed efficiently via randomized block Kaczmarz iterations on on-policy data.
Inline Critic uses a learnable token to critique and steer a frozen image-editing model's intermediate layers during generation, delivering state-of-the-art results on GEdit-Bench, RISEBench, and KRIS-Bench.
Single-timescale actor-critic with STORM momentum and a recent-sample buffer achieves optimal O(ε^{-2}) sample complexity for ε-optimal policies in finite discounted MDPs.
vsOED uses a variational one-point reward and RL policy optimization to provide a lower bound on expected information gain for sequential experimental design, supporting nuisance parameters, implicit likelihoods, and multiple design goals.
In offline RL, the structure of pessimism (set by dataset coverage) matters more for generalization than its amount; a symmetric overly pessimistic value function can outperform a non-symmetric mildly pessimistic one.
A deterministic episodic-to-semantic consolidation function with a structural lemma proving identity invariance, demonstrated in synthetic experiments on an embodied service agent.
Introduces parametric open-source games as continuous analogues of program equilibria, proves equilibrium existence, and derives an exact coupling threshold for cooperation in symmetric 2x2 games under gradient ascent.
SMR uses multi-channel map-encoded reinforcement learning to achieve roughly 10% better time utilization than greedy baselines for single-dish radio telescope scheduling.
Local 2- and 3-cycles enhance RNN computational capacity for Boolean functions, predicted by structural statistics, while adding interneurons boosts large networks.
NASDAQ normalizes observations in an online RL setting so that dynamics prediction losses are balanced across dimensions, yielding competitive performance with lower wall-time than prior model-based and self-predictive methods.
Introduces signed divergence to bound generalization gaps and defines task-space complexity as the minimum source contexts needed for ε-coverage under local smoothness, with set-cover reduction and empirical validation on LQR and DRL systems.
RL training disrupts gradient-based adversarial attacks by inducing unstable low-magnitude gradients that limit the effectiveness of methods like PGD within practical budgets.
Dmsh is a new multi-agent RL framework that formulates mesh generation as an MDP and uses three coordinated agents plus curriculum learning to produce globally conforming all-quad meshes without post-processing.
Rollout-level advantage-prioritized experience replay for GRPO recycles high-advantage individual rollouts with age eviction and fresh-anchored batches to outperform standard GRPO on math benchmarks, with gains increasing with model size.
ReviewGuard aligns LLM peer reviews with future citations via impact-aligned RL, achieving Spearman ρ=0.776 on rejected-then-published AI/ML papers versus 0.492 for human reviewers and flagging 5.6× more high-impact cases.
Benchmark study finds calibrated rule-based controller outperforms six DRL algorithms on cost for adaptive resource control across workloads, with action-space mismatch explaining large differences in constraint violations.
Orthogonal bottlenecks constrain RL encoder features to low-dimensional subspaces while preserving expressivity and gradient dynamics under linear realizability when dimension exceeds the value function's intrinsic rank.
citing papers explorer
-
LiveBench: A Challenging, Contamination-Limited LLM Benchmark
LiveBench is a contamination-limited LLM benchmark with auto-scored challenging tasks from recent sources across math, coding, reasoning and more, where top models score below 70%.
-
Heavy-Ball Q-Learning with Residual Weighting Correction
Corrected heavy-ball Q-learning with convergence and acceleration guarantees is derived via switched linear system and joint spectral radius analysis, extended to linear function approximation.
-
CHORUS: Decentralized Multi-Embodiment Collaboration with One VLA Policy
CHORUS adapts a single VLA backbone for decentralized control of diverse robot teams, achieving 64-point gains over from-scratch decentralized baselines and outperforming centralized methods in real-world tasks using only local observations.
-
Expected Free Energy-based Planning as Variational Inference
EFE-based planning is formulated as variational free energy minimization with epistemic priors, decomposing into expected plan costs plus a complexity term.
-
What Type of Inference is Active Inference?
EFE-based active inference planning is characterized as VFE on an augmented model plus entropy and planning corrections, with a derived message-passing implementation and grid-world validation.
-
From Ticks to Flows: Dynamics of Neural Reinforcement Learning in Continuous Environments
Derives an SDE describing the infinitesimal change in state distribution at each gradient step for neural actor-critic RL in continuous environments under vanishing learning rate in the infinite width limit.
-
Coordination Graphs for Constrained Multi-Agent Reinforcement Learning
CG-CMARL decomposes constrained multi-agent RL into pairwise coordination graphs with shared Q-functions, using Max-Sum message passing and a Lagrangian multiplier to coordinate actions and trace Pareto fronts scalably.
-
Randomized Advantage Transformation (RAT): Computing Natural Policy Gradients via Direct Backpropagation
RAT reformulates regularized natural policy gradients as vanilla gradients with a transformed advantage, computed efficiently via randomized block Kaczmarz iterations on on-policy data.
-
Inline Critic Steers Image Editing
Inline Critic uses a learnable token to critique and steer a frozen image-editing model's intermediate layers during generation, delivering state-of-the-art results on GEdit-Bench, RISEBench, and KRIS-Bench.
-
Optimal Sample Complexity for Single Time-Scale Actor-Critic with Momentum
Single-timescale actor-critic with STORM momentum and a recent-sample buffer achieves optimal O(ε^{-2}) sample complexity for ε-optimal policies in finite discounted MDPs.
-
Variational Sequential Optimal Experimental Design using Reinforcement Learning
vsOED uses a variational one-point reward and RL policy optimization to provide a lower bound on expected information gain for sequential experimental design, supporting nuisance parameters, implicit likelihoods, and multiple design goals.
-
Generalization in offline RL: The structure is more important than the amount of pessimism
In offline RL, the structure of pessimism (set by dataset coverage) matters more for generalization than its amount; a symmetric overly pessimistic value function can outperform a non-symmetric mildly pessimistic one.
-
Episodic-to-Semantic Consolidation Without Identity Drift
A deterministic episodic-to-semantic consolidation function with a structural lemma proving identity invariance, demonstrated in synthetic experiments on an embodied service agent.
-
Parametric Open Source Games
Introduces parametric open-source games as continuous analogues of program equilibria, proves equilibrium existence, and derives an exact coupling threshold for cooperation in symmetric 2x2 games under gradient ascent.
-
SMR: Scheduler with Multi-Channel Map-Encoded Reinforcement Learning for Radio Telescopes
SMR uses multi-channel map-encoded reinforcement learning to achieve roughly 10% better time utilization than greedy baselines for single-dish radio telescope scheduling.
-
Identifying structural design principles shaping the computational abilities of recurrent neural networks
Local 2- and 3-cycles enhance RNN computational capacity for Boolean functions, predicted by structural statistics, while adding interneurons boosts large networks.
-
NASDAQ: Normalized Observation Space Dynamics-Augmented Q-Learning
NASDAQ normalizes observations in an online RL setting so that dynamics prediction losses are balanced across dimensions, yielding competitive performance with lower wall-time than prior model-based and self-predictive methods.
-
Formalizing Task-Space Complexity for Zero-Shot Generalization
Introduces signed divergence to bound generalization gaps and defines task-space complexity as the minimum source contexts needed for ε-coverage under local smoothness, with set-cover reduction and empirical validation on LQR and DRL systems.
-
Reinforcement Learning Disrupts Gradient-Based Adversarial Optimization
RL training disrupts gradient-based adversarial attacks by inducing unstable low-magnitude gradients that limit the effectiveness of methods like PGD within practical budgets.
-
Dmsh: A Multi-Agent Reinforcement Learning Framework for All-Quad Mesh Generation
Dmsh is a new multi-agent RL framework that formulates mesh generation as an MDP and uses three coordinated agents plus curriculum learning to produce globally conforming all-quad meshes without post-processing.
-
Rollout-Level Advantage-Prioritized Experience Replay for GRPO
Rollout-level advantage-prioritized experience replay for GRPO recycles high-advantage individual rollouts with age eviction and fresh-anchored batches to outperform standard GRPO on math benchmarks, with gains increasing with model size.
-
ReviewGuard: Aligning LLM-Assisted Peer Review with Long-Term Scientific Impact
ReviewGuard aligns LLM peer reviews with future citations via impact-aligned RL, achieving Spearman ρ=0.776 on rejected-then-published AI/ML papers versus 0.492 for human reviewers and flagging 5.6× more high-impact cases.
-
When Does Deep RL Beat Calibrated Baselines? A Benchmark Study on Adaptive Resource Control
Benchmark study finds calibrated rule-based controller outperforms six DRL algorithms on cost for adaptive resource control across workloads, with action-space mismatch explaining large differences in constraint violations.
-
Learning in Low-Dimensional Subspaces: Orthogonal Bottlenecks for Reinforcement Learning
Orthogonal bottlenecks constrain RL encoder features to low-dimensional subspaces while preserving expressivity and gradient dynamics under linear realizability when dimension exceeds the value function's intrinsic rank.
-
DemoEvolve: Overcoming Sparse Feedback in Agentic Harness Evolution with Demonstrations
DemoEvolve bootstraps harness evolution with demonstrations to achieve more stable and effective edits than self-rollout search in sparse-feedback environments like Balatro.
-
Understanding Goal Generalisation in Sequential Reinforcement Learning
Empirical analysis of over 100 sequential RL training pipelines across 250+ OOD environments finds salient features drive generalization and early goals persist, with latent policy gradients simulating latent variable evolution to predict OOD behavior from training history.
-
Curriculum reinforcement learning with measurable task representation learning
A VAE-based latent task representation enables automatic curriculum generation in CRL for non-Euclidean navigation tasks, outperforming interpolation and GAN-based methods in experiments.
-
ARC-RL: A Reinforcement Learning Playground Inspired by ARC Raiders
ARC-RL is a new suite of four MuJoCo continuous-control environments featuring game-inspired hexapod and quadruped morphologies, a single closed-form multi-component reward function, CPG demonstrators, and empirical comparisons of online and offline-to-online RL algorithms.
-
Critic-Driven Voronoi-Quantization for Distilling Deep RL Policies to Explainable Models
Critic-Driven Voronoi State Partitioning distills deep RL policies into piecewise-linear models by iteratively adding linear subpolicies in high-value-error regions identified by the critic.
-
R2R2: Robust Representation for Intensive Experience Reuse via Redundancy Reduction in Self-Predictive Learning
R2R2 introduces a non-centered regularization objective for SPL that addresses conflicts with spectral properties, leading to better performance on continuous control tasks at high UTD ratios.
-
Robust Instruction Compliance in Cooperative Multi-Agent Reinforcement Learning
MAVIC corrects Bellman backups at instruction boundaries by adjusting the incoming objective and restoring continuation value, enabling consistent estimation under stochastic instruction switching in cooperative MARL.
-
Market-Alignment Risk in Pricing Agents: Trace Diagnostics and Trace-Prior RL under Hidden Competitor State
In a hotel revenue-management simulator, standard RL agents game scalar RevPAR rewards under hidden competitor states, but Trace-Prior RL matches both revenue metrics and price distributions by training a stochastic policy with a KL penalty to a learned market prior.
-
Long-Horizon Q-Learning: Accurate Value Learning via n-Step Inequalities
LQL turns n-step action-sequence lower bounds into a practical hinge-loss stabilizer for off-policy Q-learning without extra networks or forward passes.
-
Vanishing L2 regularization for the softmax Multi Armed Bandit
Vanishing L2 regularization yields provable convergence for softmax MAB policies and improves empirical performance.
-
A Meta Reinforcement Learning Approach to Goals-Based Wealth Management
MetaRL pre-trained on GBWM problems delivers near-optimal dynamic strategies in 0.01s achieving 97.8% of DP optimal utility and handles larger problems where DP fails.
-
SAVGO: Learning State-Action Value Geometry with Cosine Similarity for Continuous Control
SAVGO unifies representation learning, value estimation, and policy optimization by embedding state-action pairs such that cosine similarity reflects action-value similarity, enabling similarity-kernel-guided policy improvement.
-
A Systematic Review and Taxonomy of Reinforcement Learning-Model Predictive Control Integration for Linear Systems
This review synthesizes existing RL-MPC integration methods for linear systems into a taxonomy across RL roles, algorithms, MPC formulations, costs, and domains while identifying recurring patterns and practical challenges.
-
Advantage-Weighted Regression: Simple and Scalable Off-Policy Reinforcement Learning
AWR learns policies via advantage-weighted supervised regression on actions, achieving competitive off-policy performance on Gym tasks and strong results from static data alone.
-
Attentive Multi-Task Deep Reinforcement Learning
Attention mechanism dynamically groups task knowledge at state granularity in multi-task DRL to enable positive transfer and avoid negative transfer, matching or exceeding prior methods with fewer parameters.
-
State Representation Matters in Deep Reinforcement Learning: Application to Energy Trading
Combining absolute, relative, and forecast price features in the state for Double DQN agents improves arbitrage performance and cross-zone transfer in pumped-storage hydro trading compared to single feature families.
-
Deep Reinforcement Learning for Minimum Zero-Forcing Sets
SD-ZFS adapts the S2V-DQN architecture to the minimum zero-forcing set problem and shows improved performance over greedy heuristics on varied graph datasets.
-
Learning Empirically Admissible Neural Heuristics for Combinatorial Search
Presents a framework for training empirically admissible neural heuristics via underestimating Bellman operator, asymmetric loss, and validation calibration offset, reporting reduced node expansions with no observed admissibility violations on small puzzles.
-
Trace-Mediated Peak Bias: Bridging Temporal Credit Assignment and Cognitive Heuristics in Deep Reinforcement Learning
Eligibility traces in deep RL create a peak bias by amplifying distal TD errors into gradient shocks that fixed-step SGD cannot normalize, leading to overestimation of peak-reward trajectories and a mechanistic account of the peak-end rule.
-
MileStone: A Multi-Objective Compiler Phase Ordering Framework for Graph-based IR-Level Optimization
MileStone models compiler phase ordering as a multi-objective optimization problem using graph representations, GNN predictions, and RL agents to find Pareto-optimal pass sequences under user constraints.
-
RankQ: Offline-to-Online Reinforcement Learning via Self-Supervised Action Ranking
RankQ augments temporal-difference Q-learning with a multi-term self-supervised ranking loss to enforce structured action ordering, yielding competitive or better results than prior methods on D4RL and large gains in vision-based robot fine-tuning.
-
When Does Non-Uniform Replay Matter in Reinforcement Learning?
Non-uniform replay helps most when replay volume is low; high-entropy sampling remains important, and a truncated geometric distribution delivers better sample efficiency with negligible overhead.
-
GIFT: Global stabilisation via Intrinsic Fine Tuning
GIFT fine-tunes deep RL policies with a stability-focused reward to improve global stability while preserving task performance.
-
Artifacts as Memory Beyond the Agent Boundary
Artifacts in the environment can reduce the memory an RL agent needs to represent its history, as shown by a mathematical proof and experiments with spatial paths.
-
AGMARL-DKS: An Adaptive Graph-Enhanced Multi-Agent Reinforcement Learning for Dynamic Kubernetes Scheduling
AGMARL-DKS uses per-node multi-agent RL with GNN state representations and stress-aware lexicographical ordering to outperform the default Kubernetes scheduler on fault tolerance, utilization, and cost for batch and mission-critical workloads.
-
Morphology-Aware Graph Reinforcement Learning for Tensegrity Robot Locomotion
A GNN-augmented SAC policy that encodes tensegrity topology as a graph improves sample efficiency and enables zero-shot sim-to-real locomotion on a 3-bar tensegrity robot.