LiveBench is a contamination-limited LLM benchmark with auto-scored challenging tasks from recent sources across math, coding, reasoning and more, where top models score below 70%.
super hub Mixed citations
Human-level control through deep reinforcement learning
Mixed citation behavior. Most common role is background (43%).
hub tools
citation-role summary
citation-polarity summary
authors
co-cited works
representative citing papers
Corrected heavy-ball Q-learning with convergence and acceleration guarantees is derived via switched linear system and joint spectral radius analysis, extended to linear function approximation.
CHORUS adapts a single VLA backbone for decentralized control of diverse robot teams, achieving 64-point gains over from-scratch decentralized baselines and outperforming centralized methods in real-world tasks using only local observations.
EFE-based planning is formulated as variational free energy minimization with epistemic priors, decomposing into expected plan costs plus a complexity term.
EFE-based active inference planning is characterized as VFE on an augmented model plus entropy and planning corrections, with a derived message-passing implementation and grid-world validation.
Derives an SDE describing the infinitesimal change in state distribution at each gradient step for neural actor-critic RL in continuous environments under vanishing learning rate in the infinite width limit.
CG-CMARL decomposes constrained multi-agent RL into pairwise coordination graphs with shared Q-functions, using Max-Sum message passing and a Lagrangian multiplier to coordinate actions and trace Pareto fronts scalably.
RAT reformulates regularized natural policy gradients as vanilla gradients with a transformed advantage, computed efficiently via randomized block Kaczmarz iterations on on-policy data.
Inline Critic uses a learnable token to critique and steer a frozen image-editing model's intermediate layers during generation, delivering state-of-the-art results on GEdit-Bench, RISEBench, and KRIS-Bench.
Single-timescale actor-critic with STORM momentum and a recent-sample buffer achieves optimal O(ε^{-2}) sample complexity for ε-optimal policies in finite discounted MDPs.
vsOED uses a variational one-point reward and RL policy optimization to provide a lower bound on expected information gain for sequential experimental design, supporting nuisance parameters, implicit likelihoods, and multiple design goals.
Introduces parametric open-source games as continuous analogues of program equilibria, proves equilibrium existence, and derives an exact coupling threshold for cooperation in symmetric 2x2 games under gradient ascent.
SMR uses multi-channel map-encoded reinforcement learning to achieve roughly 10% better time utilization than greedy baselines for single-dish radio telescope scheduling.
Local 2- and 3-cycles enhance RNN computational capacity for Boolean functions, predicted by structural statistics, while adding interneurons boosts large networks.
NASDAQ normalizes observations in an online RL setting so that dynamics prediction losses are balanced across dimensions, yielding competitive performance with lower wall-time than prior model-based and self-predictive methods.
Introduces signed divergence to bound generalization gaps and defines task-space complexity as the minimum source contexts needed for ε-coverage under local smoothness, with set-cover reduction and empirical validation on LQR and DRL systems.
RL training disrupts gradient-based adversarial attacks by inducing unstable low-magnitude gradients that limit the effectiveness of methods like PGD within practical budgets.
Dmsh is a new multi-agent RL framework that formulates mesh generation as an MDP and uses three coordinated agents plus curriculum learning to produce globally conforming all-quad meshes without post-processing.
Rollout-level advantage-prioritized experience replay for GRPO recycles high-advantage individual rollouts with age eviction and fresh-anchored batches to outperform standard GRPO on math benchmarks, with gains increasing with model size.
ReviewGuard aligns LLM peer reviews with future citations via impact-aligned RL, achieving Spearman ρ=0.776 on rejected-then-published AI/ML papers versus 0.492 for human reviewers and flagging 5.6× more high-impact cases.
Benchmark study finds calibrated rule-based controller outperforms six DRL algorithms on cost for adaptive resource control across workloads, with action-space mismatch explaining large differences in constraint violations.
Orthogonal bottlenecks constrain RL encoder features to low-dimensional subspaces while preserving expressivity and gradient dynamics under linear realizability when dimension exceeds the value function's intrinsic rank.
DemoEvolve bootstraps harness evolution with demonstrations to achieve more stable and effective edits than self-rollout search in sparse-feedback environments like Balatro.
Empirical analysis of over 100 sequential RL training pipelines across 250+ OOD environments finds salient features drive generalization and early goals persist, with latent policy gradients simulating latent variable evolution to predict OOD behavior from training history.
citing papers explorer
-
Advantage-Weighted Regression: Simple and Scalable Off-Policy Reinforcement Learning
AWR learns policies via advantage-weighted supervised regression on actions, achieving competitive off-policy performance on Gym tasks and strong results from static data alone.
-
Attentive Multi-Task Deep Reinforcement Learning
Attention mechanism dynamically groups task knowledge at state granularity in multi-task DRL to enable positive transfer and avoid negative transfer, matching or exceeding prior methods with fewer parameters.
-
Optimal Use of Experience in First Person Shooter Environments
Empirical tests in VizDoom show multiple DQN updates per step do not improve performance after learning rate adjustment, with a 4:1 update-to-step ratio optimal before significant degradation.