{"total":11,"items":[{"citing_arxiv_id":"2605.17678","ref_index":19,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"On Gaussian approximation for entropy-regularized Q-learning with function approximation","primary_cat":"stat.ML","submitted_at":"2026-05-17T22:23:25+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Establishes n^{-1/4} Gaussian approximation in convex distance for averaged entropy-regularized Q-learning with linear function approximation and polynomial stepsizes.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.12206","ref_index":30,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"On the Importance of Multistability for Horizon Generalization in Reinforcement Learning","primary_cat":"cs.LG","submitted_at":"2026-05-12T14:45:55+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Multistability is necessary for temporal horizon generalization in POMDPs, sufficient in simple tasks along with transient dynamics in complex ones, while monostable parallelizable RNNs like SSMs and gated linear RNNs fail by construction.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"23-30. IEEE, 2017. [28] Leslie Pack Kaelbling, Michael L Littman, and Anthony R Cassandra. Planning and acting in partially observable stochastic domains.Artificial intelligence, 101(1-2):99-134, 1998. [29] Matthew J Hausknecht and Peter Stone. Deep recurrent q-learning for partially observable mdps. InAAAI fall symposia, volume 45, page 141, 2015. 11 [30] V olodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level control through deep reinforcement learning.nature, 518(7540):529-533, 2015. [31] V olodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy Lillicrap,"},{"citing_arxiv_id":"2605.11042","ref_index":26,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Towards Model-Free Learning in Dynamic Population Games: An Application to Karma Economies","primary_cat":"cs.GT","submitted_at":"2026-05-11T08:39:30+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Model-free DQN learning achieves suboptimality bounds of O(1/sqrt(Ns)) + O(1/N) in Karma DPGs at equilibrium, and deep RL combined with fictitious play empirically reaches near-Stationary Nash Equilibrium from scratch.","context_count":1,"top_context_role":"method","top_context_polarity":"use_method","context_text":"fully coupled DPG setting has received little attention; this paper takes a step towards addressing this gap with the following contributions: • Motivated by the plug-and-play nature of Karma economies, where new participants can join an established system, we analyze the single-agent setting in which a novel agent joins a Karma DPG already at a SNE configuration and wishes to learn a good policy via Deep Q-Networks (DQN) [26] with ε-greedy exploration, without knowledge of the game model. Leveraging recent results on the convergence of DQN [45], we establish a suboptimality bound for the learned policy, composed of a DQN approximation error of order O(1/√Ns) and a mean field perturbation error of order O(1/N), where Ns is the replay buffer size and N is the population size."},{"citing_arxiv_id":"2605.08019","ref_index":5,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Reason to Play: Behavioral and Brain Alignment Between Frontier LRMs and Human Game Learners","primary_cat":"cs.AI","submitted_at":"2026-05-08T17:07:41+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Frontier LRMs match human game-learning behavior and predict fMRI signals an order of magnitude better than RL or Bayesian agents because of their in-context game-state representations.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"subsequent levels reveal new rules incrementally. The Interactive Catalogue A lets readers try each game in the browser and browse all participant and LRM agent gameplay replays. Project page: https://botcs.github.io/reason-to-play/ Video games have been equally central to artificial intelligence. The Deep RL era began with DQN achieving human-level performance on Atari [5], and games have remained a guiding benchmark as the field moved toward model-based agents that learn internal world models to support planning [6-8]. More recently, Large Language Models (LLMs) have become the dominant paradigm [ 9], despite persistent criticism of their limited multi-step reasoning and planning capabilities [10]. Large Reasoning Models (LRMs) respond by generating explicit chains of thought [ 11-13]."},{"citing_arxiv_id":"2605.07057","ref_index":11,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Integrating Causal DAGs in Deep RL: Activating Minimal Markovian States with Multi-Order Exposure","primary_cat":"cs.LG","submitted_at":"2026-05-08T00:12:03+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A procedure builds provably minimal Markovian states from a longitudinal causal graph, but deep RL requires multi-order historical state exposure (MOSE) to realize gains over minimal or fixed-window baselines.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Empirical results in Section 5 show that MOSE combined with the minimal Markovian representation performed as well or better than MOSE alone. Remark4.3 (Choice of update-value-function).In this paper, we focus on state space construction from raw observations. The resulting state space can be used for any deep RL algorithm, including but not limited to DQN [11] and SAC [59]. 5 Experiments We tested the benefits of MOSE (Algorithm 1) and Causal-MOSE compared with two common ways of constructing state with time dependence: 1) Reward Parent: Defining only reward parents as state, 2) Window Policy: Defining all observations within a time windoww as state [11, 12] and 3) DAG-State: minimal valid state representation constructed by Algorithm F."},{"citing_arxiv_id":"2605.06228","ref_index":14,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Soft Deterministic Policy Gradient with Gaussian Smoothing","primary_cat":"cs.LG","submitted_at":"2026-05-07T13:21:01+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Soft-DPG uses Gaussian smoothing on the Bellman equation to derive a well-defined policy gradient without relying on critic action derivatives, yielding competitive performance on dense-reward tasks and gains on discretized-reward variants.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.05511","ref_index":54,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Non-Myopic Active Feature Acquisition via Pathwise Policy Gradients","primary_cat":"cs.LG","submitted_at":"2026-05-06T23:24:54+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"NM-PPG optimizes non-myopic acquisition policies for costly features by enabling pathwise gradients via continuous relaxation and straight-through rollouts in POMDPs, outperforming SOTA baselines.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.08182","ref_index":21,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Quantile Geometry Regularization for Distributional Reinforcement Learning","primary_cat":"cs.LG","submitted_at":"2026-05-05T09:38:34+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"RQIQN introduces a Wasserstein DRO-based correction to Bellman quantile targets that enlarges distributional spread without altering risk-neutral averages.","context_count":1,"top_context_role":"baseline","top_context_polarity":"baseline","context_text":"We evaluate RQIQN in a representative safety-critical marine navigation environment [9], where bounded workspaces, static obstacles, and vortex-induced flow disturbances are simulated according to physically motivated environmental settings [20]. Following the original learning-based evaluation protocol, we deploy the learned policy for unmanned surface vehicle control and compare RQIQN against IQN and DQN [21]. Figure 2: Qualitative trajectory results of RL agents. The yellow circle denotes the start position, and the yellow star denotes the goal. Magenta circles indicate static obstacles, while the background vector field represents vortex-induced flow disturbances. Trajectories for IQN and RQIQN are shown under the adaptive setting, where both achieve stronger performance."},{"citing_arxiv_id":"2605.02320","ref_index":18,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"ANO: A Principled Approach to Robust Policy Optimization","primary_cat":"cs.AI","submitted_at":"2026-05-04T08:15:52+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"ANO derives a robust policy optimizer from geometric principles that replaces clipping with a smooth redescending gradient, showing better performance and stability than PPO, SPO, and GRPO in MuJoCo, Atari, and RLHF experiments.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2509.15103","ref_index":68,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Vulnerable Agent Identification in Large-Scale Multi-Agent Reinforcement Learning","primary_cat":"cs.MA","submitted_at":"2025-09-18T16:03:50+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Proposes HAD-MFC framework that decouples upper-level vulnerable agent selection from lower-level adversarial policy learning in large-scale MARL using Fenchel-Rockafellar transform and MDP reformulation with provable optimality preservation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2506.12622","ref_index":33,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"DR-SAC: Distributionally Robust Soft Actor-Critic for Reinforcement Learning under Uncertainty","primary_cat":"cs.LG","submitted_at":"2025-06-14T20:36:44+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"DR-SAC is the first actor-critic distributionally robust RL algorithm for offline continuous control that derives a convergent robust soft policy iteration and reports up to 9.8x higher rewards than SAC under perturbations.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}