{"total":23,"items":[{"citing_arxiv_id":"2606.13076","ref_index":20,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"$\\alpha$-fair heterogeneous agent reinforcement learning","primary_cat":"cs.MA","submitted_at":"2026-06-11T08:59:41+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"Introduces α-fair HATRPO and HAPPO algorithms that integrate α-fairness into HATRL via a weighted advantage function while claiming to preserve convergence to Nash equilibria.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.20658","ref_index":268,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Expected Free Energy-based Planning as Variational Inference","primary_cat":"cs.AI","submitted_at":"2026-06-09T08:09:48+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"EFE-based planning is formulated as variational free energy minimization with epistemic priors, decomposing into expected plan costs plus a complexity term.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.04935","ref_index":292,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"What Type of Inference is Active Inference?","primary_cat":"cs.AI","submitted_at":"2026-06-03T14:24:53+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"EFE-based active inference planning is characterized as VFE on an augmented model plus entropy and planning corrections, with a derived message-passing implementation and grid-world validation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.19425","ref_index":20,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"When to Stop Reusing: Dynamic Gradient Gating for Sample-Efficient RLVR","primary_cat":"cs.LG","submitted_at":"2026-05-19T06:23:43+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Dynamic Gradient Gating monitors lm_head gradient norms to safely reuse rollout batches in RLVR, achieving up to 2.93x sample efficiency and 2.14x wall-clock speedup across math, ALFWorld, WebShop, and QA tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.14366","ref_index":11,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Reinforcement Learning with Semantic Rewards Enables Low-Resource Language Expansion without Alignment Tax","primary_cat":"cs.CL","submitted_at":"2026-05-14T04:47:22+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Reinforcement learning with semantic rewards lets LLMs gain low-resource language skills without the alignment tax that degrades general capabilities in supervised fine-tuning.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.09214","ref_index":35,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Fast Rates for Offline Contextual Bandits with Forward-KL Regularization under Single-Policy Concentrability","primary_cat":"cs.LG","submitted_at":"2026-05-09T23:17:46+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"The paper establishes the first tilde O(epsilon^{-1}) upper bounds and matching lower bounds for forward-KL-regularized offline contextual bandits under single-policy concentrability in both tabular and general function approximation settings.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.08378","ref_index":18,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Reinforcement Learning for Scalable and Trustworthy Intelligent Systems","primary_cat":"cs.LG","submitted_at":"2026-05-08T18:36:25+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":3.0,"formal_verification":"none","one_line_summary":"Reinforcement learning is advanced for communication-efficient federated optimization and for preference-aligned, contextually safe policies in large language models.","context_count":1,"top_context_role":"method","top_context_polarity":"use_method","context_text":"∇θlogπθ(at∥st) ) Aπθ(st,at) ] ,(2.5) 22 where τ= (s0,a 0,s 1,a 1,···)is a trajectory induced by policy πθ. We denote the policy gradient bygfor short. In practice, we can sample(s,a )∼νπθk and obtain the unbiased estimate ˆAπθk(s,a)using Algorithm 3 in [ 44 ]. Natural Policy Gradient (NPG): At thek-th iteration, natural policy methods with a trust region [ 18 ] update policy parameters as follows θk+1 = arg max θ Es,a [ πθ(a∥s) πθk(a∥s)Aπθk(s,a) ] s.t.D(θ∥θk)≤δ. (2.6) where D(θ∥θk) = Es [ D ( πθ(·∥s)∥πθk(·∥s) )] , (2.7) D(·)is the KL-divergence operation, andδ >0is the radius of the trust region. Practically, using the first-order Taylor expansion for the target value and the second-order Taylor expansion for the divergence constraint, ( 2."},{"citing_arxiv_id":"2604.23716","ref_index":41,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Information-Theoretic Measures in AI: A Practical Decision Guide","primary_cat":"cs.AI","submitted_at":"2026-04-26T14:00:22+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"model parametersqis equivalent to minimizingD KL(pdata∥qθ)when the true label distributionpis fixed. KL divergence also appears as the regularization term in vari- ational autoencoders (VAEs), constraining the learned posterior toward a prior [21], and as a policy-update constraint in trust-region reinforcement learning: TRPO imposes a hard KL bound between consecutive policies [41]; PPO replaces this with probability-ratio clipping as its primary mechanism but optionally adds a KL penalty, and KL regularization more generally improves the optimization landscape of RL objectives [25, 42]. Applications. Classic AI/ML:Knowledge distillation [16] minimizesD KL(pteacher∥pstudent)to com- press a large model's soft predictions into a smaller student; this is among the most"},{"citing_arxiv_id":"2604.19695","ref_index":22,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Planning in entropy-regularized Markov decision processes and games","primary_cat":"cs.LG","submitted_at":"2026-04-21T17:17:33+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"SmoothCruiser achieves O~(1/epsilon^4) problem-independent sample complexity for value estimation in entropy-regularized MDPs and games via a generative model.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.17706","ref_index":48,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"OmniVLA-RL: A Vision-Language-Action Model with Spatial Understanding and Online RL","primary_cat":"cs.RO","submitted_at":"2026-04-20T01:36:58+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"OmniVLA-RL uses a mix-of-transformers architecture and flow-matching reformulated as SDE with group segmented policy optimization to surpass prior VLA models on LIBERO benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.09035","ref_index":39,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Advantage-Guided Diffusion for Model-Based Reinforcement Learning","primary_cat":"cs.AI","submitted_at":"2026-04-10T06:53:25+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Advantage-guided diffusion (SAG and EAG) steers sampling in diffusion world models to higher-advantage trajectories, enabling policy improvement and better sample efficiency on MuJoCo tasks.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"H >0. The objective of the RL agent is to learn an optimal policyπ ⋆ maximizing the policy valueV π(s) = Eπ[P∞ i=0 γirt+i+1|st =s]. Alternatively, we can define the optimal policy as the one maximizing theQ-value function Qπ(s, a) =E π[P∞ i=0 γirt+i+1|st =s, a t =a]. Several Policy Gradient methods combine the two value functions into a new objective [39], [40], the advantage function, defined asA π(s, a) =Q π(s, a)−V π(s). The advantage function can be considered as a state-action value function usingV π as a baseline to reduce the variance. The value of a policyπcan be defined asJ(π) =E s∼ρ[Vπ(s)]. In this paper, we consider an MBRL setting, where we use a diffusion model to approximate the distribution of trajectories"},{"citing_arxiv_id":"2604.05394","ref_index":27,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Neural Assistive Impulses: Synthesizing Exaggerated Motions for Physics-based Characters","primary_cat":"cs.AI","submitted_at":"2026-04-07T03:47:14+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"A hybrid neural policy operating in impulse space enables physics-based characters to track exaggerated, dynamically infeasible motions that standard DRL methods cannot stabilize.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2602.19837","ref_index":141,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Meta-Learning and Meta-Reinforcement Learning -- Tracing the Path towards DeepMind's Adaptive Agent","primary_cat":"cs.AI","submitted_at":"2026-02-23T13:39:58+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":2.0,"formal_verification":"none","one_line_summary":"A survey provides a task-based formalization of meta-learning and meta-RL while chronicling algorithms that lead to DeepMind's Adaptive Agent.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"presentMAMLmodificationsfordetectingfaultsinbearings[90], improvingshort-termloadforecastingin scenarios where different clients require federated learning [41], and forecasting stock prices [25]. However, Transformers additionally have a demonstrable long-term memory of up to1500steps into the past [110], which particularly motivates to use transformers in memory-based meta-learning [103], [56], [179], [142], [141]. TrMRL [103] is such a Transformer architecture tailored for meta-RL. It extends RL2 by a Transformer architecture i.e., by self-attention, and fulfills all necessary properties of a meta-learner [103], i.e., 1. The fast adaptation of the multi-head attention serves as a task representation mechanism, since each self-attention head contextualizes the embeddings of the contextci"},{"citing_arxiv_id":"2509.25424","ref_index":30,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Polychromic Objectives for Reinforcement Learning","primary_cat":"cs.LG","submitted_at":"2025-09-29T19:32:11+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Introduces polychromic objectives adapted into PPO via vine sampling and modified advantages, showing higher success rates and better coverage under perturbations on BabyAI, Minigrid, and algorithmic tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2509.22963","ref_index":41,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Reinforcement Learning with Discrete Diffusion Policies for Combinatorial Action Spaces","primary_cat":"cs.LG","submitted_at":"2025-09-26T21:53:36+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A method trains discrete diffusion policies for combinatorial RL by matching to a PMD-regularized target distribution, reporting SOTA performance and sample efficiency on DNA generation, macro-action, and multi-agent benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2509.09838","ref_index":27,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Dissecting Discrete Soft Actor-Critic: Limitations and Principled Alternatives","primary_cat":"cs.LG","submitted_at":"2025-09-11T20:34:08+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Shows entropy coupling limits DSAC on discrete tasks and introduces a generalized actor-critic framework with m-step critics and novel entropy-regularized objectives that perform robustly on Atari.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2508.16474","ref_index":63,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Reinforcement Learning-based Control via Y-wise Affine Neural Networks (YANNs)","primary_cat":"eess.SY","submitted_at":"2025-08-22T15:42:03+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"YANN-RL initializes RL actor and critic networks with explicit multi-parametric linear MPC solutions via YANNs to start from linear optimal control performance and then learn nonlinear policies through online interaction.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2504.06355","ref_index":11,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"An Information-Geometric Approach to Artificial Curiosity","primary_cat":"cs.LG","submitted_at":"2025-04-08T18:04:15+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Information geometry constrains intrinsic rewards to strictly concave functions of reciprocal occupancy, with geodesic interpolation on the occupancy manifold yielding a scalar-parameter family that includes count-based and max-entropy exploration.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2410.06347","ref_index":36,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Goal-Conditioned Decision Transformer for Multi-Goal Offline Reinforcement Learning","primary_cat":"cs.RO","submitted_at":"2024-10-08T20:35:30+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"A Goal-Conditioned Decision Transformer is adapted for offline multi-goal RL and shown to outperform online baselines on a new Franka Emika Panda dataset.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2408.09468","ref_index":23,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Towards Safe and Robust Autonomous Vehicle Platooning: A Self-Organizing Cooperative Control Framework","primary_cat":"cs.RO","submitted_at":"2024-08-18T13:27:49+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":3.0,"formal_verification":"none","one_line_summary":"TriCoD is a cooperative decision-making framework using twin-world deduction and adaptive switching between DRL and model-driven methods to enable safe, dynamic AV platooning in hybrid traffic.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2312.09436","ref_index":61,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Temporal Transfer Learning for Traffic Optimization with Coarse-grained Advisory Autonomy","primary_cat":"cs.RO","submitted_at":"2023-11-27T21:18:06+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Temporal Transfer Learning selects source tasks for zero-shot transfer of RL policies to solve a range of coarse-grained advisory autonomy hold durations in traffic optimization more reliably than baselines.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"1509.02971","ref_index":8,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Continuous control with deep reinforcement learning","primary_cat":"cs.LG","submitted_at":"2015-09-09T23:01:36+00:00","verdict":"ACCEPT","verdict_confidence":"MODERATE","novelty_score":7.0,"formal_verification":"none","one_line_summary":"DDPG is a model-free actor-critic algorithm that learns continuous control policies end-to-end from states or pixels using deterministic policy gradients and deep networks, solving more than 20 physics tasks competitively with full-information planning methods.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"1506.02438","ref_index":18,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"High-Dimensional Continuous Control Using Generalized Advantage Estimation","primary_cat":"cs.LG","submitted_at":"2015-06-08T11:12:48+00:00","verdict":"ACCEPT","verdict_confidence":"MODERATE","novelty_score":8.0,"formal_verification":"none","one_line_summary":"Generalized advantage estimation combined with trust region optimization enables stable neural network policy learning for complex continuous control from raw kinematics.","context_count":1,"top_context_role":"extension","top_context_polarity":"extend","context_text":"analogous to the one used to deﬁne TD(λ) (Sutton & Barto, 1998), however TD(λ) is an estimator of the value function, whereas here we are estimating the advantage function. There are two notable special cases of this formula, obtained by settingλ = 0 andλ = 1. GAE(γ, 0) : ˆAt :=δt =rt +γV (st+1)−V (st) (17) GAE(γ, 1) : ˆAt := ∞∑ l=0 γlδt+l = ∞∑ l=0 γlrt+l−V (st) (18) GAE(γ, 1) is γ-just regardless of the accuracy of V , but it has high variance due to the sum of terms. GAE(γ, 0) is γ-just for V = Vπ,γ and otherwise induces bias, but it typically has much lower variance. The generalized advantage estimator for 0< λ <1 makes a compromise between bias and variance, controlled by parameterλ. We've described an advantage estimator with two separate parametersγ andλ, both of which con-"}],"limit":50,"offset":0}