{"total":17,"items":[{"citing_arxiv_id":"2606.29201","ref_index":9,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Behavior Uncloning: Distilling Mode Redirection into Policy Weights without Inference-Time Steering","primary_cat":"cs.RO","submitted_at":"2026-06-28T05:01:27+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"MoRE improves robot policy success rates by 44 percentage points by distilling mode redirection into weights, matching filtered retraining performance without inference overhead.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.23551","ref_index":112,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Goal-Conditioned Agents that Learn Everything All at Once","primary_cat":"cs.LG","submitted_at":"2026-05-22T12:17:09+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"LEO enables efficient all-goals learning in goal-conditioned RL by jointly predicting for all goals in one network pass, yielding >250x speedup over relabelling and better performance on Craftax.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.21822","ref_index":37,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Implicit Safety Alignment from Crowd Preferences","primary_cat":"cs.AI","submitted_at":"2026-05-20T23:44:06+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A hierarchical framework extracts implicit safety criteria from crowd preferences and composes them via high-level policy to reduce safety violations in downstream RL tasks without explicit safety rewards.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.16725","ref_index":5,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Baba in Wonderland: Online Self-Supervised Dynamics Discovery for Executable World Models","primary_cat":"cs.AI","submitted_at":"2026-05-16T00:18:22+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Alice uses preservation conflicts from failed candidate updates to create class-stratified hypotheses and guide exploration, improving executable world-model learning under prior misalignment.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.16054","ref_index":192,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Ada-Diffuser: Latent-Aware Adaptive Diffusion for Decision-Making","primary_cat":"cs.LG","submitted_at":"2026-05-15T15:21:25+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Ada-Diffuser is a causal diffusion model that jointly learns observed interaction structure and underlying latent dynamics from minimal observations for adaptive planning and policy learning.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.12655","ref_index":13,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Robust Instruction Compliance in Cooperative Multi-Agent Reinforcement Learning","primary_cat":"cs.AI","submitted_at":"2026-05-12T19:01:16+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"MAVIC corrects Bellman backups at instruction boundaries by adjusting the incoming objective and restoring continuation value, enabling consistent estimation under stochastic instruction switching in cooperative MARL.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.12261","ref_index":28,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Delay-Empowered Causal Hierarchical Reinforcement Learning","primary_cat":"cs.LG","submitted_at":"2026-05-12T15:28:32+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"DECHRL models causal structures and stochastic delay distributions within hierarchical RL and incorporates them into a delay-aware empowerment objective to improve performance under temporal uncertainty.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Empowerment is a task-agnostic intrinsic reward that measures an agent's influence over its future, guiding it toward states with greater control-without 5 external rewards. Empowerment-based methods can be grouped into the following categories:(1) Controllability-based.VIMIM [ 26] and VIC [ 27] maximize empowerment directly as an intrinsic reward to drive exploration.(2) Diversity- driven.DIY AN [28] and DADS [29] discover diverse and predictable skills using mutual information between latent skills and future states.(3) Latent planning and representation.IPE [ 30] and VGCRL [31] shape latent spaces to focus on controllable features.(4) Causal-aware.ECL [ 32] incorporates causal modeling into empowerment to improve interpretability. Most empowerment-based methods rely on explicit modeling of environment"},{"citing_arxiv_id":"2605.01862","ref_index":46,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"QHyer: Q-conditioned Hybrid Attention-mamba Transformer for Offline Goal-conditioned RL","primary_cat":"cs.LG","submitted_at":"2026-05-03T13:11:28+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"QHyer replaces return-to-go with a state-conditioned Q-estimator and adds a gated hybrid attention-mamba backbone to achieve state-of-the-art performance in offline goal-conditioned RL on both Markovian and non-Markovian datasets.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Q-conditioned Hybrid Attention-mamba Transformer for Offline Goal-conditioned RL c: Apply distribution mismatch (Assumption B.5), we have: E s∼d π⋆ β h [DKL] = X s d π⋆ β h (s)·D KL(Pβ(·|s)∥ˆπ(·|s))(43) = X s d π⋆ β h (s) dβ h(s) ·d β h(s)·D KL (44) ≤c ⋆ β ·E s∼dβ h [DKL].(45) Combining a-c, we have: E s∼d π⋆ β h [TV(Pβ∥ˆπ)]≤ r c⋆ β 2 Es∼dβ h [DKL(Pβ∥ˆπ)] = r c⋆ β 2 L(ˆπ).(46) By MLE analysis (Liu et al., 2025), with probability≥1−δ: L(ˆπ)≤ O r c· log|Π|/δ N ! +δ approx.(47) Summing overHstages: HX h=1 E[Term (I)]≤H r c⋆ β 2 O \u0012log|Π|/δ N \u00131/4! + p δapprox ! .(48) Forth, Bounding Term (II) of Equation (40).By Assumption B.8, we have: TV(ˆπ(·|Q⋆)∥ˆπ(·|ˆQτ))≤L π|Q⋆ − ˆQτ |.(49) From Theorem 3.1, we have: |Q⋆ − ˆQτ | ≤ϵ τ +L Q"},{"citing_arxiv_id":"2605.01242","ref_index":57,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Breaking the Computational Barrier: Provably Efficient Actor-Critic for Low-Rank MDPs","primary_cat":"cs.LG","submitted_at":"2026-05-02T04:46:54+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"An actor-critic RL algorithm for low-rank MDPs achieves improved sample efficiency using solely a policy evaluation oracle.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.24558","ref_index":1,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Hierarchical Behaviour Spaces","primary_cat":"cs.AI","submitted_at":"2026-04-27T14:47:22+00:00","verdict":"UNVERDICTED","verdict_confidence":"UNKNOWN","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Hierarchical Behaviour Spaces uses linear combinations of reward functions to induce expressive behavior spaces in hierarchical RL, yielding strong performance on NetHack primarily through better exploration rather than long-term planning.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.20381","ref_index":11,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Distributional Value Estimation Without Target Networks for Robust Quality-Diversity","primary_cat":"cs.LG","submitted_at":"2026-04-22T09:31:45+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"QDHUAC is a distributional, target-free QD-RL method that enables stable high-UTD training and competitive performance on Brax locomotion tasks using far fewer environment steps than prior approaches.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.15414","ref_index":8,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Beyond Single-Model Optimization: Preserving Plasticity in Continual Reinforcement Learning","primary_cat":"cs.LG","submitted_at":"2026-04-16T17:06:54+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"22H[b t]←RefreshArchive(H[b t], θt, et, fϕ,M) 23 end /* 4. Online Latent Space Maintenance */ 24M,H ←BoundaryMaintenance(f ϕ,M,A,R,H) 25θ prev ←θ t 26 end The first five coordinates are normalized geometric and control features: x01 t = clip \u0012 xt max(W−1,1) ,0,1 \u0013 ,(5) y01 t = clip \u0012 yt max(H−1,1) ,0,1 \u0013 ,(6) d01 t = clip \u0012 dt 3 ,0,1 \u0013 ,(7) τ 01 t = clip \u0012 step countt max steps ,0,1 \u0013 ,(8) a01 t =    clip \u0012 at nact −1 ,0,1 \u0013 , n act >1, at,otherwise, (9) where(x t, yt)is the agent position,d t ∈ {0,1,2,3}is the agent direction, anda t is the executed action. The remaining six coordinates are sticky binary task-event indicators provided by the upstream event wrapper. Thus, the encoder does not consume raw observations directly; it consumes a sequence of compact per-step behavioral features."},{"citing_arxiv_id":"2502.02834","ref_index":9,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Task-Aware Virtual Training: Enhancing Generalization in Meta-Reinforcement Learning for Out-of-Distribution Tasks","primary_cat":"cs.LG","submitted_at":"2025-02-05T02:31:50+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"TAVT improves OOD task generalization in meta-RL by preserving task characteristics in virtual tasks via metric learning and using state regularization.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"1907.08097","ref_index":28,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Intrinsically motivated collective motion","primary_cat":"physics.bio-ph","submitted_at":"2019-07-18T14:49:06+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Future State Maximisation (FSM) leads to emergent collective motion features like cohesion and co-alignment in agent simulations.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"1907.06143","ref_index":23,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Neural Embedding for Physical Manipulations","primary_cat":"cs.LG","submitted_at":"2019-07-13T22:57:23+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"Generative model with normalized pairwise distance constraint discovers output space topologies from sparse data and outperforms GANs and VAEs by avoiding mode collapse.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"1906.10667","ref_index":11,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Reinforcement Learning with Competitive Ensembles of Information-Constrained Primitives","primary_cat":"cs.LG","submitted_at":"2019-06-25T17:04:48+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"RL policies decompose into information-regularized primitives that compete by requesting state information amounts, with the greediest one acting, yielding better generalization than flat or hierarchical baselines.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"1906.09205","ref_index":7,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Continual Reinforcement Learning with Diversity Exploration and Adversarial Self-Correction","primary_cat":"cs.LG","submitted_at":"2019-06-21T15:44:41+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"CDAN framework uses diversity exploration and adversarial self-correction for continual RL in continuous control, evaluated on new CAM environment with NSD metric showing 18.35% NSD improvement over baseline.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}