{"total":12,"items":[{"citing_arxiv_id":"2606.31691","ref_index":21,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"FastDSAC: Enhancing Policy Plasticity via Constrained Exploration for Scalable Humanoid Locomotion","primary_cat":"cs.RO","submitted_at":"2026-06-30T14:04:26+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"FastDSAC adds a truncated Gaussian policy constraint to distributional actor-critic methods to preserve network plasticity and accelerate training for scalable humanoid locomotion in parallel sampling setups.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.30072","ref_index":100,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"ACPO: Agent-Chained Policy Optimization for Multi-Agent Reinforcement Learning","primary_cat":"cs.AI","submitted_at":"2026-06-29T10:04:15+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"ACPO decomposes the joint policy gradient into per-agent terms allowing independent actor training that collectively forms a joint gradient step in CTDE-based MARL.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.29209","ref_index":2,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"AnyBody: Free-Form Whole-Body Humanoid Control from Arbitrary Keypoint Guidance","primary_cat":"cs.RO","submitted_at":"2026-06-28T05:20:37+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"AnyBody distills a privileged teacher tracker into a latent unit-sphere representation and uses a masked transformer to drive humanoid control from arbitrary keypoint subsets.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.30313","ref_index":30,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"UniLab: A Heterogeneous Architecture for Robot RL Beyond GPU-Dominant Paradigms","primary_cat":"cs.RO","submitted_at":"2026-05-28T17:53:50+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"UniLab is a CPU/GPU heterogeneous system for robot RL training using MuJoCoUni and MotrixSim backends that reports 3-10x end-to-end efficiency improvements and cross-platform compatibility beyond CUDA.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.10236","ref_index":32,"ref_count":3,"confidence":0.9,"is_internal_anchor":false,"paper_title":"When Does Non-Uniform Replay Matter in Reinforcement Learning?","primary_cat":"cs.LG","submitted_at":"2026-05-11T09:11:05+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Non-uniform replay helps most when replay volume is low; high-entropy sampling remains important, and a truncated geometric distribution delivers better sample efficiency with negligible overhead.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Truncated Geometric; we discuss entropy interpretation in Appendix E.3. practically relevant hyperparameters and both reflect compute-data tradeoffs in modern off-policy RL systems [27, 8]. As such, contemporary algorithms operate across a wide range of UTD regimes and batch size settings.* We compare uniform replay to recency-biased Truncated Geometric sampling using SimbaV2 [18] on 13 HumanoidBench tasks [32]. We sweep UTD values {2,1, 1 2 , 1 4 , 1 8 , 1 16 } and batch sizes {256,128,64,32,16} . As UTD or batch size decreases, replay volume decreases, entering regimes where each transition is replayed relatively few times. We detail the setup in Appendix C. Results.As shown in Figure 3, uniform and recency-biased replay perform similarly at high UTD values and large batch sizes."},{"citing_arxiv_id":"2604.25508","ref_index":5,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Dyna-Style Safety Augmented Reinforcement Learning: Staying Safe in the Face of Uncertainty","primary_cat":"cs.LG","submitted_at":"2026-04-28T11:14:00+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Dyna-SAuR learns scalable safety filters and policies from an uncertainty-aware model, cutting failures by two orders of magnitude on CartPole and MuJoCo Walker tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.06497","ref_index":39,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Hyperfastrl: Hypernetwork-based reinforcement learning for unified control of parametric chaotic PDEs","primary_cat":"cs.CE","submitted_at":"2026-04-07T21:58:09+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Hypernetworks map a forcing parameter directly to policy weights in an RL framework, enabling unified stabilization of the Kuramoto-Sivashinsky equation across regimes with KAN architectures showing strongest extrapolation.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"DQN demonstrated that a single agent can learn directly from pixels and reach human-competitive Atari performance [18]. AlphaGo showed that deep RL combined with search can solve long-horizon strategic planning at superhuman level in Go [38]. In robotics and humanoid control, recent high-throughput actor-critic pipelines have produced agile and robust locomotion behaviors [39]. In autonomous-driving decision stacks, deep RL has been used for tactical control tasks such as lane-change and merge decision-making under dynamic multi-agent traffic interactions [40]. Closely related learning-based advances include deep-network methods for high-dimensional PDE computation [41, 42] and reinforcement-learning-based controller design for hybrid UA V flight [43]."},{"citing_arxiv_id":"2604.04539","ref_index":75,"ref_count":2,"confidence":0.9,"is_internal_anchor":false,"paper_title":"FlashSAC: Fast and Stable Off-Policy Reinforcement Learning for High-Dimensional Robot Control","primary_cat":"cs.LG","submitted_at":"2026-04-06T09:03:41+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"FlashSAC improves training speed and final performance of off-policy RL on high-dimensional robot tasks by reducing update frequency, increasing model scale, and bounding norms to limit critic error accumulation.","context_count":1,"top_context_role":"baseline","top_context_polarity":"baseline","context_text":"Simultaneously, because critic targets depend on the critic's own predictions, approximation and extrapolation errors at poorly supported state-action pairs compound across updates [80, 87]. Prior work has primarily addressed each challenge in isolation. To improvespeed, one line of work scales data throughput via parallel simulation and large replay buffers [75, 74, 62]. For example, FastTD3 [75] achieves strong wall-clock efficiency in humanoid locomotion but relies on small networks (∼0.2M parameters), which limits its asymptotic performance. Scaling to larger networks is difficult in this setting, as increased model capacity exacerbates instability under bootstrapped training. To improvestability, a second line of work constrains value-function sensitivity by bounding feature, weight,"},{"citing_arxiv_id":"2603.15956","ref_index":27,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"ExpertGen: Scalable Sim-to-Real Expert Policy Learning from Imperfect Behavior Priors","primary_cat":"cs.RO","submitted_at":"2026-03-16T22:12:48+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"ExpertGen generates high-success expert policies in simulation from imperfect priors by freezing a diffusion behavior model and optimizing its initial noise via RL, then distills them for real-robot deployment.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2603.12612","ref_index":5,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"FastDSAC: Unlocking the Potential of Maximum Entropy RL in High-Dimensional Humanoid Control","primary_cat":"cs.LG","submitted_at":"2026-03-13T03:27:25+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"FastDSAC enables state-of-the-art maximum entropy RL for high-dimensional humanoid control via entropy redistribution per dimension and improved continuous value estimation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2507.11019","ref_index":11,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Relative Entropy Pathwise Policy Optimization","primary_cat":"cs.LG","submitted_at":"2025-07-15T06:24:07+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"REPPO is an on-policy RL method that combines pathwise policy gradients with relative entropy constraints to achieve stable training and high sample efficiency without replay buffers.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2506.15953","ref_index":25,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"ViTacFormer: Learning Cross-Modal Representation for Visuo-Tactile Dexterous Manipulation","primary_cat":"cs.RO","submitted_at":"2025-06-19T01:38:13+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"ViTacFormer learns a cross-modal visuo-tactile latent space with autoregressive tactile prediction and an easy-to-hard curriculum, then uses the representation for imitation learning that yields ~50% higher success and the first reported 11-stage, 2.5-minute autonomous dexterous tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}