{"total":47,"items":[{"citing_arxiv_id":"2606.00880","ref_index":25,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Task diversity produces systematic transfer but inhibits continual reinforcement learning","primary_cat":"cs.LG","submitted_at":"2026-05-30T20:31:25+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Task diversity along map, object, and hierarchy axes produces local transfer across shifts in a new continual RL benchmark but fails to sustain learning as the number of shifts grows.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.30313","ref_index":11,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"UniLab: A Heterogeneous Architecture for Robot RL Beyond GPU-Dominant Paradigms","primary_cat":"cs.RO","submitted_at":"2026-05-28T17:53:50+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"UniLab is a CPU/GPU heterogeneous system for robot RL training using MuJoCoUni and MotrixSim backends that reports 3-10x end-to-end efficiency improvements and cross-platform compatibility beyond CUDA.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.28812","ref_index":4,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Beyond Binary: Sim-to-Real Dexterous Manipulation with Physics-Grounded Contact Representation","primary_cat":"cs.RO","submitted_at":"2026-05-27T17:59:02+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"CoP tactile representation with differentiable calibration enables zero-shot sim-to-real transfer and outperforms binary and raw-taxel baselines on peg-in-hole insertion and ball balancing with a multi-fingered hand.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.23372","ref_index":1,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Curriculum reinforcement learning with measurable task representation learning","primary_cat":"cs.LG","submitted_at":"2026-05-22T08:36:49+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A VAE-based latent task representation enables automatic curriculum generation in CRL for non-Euclidean navigation tasks, outperforming interpolation and GAN-based methods in experiments.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.21688","ref_index":13,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Closed-Loop Sim-to-Real Reinforcement Learning for Deformable Microfiber Shape Control","primary_cat":"cs.RO","submitted_at":"2026-05-20T19:45:58+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"A closed-loop sim-to-real RL policy trained in a simplified frictionless simulator achieves sub-millimeter microfiber shape control on physical hardware via visual feedback without retraining.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.21458","ref_index":14,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Mind the Sim-to-Real Gap & Think Like a Scientist","primary_cat":"cs.AI","submitted_at":"2026-05-20T17:48:14+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"The paper decomposes simulator value errors into identifiable shifts and irreducible residuals, shows passive learning fails on reachability, and introduces Fisher-SEP to minimize posterior value variance via targeted experiments.","context_count":1,"top_context_role":"baseline","top_context_polarity":"baseline","context_text":"Table A7: Paired-difference statistics for every headline ordering. Values in percentage points of oracle.n= 30paired trials.✓marks pairs for which the one-sided Wilcoxonp-value<0.05. Comparison Mean±half-width Paired-t95% CI Wilcoxonp 1-sided Sig. A-SOP−SOP (vending,T=100)+0.08 ±1.01 [−0.98,+1.13] 0.516- Fisher-SEP−A-SOP (vending,T=100)−12.62 ±1.42 [−14.10,−11.13] 1.000- KG-SEP−A-SOP (vending,T=100)−8.64 ±0.92 [−9.59,−7.68] 1.000- A-SOP−SOP (vending,T=200)+3.41 ±1.46 [+1.89,+4.93] 0.000✓ Fisher-SEP−A-SOP (vending,T=200)−8.42 ±1.86 [−10.36,−6.48] 1.000- KG-SEP−A-SOP (vending,T=200)−5.31 ±1.59 [−6.96,−3.65] 1.000- A-SOP−SOP (vending,T=400)+8.80 ±2.53 [+6.16,+11.44] 0.000✓ Fisher-SEP−A-SOP (vending,T=400)−7.38 ±2."},{"citing_arxiv_id":"2605.21330","ref_index":11,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Learning Robust Dexterous In-Hand Manipulation from Joint Sensors with Proprioceptive Transformer","primary_cat":"cs.RO","submitted_at":"2026-05-20T15:57:27+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"A transformer policy distilled from a privileged RL teacher enables 3.1x faster real-world cube rotation on the ORCA hand using solely joint sensor data by extracting implicit object state from temporal joint patterns.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.16520","ref_index":164,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Global Convergence of Sampling-Based Nonconvex Optimization through Diffusion-Style Smoothing","primary_cat":"cs.LG","submitted_at":"2026-05-15T18:14:38+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Recasts sampling-based nonconvex optimization as smoothed gradient descent to obtain non-asymptotic convergence guarantees and introduces the DIDA annealed algorithm that converges to the global optimum.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.14350","ref_index":40,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Distributionally Robust Multi-Task Reinforcement Learning via Adaptive Task Sampling","primary_cat":"cs.LG","submitted_at":"2026-05-14T04:22:24+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"DRATS derives a minimax objective from a feasibility formulation of MTRL to adaptively sample tasks with the largest return gaps, leading to better worst-task performance on MetaWorld benchmarks.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"⟨qt, g(θ)⟩+ C√ T .(38) Since qt ∈∆ k, ⟨qt, g(θ)⟩ is a convex combination of {gi(θ)}k i=1, we have ⟨qt, g(θ)⟩ ≤ maxi∈[k] gi(θ). Averaging overt= 1, . . . , Tand takingmin θ∈Θ on both sides yields: min θ∈Θ 1 T TX t=1 ⟨qt, g(θ)⟩ ≤min θ∈Θ max i∈[k] gi(θ).(39) Substituting equation 39 into equation 38: 1 T TX t=1 ⟨qt, g(θt)⟩ ≤min θ∈Θ max i∈[k] gi(θ) + C√ T .(40) Step 3: Combining the bounds.Substituting equation 40 into equation 37: max i∈[k] 1 T TX t=1 gi(θt)≤min θ∈Θ max i∈[k] gi(θ) + logk η + logk αT + αG2 2 + C√ T .(41) To balance the two α-dependent terms, we choose α∗ = √2 logk G √ T so that logk αT = αG2 2 . Substituting into equation 41: max i∈[k] 1 T TX t=1 gi(θt)≤min θ∈Θ max i∈[k] gi(θ) + logk η + G√2 logk+C√"},{"citing_arxiv_id":"2605.09789","ref_index":4,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Zero-Shot Sim-to-Real Robot Learning: A Dexterous Manipulation Study on Reactive Catching","primary_cat":"cs.RO","submitted_at":"2026-05-10T22:20:20+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"DRIS improves zero-shot sim-to-real transfer for reactive catching by maintaining and acting on sets of randomized dynamics instances instead of single instances per episode.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"strategies have evolved from visual and geometric proper- ties [49, 50] to kinematic [16] and dynamic parameters [33]. While active DR optimizes sampling distributions to im- prove transfer [39, 10, 28, 29, 47], it typically requires real- world rollouts. Although some established works strategically employ DR-either through curriculum-based scaling [4] or entropy maximization [48]-to enable zero-shot ransfer, our approach differs by training over the joint evolution of multi- ple randomized instances simultaneously to achieve zero-shot transfer for challenging tasks such as catching. Representing Uncertainty in States and Dynamics.Robotics has extensively studied uncertainty representation. Classical"},{"citing_arxiv_id":"2605.09183","ref_index":35,"ref_count":2,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Learning When to Stop: Selective Imitation Learning Under Arbitrary Dynamics Shift","primary_cat":"cs.LG","submitted_at":"2026-05-09T21:48:04+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"SeqRejectron constructs a stopping rule with a small set of validator policies to achieve horizon-free sample complexity for selective imitation learning under arbitrary dynamics shifts.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.04373","ref_index":5,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Worst-Case Discovery and Runtime Protection for RL-Based Network Controllers","primary_cat":"cs.NI","submitted_at":"2026-05-06T00:42:32+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"ReGuard discovers network scenarios where RL controllers perform 43-64% worse than achievable and reduces those gaps by 79-85% with lightweight rule-based protection that preserves normal performance.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.03125","ref_index":24,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Taming the Curses of Multiagency in Robust Markov Games with Large State Space through Linear Function Approximation","primary_cat":"cs.LG","submitted_at":"2026-05-04T20:04:52+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":8.0,"formal_verification":"none","one_line_summary":"The work gives the first algorithms for general robust Markov games with linear function approximation whose sample complexity breaks the curse of multiagency for large state spaces in both generative and online settings.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"∀s∈ S:f i,s πi(s), π−i(s);V i,h+1 \u0001 =E a∼π(s) \u0002 ri,h(s,a) \u0003 +E ai∼πi(s) \u0002 inf U σi \u0010 P π−i h,s,ai \u0011 P Vi,h+1 \u0003 .(23) Based on these payoffs, we can introduce the best-response correspondence mapping ϕ as follows: for any π:S → Q i∈[n] ∆(Ai), ϕ(π) := \b u:S 7→ Q i∈[n] ∆(Ai)|u i(s)∈argmax π′ i(s)∈∆(Ai) fi,s(π′ i(s), π−i(s);V i,h+1),∀(i, s)∈[n]× S . (24) For this one-step game, the (joint) product policy space is X := n π:S → Q i∈[n] ∆(Ai) o . Since fixed points of ϕ correspond to Nash equilibria (NE), Theorem 4 yields the existence of NE once its conditions are verified. To show the three conditions, we begin by showing that X is a compact and convex subset of a convex Hausdorff linear topological space."},{"citing_arxiv_id":"2604.25459","ref_index":2,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"GS-Playground: A High-Throughput Photorealistic Simulator for Vision-Informed Robot Learning","primary_cat":"cs.RO","submitted_at":"2026-04-28T10:05:39+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"GS-Playground delivers a high-throughput photorealistic simulator for vision-informed robot learning via parallel physics integrated with batch 3D Gaussian Splatting at 10^4 FPS and an automated Real2Sim workflow for consistent environments.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"by Vision-Language-Action (VLA) models [25, 62, 4] and Vision-Language-Navigation (VLN) models [9, 5, 57, 58]. However, tasks involving complex dynamics and intermittent contacts rely heavily on simulation-based reinforcement learn- ing (RL) to acquire skills in an unsupervised manner. Early attempts to incorporate visual inputs into RL were often constrained by conventional, small-scale simulations [2, 37, 63], where low simulation throughput hindered the stable acquisition of complex skills. While recent advancements in massive parallel simulation have enabled sophisticated policy optimization for locomotion and dexterous manipulation, these frameworks primarily rely on proprioceptive states or point clouds due to the prohibitive computational overhead and lim-"},{"citing_arxiv_id":"2604.25126","ref_index":20,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"HANDFUL: Sequential Grasp-Conditioned Dexterous Manipulation with Resource Awareness","primary_cat":"cs.RO","submitted_at":"2026-04-28T02:04:50+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"HANDFUL learns resource-aware grasps using finger contact rewards and curriculum learning to improve success on sequential dexterous tasks in simulation and on a real LEAP hand.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.24018","ref_index":43,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Betting for Sim-to-Real Performance Evaluation","primary_cat":"cs.RO","submitted_at":"2026-04-27T03:58:50+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Betting mechanisms can yield provably more accurate and efficient estimates of real-world robot behavior than Monte Carlo sampling under specified conditions, with practical approximations demonstrated on synthetic data and a robotic manipulator task.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.16513","ref_index":1,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"SynthPID: P&ID digitization from Topology-Preserving Synthetic Data","primary_cat":"cs.CV","submitted_at":"2026-04-15T09:14:44+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Topology-preserving synthetic P&IDs generated by seeding from real drawings enable models trained solely on synthetics to achieve 63.8% edge mAP on real P&ID benchmarks, closing most of the gap to real-data training.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.11138","ref_index":1,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"ViserDex: Visual Sim-to-Real for Robust Dexterous In-hand Reorientation","primary_cat":"cs.RO","submitted_at":"2026-04-13T07:50:41+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A framework using 3D Gaussian Splatting for visual domain randomization enables robust monocular RGB-based dexterous in-hand reorientation on real hardware for multiple objects under varied lighting.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.10351","ref_index":25,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Trajectory-based actuator identification via differentiable simulation","primary_cat":"cs.RO","submitted_at":"2026-04-11T21:36:34+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Differentiable simulation enables torque-sensor-free actuator model identification from trajectory data, achieving 1.88x better position tracking than a stand-trained baseline and 46% longer travel in downstream locomotion policies.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.05954","ref_index":4,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"You're Pushing My Buttons: Instrumented Learning of Gentle Button Presses","primary_cat":"cs.RO","submitted_at":"2026-04-07T14:46:55+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Training-time instrumentation with audio and privileged button-state signals produces contact policies that match success rates but apply lower forces using only vision and audio at inference.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.04138","ref_index":10,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Learning Dexterous Grasping from Sparse Taxonomy Guidance","primary_cat":"cs.RO","submitted_at":"2026-04-05T14:53:43+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"GRIT learns dexterous grasping from sparse taxonomy guidance, achieving 87.9% success and better generalization to novel objects via a two-stage prediction-plus-policy approach.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"maps to produce grasp candidates and constrain policy learning. Our method follows this decoupled structure but additionally incorporates user intention into grasp planning. B. Level of Autonomy in Dexterous Manipulation Dexterous manipulation systems can also be categorized by their level of autonomy. Fully autonomous approaches, such as end-to-end reinforcement learning methods [10, 11, 12, 13], learn a direct mapping from observations to actions to optimize task performance. While effective across many tasks, these systems provide limited controllability after training, making it difficult to adjust behaviors during deployment. At the other extreme, teleoperation-based sys- tems provide fine-grained control by directly specifying hand"},{"citing_arxiv_id":"2603.22126","ref_index":10,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"ROBOGATE: Adaptive Failure Discovery for Safe Robot Policy Deployment via Two-Stage Boundary-Focused Sampling","primary_cat":"cs.RO","submitted_at":"2026-03-23T15:52:54+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"ROBOGATE applies adaptive boundary-focused sampling in simulation to discover robot policy failure boundaries, revealing a 97.65 percentage point performance gap for a VLA model between LIBERO and industrial scenarios.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2603.12243","ref_index":29,"ref_count":2,"confidence":0.98,"is_internal_anchor":true,"paper_title":"HandelBot: Real-World Piano Playing via Fast Adaptation of Dexterous Robot Policies","primary_cat":"cs.RO","submitted_at":"2026-03-12T17:56:29+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"HandelBot refines simulation policies via physical rollouts and residual RL to achieve precise bimanual piano playing, outperforming direct sim transfer by 1.8x with only 30 minutes of real data across five songs.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2603.04531","ref_index":16,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"PTLD: Sim-to-real Privileged Tactile Latent Distillation for Dexterous Manipulation","primary_cat":"cs.RO","submitted_at":"2026-03-04T19:17:42+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2602.01505","ref_index":42,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Optimal Sample Complexity for Single Time-Scale Actor-Critic with Momentum","primary_cat":"cs.LG","submitted_at":"2026-02-02T00:35:42+00:00","verdict":"UNVERDICTED","verdict_confidence":"MODERATE","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Single-timescale actor-critic with STORM momentum and a recent-sample buffer achieves optimal O(ε^{-2}) sample complexity for ε-optimal policies in finite discounted MDPs.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2601.14617","ref_index":18,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"UniCon: A Unified System for Efficient Robot Learning Transfers","primary_cat":"cs.RO","submitted_at":"2026-01-21T03:19:32+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"UniCon standardizes states and control logic into modular execution graphs for efficient transfer of learning controllers across heterogeneous robots, with lower latency than ROS.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2511.04831","ref_index":73,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Isaac Lab: A GPU-Accelerated Simulation Framework for Multi-Modal Robot Learning","primary_cat":"cs.RO","submitted_at":"2025-11-06T21:43:02+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Isaac Lab is a unified GPU-native platform combining high-fidelity physics, photorealistic rendering, multi-frequency sensors, domain randomization, and learning pipelines for scalable multi-modal robot policy training.","context_count":1,"top_context_role":"background","top_context_polarity":"unclear","context_text":"abs/1710.06537. 26 49 Isaac Lab: A GPU-Accelerated Simulation Framework for Multi-Modal Robot Learning [72] Aleksei Petrenko, Arthur Allshire, Gavriel State, Ankur Handa, and Viktor Makoviychuk. DexPBT: Scaling up dexterous manipulation for hand-arm systems with population based training. InRSS, 2023. URLhttps://arxiv.org/abs/2305.12127. 19, 26, 41 [73] Pixar Animation Studios. Universal scene description (openusd), 2016. URLhttps://openusd.org. Accessed: 2025-09-16. 4 [74] Tifanny Portela, Andrei Cramariuc, Mayank Mittal, and Marco Hutter. Whole-body end-effector pose tracking.Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), 2025. 31 [75] Antonin Raffin, Ashley Hill, Adam Gleave, Anssi Kanervisto, Maximilian Ernestus, and Noah Dormann."},{"citing_arxiv_id":"2510.17640","ref_index":13,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"RESample: A Robust Data Augmentation Framework via Exploratory Sampling for Robotic Manipulation","primary_cat":"cs.RO","submitted_at":"2025-10-20T15:21:12+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"RESample uses exploratory sampling guided by a lightweight Coverage Function to expand VLA training data coverage, yielding 12% performance gains on LIBERO and real-world tasks with 10-20% added samples.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2510.03599","ref_index":3,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Learning to Act Through Contact: A Unified View of Multi-Task Robot Learning","primary_cat":"cs.RO","submitted_at":"2025-10-04T01:23:30+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"A single goal-conditioned RL policy trained on contact plans performs multiple gaits and bimanual manipulation tasks on quadruped and humanoid robots.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2509.18455","ref_index":46,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Learning Geometry-Aware Nonprehensile Pushing and Pulling with Dexterous Hands","primary_cat":"cs.RO","submitted_at":"2025-09-22T22:25:35+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"GD2P generates and learns dexterous hand poses for nonprehensile pushing and pulling by combining contact-guided sampling, physics-based filtering, and a geometry-conditioned diffusion model, demonstrated on Allegro and LEAP hands in real-world tests.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2505.18719","ref_index":2,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"VLA-RL: Towards Masterful and General Robotic Manipulation with Scalable Reinforcement Learning","primary_cat":"cs.RO","submitted_at":"2025-05-24T14:42:51+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":6.0,"formal_verification":"none","one_line_summary":"VLA-RL applies online RL to pretrained VLAs, yielding a 4.5% gain over strong baselines on 40 LIBERO manipulation tasks and matching commercial models like π₀-FAST.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2505.06182","ref_index":1,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Apple: Toward General Active Perception via Reinforcement Learning","primary_cat":"cs.RO","submitted_at":"2025-05-09T16:49:26+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"APPLE is an RL framework that jointly optimizes a transformer perception module and policy via a unified objective for general active perception, with evaluations on tactile MNIST regression and classification tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2503.15481","ref_index":7,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Learning to Play Piano in the Real World","primary_cat":"cs.RO","submitted_at":"2025-03-19T17:56:14+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"A Sim2Real2Sim learning pipeline enables a real-world dexterous robot to play piano pieces including Happy Birthday and Ode to Joy with an average F1-score of 0.881.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2411.12173","ref_index":1,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"SkillTree: Explainable Skill-Based Deep Reinforcement Learning for Long-Horizon Control Tasks","primary_cat":"cs.LG","submitted_at":"2024-11-19T02:35:14+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"SkillTree reduces continuous action spaces to discrete skills via a differentiable decision tree in a hierarchical policy, achieving comparable performance to neural skill methods with added skill-level explainability in robotic arm tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2411.04832","ref_index":88,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Plasticity Loss in Deep Reinforcement Learning: A Survey","primary_cat":"cs.AI","submitted_at":"2024-11-07T16:13:54+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"Survey unifies the definition of plasticity loss in DRL, taxonomizes over 50 mitigations, identifies evaluation gaps, and finds general regularization often outperforms domain-specific methods.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2407.15134","ref_index":3,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Proximal Policy Distillation","primary_cat":"cs.LG","submitted_at":"2024-07-21T12:08:54+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"PPD integrates PPO into policy distillation so the student collects and uses its own rewards, yielding better sample efficiency and robustness than standard student-distill or teacher-distill on ATARI, Mujoco, and Procgen tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2403.12193","ref_index":14,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Continual Domain Randomization","primary_cat":"cs.RO","submitted_at":"2024-03-18T19:04:56+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Continual Domain Randomization trains RL policies sequentially on randomization parameter subsets with continual learning to achieve robust sim-to-real transfer in robotic reaching and grasping.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2402.05284","ref_index":1,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Analyzing Adversarial Inputs in Deep Reinforcement Learning","primary_cat":"cs.LG","submitted_at":"2024-02-07T21:58:40+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Introduces the Adversarial Rate metric and associated tools to systematically evaluate and visualize the impact of adversarial inputs on DRL policies using formal verification.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2309.16797","ref_index":246,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution","primary_cat":"cs.CL","submitted_at":"2023-09-28T19:01:07+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":8.0,"formal_verification":"none","one_line_summary":"Promptbreeder evolves both task prompts and the mutation prompts that improve them using LLMs, outperforming Chain-of-Thought and Plan-and-Solve on arithmetic and commonsense reasoning benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2302.11550","ref_index":16,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Scaling Robot Learning with Semantically Imagined Experience","primary_cat":"cs.RO","submitted_at":"2023-02-22T18:47:51+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Augmenting robot datasets via diffusion-based semantic inpainting enables manipulation policies to solve unseen tasks with new objects and improves robustness to novel distractors.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2207.05221","ref_index":92,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Language Models (Mostly) Know What They Know","primary_cat":"cs.CL","submitted_at":"2022-07-11T22:59:39+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Language models show good calibration when asked to estimate the probability that their own answers are correct, with performance improving as models get larger.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2206.06282","ref_index":21,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Analysis of Randomization Effects on Sim2Real Transfer in Reinforcement Learning for Robotic Manipulation Tasks","primary_cat":"cs.RO","submitted_at":"2022-06-13T16:12:56+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"A benchmark study finds that increased randomization improves Sim2Real transfer in robotic RL despite trade-offs in simulation learning, with full randomization and fine-tuning outperforming other approaches on the real robot.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2112.00861","ref_index":37,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"A General Language Assistant as a Laboratory for Alignment","primary_cat":"cs.CL","submitted_at":"2021-12-01T22:24:34+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Ranked preference modeling outperforms imitation learning for language model alignment and scales more favorably with model size.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2108.10470","ref_index":11,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Isaac Gym: High Performance GPU-Based Physics Simulation For Robot Learning","primary_cat":"cs.RO","submitted_at":"2021-08-24T01:38:11+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Isaac Gym achieves 2-3 orders of magnitude faster robot policy training by keeping physics simulation and PyTorch-based RL entirely on GPU with direct buffer sharing.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"tional requirements and 2) limited simulation speed. These problems are especially challenging when learning long-horizon behaviours for robots with high degrees of freedom. Popular physics engines like MuJoCo[6], PyBullet[7], DART[8], Drake[9], V-Rep[10] etc. need large CPU clusters to solve challenging RL tasks naturally face these bottlenecks. For instance, in [11], almost 30,000 CPU cores (920 worker machines with 32 cores each) were used to train a robot to solve the Rubik's Cube task using RL. In a similar task, [5] used a cluster of 384 systems with 6144 CPU cores, plus 8 NVIDIA V100 GPUs, and required 30 hours of training for RL to converge. One way to speed-up simulation and training is to make use of hardware accelerators."},{"citing_arxiv_id":"2102.01293","ref_index":182,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Scaling Laws for Transfer","primary_cat":"cs.LG","submitted_at":"2021-02-02T04:07:38+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Effective data transferred from pre-training to fine-tuning is described by a power law in model parameter count and fine-tuning dataset size, acting like a multiplier on the fine-tuning data.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2009.03393","ref_index":16,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Generative Language Modeling for Automated Theorem Proving","primary_cat":"cs.LG","submitted_at":"2020-09-07T19:50:10+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":8.0,"formal_verification":"none","one_line_summary":"GPT-f, a transformer-based prover for Metamath, generated new short proofs that were accepted into the main library—the first such contribution from a deep-learning system.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"1912.06680","ref_index":26,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Dota 2 with Large Scale Deep Reinforcement Learning","primary_cat":"cs.LG","submitted_at":"2019-12-13T19:56:40+00:00","verdict":"ACCEPT","verdict_confidence":"MODERATE","novelty_score":7.0,"formal_verification":"none","one_line_summary":"OpenAI Five achieved superhuman performance in Dota 2 by defeating the world champions using scaled self-play reinforcement learning.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"to an 80% winrate, and it is easier to learn how to consistently defeat bad agents. lost. In Dota 2, the key measure of human dexterity isreaction time10. OpenAI Five can react to a game event in 217ms on average. This quantity does not vary depending on game state. It is diﬃcult to ﬁnd reliable data on Dota 2 professionals' reaction times, but typical human visual reaction time is approximately 250ms[26]. See Appendix L for more details. While human evaluation is the ultimate goal, we also need to evaluate our agents continually during training in an automated way. We achieve this by comparing them to a pool of ﬁxed reference agentswithknownskillusingtheTrueSkillratingsystem[27]. InourTrueSkillenvironment, arating of 0 corresponds to a random agent, and a diﬀerence of approximately 8."}],"limit":50,"offset":0}