{"total":112,"items":[{"citing_arxiv_id":"2606.27872","ref_index":10,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"S$^2$-VLA: State-Space Guided Vision-Language-Action Models for Long-Horizon Manipulation","primary_cat":"cs.RO","submitted_at":"2026-06-26T09:13:16+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"S²-VLA uses a state-space model to maintain a belief state that produces dynamic gating weights for fusing visual, language, and action features, claiming better long-horizon manipulation than 7B models with only 2B parameters.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.11324","ref_index":13,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Embodied-R1.5: Evolving Physical Intelligence via Embodied Foundation Models","primary_cat":"cs.RO","submitted_at":"2026-06-09T18:07:50+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Embodied-R1.5 is an 8B EFM achieving SOTA on 16 of 24 embodied VLM benchmarks, fine-tunable to outperform leading VLAs, with claimed zero-shot real-robot generalization.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.09499","ref_index":7,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Targeting World Models to Compromise Robot Learning Pipelines","primary_cat":"cs.RO","submitted_at":"2026-06-08T13:50:31+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"World models introduce a stealthy poisoning vector into robot learning pipelines where malicious prompts or dynamics in teleoperated data activate only during synthetic trajectory generation, enabling backdoors in downstream policies.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.00537","ref_index":18,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"PACE: Phase-Aware Chunk Execution for Robot Policies with Action Chunking","primary_cat":"cs.RO","submitted_at":"2026-05-30T05:11:06+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"PACE dynamically selects execution horizons for action chunks in robot policies by detecting low-speed transition points in predicted speed profiles, raising success rates from 57.8% to 64.2% on 50 simulation tasks and from 50.7% to 70.4% in real-robot tests.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.00439","ref_index":8,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Physical Object Understanding with a Physically Controllable World Model","primary_cat":"cs.CV","submitted_at":"2026-05-30T00:10:32+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Autoregressive probabilistic world models trained on raw videos yield emergent object segmentation, 3D controllability, and physical relationship inference via multi-future motion correlation analysis.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.30957","ref_index":17,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"RDGen: Demonstration Generation for High-Quality Robot Learning via Reinforcement Learning","primary_cat":"cs.RO","submitted_at":"2026-05-29T07:53:19+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"RDGen uses sim-to-real RL policies to generate smoother robot demonstrations that improve downstream VLA performance over human-collected data on pick-and-place tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.30660","ref_index":15,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"BOKBO (Best of K Bad Options): Calibrated Abstention for VLA Policies","primary_cat":"cs.LG","submitted_at":"2026-05-28T23:39:09+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"BOKBO is the first conformal abstention method for K-sample VLA policies that supplies finite-sample distribution-free guarantees on executed violation rates, with global and Mondrian per-task variants.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.30011","ref_index":21,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"VisualThink-VLA: Visual Intermediate Reasoning for Effective and Low-Latency Vision-Language-Action Policies","primary_cat":"cs.CV","submitted_at":"2026-05-28T14:36:53+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"VISUALTHINK-VLA uses visual evidence tokens and selective routing to reach top success rates on VLA benchmarks while cutting reasoning latency from multi-second to sub-second levels.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.23847","ref_index":10,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Instrumentation for Imitation Learning: Enhancing Training Datasets for Clothes Hanger Insertion","primary_cat":"cs.RO","submitted_at":"2026-05-22T16:59:55+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Instrumented objects boost diffusion policy success in robotic hanger insertion by 14-25 percentage points over vision-only baselines, and augmenting datasets with instrumented expert rollouts lets a vision-only student match the instrumented expert.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.23341","ref_index":16,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Sparse Compositional Flow Matching by geometric assembly from motion primitives","primary_cat":"cs.RO","submitted_at":"2026-05-22T07:55:48+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A compositional flow-matching model learns a dictionary of motion primitives with length masks and assembles them via sparse binary placement with geometric continuity losses, reporting SOTA results on two embodied trajectory datasets.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.22183","ref_index":6,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Action with Visual Primitives","primary_cat":"cs.RO","submitted_at":"2026-05-21T08:52:47+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"AVP architecture has VLM emit visual-primitive tokens to condition flow-matching action expert, yielding 27.61% higher success rate than pi_0.5 on real-robot pick-and-place tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.19924","ref_index":15,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"RoHIL: Robust Human-in-the-Loop Robotic Reinforcement Learning Against Illumination Variations","primary_cat":"cs.RO","submitted_at":"2026-05-19T14:47:38+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"RoHIL adapts human-in-the-loop RL policies to new illumination conditions offline by combining world-model image relighting, illumination-retention replay, and anchored Bellman regularisation, improving shifted-light performance while preserving source performance on four real-robot tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.19319","ref_index":12,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"SWEET: Sparse World Modeling with Image Editing for Embodied Task Execution","primary_cat":"cs.CV","submitted_at":"2026-05-19T03:54:46+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"SWEET is a one-shot sparse visual planning framework that progressively generates manipulation keyframes via image editing conditioned on language and spatial guidance, then converts them to actions with a diffusion predictor, showing better fidelity and lower cost than video models on DROID and Rob","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.19029","ref_index":7,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Distributionally Robust Control via Stein Variational Inference for Contact-Rich Manipulation","primary_cat":"cs.RO","submitted_at":"2026-05-18T18:54:29+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Introduces a Stein variational inference-based deterministic formulation for distributionally robust control in contact-rich robotic manipulation, reporting up to 3x improved robustness under parametric uncertainty.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.17486","ref_index":16,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"DyGRO-VLA: Cross-Task Scaling of Vision-Language-Action Models via Dynamic Grouped Residual Optimization","primary_cat":"cs.RO","submitted_at":"2026-05-17T14:55:32+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"DyGRO-VLA is a two-stage optimization framework for cross-task scaling of Vision-Language-Action models via dynamic grouped residual optimization in RL.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.17077","ref_index":8,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"How to Instruct Your Robot: Dense Language Annotations Power Robot Policy Learning","primary_cat":"cs.RO","submitted_at":"2026-05-16T16:52:08+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"DeMiAn re-annotates robot and egocentric videos with VLM-generated dense labels across motion, scene, pose, and reasoning aspects, then uses a learned instructor to boost policy success by 5 points on RoboCasa over task-only baselines.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.16797","ref_index":37,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"EgoKit: Towards Unified Low-Cost Egocentric Data Collection with Heterogeneous Devices","primary_cat":"cs.CV","submitted_at":"2026-05-16T03:59:17+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"EgoKit is a new toolkit and accessory set that unifies egocentric video collection with wrist views across heterogeneous consumer devices using a consistent interface and log format.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.16054","ref_index":169,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Ada-Diffuser: Latent-Aware Adaptive Diffusion for Decision-Making","primary_cat":"cs.LG","submitted_at":"2026-05-15T15:21:25+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Ada-Diffuser is a causal diffusion model that jointly learns observed interaction structure and underlying latent dynamics from minimal observations for adaptive planning and policy learning.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.15536","ref_index":22,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"SkiP: When to Skip and When to Refine for Efficient Robot Manipulation","primary_cat":"cs.RO","submitted_at":"2026-05-15T02:16:34+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"SkiP introduces action relabeling and Motion Spectrum Keying to skip redundant steps in robot trajectories, cutting executed steps by 15-40% while maintaining success rates across 72 simulated and 3 real tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.20223","ref_index":22,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Why Latent Actions Fail, and How to Prevent It","primary_cat":"cs.CV","submitted_at":"2026-05-13T09:54:35+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Extending linear LAMs to model exogenous state shows standard reconstruction encodes future exogenous info in latent actions, while endogenous-focused spaces and auxiliary objectives like action-supervision enforce consistency across noise.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.13119","ref_index":23,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Towards Long-horizon Embodied Agents with Tool-Aligned Vision-Language-Action Models","primary_cat":"cs.RO","submitted_at":"2026-05-13T07:40:34+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"VLAs-as-Tools pairs a VLM planner with specialized VLA executors via a new interface and Tool-Aligned Post-Training to raise long-horizon robot success rates on LIBERO-Long and RoboTwin benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.12416","ref_index":7,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Aligning Flow Map Policies with Optimal Q-Guidance","primary_cat":"cs.LG","submitted_at":"2026-05-12T17:12:29+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Flow map policies enable fast one-step inference for flow-based RL policies, and FMQ provides an optimal closed-form Q-guided target for offline-to-online adaptation under trust-region constraints, achieving SOTA performance.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Figure 16:antmaze-giant-task5(OGBench). Navigate an ant to a nearby goal in a giant maze. C Algorithms Algorithm 1FLOWMAPQ-GUIDANCE(FMQ) Require:Offline policyu off r,1, online policyu θ r,1, criticsQ ϕ1 , Qϕ2, bufferD 1:foreach environment stepdo 2:a 1 ←a 0 +u θ 0,1(a0|s),a 0 ∼ N(0, I) 3:D ← D ∪ {(s, a 1, r, s′)} 4:Sample batch fromD; update critics via Eq. 8 5:r∼ U[0,1);a 0 ∼ N(0, I);a r ←(1−r)a 0 +r a data 6:a 1 ←a r + (1−r)u off r,1(ar|s) 7:g← ∇ aQϕ1(s, a1)/(∥∇aQϕ1(s, a1)∥2 +κ 1) 8:η eff ←η/(1 +β ˜δcritic)▷Eq. 13 9:θ←θ−α∇ θ∥uθ r,1(ar|s)−sg(u off r,1(ar|s) +η eff g)∥2 10:end for Algorithm 2Q-GUIDEDBEAMSEARCH(QGBS) Require:Flow mapX θ r,1, criticQ ϕ, states, beamM, stepsK, branchesB, SNRρ, step sizeη 1:t ′ ←ρ/(1+ρ) 2:Sample{a m 0 }M m=1 ∼ N(0, I);a m 1 ←a m 0 +u θ 0,1(am 0 |s)for allm 3:fork= 1, . . . , Kdo 4:form= 1, . . . , Mandb= 1, . . . , Bdo 5:ε mb ∼ N(0, I) 6:ˆa mb 1 ←X θ t′,1 t′ am 1 + (1−t′)ε mb |s \u0001 ▷Re-noise & complete 7:end for 8:{a m 1 }M m=1 ←Top-M {ˆamb 1 }m,b;Q ϕ(s,ˆamb 1 ) \u0001 ▷Select bestMofM·B 9:a m 1 ←a m 1 +η∇ aQϕ(s, am 1 )/∥∇aQϕ(s, am 1 )∥2 for allm ▷Thm. 3.2 10:end for 11:returna arg maxm Qϕ(s,am 1 ) 1 18 0.0 0.5 1.0 1.5 2.0 Steps (×106) 0.00 0.25 0.50 0.75 1.00Success Rate Can 0.0 0.5 1.0 1.5 2.0 Steps (×106) 0.00 0.25 0.50 0.75 1.00 Square 0.0 0.5 1.0 1.5 2.0 Steps (×106) 0.00 0.25 0.50 0.75 1.00 Cube-Double-T3 0.0 0.5 1.0 1.5 2.0 Steps (×106) 0.00 0.25 0.50 0.75 1.00Success Rate Cube-Double-T4 0.0 0.5 1.0 1.5 2.0 Steps (×106) 0.00 0.25 0.50 0.75 1.00 Cube-Triple-T3 0.0 0.5 1.0 1.5 2.0 Steps (×106) 0.00 0.25 0.50 0.75 1.00 Cube-Triple-T4 0.0 0.5 1.0 1.5 2.0 Steps (×106) 0.00 0.25 0.50 0.75 1.00Success Rate Scene-T4 0.0 0.5 1.0 1.5 2.0 Steps (×106) 0.00 0.25 0.50 0.75 1.00 Scene-T5 0.0 0.5 1.0 1.5 2.0 Steps (×106) 0.00 0.25 0.50 0.75 1.00 HMaze-Med-T3 0.0 0.5 1.0 1.5 2.0 Steps (×106) 0.00 0.25 0.50 0.75 1.00Success Rate HMaze-Med-T4 0.0 0.5 1.0 1.5 2.0 Steps (×106) 0.00 0.25 0.50 0.75 1.00 AMaze-Giant-T4 0.0 0.5 1.0 1.5 2.0 Steps (×"},{"citing_arxiv_id":"2605.12334","ref_index":7,"ref_count":2,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Reinforcing VLAs in Task-Agnostic World Models","primary_cat":"cs.AI","submitted_at":"2026-05-12T16:16:15+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"RAW-Dream disentangles world-model learning from task data by using a pre-trained task-agnostic world model and VLM rewards, with dual-noise filtering, to enable zero-shot VLA adaptation in simulation and real settings.","context_count":1,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":"control (6-DoF end-effector poses + 1-DoF continuous gripper within[0,0.1] ). Our evaluation strictly adheres to the zero-target-data paradigm: the WM is never exposed to any downstream task data, and the VLA policy is improved from extremely-few-shot SFT base models entirely through imagination. Task-agnostic WM from play data.We first pre-train our WM on Open X-Embodiment datasets [7], which is then fine-tuned on approximately 4 hours of uncurated, teleoperated play data, collected via a master-slave control system. The tabletop workspace contains a rich variety of everyday 8 Figure 2: (a) Sample scenes from our collected play data spanning diverse object arrangements and tabletop layouts. (b) The four downstream evaluation tasks for VLA fine-tuning and RL."},{"citing_arxiv_id":"2605.12090","ref_index":132,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"World Action Models: The Next Frontier in Embodied AI","primary_cat":"cs.RO","submitted_at":"2026-05-12T13:10:52+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"The paper introduces World Action Models as a new paradigm unifying predictive world modeling with action generation in embodied foundation models and provides a taxonomy of existing approaches.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Robot-centric Teleoperation QT-Opt [112], MIME [ 113], RoboNet [114], Robo T urk-Real [115], BridgeData [116], MT-Opt [117] BC-Z [118], RT-1 [119], Language-Table [120], BridgeData v2 [ 121], Jaco Play [ 122] Cable Routing Dataset [ 123], RH20T [124], OXE [125], DROID [126], RH20T-P [127], RoboMIND [128] ARIO [129], RoboData [130], DexCap [131], FuSe [132], AgiBot World [133], REASSEMBLE [ 134] OmniAction [135], UnifoLM-WBT [136] UMI-style Human Demonstration UMI [137], FastUMI [138], FastUMI-100K [139], RealOmin [140], Hoi! [ 141], RDT2 [142] ActiveUMI [143], exUMI [ 144], Tactile-Conditioned Diffusion Policy [145], DexUMI [ 146] UMI on Legs [ 147], HoMMI [ 148], MV-UMI [149] Simulation Data MimicGen [150], ManiSkill2 [ 151], RoboCasa [152], Robo T win [153], DexMimicGen [ 154]"},{"citing_arxiv_id":"2605.11567","ref_index":21,"ref_count":2,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Dynamic Execution Commitment of Vision-Language-Action Models","primary_cat":"cs.CV","submitted_at":"2026-05-12T05:52:58+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":1,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":"• We demonstrate across multiple VLA benchmarks that A3 removes the need for task-specific horizon tuning and exposes a controllable trade-off between execution reliability and inference efficiency. 2 Related Work Vision-language-action model.VLA models [ 17, 18] aim to learn policies that map visual observa- tions and language instructions [19, 20] directly to continuous control actions [21]. Early approaches [2] primarily relied on autoregressive or behavior-cloning policies that predict actions step by step. More recently, dual-system VLA architectures [1, 9] have emerged, decoupling high-level multimodal reasoning [19, 22] from low-level action generation. In these systems, a perception-language back- bone [19] produces contextual representations, while a dedicated action expert is often implemented"},{"citing_arxiv_id":"2605.09613","ref_index":21,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"SABER: A Scalable Action-Based Embodied Dataset for Real-World VLA Adaptation","primary_cat":"cs.RO","submitted_at":"2026-05-10T15:51:01+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"SABER provides 44.8K multi-representation action samples from unscripted retail environments that raise a VLA model's mean success rate on ten manipulation tasks from 13.4% to 29.3%.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"The field has converged on two dominant design patterns: single end- to-end models (RT-2, OpenVLA,π0) that process visual and language inputs in a single forward pass, and dual-system architectures (GR00T N1, Helix) that decouple a slower VLM reasoning component (System 2) from a faster visuomotor flow-matching policy (System 1). Open X-Embodiment [21] aggregated 1M+ trajectories across 22 embodiments to study cross-embodiment generalization at scale. LIBERO-PRO [14] quantified distribution-shift fragility: models exceeding 90% on standard LIBERO suites collapse to near zero undermodestperturbations toobject positionsor scene context-aresultthatmotivates domain-specificpost- training rather than continued scaling of general corpora."},{"citing_arxiv_id":"2605.08774","ref_index":48,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"ProcVLM: Learning Procedure-Grounded Progress Rewards for Robotic Manipulation","primary_cat":"cs.RO","submitted_at":"2026-05-09T08:00:59+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"ProcVLM learns procedure-grounded dense progress rewards for robotic manipulation via a reasoning-before-estimation VLM trained on a 60M-frame synthesized corpus from 30 embodied datasets.","context_count":1,"top_context_role":"baseline","top_context_polarity":"baseline","context_text":"pretraining and subtask-structured reasoning in improving one-shot transferability. 4.3 Reward Fine-tuning Setup.We evaluate ProcVLM as a progress-based reward model for downstream policy learning. We build on SJTU Evo-RL, an open-source offline RL framework that supports value inference and advantage-conditioned policy training [13]. Using π0.5 as the base policy [48], both SFT and RFT start from the same policy initialization and use the same training data. The SFT baseline follows the standard supervised fine-tuning pipeline. For RFT, ProcVLM assigns progress scores as dense rewards to training trajectories, which are used by Evo-RL to estimate advantages within a 50-step horizon. Within each task, the top 30% advantage samples are labeled as positive and the remaining"},{"citing_arxiv_id":"2605.06481","ref_index":58,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"OA-WAM: Object-Addressable World Action Model for Robust Robot Manipulation","primary_cat":"cs.RO","submitted_at":"2026-05-07T16:06:08+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"OA-WAM uses persistent address vectors and dynamic content vectors in object slots to enable addressable world-action prediction, improving robustness on manipulation benchmarks under scene changes.","context_count":1,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":"5: An improved open foundation model for generalist humanoid robots. NVIDIA Research Blog, June 2025.https://research.nvidia.com/labs/gear/gr00t-n1_5/. [57] Octo Model Team, Dibya Ghosh, Homer Walke, Karl Pertsch, Kevin Black, Oier Mees, Sudeep Dasari, Joey Hejna, Tobias Kreiman, Charles Xu, et al. Octo: An open-source generalist robot policy. InRobotics: Science and Systems (RSS), 2024. arXiv:2405.12213. [58] Open X-Embodiment Collaboration, Abby O'Neill, Abdul Rehman, Abhinav Gupta, Abhiram Maddukuri, Abhishek Gupta, Abhishek Padalkar, Abraham Lee, Acorn Pooley, Agrim Gupta, et al. Open X- Embodiment: Robotic learning datasets and RT-X models.arXiv preprint arXiv:2310.08864, 2023. [59] Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy V . V o, Marc Szafraniec, Vasil Khalidov, Pierre"},{"citing_arxiv_id":"2605.06747","ref_index":6,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"HumanNet: Scaling Human-centric Video Learning to One Million Hours","primary_cat":"cs.CV","submitted_at":"2026-05-07T15:21:58+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"HumanNet is a 1M-hour human-centric video dataset with interaction annotations that enables better vision-language-action model performance than equivalent robot data in a controlled test.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9044-9053, 2021. [5] Zhe Chen, Weiyun Wang, Yue Cao, Yangzhou Liu, Zhangwei Gao, Erfei Cui, Jinguo Zhu, Shenglong Ye, Hao Tian, Zhaoyang Liu, et al. Expanding performance boundaries of open-source multimodal models with model, data, and test-time scaling, 2024. [6] Embodiment Collaboration, Abby O'Neill, Abdul Rehman, Abhinav Gupta, Abhiram Maddukuri, Abhishek Gupta, Abhishek Padalkar, Abraham Lee, Acorn Pooley, Agrim Gupta, Ajay Mandlekar, Ajinkya Jain, Albert Tung, Alex Bewley, Alex Herzog, Alex Irpan, Alexander Khazatsky, Anant Rai, Anchit Gupta, Andrew Wang, Andrey Kolobov, Anikait Singh, Animesh Garg, Aniruddha Kembhavi, Annie Xie, Anthony Brohan, Antonin Raffin, Archit"},{"citing_arxiv_id":"2605.06311","ref_index":6,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Toward Visually Realistic Simulation: A Benchmark for Evaluating Robot Manipulation in Simulation","primary_cat":"cs.RO","submitted_at":"2026-05-07T14:13:05+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"VISER is a new visually realistic simulation benchmark for robot manipulation tasks that uses PBR materials and MLLM-assisted asset generation, achieving 0.92 Pearson correlation with real-world policy performance.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Xianliang Lin, Yiheng Ge, Zhenyu Gu, Weiliang Deng, Yubin Guo, Tian Nian, Xuanbing Xie, Qiangyu Chen, Kailun Su, Tianling Xu, Guodong Liu, Mengkang Hu, Huan ang Gao, Kaixuan Wang, Zhixuan Liang, Yusen Qin, Xiaokang Yang, Ping Luo, and Yao Mu. Robotwin 2.0: A scalable data generator and benchmark with strong domain randomization for robust bimanual robotic manipulation, 2025. URLhttps://arxiv.org/abs/2506.18088. [6] Open X-Embodiment Collaboration et al. Open X-Embodiment: Robotic learning datasets and RT-X models.https://arxiv.org/abs/2310.08864, 2023. [7] Matt Deitke, Dustin Schwenk, Jordi Salvador, Luca Weihs, Oscar Michel, Eli VanderBilt, Ludwig Schmidt, Kiana Ehsani, Aniruddha Kembhavi, and Ali Farhadi. Objaverse: A universe of annotated 3d objects, 2022."},{"citing_arxiv_id":"2605.04647","ref_index":55,"ref_count":2,"confidence":0.98,"is_internal_anchor":true,"paper_title":"ReflectDrive-2: Reinforcement-Learning-Aligned Self-Editing for Discrete Diffusion Driving","primary_cat":"cs.RO","submitted_at":"2026-05-06T08:52:32+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"ReflectDrive-2 combines masked discrete diffusion with RL-aligned self-editing to generate and refine driving trajectories, reaching 91.0 PDMS on NAVSIM camera-only and 94.8 in best-of-6.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.01544","ref_index":2,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"An Efficient Metric for Data Quality Measurement in Imitation Learning","primary_cat":"cs.RO","submitted_at":"2026-05-02T17:16:50+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Power spectral density of trajectories ranks demonstration quality for imitation learning, enabling rollout-free curation that improves fine-tuned policy success.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.01477","ref_index":5,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Action Agent: Agentic Video Generation Meets Flow-Constrained Diffusion","primary_cat":"cs.RO","submitted_at":"2026-05-02T14:52:03+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Action Agent pairs LLM-driven video generation with a flow-constrained diffusion transformer to produce velocity commands, raising video success to 86% and delivering 64.7% real-world navigation on a Unitree G1 humanoid.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.00397","ref_index":13,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"MiniVLA-Nav v1: A Multi-Scene Simulation Dataset for Language-Conditioned Robot Navigation","primary_cat":"cs.RO","submitted_at":"2026-05-01T04:36:54+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"MiniVLA-Nav v1 provides 1,174 episodes of language-instructed robot navigation in photorealistic simulations with RGB, depth, segmentation, and expert action data.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.00244","ref_index":41,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Lucid-XR: An Extended-Reality Data Engine for Robotic Manipulation","primary_cat":"cs.RO","submitted_at":"2026-04-30T21:25:20+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Lucid-XR uses XR-headset physics simulation and physics-guided video generation to create synthetic data that trains robot policies transferring zero-shot to unseen real-world manipulation tasks.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"text-to-image generation to hallucinate new scenarios directly on real data, maintaining semantic consistency with robot objectives. Large-scale imitation learning and data aggregation.Multi-institution efforts aggregate demon- strations but remain embodiment-specific. Open X-Embodiment collects 50k+ demos for RT-X policies [39]; RH20T [40] and DROID [41] add 100k+ crowdsourced trajectories. Despite scale, diversity across robots, environments, and tasks remains limited, falling short of what is needed for truly generalist policies analogous to vision or NLP foundation models. 6 Conclusion In this work, we present Lucid-XR, a generative-AI-powered learning pipeline for producing gen- eralizable visual policies for manipulation."},{"citing_arxiv_id":"2604.28197","ref_index":36,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"OmniRobotHome: A Multi-Camera Platform for Real-Time Multiadic Human-Robot Interaction","primary_cat":"cs.RO","submitted_at":"2026-04-30T17:59:58+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"A 48-camera residential platform delivers real-time occlusion-robust 3D perception and coordinated actuation for multi-human multi-robot interaction in a shared home workspace.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"whole-body interaction in furnished rooms [25], and joint human-object trajec- tories [15,32]. All operate offline and none produces real-time 3D scene state for closed-loop robotic control. 2.2 Robotic Manipulation Platforms Platforms for robot learning have scaled in task diversity and data volume. ALOHA [11, 60] demonstrates bimanual skills via teleoperation, SayCan [1] grounds language in affordances, and large datasets [9,23,36] aggregate tra- jectories across embodiments. Multi-robot coordination has progressed from in- dustrial assembly [54] to collision-free multi-arm planning [20,28] and multi-user evaluation [56]. These platforms treat human state as external to the control loop; none integrates room-scale real-time human and object sensing with coor- dinated multi-robot actuation."},{"citing_arxiv_id":"2604.27472","ref_index":10,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"PRTS: A Primitive Reasoning and Tasking System via Contrastive Representations","primary_cat":"cs.AI","submitted_at":"2026-04-30T06:14:02+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"PRTS pretrains VLA models with contrastive goal-conditioned RL to embed goal-reachability probabilities from offline data, yielding SOTA results on robotic benchmarks especially for long-horizon and novel instructions.","context_count":1,"top_context_role":"method","top_context_polarity":"use_method","context_text":"motivates a temporal weighting scheme that emulates geometric sampling by assigning weights to multiple positive samples according to their temporal distance to the goal. Formally, for each positive samplej∈ S (i) at trajectory timesteptj (with lengthT j) sharing the same language goalli, we define the temporal weight: qij = γTj −tj P j′∈S(i) γTj′ −tj′ .(10) 5 This weighting scheme assigns exponentially larger weights to states closer to task completion and mirrors the decay of the geometric distribution. As we prove below, optimizing with these soft targets yields representations whose inner productψ(l)⊤ϕ(s, a)is proportional to the log-discounted occupancy, recovering the same temporal structure as standard CRL."},{"citing_arxiv_id":"2604.26689","ref_index":5,"ref_count":2,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Atomic-Probe Governance for Skill Updates in Compositional Robot Policies","primary_cat":"cs.RO","submitted_at":"2026-04-29T13:56:11+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"A cross-version swap protocol reveals dominant skills that swing composition success by up to 50 percentage points, and an atomic probe with selective revalidation governs updates at lower cost than always re-testing full compositions.","context_count":1,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":"None of this literature, to our knowledge, formalizes the question of what happens to existing compositions when one of the constituent skills is later updated. Generalist policies, evaluation, and continual learning.Vision-language-action models such as OpenVLA [1], Octo [2], π0 [3], and RT-2 [6], together with the Open X-Embodiment / RT-X collaboration's large heterogeneous datasets [5] and the DROID in-the-wild dataset [4], are explicitly designed for downstream fine-tuning, making post- deployment skill updates a routine event. Recent benchmarks evaluate such generalist policies in simulation [22] and in distributed real-world setups [23], but these evaluate policies as monolithic units rather than the post-update composition stability we target."},{"citing_arxiv_id":"2604.24182","ref_index":20,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"$M^2$-VLA: Boosting Vision-Language Models for Generalizable Manipulation via Layer Mixture and Meta-Skills","primary_cat":"cs.RO","submitted_at":"2026-04-27T08:44:12+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"M²-VLA shows that generalized VLMs can serve as direct backbones for robotic manipulation by selectively extracting task-critical features via Mixture of Layers and adding Meta Skill Modules for efficient trajectory learning.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"have demonstrated the effectiveness of Transformer-based architectures in large-scale robot learning. Notable works such as Gato [18] and RT-2 [3] unify multi-modal inputs (i.e., images, text, and proprioception) into a single sequence, enabling a single policy to perform diverse tasks. Further- more, initiatives like Octo [19] and Open X-Embodiment [20] have highlighted the potential of training across hetero- geneous robot embodiments. In contrast, our work adheres to the Vision-Language-Action (VLA) paradigm, employing a Large Language Model (LLM) backbone to uniformly en- code multi-modal inputs for action generation. This approach allows us to directly leverage the robust generalization capa-"},{"citing_arxiv_id":"2604.23001","ref_index":16,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Vision-Language-Action in Robotics: A Survey of Datasets, Benchmarks, and Data Engines","primary_cat":"cs.RO","submitted_at":"2026-04-24T20:41:59+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":3.0,"formal_verification":"none","one_line_summary":"A survey of VLA robotics research identifies data infrastructure as the primary bottleneck and distills four open challenges in representation alignment, multimodal supervision, reasoning assessment, and scalable data generation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.22551","ref_index":2,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"QDTraj: Exploration of Diverse Trajectory Primitives for Articulated Objects Robotic Manipulation","primary_cat":"cs.RO","submitted_at":"2026-04-24T13:45:04+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"QDTraj uses Quality-Diversity algorithms with sparse rewards to produce at least five times more diverse high-performing trajectories for articulated object manipulation than compared methods, validated across 30 objects with hundreds of trajectories per task.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"articulated object with a hinge joint connecting the frame to the door and a slider joint allowing the oven rack to move in and out. One of the main challenges of contact-rich articulated object manipulation relies on the expert data acquisition challenge. Current approaches intend to acquire reliable ex- pert manipulation demonstrations through a teleoperated data acquisition system [2], [3], or simulated data collection [4]. However, real-world data collection is time-consuming and Fig. 1.Plug-and-play QDTraj exploration algorithm.Given an articu- lated object URDF and an activation task, QDtraj generates sets of diverse trajectory primitives to achieve the task. Trajectory primitives generated in Genesis parallelized simulation are deployable real world set-up."},{"citing_arxiv_id":"2604.22227","ref_index":25,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"A Co-Evolutionary Theory of Human-AI Coexistence: Mutualism, Governance, and Dynamics in Complex Societies","primary_cat":"cs.CY","submitted_at":"2026-04-24T05:02:20+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"Human-AI coexistence is best modeled as conditional mutualism under governance, formalized as a multiplex dynamical system whose simulations show stable high-coexistence equilibria only under balanced institutional oversight.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"so Picard-Lindelöf yields local existence and uniqueness. It remains to exclude finite-time blow-up. 18 LetV(x) = 1 2∥x∥2. Then ˙V=x ⊤b−x⊤Ax− d∑ i=1 νix4 i.(23) By Cauchy-Schwarz and Young's inequality, x⊤b≤∥x∥∥b∥≤1 2∥x∥2 + 1 2∥b∥2.(24) Also,−x⊤Ax≤∥A∥∥x∥2. Writingνmin = miniνi and using∑ ix4 i≥d−1∥x∥4, we obtain ˙V≤1 2∥b∥2 + (1 2 +∥A∥ ) ∥x∥2−νmin d ∥x∥4.(25) The quartic term dominates for large∥x∥, so ˙V < 0outside a sufficiently large ball. Hence trajectories cannot escape to infinity in finite time, and the local solution extends globally. Theorem 5(Monotonic ascent of the coexistence functional).Along every solution of(16), d dtJ(x(t)) =∥∇J(x(t))∥2≥0.(26) ThereforeJis nondecreasing along trajectories."},{"citing_arxiv_id":"2604.20100","ref_index":28,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"JoyAI-RA 0.1: A Foundation Model for Robotic Autonomy","primary_cat":"cs.RO","submitted_at":"2026-04-22T01:51:48+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"JoyAI-RA is a multi-source pretrained VLA model that claims to bridge human-to-robot embodiment gaps via data unification and outperforms prior methods on generalization-heavy robotic tasks.","context_count":1,"top_context_role":"dataset","top_context_polarity":"background","context_text":"Real-world robot demonstrations remain essential for grounding learning in executable behavior under sensing noise, contact uncertainty, and hardware constraints, as reflected in datasets such as AgiBot-World [8], RoboMIND [19] and RoboCOIN [37]. RT-1 [6] shows the value of scaling real-robot interaction data for policy learning, while Open X-Embodiment [28] extends this idea by aggregating robot demonstrations across many embodiments and collection setups. Human videos have also become an important source of supervision, owing to their scalability of collection and alignment with real-world interaction patterns [12, 17, 18, 32]. EgoDex [18] further demonstrates that large- scale egocentric human videos can provide transferable priors for manipulation, especially when large-scale"},{"citing_arxiv_id":"2604.19728","ref_index":16,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"VLA Foundry: A Unified Framework for Training Vision-Language-Action Models","primary_cat":"cs.RO","submitted_at":"2026-04-21T17:51:51+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"VLA Foundry provides a single training stack for VLA models and releases open models that match prior closed-source performance or outperform baselines on multi-task manipulation in simulation.","context_count":1,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":"demonstrations from stationary bimanual manipulation stations described in our previous workLBM[65]. The data mix features 42 tasks in simulation and 361 tasks in the real world; 39 tasks are replicated in both real and simulation with copies of the stations and manipulands. Unlike our previous work we do not train on open-sourced data such as OXE [16] or data collected with a universal manipulation device (UMI) [13]. Further details regarding the dataset, including number of episodes per benchmark task and differences from the dataset ofLBM, can be found in Section C.6. Unless otherwise noted,Foundry-VLA-1.7Band Foundry-Qwen3VLA-2.1B-MTare trained on a multi-task mixture of both real and simulation data4."},{"citing_arxiv_id":"2604.17800","ref_index":26,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"ReFineVLA: Multimodal Reasoning-Aware Generalist Robotic Policies via Teacher-Guided Fine-Tuning","primary_cat":"cs.RO","submitted_at":"2026-04-20T04:46:20+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"ReFineVLA adds teacher-generated reasoning steps to VLA training and reports state-of-the-art success rates on SimplerEnv WidowX and Google Robot benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.15483","ref_index":79,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"${\\pi}_{0.7}$: a Steerable Generalist Robotic Foundation Model with Emergent Capabilities","primary_cat":"cs.LG","submitted_at":"2026-04-16T19:18:07+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"π₀.₇ is a steerable generalist robotic model that uses rich multimodal prompts including language, subgoal images, and performance metadata to achieve out-of-the-box generalization across tasks and robot bodies.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"[77] OX-Embodiment Collaboration, A Padalkar, A Pooley, A Jain, A Bewley, A Herzog, A Irpan, A Khazatsky, A Rai, A Singh, et al. Open X-Embodiment: Robotic learning datasets and RT-X models.arXiv preprint arXiv:2310.08864, 1(2), 2023. 3 [78] Jonathan Yang, Chelsea Finn, and Dorsa Sadigh. Data analogies enable efficient cross-embodiment transfer. arXiv preprint arXiv:2603.06450, 2026. 3 [79] Ria Doshi, Homer Walke, Oier Mees, Sudeep Dasari, and Sergey Levine. Scaling cross-embodied learning: One policy for manipulation, navigation, locomotion and aviation. InConference on Robot Learning, 2024. [80] Jonathan Yang, Catherine Glossop, Arjun Bhorkar, Dhruv Shah, Quan Vuong, Chelsea Finn, Dorsa Sadigh, and Sergey Levine. Pushing the limits of cross-"},{"citing_arxiv_id":"2604.13733","ref_index":30,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Vision-Language-Action Jump-Starting for Reinforcement Learning Robotic Agents","primary_cat":"cs.LG","submitted_at":"2026-04-15T11:17:54+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.13001","ref_index":6,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"XRZero-G0: Pushing the Frontier of Dexterous Robotic Manipulation with Interfaces, Quality and Ratios","primary_cat":"cs.RO","submitted_at":"2026-04-14T17:34:21+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"XRZero-G0 enables 2000-hour robot-free datasets that, when mixed 10:1 with real-robot data, match full real-robot performance at 1/20th the cost and support zero-shot transfer.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.11174","ref_index":3,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"EmbodiedGovBench: A Benchmark for Governance, Recovery, and Upgrade Safety in Embodied Agent Systems","primary_cat":"cs.RO","submitted_at":"2026-04-13T08:34:04+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"EmbodiedGovBench is a new benchmark framework that measures embodied agent systems on seven governance dimensions including policy adherence, recovery success, and upgrade safety.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"*Code:https://github.com/s20sc/embodied-gov-bench. 1 arXiv:2604.11174v1 [cs.RO] 13 Apr 2026 1 Introduction Embodied1 AI systems are increasingly evaluated by what they can do: whether they complete tasks, reach goals, manipulate objects, or follow instructions. Across robot learning, vision-language-action models [1, 2], embodied foundation models [3], and modular runtime systems [4, 5], benchmark practice remains dominated by metrics such as success rate, path efficiency, grasp accuracy, or end-task comple- tion [6, 7, 8, 9]. These metrics are important, but they capture only one side of operational reality. They tell us whether an embodied system can succeed, but they say much less about whether it remains"},{"citing_arxiv_id":"2604.10809","ref_index":19,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"WARPED: Wrist-Aligned Rendering for Robot Policy Learning from Egocentric Human Demonstrations","primary_cat":"cs.RO","submitted_at":"2026-04-12T20:40:59+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"WARPED synthesizes realistic wrist-view observations from monocular egocentric human videos via foundation models, hand-object tracking, retargeting, and Gaussian Splatting to train visuomotor policies that match teleoperation success rates on five tabletop tasks with 5-8x less collection effort.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"tasks range from simple pick and place, pushing, and insertion [41, 108, 93], to more complex long-horizon tasks such as folding laundry, tool use, and washing dishes [75, 12, 16]. However, the performance of these policies is highly de- pendent on the availability and quality of the demonstra- tion data. Methods often utilize existing large-scale datasets from teleoperated robot demonstrations [19, 28] or internet videos [20, 32], which are expensive and difficult to collect when scaling to new tasks and environments. This challenge is more evident for domain-specific manipulation tasks, such as agriculture [47], where demonstration data is often limited. Alternatively, methods can rely on collecting new teleoperated robot data [40, 85, 22, 77], which is slow, time-consuming, and"}],"limit":50,"offset":0}