{"total":21,"items":[{"citing_arxiv_id":"2606.29940","ref_index":2,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"WARP: Whole-Body Retargeting for Learning from Offline Human Demonstrations","primary_cat":"cs.RO","submitted_at":"2026-06-29T08:17:31+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"WARP is an offline retargeting method using a SEW geometric solver to produce consistent whole-body robot trajectories from human demonstrations for zero-shot mobile manipulation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.30989","ref_index":10,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"A study on a Real-Time VR-Based Teleoperation Framework for Manipulator in Dynamic Environment","primary_cat":"cs.RO","submitted_at":"2026-05-29T08:24:28+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"A VR teleoperation framework integrates GPU-accelerated inverse kinematics and trajectory optimization to generate collision-aware joint commands for a 7-DoF manipulator in real time across obstacle-free, static, and moving-obstacle scenarios.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.29298","ref_index":5,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"MonoDuo: Using One Robot Arm to Learn Bimanual Policies","primary_cat":"cs.RO","submitted_at":"2026-05-28T03:27:38+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"MonoDuo generates synthetic bimanual demonstrations from single-arm teleoperation plus human collaboration to train policies achieving up to 70% zero-shot success on five manipulation tasks, with 65-70% gains from 25-shot finetuning.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.00110","ref_index":169,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"General Covariant Action Modeling: Constructing Generalized Manifolds via Spatio-Temporal Decoupling","primary_cat":"cs.CV","submitted_at":"2026-05-27T03:38:15+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"GAM framework uses arc-length parameterization for temporal invariance and schema-affine factorization for geometric invariance to build a covariant action manifold integrated into VLA models for improved generalization from sparse data.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.12182","ref_index":4,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"DexTwist: Dexterous Hand Retargeting for Twist Motion via Mixed Reality-based Teleoperation","primary_cat":"cs.RO","submitted_at":"2026-05-12T14:25:00+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"DexTwist detects tripod pinches, estimates the intended screw axis and twist magnitude, then applies real-time joint refinement to track turning progress while stabilizing the robot's tripod geometry.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.05925","ref_index":8,"ref_count":2,"confidence":0.9,"is_internal_anchor":false,"paper_title":"DexSynRefine: Synthesizing and Refining Human-Object Interaction Motion for Physically Feasible Dexterous Robot Actions","primary_cat":"cs.RO","submitted_at":"2026-05-07T09:31:43+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"DexSynRefine couples HOI motion manifold flow primitives with task-space residual RL and proprioceptive adaptation to convert human-object interaction data into executable dexterous robot motions, reporting 50-70 point real-world success rate gains over kinematic retargeting on five tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.00244","ref_index":11,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Lucid-XR: An Extended-Reality Data Engine for Robotic Manipulation","primary_cat":"cs.RO","submitted_at":"2026-04-30T21:25:20+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Lucid-XR uses XR-headset physics simulation and physics-guided video generation to create synthetic data that trains robot policies transferring zero-shot to unseen real-world manipulation tasks.","context_count":1,"top_context_role":"method","top_context_polarity":"use_method","context_text":"The third component of Lucid-XR is a generative engine that converts the virtual human demonstra- tions collected in the sim into diverse and realistic-looking multiview image datasets for the robot. 4 Figure 7: (a) Spot robot from Menagerie (b - c) RobotCasa Scenes 3.1 Generating Realistic Images from Virtual Demonstrations. Figure 8 shows our setup for generating realistic-looking images. We follow the LucidSim [11] recipe that starts with a collection of diverse text prompts collected from chatGPT, and use the semantic mask labels from the physics simulation to control the image generation process. Prompts for generation are sourced en masse from ChatGPT via a meta-prompt (see appendix). Prompts for the background tend to be more complex. In alignment with observations made by prior works [11],"},{"citing_arxiv_id":"2604.21351","ref_index":44,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Learn Weightlessness: Imitate Non-Self-Stabilizing Motions on Humanoid Robot","primary_cat":"cs.RO","submitted_at":"2026-04-23T07:10:05+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.14834","ref_index":11,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Switch: Learning Agile Skills Switching for Humanoid Robots","primary_cat":"cs.RO","submitted_at":"2026-04-16T10:11:28+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Switch enables humanoid robots to perform agile, seamless transitions between locomotion skills via a kinematic skill graph, DRL tracking policy, and real-time graph-search scheduler.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.13015","ref_index":27,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Learning Versatile Humanoid Manipulation with Touch Dreaming","primary_cat":"cs.RO","submitted_at":"2026-04-14T17:54:17+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"HTD, a multimodal transformer policy trained with behavioral cloning and touch dreaming to predict future tactile latents, achieves a 90.9% relative success rate improvement over baselines on five real-world contact-rich humanoid loco-manipulation tasks.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"root tracking, joint-space tracking, and body-keypoint or pose tracking, depending on the task and operator inter- face [1], [5], [25], [26]. One line of work improves robust- ness through decomposition, separating functions such as lower-body stabilization, upper-body tracking, force adap- tation, or compliance modulation, as in dual-agent force- adaptive control [27], heterogeneous meta-control over mul- tiple control modes [28], adaptive compliance control [29], and hybrid optimization-and-learning frameworks for dex- terous whole-body behaviors [20], [21]. Related systems also combine learned whole-body control with specialized teleoperation hardware or tracking modules for more precise loco-manipulation [30], [31]."},{"citing_arxiv_id":"2604.08534","ref_index":11,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"ActiveGlasses: Learning Manipulation with Active Vision from Ego-centric Human Demonstration","primary_cat":"cs.RO","submitted_at":"2026-04-09T17:59:08+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"ActiveGlasses learns robot manipulation from ego-centric human demos captured with active vision via smart glasses, achieving zero-shot transfer using object-centric point-cloud policies.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"compromise is not only physically exhausting (severely hin- dering scalability), but it also yields constrained, suboptimal data that lacks true human-like smoothness. From aperceptionperspective, existing setups suffer from a similar misalignment with human nature. Current paradigms heavily rely on fixed third-person cameras, which are prone to occlusion, or wrist-mounted cameras [11], [12], which serve as a pragmatic hack to provide local visual feedback. However, wrist cameras are inherently passive; their viewpoint is entirely slaved to the end effectors' trajectory. When humans perform complex tasks, we do not rely on passive wrist movements to adjust our perspective. Instead, we possess the ability to actively perceive. We instinctively move our heads to peer"},{"citing_arxiv_id":"2604.07607","ref_index":10,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"EgoVerse: An Egocentric Human Dataset for Robot Learning from Around the World","primary_cat":"cs.RO","submitted_at":"2026-04-08T21:27:06+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"EgoVerse releases 1,362 hours of standardized egocentric human data across 1,965 tasks and shows via multi-lab experiments that robot policy performance scales with human data volume when the data aligns with robot objectives.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"t+i ·p H t+i ik i=1 Aligning Human and Robot Data.Recent cross-embodiment work [29, 40, 48, 59] show that co-training benefits from individually normalizing proprioception and actions. To make the normalization robust to outliers, we employquantile normalization. We map the1 st and99 th percentiles of the feature distribution to the range[−1,1], following [10, 27]. For a feature tensorx, the normalized outputˆxis calculated asˆx= 2· \u0010 x−q0.01 q0.99−q0.01 \u0011 −1. To account for varying camera sensors, we perform random image crop and color jittering during training. C. Learning Architecture and Algorithm Policy Architecture.To enable joint training across diverse embodiments, we adopt an encoder-decoder architecture with"},{"citing_arxiv_id":"2604.03730","ref_index":8,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"A Multi-View 3D Telepresence System for XR Robot Teleoperation","primary_cat":"cs.RO","submitted_at":"2026-04-04T13:22:12+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"A multi-view point cloud VR system with wrist RGB detail outperforms RGB streams and stereo views in robot teleoperation tasks per a 31-participant user study.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"tion, especially for tasks that require dexterity and appropriate situational awareness [1]-[3]. Conventional camera-based or monitor-based interfaces limit depth perception and spatial awareness [4]-[6]. Many teleoperation systems that use virtual reality rely on multiple RGB camera streams, which provide high-resolution visual detail but lack the depth cues necessary for accurate spatial reasoning [7], [8]. Conversely, point- cloud views have the potential to support better spatial understanding [9]-[11]. Yet, the potential advantages of point- cloud visualzation interfaces have never been investigated by systematic user studies. Moreover, they often lack the fine visual detail required for grasping and contact-rich manipulation. As a result, operators frequently face a trade-off"},{"citing_arxiv_id":"2603.07672","ref_index":3,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Low-Cost Teleoperation Extension for Mobile Manipulators","primary_cat":"cs.RO","submitted_at":"2026-03-08T15:09:34+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"An open-source teleoperation framework enables intuitive whole-body control of mobile manipulators using commodity smartphone, leader arms, and foot pedals instead of costly VR equipment.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2601.16870","ref_index":24,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"A Multimodal Data Collection Framework for Dialogue-Driven Assistive Robotics to Clarify Ambiguities: A Wizard-of-Oz Pilot Study","primary_cat":"cs.RO","submitted_at":"2026-01-23T16:22:21+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"A two-room Wizard-of-Oz pilot collected 53 multimodal trials from five users to capture dialogue ambiguities for training ambiguity-aware assistive robot controllers.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2512.01773","ref_index":10,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"IGen: Scalable Data Generation for Robot Learning from Open-World Images","primary_cat":"cs.RO","submitted_at":"2025-12-01T15:15:04+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"IGen generates realistic visuomotor training data including actions and temporally coherent visuals from unstructured open-world images via 3D reconstruction and VLM reasoning.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2511.04831","ref_index":18,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Isaac Lab: A GPU-Accelerated Simulation Framework for Multi-Modal Robot Learning","primary_cat":"cs.RO","submitted_at":"2025-11-06T21:43:02+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Isaac Lab is a unified GPU-native platform combining high-fidelity physics, photorealistic rendering, multi-frequency sensors, domain randomization, and learning pipelines for scalable multi-modal robot policy training.","context_count":1,"top_context_role":"background","top_context_polarity":"unclear","context_text":"[16] Dian Chen, Brady Zhou, Vladlen Koltun, and Philipp Krähenbühl. Learning by cheating. InConference on Robot Learning (CoRL), pages 66-75. PMLR, 2020. 25 [17] An-Chieh Cheng, Yandong Ji, Zhaojing Yang, Xueyan Zou, Jan Kautz, Erdem Biyik, Hongxu Yin, Sifei Liu, and Xiaolong Wang. Navila: Legged robot vision-language-action model for navigation. InRobotics: Science and Systems (RSS), 2025. 32 [18] Xuxin Cheng, Jialong Li, Shiqi Yang, Ge Yang, and Xiaolong Wang. Open-television: Teleoperation with immersive active visual feedback.arXiv preprint arXiv:2407.01512, 2024. 16 [19] Erwin Coumans and Yunfei Bai. Pybullet, a python module for physics simulation for games, robotics and machine learning.http://pybullet.org, 2016-2023. 2 [20] Ioannis Dadiotis, Mayank Mittal, Nikos Tsagarakis, and Marco Hutter."},{"citing_arxiv_id":"2507.12440","ref_index":10,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"EgoVLA: Learning Vision-Language-Action Models from Egocentric Human Videos","primary_cat":"cs.RO","submitted_at":"2025-07-16T17:27:44+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"EgoVLA pretrains VLA models on egocentric human videos, retargets predicted actions to robots via IK, and fine-tunes on few robot demos to improve bimanual manipulation performance on a new simulation benchmark.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2505.18780","ref_index":14,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"DreamPolicy: A Unified World-model Policy for Scalable Humanoid Locomotion","primary_cat":"cs.RO","submitted_at":"2025-05-24T16:33:44+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"DreamPolicy integrates an autoregressive diffusion world model with policy learning to produce a single scalable policy that generalizes to unseen composite terrains for humanoid locomotion.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2505.07813","ref_index":5,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"DexWild: Dexterous Human Interactions for In-the-Wild Robot Policies","primary_cat":"cs.RO","submitted_at":"2025-05-12T17:59:05+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"DexWild co-trains dexterous robot policies on in-the-wild human hand interactions recorded with a low-cost system and limited robot data, achieving 68.5% success in unseen environments and 5.8x better cross-embodiment generalization.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2501.09747","ref_index":14,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"FAST: Efficient Action Tokenization for Vision-Language-Action Models","primary_cat":"cs.RO","submitted_at":"2025-01-16T18:57:04+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"FAST applies discrete cosine transform to robot action sequences for efficient tokenization, enabling autoregressive VLAs to succeed on high-frequency dexterous tasks and scale to 10k hours of data while matching diffusion VLA performance with up to 5x faster training.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Daniel Tompkins, Zhuo Chen, and Furu Wei. Beats: Au- dio pre-training with acoustic tokenizers. arXiv preprint arXiv:2212.09058, 2022. [13] An-Chieh Cheng, Yandong Ji, Zhaojing Yang, Xueyan Zou, Jan Kautz, Erdem Biyik, Hongxu Yin, Sifei Liu, and Xiaolong Wang. NaVILA: Legged Robot Vision- Language-Action Model for Navigation. arXiv preprint arXiv:2412.04453, 2024. [14] Xuxin Cheng, Jialong Li, Shiqi Yang, Ge Yang, and Xiaolong Wang. Open-television: Teleoperation with immersive active visual feedback. arXiv preprint arXiv:2407.01512, 2024. [15] Cheng Chi, Siyuan Feng, Yilun Du, Zhenjia Xu, Eric Cousineau, Benjamin Burchfiel, and Shuran Song. Dif- fusion policy: Visuomotor policy learning via action diffusion. In Proceedings of Robotics: Science and"}],"limit":50,"offset":0}