{"total":15,"items":[{"citing_arxiv_id":"2606.28133","ref_index":46,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Translation as a Bridging Action: Transferring Manipulation Skills from Humans to Robots","primary_cat":"cs.RO","submitted_at":"2026-06-26T14:34:04+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A relative wrist translation bridging action with a vision-language-action model using interleaved tokens and attention masking transfers human manipulation skills to robots more effectively than 6DoF actions.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.18772","ref_index":20,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"HALOMI: Learning Humanoid Loco-Manipulation with Active Perception from Human Demonstrations","primary_cat":"cs.RO","submitted_at":"2026-06-17T07:33:37+00:00","verdict":"UNVERDICTED","verdict_confidence":"UNKNOWN","novelty_score":5.0,"formal_verification":"none","one_line_summary":"HALOMI extends UMI with egocentric sensing and a manifold-constrained controller plus alignment adaptations to learn loco-manipulation on humanoids from human demos, reporting 85% average success on three real-world tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.16776","ref_index":40,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"JoyAI-Sim: A Simulation-Enabled Interconversion Toolchain for the Embodied Data Pyramid","primary_cat":"cs.RO","submitted_at":"2026-06-15T14:21:35+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":3.0,"formal_verification":"none","one_line_summary":"JoyAI-Sim provides bidirectional Robot-Simulation-Human pathways for aligned model evaluation and data generation in robotics using the JoySim simulator as an evaluation layer and physical consistency filter.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.12995","ref_index":22,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"GenHOI: Contact-Aware Humanoid-Object Interaction by Imitating Generated Videos without Task-Specific Training","primary_cat":"cs.RO","submitted_at":"2026-06-11T07:31:05+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"GenHOI reconstructs robot-object scenes, generates task videos from language and first-frame images, extracts contact constraints, optimizes reference trajectories, and executes them via closed-loop control for zero-shot humanoid-object interaction.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.09215","ref_index":13,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"MotionWAM: Towards Foundation World Action Models for Real-Time Humanoid Loco-Manipulation","primary_cat":"cs.RO","submitted_at":"2026-06-08T08:50:14+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"MotionWAM conditions a policy on intermediate features from a video world model to predict unified whole-body motion tokens, enabling real-time humanoid loco-manipulation that outperforms VLA baselines by over 30% on nine Unitree G1 tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.08548","ref_index":13,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"OASIS: From Simulation Data Collection to Real-World Humanoid Loco-Manipulation","primary_cat":"cs.RO","submitted_at":"2026-06-07T10:01:53+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"OASIS generates scalable simulation data for humanoid loco-manipulation via 3D generative asset reconstruction and domain randomization, yielding a policy with higher zero-shot real-world success than real-robot teleoperation data.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.06194","ref_index":19,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"ActiveMimic: Egocentric Video Pretraining with Active Perception","primary_cat":"cs.RO","submitted_at":"2026-06-04T14:01:01+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"ActiveMimic pretrains on egocentric human video by recovering and modeling active camera motion as viewpoint actions, matching robot-data pretraining performance on real-world tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.01458","ref_index":13,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"LEGS: Fine-Tuning Teleop-Free VLAs for Humanoid Loco-manipulation in an Embodied Gaussian Splatting World","primary_cat":"cs.RO","submitted_at":"2026-05-31T21:36:02+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"LEGS shows synthetic data from a 3DGS-mesh hybrid simulator trains VLA policies for humanoid pick-and-place that match or exceed human teleoperation performance across multiple backbones and tasks while enabling low-cost robustness to appearance shifts.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.16797","ref_index":15,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"EgoKit: Towards Unified Low-Cost Egocentric Data Collection with Heterogeneous Devices","primary_cat":"cs.CV","submitted_at":"2026-05-16T03:59:17+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"EgoKit is a new toolkit and accessory set that unifies egocentric video collection with wrist views across heterogeneous consumer devices using a consistent interface and log format.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.03452","ref_index":25,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"BifrostUMI: Bridging Robot-Free Demonstrations and Humanoid Whole-Body Manipulation","primary_cat":"cs.RO","submitted_at":"2026-05-05T07:35:09+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"BifrostUMI enables robot-free human demonstration capture via VR and wrist cameras to train visuomotor policies that predict keypoint trajectories for transfer to humanoid whole-body control through retargeting.","context_count":1,"top_context_role":"method","top_context_polarity":"use_method","context_text":"collection, it reduces hardware risk and operator burden. By synchronizing whole-body motion, wrist-view visual ob- servations, and gripper commands, it produces structured demonstrations that can be directly used for visuomotor policy learning. B. High Level: Diffusion Policy We instantiate the high-level policy as a whole-body exten- sion of Diffusion Policy [25] that operates on a sparse task- space rather than the full joint configuration. At each decision stept, the policy predicts a receding-horizon action chunk at+1:t+H of lengthH=48over the same five keypoints used during data collection - the pelvis, the left and right TCPs, and the left and right feet - accompanied by two gripper- width commands. The pelvis encodes the global root motion,"},{"citing_arxiv_id":"2604.27711","ref_index":24,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"ExoActor: Exocentric Video Generation as Generalizable Interactive Humanoid Control","primary_cat":"cs.RO","submitted_at":"2026-04-30T10:57:29+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"ExoActor uses exocentric video generation to implicitly model robot-environment-object interactions and converts the resulting videos into task-conditioned humanoid control sequences.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.23570","ref_index":39,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"EgoLive: A Large-Scale Egocentric Dataset from Real-World Human Tasks","primary_cat":"cs.RO","submitted_at":"2026-04-26T07:21:15+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"EgoLive is presented as the largest open-source annotated egocentric dataset for real-world task-oriented human routines, captured with a custom head-mounted device and multi-modal annotations exclusively in unconstrained environments.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.07993","ref_index":39,"ref_count":2,"confidence":0.9,"is_internal_anchor":true,"paper_title":"HEX: Humanoid-Aligned Experts for Cross-Embodiment Whole-Body Manipulation","primary_cat":"cs.RO","submitted_at":"2026-04-09T09:01:43+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"HEX introduces a state-centric framework with humanoid-aligned representations and mixture-of-experts proprioceptive prediction for coordinated whole-body control on bipedal humanoids.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"[37] Ri-Zhao Qiu, Shiqi Yang, Xuxin Cheng, Chaitanya Chawla, Jialong Li, Tairan He, Ge Yan, David J Yoon, Ryan Hoque, Lars Paulsen, et al. Humanoid policy human policy. InConference on Robot Learning, pages 2888-2906. PMLR, 2025. [38] Ilija Radosavovic, Tete Xiao, Bike Zhang, Trevor Darrell, Jitendra Malik, and Koushil Sreenath. Real-world humanoid locomotion with reinforcement learning.Science Robotics, 9(89):eadi9579, 2024. [39] Modi Shi, Shijia Peng, Jin Chen, Haoran Jiang, Yinghui Li, Di Huang, Ping Luo, Hongyang Li, and Li Chen. Egohumanoid: Unlocking in-the-wild loco-manipulation with robot-free egocentric demonstration.arXiv preprint arXiv:2602.10106, 2026. [40] Wenxuan Song, Ziyang Zhou, Han Zhao, Jiayi Chen, Pengxiang Ding, Haodong Yan, Yuxin Huang, Feilong Tang, Donglin"},{"citing_arxiv_id":"2604.07335","ref_index":12,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"TAMEn: Tactile-Aware Manipulation Engine for Closed-Loop Data Collection in Contact-Rich Tasks","primary_cat":"cs.RO","submitted_at":"2026-04-08T17:49:35+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"TAMEn supplies a cross-morphology wearable interface and pyramid-structured visuo-tactile data regime that raises bimanual manipulation success rates from 34% to 75% via closed-loop collection.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"In these scenarios, success often depends on subtle contact events, such as contact onset, excessive loading, and incipient slip [4], which are difficult to infer from vision alone [5]-[7]. Tactile sensing therefore plays a critical role in enabling robust manipulation [8], [9]. However, unlike visual data, which can be collected at scale from internet videos or human recordings [10]- [12], tactile data must be generated through direct physical interaction. These challenges motivate the development of hardware interfaces for efficient, high-quality, and scalable visuo-tactile data collection. Existing data collection pipelines remain limited in both interaction fidelity and hardware adaptability. Most teleop- eration systems rely primarily on visual feedback and thus"},{"citing_arxiv_id":"2603.11755","ref_index":41,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Controllable Egocentric Video Generation via Occlusion-Aware Sparse 3D Hand Joints","primary_cat":"cs.CV","submitted_at":"2026-03-12T10:02:23+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}