Point policy: Unifying observations and actions with key points for robot manipulation

· 2025 · arXiv 2502.20391

11 Pith papers cite this work. Polarity classification is still indexing.

11 Pith papers citing it

read on arXiv browse 11 citing papers

citation-role summary

background 2

citation-polarity summary

background 2

representative citing papers

Human Universal Grasping

cs.RO · 2026-06-15 · unverdicted · novelty 7.0

HUG trains a flow-matching model on a new 1M-frame egocentric human grasp dataset to generate retargetable grasps from single RGB-D images, beating baselines by 23-34% on a new 90-object benchmark.

MaskWAM: Unifying Mask Prompting and Prediction for World-Action Models

cs.CV · 2026-06-11 · unverdicted · novelty 6.0

MaskWAM unifies mask prompting and prediction in world-action models via Mixture of Transformers to improve robotic policy generalization on language-ambiguous tasks.

Hand-centric Human-to-Robot Trajectory Transfer from Video Demonstrations via Open-World Contact Localization

cs.RO · 2026-06-09 · unverdicted · novelty 6.0

HOWTransfer recovers 3D hand motion from video, localizes contact intervals via hand-object cues, generates multi-modal grasp hypotheses, and edits trajectories to produce diverse robot-executable motions achieving 86% success.

GHOST: Hierarchical Sub-Goal Policies for Generalizing Robot Manipulation

cs.RO · 2026-06-08 · unverdicted · novelty 6.0

GHOST improves generalization in robot manipulation via hierarchical factorization into 3D sub-goal prediction from RGB-D views and a goal-conditioned low-level controller, enabling human video integration without action retargeting.

KPGrasp: Scalable Keypoint Flow Matching for Dexterous Grasp Generation

cs.RO · 2026-06-08 · unverdicted · novelty 6.0

KPGrasp is a scalable Transformer flow-matching model using 3D hand keypoints that achieves 76.3% success on Dexonomy (47.4% improvement) and best average on DexGrasp Anything without contact losses or test-time refinement.

LACE: Latent Visual Representation for Cross-Embodiment Learning

cs.RO · 2026-05-16 · unverdicted · novelty 6.0

LACE aligns human-robot visual features via semantic distribution matching on corresponding body parts plus Gram loss, yielding 65% better zero-shot policy transfer than baseline DINO.

GuidedVLA: Specifying Task-Relevant Factors via Plug-and-Play Action Attention Specialization

cs.RO · 2026-05-12 · unverdicted · novelty 6.0 · 2 refs

GuidedVLA improves VLA generalization by supervising individual attention heads with manually defined auxiliary signals for three task-relevant factors.

WARPED: Wrist-Aligned Rendering for Robot Policy Learning from Egocentric Human Demonstrations

cs.RO · 2026-04-12 · unverdicted · novelty 6.0

WARPED synthesizes realistic wrist-view observations from monocular egocentric human videos via foundation models, hand-object tracking, retargeting, and Gaussian Splatting to train visuomotor policies that match teleoperation success rates on five tabletop tasks with 5-8x less collection effort.

X-Diffusion: Training Diffusion Policies on Cross-Embodiment Human Demonstrations

cs.RO · 2025-11-06 · unverdicted · novelty 6.0

X-Diffusion adapts Ambient Diffusion to selectively train on noised human actions for cross-embodiment robot policies, yielding 16% higher average success rates than naive co-training or manual filtering across five real-world manipulation tasks.

LUCID: Learning Embodiment-Agnostic Intent Models from Unstructured Human Videos for Scalable Dexterous Robot Skill Acquisition

cs.RO · 2026-06-10 · unverdicted · novelty 5.0

LUCID learns embodiment-agnostic intent models from unstructured human videos to train dexterous robot policies in simulation, enabling zero-shot transfer on real-world tasks like stirring and wiping.

On the Generalization Capabilities, Design Choices and Limitations of Keypoint Imitation Learning

cs.RO · 2026-05-26 · conditional · novelty 5.0

KIL using foundation model keypoints reaches 75% success on five manipulation tasks, beating RGB (47%) but matching S2-diffusion (73%), with generalization tests on unseen objects via over 2000 real-world rollouts.

citing papers explorer

Showing 10 of 10 citing papers after filters.

Human Universal Grasping cs.RO · 2026-06-15 · unverdicted · none · ref 46
HUG trains a flow-matching model on a new 1M-frame egocentric human grasp dataset to generate retargetable grasps from single RGB-D images, beating baselines by 23-34% on a new 90-object benchmark.
MaskWAM: Unifying Mask Prompting and Prediction for World-Action Models cs.CV · 2026-06-11 · unverdicted · none · ref 16
MaskWAM unifies mask prompting and prediction in world-action models via Mixture of Transformers to improve robotic policy generalization on language-ambiguous tasks.
Hand-centric Human-to-Robot Trajectory Transfer from Video Demonstrations via Open-World Contact Localization cs.RO · 2026-06-09 · unverdicted · none · ref 25
HOWTransfer recovers 3D hand motion from video, localizes contact intervals via hand-object cues, generates multi-modal grasp hypotheses, and edits trajectories to produce diverse robot-executable motions achieving 86% success.
GHOST: Hierarchical Sub-Goal Policies for Generalizing Robot Manipulation cs.RO · 2026-06-08 · unverdicted · none · ref 11
GHOST improves generalization in robot manipulation via hierarchical factorization into 3D sub-goal prediction from RGB-D views and a goal-conditioned low-level controller, enabling human video integration without action retargeting.
KPGrasp: Scalable Keypoint Flow Matching for Dexterous Grasp Generation cs.RO · 2026-06-08 · unverdicted · none · ref 34
KPGrasp is a scalable Transformer flow-matching model using 3D hand keypoints that achieves 76.3% success on Dexonomy (47.4% improvement) and best average on DexGrasp Anything without contact losses or test-time refinement.
LACE: Latent Visual Representation for Cross-Embodiment Learning cs.RO · 2026-05-16 · unverdicted · none · ref 11
LACE aligns human-robot visual features via semantic distribution matching on corresponding body parts plus Gram loss, yielding 65% better zero-shot policy transfer than baseline DINO.
GuidedVLA: Specifying Task-Relevant Factors via Plug-and-Play Action Attention Specialization cs.RO · 2026-05-12 · unverdicted · none · ref 32 · 2 links
GuidedVLA improves VLA generalization by supervising individual attention heads with manually defined auxiliary signals for three task-relevant factors.
WARPED: Wrist-Aligned Rendering for Robot Policy Learning from Egocentric Human Demonstrations cs.RO · 2026-04-12 · unverdicted · none · ref 34
WARPED synthesizes realistic wrist-view observations from monocular egocentric human videos via foundation models, hand-object tracking, retargeting, and Gaussian Splatting to train visuomotor policies that match teleoperation success rates on five tabletop tasks with 5-8x less collection effort.
X-Diffusion: Training Diffusion Policies on Cross-Embodiment Human Demonstrations cs.RO · 2025-11-06 · unverdicted · none · ref 4
X-Diffusion adapts Ambient Diffusion to selectively train on noised human actions for cross-embodiment robot policies, yielding 16% higher average success rates than naive co-training or manual filtering across five real-world manipulation tasks.
LUCID: Learning Embodiment-Agnostic Intent Models from Unstructured Human Videos for Scalable Dexterous Robot Skill Acquisition cs.RO · 2026-06-10 · unverdicted · none · ref 28
LUCID learns embodiment-agnostic intent models from unstructured human videos to train dexterous robot policies in simulation, enabling zero-shot transfer on real-world tasks like stirring and wiping.

Point policy: Unifying observations and actions with key points for robot manipulation

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer