Learning Semantic Atomic Skills for Multi-Task Robotic Manipulation

· 2025 · cs.RO · arXiv 2512.18368

2 Pith papers cite this work. Polarity classification is still indexing.

2 Pith papers citing it

open full Pith review browse 2 citing papers arXiv PDF

abstract

Scaling imitation learning to diverse multi-task robot manipulation remains challenging due to suboptimal demonstrations, behavioral multi-modality, and destructive interference across tasks. While skill-based methods offer a promising direction by decomposing behaviors into reusable abstractions, existing approaches often learn skills that are either biased toward linguistic structure or lack semantic alignment across tasks, limiting generalization. In this work, we propose AtomSkill, a novel framework that learns a semantically aligned Atomic Skill Space from demonstrations and enables robust long-horizon execution through keypose imagination. Our method introduces: (1) semantic contrastive skill alignment, which partitions demonstrations into variable-length atomic skills and employs a contrastive objective to jointly enforce semantic consistency and temporal coherence, yielding a compact and reusable skill library; and (2) action decoding with keypose imagining, where the policy predicts both a skill's terminal keypose and immediate actions, thereby supporting progress-aware skill transitions. During inference, an atomic skill diffusion sampler generates plausible skill sequences, while predicted keyposes autonomously trigger smooth skill chaining. Extensive experiments in simulation and real-world settings show that AtomSkill consistently outperforms state-of-the-art imitation learning and skill-based baselines. Project page: https://atom-skill.github.io.

representative citing papers

Improving Vision-Language-Action Model Fine-Tuning with Structured Stage and Keyframe Supervision

cs.RO · 2026-06-25 · unverdicted · novelty 6.0

StaKe adds lightweight auxiliary heads for manipulation stage identification and next-gripper-transition keyframe prediction to VLA fine-tuning, reporting relative success rate gains of 14% in bimanual simulation and 56% on single-arm real-robot tasks.

ARP: Enhancing Quantized Skill Abstractions via Visual Alignment and Iterative Refinement for Robotic Manipulation

cs.RO · 2026-06-21 · unverdicted · novelty 6.0

ARP enhances quantized skill abstractions in imitation learning by coupling visual grounding via contrastive alignment with execution refinement via IRH, reporting SOTA results on LIBERO, Meta-World, and real-robot tasks.

citing papers explorer

Showing 2 of 2 citing papers.

Improving Vision-Language-Action Model Fine-Tuning with Structured Stage and Keyframe Supervision cs.RO · 2026-06-25 · unverdicted · none · ref 38 · internal anchor
StaKe adds lightweight auxiliary heads for manipulation stage identification and next-gripper-transition keyframe prediction to VLA fine-tuning, reporting relative success rate gains of 14% in bimanual simulation and 56% on single-arm real-robot tasks.
ARP: Enhancing Quantized Skill Abstractions via Visual Alignment and Iterative Refinement for Robotic Manipulation cs.RO · 2026-06-21 · unverdicted · none · ref 32 · internal anchor
ARP enhances quantized skill abstractions in imitation learning by coupling visual grounding via contrastive alignment with execution refinement via IRH, reporting SOTA results on LIBERO, Meta-World, and real-robot tasks.

Learning Semantic Atomic Skills for Multi-Task Robotic Manipulation

fields

years

verdicts

representative citing papers

citing papers explorer