Learning Semantic Atomic Skills for Multi-Task Robotic Manipulation

Jingya Wang; Shijie Wu; Weiqing Wang; Ye Shi; Yihang Zhu

arxiv: 2512.18368 · v2 · pith:BQNIZHTJnew · submitted 2025-12-20 · 💻 cs.RO

Learning Semantic Atomic Skills for Multi-Task Robotic Manipulation

Yihang Zhu , Weiqing Wang , Shijie Wu , Ye Shi , Jingya Wang This is my paper

classification 💻 cs.RO

keywords skillatomicsemanticdemonstrationskeyposelearningskillsacross

0 comments

read the original abstract

Scaling imitation learning to diverse multi-task robot manipulation remains challenging due to suboptimal demonstrations, behavioral multi-modality, and destructive interference across tasks. While skill-based methods offer a promising direction by decomposing behaviors into reusable abstractions, existing approaches often learn skills that are either biased toward linguistic structure or lack semantic alignment across tasks, limiting generalization. In this work, we propose AtomSkill, a novel framework that learns a semantically aligned Atomic Skill Space from demonstrations and enables robust long-horizon execution through keypose imagination. Our method introduces: (1) semantic contrastive skill alignment, which partitions demonstrations into variable-length atomic skills and employs a contrastive objective to jointly enforce semantic consistency and temporal coherence, yielding a compact and reusable skill library; and (2) action decoding with keypose imagining, where the policy predicts both a skill's terminal keypose and immediate actions, thereby supporting progress-aware skill transitions. During inference, an atomic skill diffusion sampler generates plausible skill sequences, while predicted keyposes autonomously trigger smooth skill chaining. Extensive experiments in simulation and real-world settings show that AtomSkill consistently outperforms state-of-the-art imitation learning and skill-based baselines. Project page: https://atom-skill.github.io.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Improving Vision-Language-Action Model Fine-Tuning with Structured Stage and Keyframe Supervision
cs.RO 2026-06 unverdicted novelty 6.0

StaKe adds lightweight auxiliary heads for manipulation stage identification and next-gripper-transition keyframe prediction to VLA fine-tuning, reporting relative success rate gains of 14% in bimanual simulation and ...
ARP: Enhancing Quantized Skill Abstractions via Visual Alignment and Iterative Refinement for Robotic Manipulation
cs.RO 2026-06 unverdicted novelty 6.0

ARP enhances quantized skill abstractions in imitation learning by coupling visual grounding via contrastive alignment with execution refinement via IRH, reporting SOTA results on LIBERO, Meta-World, and real-robot tasks.