STORM: Slot-based Task-aware Object-centric Representation for robotic Manipulation
read the original abstract
Visual foundation models provide strong perceptual features for robotics, but their dense representations lack explicit object-level structure, limiting robustness and controllability in manipulation tasks. We propose STORM (Slot-based Task-aware Object-centric Representation for robotic Manipulation), a lightweight object-centric adaptation module that augments frozen visual foundation models with a small set of task-aware slots for robotic manipulation. Rather than fully tuning large backbones on the task, STORM employs an efficient two-stage training strategy: few layers of object-centric representation are first trained on top of the frozen backbone through visual--semantic pretraining using language embeddings, then jointly adapted with a downstream manipulation policy for task alignement. This staged learning prevents degenerate slot formation and preserves semantic consistency while aligning perception with task objectives. Experiments on object discovery benchmarks and robotic manipulation tasks show that STORM improves control performance and generalization to visual shifts (distractors, textures, lighting) compared to directly using frozen or fine-tuned foundation model features, or existing object-centric representations. STORM serves not only as an efficient mechanism for refining generic foundation model features, but also as a novel way of injecting beneficial structural and semantic bias into policy learning.
This paper has not been read by Pith yet.
Forward citations
Cited by 2 Pith papers
-
OA-WAM: Object-Addressable World Action Model for Robust Robot Manipulation
OA-WAM uses persistent address vectors and dynamic content vectors in object slots to enable addressable world-action prediction, improving robustness on manipulation benchmarks under scene changes.
-
GuidedVLA: Specifying Task-Relevant Factors via Plug-and-Play Action Attention Specialization
GuidedVLA improves VLA success rates by manually supervising separate attention heads in the action decoder with auxiliary signals for task-relevant factors.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.