STORM: Slot-based Task-aware Object-centric Representation for robotic Manipulation

Alexandre Chapin (LIRIS); Emmanuel Dellandr\'ea (LIRIS); Liming Chen (LIRIS)

arxiv: 2601.20381 · v3 · pith:O6BBJRI3new · submitted 2026-01-28 · 💻 cs.RO

STORM: Slot-based Task-aware Object-centric Representation for robotic Manipulation

Alexandre Chapin (LIRIS) , Emmanuel Dellandr\'ea (LIRIS) , Liming Chen (LIRIS) This is my paper

classification 💻 cs.RO

keywords manipulationobject-centricstormfoundationroboticfeaturesfrozenrepresentation

0 comments

read the original abstract

Visual foundation models provide strong perceptual features for robotics, but their dense representations lack explicit object-level structure, limiting robustness and controllability in manipulation tasks. We propose STORM (Slot-based Task-aware Object-centric Representation for robotic Manipulation), a lightweight object-centric adaptation module that augments frozen visual foundation models with a small set of task-aware slots for robotic manipulation. Rather than fully tuning large backbones on the task, STORM employs an efficient two-stage training strategy: few layers of object-centric representation are first trained on top of the frozen backbone through visual--semantic pretraining using language embeddings, then jointly adapted with a downstream manipulation policy for task alignement. This staged learning prevents degenerate slot formation and preserves semantic consistency while aligning perception with task objectives. Experiments on object discovery benchmarks and robotic manipulation tasks show that STORM improves control performance and generalization to visual shifts (distractors, textures, lighting) compared to directly using frozen or fine-tuned foundation model features, or existing object-centric representations. STORM serves not only as an efficient mechanism for refining generic foundation model features, but also as a novel way of injecting beneficial structural and semantic bias into policy learning.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

OA-WAM: Object-Addressable World Action Model for Robust Robot Manipulation
cs.RO 2026-05 unverdicted novelty 7.0

OA-WAM uses persistent address vectors and dynamic content vectors in object slots to enable addressable world-action prediction, improving robustness on manipulation benchmarks under scene changes.
GuidedVLA: Specifying Task-Relevant Factors via Plug-and-Play Action Attention Specialization
cs.RO 2026-05 unverdicted novelty 6.0

GuidedVLA improves VLA success rates by manually supervising separate attention heads in the action decoder with auxiliary signals for task-relevant factors.