pith. sign in

arxiv: 2509.19658 · v2 · pith:B6IDQXZ3new · submitted 2025-09-24 · 💻 cs.RO · cs.AI

RoboSSM: Scalable In-context Imitation Learning via State-Space Models

classification 💻 cs.RO cs.AI
keywords icilrobossmimitationin-contextlearningpromptsscalabletasks
0
0 comments X
read the original abstract

In-context imitation learning (ICIL) enables robots to learn tasks from prompts consisting of just a handful of demonstrations. By eliminating the need for parameter updates at deployment time, this paradigm supports few-shot adaptation to novel tasks. However, recent ICIL methods rely on Transformers, which have computational limitations and tend to underperform when handling longer prompts than those seen during training. In this work, we introduce RoboSSM, a scalable recipe for in-context imitation learning based on state-space models (SSM). Specifically, RoboSSM replaces Transformers with Longhorn -- a state-of-the-art SSM that provides linear-time inference and strong extrapolation capabilities, making it well-suited for long-context prompts. Through diverse experiments on the LIBERO benchmark, we demonstrate the effectiveness of applying SSMs to ICIL, achieving improved generalization to both unseen and long-horizon tasks than Transformer-based ICIL methods by handling longer contexts at test-time. These results show for the first time that SSMs are an efficient and scalable backbone for ICIL. Our code is available at https://github.com/youngjuY/RoboSSM.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. DSSP: Diffusion State Space Policy with Full-History Encoding

    cs.RO 2026-05 conditional novelty 7.0

    DSSP is a history-conditioned diffusion state space policy that uses SSMs to encode full observation streams with an auxiliary dynamics objective and hierarchical fusion, achieving SOTA results with reduced model size...

  2. From Abstraction to Instantiation: Learning Behavioral Representation for Vision-Language-Action Model

    cs.CV 2026-05 unverdicted novelty 5.0

    BehaviorVLA introduces a symmetric encoder-decoder architecture with causal Mamba and phase conditioning to learn unified long-horizon behavioral representations for improved generalization in VLA models.

  3. From Abstraction to Instantiation: Learning Behavioral Representation for Vision-Language-Action Model

    cs.CV 2026-05 unverdicted novelty 4.0

    BehaviorVLA learns long-horizon behavioral representations via causal Mamba encoder and phase-conditioned decoder, reporting SOTA results of 58% on RoboTwin 2.0, 98% on LIBERO, 4.36 on CALVIN, and matching OpenVLA-OFT...