pith. sign in

arxiv: 2605.14598 · v2 · pith:UTWOCQGKnew · submitted 2026-05-14 · 💻 cs.RO

DSSP: Diffusion State Space Policy with Full-History Encoding

Pith reviewed 2026-05-22 10:14 UTC · model grok-4.3

classification 💻 cs.RO
keywords diffusion policystate space modelrobot manipulationimitation learningfull history conditioningauxiliary dynamics objectivehierarchical conditioning
0
0 comments X

The pith

DSSP conditions diffusion policies on full robot observation history via state space model compression to resolve long-horizon ambiguities.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper introduces DSSP, a diffusion-based imitation learning policy for robot manipulation that conditions actions on the entire history of observations rather than a short recent window. It uses a state space model as a history encoder to compress the full observation stream into a compact context representation. The encoder is trained with a dynamics-aware auxiliary objective to preserve information about future state evolution. This context is fused hierarchically with recent observations to condition a diffusion model for action generation, and the diffusion backbone itself is also built from state space models for consistency and efficiency. Experiments on simulation benchmarks and real-world tasks show state-of-the-art performance with a significantly smaller model size, with the efficiency advantage growing as history length increases.

Core claim

The central claim is that an SSM-based history encoder optimized by a dynamics-aware auxiliary objective, when fused hierarchically with recent states to condition an SSM diffusion backbone, enables efficient full-history conditioning that outperforms short-window baselines and scales better with longer histories while using fewer parameters.

What carries the argument

The dynamics-aware SSM history encoder that compresses the full observation stream into a compact context representation, which is then hierarchically fused with recent observations for diffusion-based action generation.

If this is right

  • Full-history conditioning resolves history-dependent ambiguities that short observation windows cannot address in long-horizon tasks.
  • The auxiliary objective keeps the compressed context predictive of future states without requiring the full history at inference time.
  • Using SSMs for both the history encoder and the diffusion backbone maintains architectural consistency while reducing model size and GPU memory.
  • Performance remains strong or improves as history length grows without proportional increases in parameter count.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The hierarchical compression approach could be tested with other sequence models such as transformers to see if the same efficiency pattern holds.
  • Resource-constrained robots might benefit from deploying these smaller models on tasks that previously required larger history-aware policies.
  • The method suggests exploring similar auxiliary objectives for compressing other sensor streams like vision or tactile data in manipulation.

Load-bearing premise

The dynamics-aware auxiliary training objective ensures the compressed context representation from the SSM history encoder preserves critical information regarding future state evolution.

What would settle it

Measure success rates on long-horizon manipulation tasks with increasing history lengths when the auxiliary objective is ablated; if the performance gap over short-window baselines disappears or reverses, the central claim would be falsified.

Figures

Figures reproduced from arXiv: 2605.14598 by Han Fang, Jianshu Hu, Shujia Li, Xiao Li, Yize Huang, Yunpeng Jiang, Yutong Ban, Zhiyuan Guan.

Figure 1
Figure 1. Figure 1: The proposed DSSP leverages full-history context to resolve visual aliasing when history￾blind baselines lose track of task progress, enabling consistent execution in long-horizon tasks. DSSP achieves superior success rates across both simulation tasks and real-world experiments. histories into the policy, but this increases memory and inference cost and may introduce redundant visual inputs or spurious co… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of DSSP. DSSP summarizes past multi-modal observations into a compact context token using a state-space history encoder. A dynamics-aware auxiliary loss encourages this token to retain historical information predictive of future state evolution. The learned context token is then combined with recent state tokens as a hierarchical prefix condition for a state-space diffusion denoiser to generate fu… view at source ↗
Figure 3
Figure 3. Figure 3: Overview of Experimental Environments. The figure summarizes representative environments from both simulation and real-world experiments. In each row, the left three panels visualize the RoboTwin tasks, the middle three columns present representative MetaWorld tasks, the next column shows Adroit tasks, and the rightmost column shows our real-world tasks. Timestep-Decoupled Action Denoising. To provide a st… view at source ↗
read the original abstract

Diffusion-based imitation learning has shown strong promise for robot manipulation. However, most existing policies condition only on the current observation or a short window of recent observations, limiting their ability to resolve history-dependent ambiguities in long-horizon tasks. To address this, we introduce DSSP, a history-conditioned Diffusion State Space Policy that enables efficient, full-history conditioning for robot manipulation. Leveraging the continuous sequence modeling properties of State Space Models (SSMs), our history encoder effectively compresses the entire observation stream into a compact context representation. To ensure this context preserves critical information regarding future state evolution, the encoder is optimized with a dynamics-aware auxiliary training objective. This high-level context representation is then seamlessly fused with recent state observations to form a hierarchical conditioning mechanism for action generation. Furthermore, to maintain architectural consistency and minimize GPU memory overhead, we also instantiate the diffusion backbone itself using an SSM. Extensive experiments across simulation benchmarks and real-world manipulation tasks show that DSSP achieves state-of-the-art performance with a significantly smaller model size, demonstrating superior efficiency of the hierarchical conditioning in capturing crucial information as the history length increases.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces DSSP, a Diffusion State Space Policy for robot manipulation that conditions on full observation history via an SSM-based encoder. The encoder compresses the entire history into a compact context vector, trained with a dynamics-aware auxiliary objective to preserve future state evolution information. This context is hierarchically fused with recent observations to condition a diffusion policy (also SSM-based) for action generation. Experiments across simulation benchmarks and real-world tasks claim state-of-the-art performance with significantly smaller model sizes, with gains attributed to superior efficiency of the hierarchical conditioning as history length increases.

Significance. If the empirical claims hold, the work offers a practical advance in efficient long-horizon imitation learning by showing that SSM-based full-history encoding can outperform short-window baselines while reducing model size. The combination of dynamics-aware auxiliary training and hierarchical conditioning provides a concrete mechanism for resolving history-dependent ambiguities in manipulation, with potential applicability to other sequential decision tasks where memory efficiency matters.

major comments (2)
  1. [Abstract and §3] Abstract and §3 (Method): The central claim that the dynamics-aware auxiliary training objective ensures the compressed context representation from the SSM history encoder preserves critical information regarding future state evolution lacks direct quantitative validation. No next-state prediction error, mutual information, or reconstruction metrics between the context vector and future trajectories are reported to confirm that the auxiliary loss actually achieves the asserted preservation; observed performance gains could therefore be attributable to the SSM diffusion backbone or fusion mechanism instead.
  2. [§4] §4 (Experiments): The SOTA claims and the assertion of superior efficiency as history length increases rest on comparisons whose exact baselines, metrics, statistical significance tests, and ablation controls are not fully detailed in the provided abstract and summary. Without these, it is difficult to assess whether the hierarchical conditioning is the load-bearing factor or whether gains are driven by other architectural choices.
minor comments (2)
  1. [§3] Notation for the context representation dimension and the auxiliary loss weighting should be introduced explicitly with symbols rather than descriptive phrases to improve reproducibility.
  2. [§4] Figure captions for the architecture diagram and history-length scaling plots should include axis labels, legend details, and error bars to clarify the efficiency claims.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback on our manuscript. We address each major comment below and outline the revisions we intend to make to strengthen the presentation and validation of our claims.

read point-by-point responses
  1. Referee: [Abstract and §3] Abstract and §3 (Method): The central claim that the dynamics-aware auxiliary training objective ensures the compressed context representation from the SSM history encoder preserves critical information regarding future state evolution lacks direct quantitative validation. No next-state prediction error, mutual information, or reconstruction metrics between the context vector and future trajectories are reported to confirm that the auxiliary loss actually achieves the asserted preservation; observed performance gains could therefore be attributable to the SSM diffusion backbone or fusion mechanism instead.

    Authors: We agree that direct quantitative validation of the information preserved by the context vector would provide stronger support for the role of the auxiliary objective. The current manuscript relies on downstream task performance to demonstrate the benefits of the dynamics-aware training, but does not report explicit metrics such as next-state prediction error or mutual information between the context and future states. In the revised version, we will add these evaluations, including next-state prediction accuracy computed from the context vector and an ablation comparing performance with and without the auxiliary loss, to isolate its contribution. revision: yes

  2. Referee: [§4] §4 (Experiments): The SOTA claims and the assertion of superior efficiency as history length increases rest on comparisons whose exact baselines, metrics, statistical significance tests, and ablation controls are not fully detailed in the provided abstract and summary. Without these, it is difficult to assess whether the hierarchical conditioning is the load-bearing factor or whether gains are driven by other architectural choices.

    Authors: We thank the referee for highlighting the need for greater experimental transparency. While the full manuscript contains the relevant experimental details, we acknowledge that the description of baselines, exact metrics, statistical tests, and controls could be expanded for clarity. In the revision, we will update §4 to explicitly enumerate all baseline configurations, report means and standard deviations across seeds with statistical significance tests, and provide additional ablations that isolate the hierarchical fusion mechanism from other architectural elements. revision: yes

Circularity Check

0 steps flagged

No circularity: architectural design with empirical validation

full rationale

The paper presents a new policy architecture (DSSP) that combines SSM-based history encoding with diffusion and a dynamics-aware auxiliary loss. The abstract and provided text describe the auxiliary objective as a training choice to encourage preservation of future-state information, but no equations, derivations, or self-citations reduce the claimed performance gains or information-preservation property to a fitted parameter or input quantity by construction. Claims of superior efficiency with longer histories rest on benchmark experiments rather than on any self-referential definition or renaming of known results. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on standard assumptions about the sequence modeling capabilities of SSMs and the ability of an auxiliary dynamics objective to preserve predictive information in the context vector; no new entities are postulated.

free parameters (1)
  • context representation dimension
    The size of the compact context vector is a tunable hyperparameter whose value affects information preservation and is likely selected via validation.
axioms (1)
  • domain assumption State space models can efficiently compress long observation sequences while preserving information relevant to future state evolution when trained with a dynamics-aware objective
    This underpins the history encoder design and is invoked to justify full-history conditioning without prohibitive memory cost.

pith-pipeline@v0.9.0 · 5741 in / 1357 out tokens · 49827 ms · 2026-05-22T10:14:57.329043+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

69 extracted references · 69 canonical work pages · 21 internal anchors

  1. [1]

    Openpi Comet: Competition Solution For 2025 BEHA VIOR Challenge, December 2025

    Junjie Bai, Yu-Wei Chao, Qizhi Chen, Jinwei Gu, Moo Jin Kim, Zhaoshuo Li, Xuan Li, Tsung- Yi Lin, Ming-Yu Liu, Nic Ma, Kaichun Mo, Delin Qu, Shangkun Sun, Hongchi Xia, Fangyin Wei, and Xiaohui Zeng. Openpi Comet: Competition Solution For 2025 BEHA VIOR Challenge, December 2025. URLhttp://arxiv.org/abs/2512.10071. arXiv:2512.10071 [cs]

  2. [2]

    RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control

    Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Xi Chen, Krzysztof Choro- manski, Tianli Ding, Danny Driess, Avinava Dubey, Chelsea Finn, Pete Florence, Chuyuan Fu, Montse Gonzalez Arenas, Keerthana Gopalakrishnan, Kehang Han, Karol Hausman, Alexander Herzog, Jasmine Hsu, Brian Ichter, Alex Irpan, Nikhil Joshi, Ryan Julian, Dmitry Kalash- n...

  3. [3]

    Mamba Policy: Towards Efficient 3D Diffusion Policy with Hybrid Selective State Models, June 2025

    Jiahang Cao, Qiang Zhang, Jingkai Sun, Jiaxu Wang, Hao Cheng, Yulin Li, Jun Ma, Kun Wu, Zhiyuan Xu, Yecheng Shao, Wen Zhao, Gang Han, Yijie Guo, and Renjing Xu. Mamba Policy: Towards Efficient 3D Diffusion Policy with Hybrid Selective State Models, June 2025. URL http://arxiv.org/abs/2409.07163. arXiv:2409.07163 [cs]

  4. [4]

    Gaf: Gaussian action field as a 4d representation for dynamic world modeling in robotic manipulation, 2025

    Ying Chai, Litao Deng, Ruizhi Shao, Jiajun Zhang, Kangchen Lv, Liangjun Xing, Xiang Li, Hongwen Zhang, and Yebin Liu. Gaf: Gaussian action field as a 4d representation for dynamic world modeling in robotic manipulation, 2025. URL https://arxiv.org/abs/2506.14135

  5. [5]

    History-Aware Visuomotor Policy Learning via Point Tracking, March 2026

    Jingjing Chen, Hongjie Fang, Chenxi Wang, Shiquan Wang, and Cewu Lu. History-Aware Visuomotor Policy Learning via Point Tracking, March 2026. URL http://arxiv.org/abs/ 2509.17141. arXiv:2509.17141 [cs]

  6. [6]

    RoboTwin 2.0: A Scalable Data Generator and Benchmark with Strong Domain Randomization for Robust Bimanual Robotic Manipulation

    Tianxing Chen, Zanxin Chen, Baijun Chen, Zijian Cai, Yibin Liu, Zixuan Li, Qiwei Liang, Xianliang Lin, Yiheng Ge, Zhenyu Gu, Weiliang Deng, Yubin Guo, Tian Nian, Xuanbing Xie, Qiangyu Chen, Kailun Su, Tianling Xu, Guodong Liu, Mengkang Hu, Huan-ang Gao, Kaixuan Wang, Zhixuan Liang, Yusen Qin, Xiaokang Yang, Ping Luo, and Yao Mu. RoboTwin 2.0: A Scalable D...

  7. [7]

    Diffusion Policy: Visuomotor Policy Learning via Action Diffusion

    Cheng Chi, Zhenjia Xu, Siyuan Feng, Eric Cousineau, Yilun Du, Benjamin Burchfiel, Russ Tedrake, and Shuran Song. Diffusion Policy: Visuomotor Policy Learning via Action Diffusion, March 2024. URLhttp://arxiv.org/abs/2303.04137. arXiv:2303.04137 [cs]

  8. [8]

    Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality

    Tri Dao and Albert Gu. Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality, May 2024. URL http://arxiv.org/abs/2405. 21060. arXiv:2405.21060 [cs]

  9. [9]

    Dream2flow: Bridging video generation and open-world manipulation with 3d object flow, 2025

    Karthik Dharmarajan, Wenlong Huang, Jiajun Wu, Li Fei-Fei, and Ruohan Zhang. Dream2flow: Bridging video generation and open-world manipulation with 3d object flow, 2025. URL https://arxiv.org/abs/2512.24766

  10. [10]

    Omp: One-step meanflow policy with directional alignment, 2026

    Han Fang, Yize Huang, Yuheng Zhao, Paul Weng, Xiao Li, and Yutong Ban. Omp: One-step meanflow policy with directional alignment, 2026. URL https://arxiv.org/abs/2512. 19347

  11. [11]

    Learning video generation for robotic manipulation with collaborative trajectory control,

    Xiao Fu, Xintao Wang, Xian Liu, Jianhong Bai, Runsen Xu, Pengfei Wan, Di Zhang, and Dahua Lin. Learning video generation for robotic manipulation with collaborative trajectory control,

  12. [12]

    URLhttps://arxiv.org/abs/2506.01943

  13. [13]

    Vita: Vision-to-action flow matching policy, 2026

    Dechen Gao, Boqi Zhao, Andrew Lee, Ian Chuang, Hanchu Zhou, Hang Wang, Zhe Zhao, Junshan Zhang, and Iman Soltani. Vita: Vision-to-action flow matching policy, 2026. URL https://arxiv.org/abs/2507.13231. 10

  14. [14]

    Mamba: Linear-Time Sequence Modeling with Selective State Spaces

    Albert Gu and Tri Dao. Mamba: Linear-Time Sequence Modeling with Selective State Spaces, May 2024. URLhttp://arxiv.org/abs/2312.00752. arXiv:2312.00752 [cs]

  15. [15]

    SeedPolicy: Horizon Scaling via Self-Evolving Diffusion Policy for Robot Manipulation

    Youqiang Gui, Yuxuan Zhou, Shen Cheng, Xinyang Yuan, Haoqiang Fan, Peng Cheng, and Shuaicheng Liu. SeedPolicy: Horizon Scaling via Self-Evolving Diffusion Policy for Robot Manipulation, March 2026. URL http://arxiv.org/abs/2603.05117. arXiv:2603.05117 [cs]

  16. [16]

    Flowdreamer: A rgb-d world model with flow-based motion representations for robot manipulation, 2025

    Jun Guo, Xiaojian Ma, Yikai Wang, Min Yang, Huaping Liu, and Qing Li. Flowdreamer: A rgb-d world model with flow-based motion representations for robot manipulation, 2025. URL https://arxiv.org/abs/2505.10075

  17. [17]

    Ctrl-world: A controllable generative world model for robot manipulation, 2026

    Yanjiang Guo, Lucy Xiaoyang Shi, Jianyu Chen, and Chelsea Finn. Ctrl-world: A controllable generative world model for robot manipulation, 2026. URL https://arxiv.org/abs/2510. 10125

  18. [18]

    Causal Confusion in Imitation Learning, November 2019

    Pim de Haan, Dinesh Jayaraman, and Sergey Levine. Causal Confusion in Imitation Learning, November 2019. URLhttp://arxiv.org/abs/1905.11979. arXiv:1905.11979 [cs]

  19. [19]

    Denoising Diffusion Probabilistic Models

    Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models, 2020. URLhttps://arxiv.org/abs/2006.11239

  20. [20]

    AdaFlow: Imitation Learning with Variance- Adaptive Flow-Based Policies, November 2024

    Xixi Hu, Bo Liu, Xingchao Liu, and Qiang Liu. AdaFlow: Imitation Learning with Variance- Adaptive Flow-Based Policies, November 2024. URLhttp://arxiv.org/abs/2402.04292. arXiv:2402.04292 [cs]

  21. [21]

    Graphcot- vla: A 3d spatial-aware reasoning vision-language-action model for robotic manipulation with ambiguous instructions, 2025

    Helong Huang, Min Cen, Kai Tan, Xingyue Quan, Guowei Huang, and Hong Zhang. Graphcot- vla: A 3d spatial-aware reasoning vision-language-action model for robotic manipulation with ambiguous instructions, 2025. URLhttps://arxiv.org/abs/2508.07650

  22. [22]

    ContextVLA: Vision-Language-Action Model with Amortized Multi-Frame Context, October

    Huiwon Jang, Sihyun Yu, Heeseung Kwon, Hojin Jeon, Younggyo Seo, and Jinwoo Shin. ContextVLA: Vision-Language-Action Model with Amortized Multi-Frame Context, October

  23. [23]

    arXiv:2510.04246 [cs]

    URLhttp://arxiv.org/abs/2510.04246. arXiv:2510.04246 [cs]

  24. [24]

    arXiv preprint arXiv:2406.08234 (2024)

    Xiaogang Jia, Qian Wang, Atalay Donat, Bowen Xing, Ge Li, Hongyi Zhou, Onur Celik, Denis Blessing, Rudolf Lioutikov, and Gerhard Neumann. MaIL: Improving Imitation Learning with Mamba, November 2024. URL http://arxiv.org/abs/2406.08234. arXiv:2406.08234 [cs]

  25. [25]

    Wang, Hanyi Zhang, Qian Wang, Rudolf Lioutikov, and Gerhard Neumann

    Xiaogang Jia, Atalay Donat, Xi Huang, Xuan Zhao, Denis Blessing, Hongyi Zhou, Han A. Wang, Hanyi Zhang, Qian Wang, Rudolf Lioutikov, and Gerhard Neumann. X-IL: Exploring the Design Space of Imitation Learning Policies, February 2025. URL http://arxiv.org/ abs/2502.12330. arXiv:2502.12330 [cs]

  26. [26]

    AsyncVLA: Asynchronous Flow Matching for Vision-Language-Action Models

    Yuhua Jiang, Shuang Cheng, Yan Ding, Feifei Gao, and Biqing Qi. Asyncvla: Asynchronous flow matching for vision-language-action models, 2025. URL https://arxiv.org/abs/ 2511.14148

  27. [27]

    World4rl: Diffusion world models for policy refinement with reinforcement learning for robotic manipulation, 2026

    Zhennan Jiang, Kai Liu, Yuxin Qin, Shuai Tian, Yupeng Zheng, Mingcai Zhou, Chao Yu, Haoran Li, and Dongbin Zhao. World4rl: Diffusion world models for policy refinement with reinforcement learning for robotic manipulation, 2026. URL https://arxiv.org/abs/2509. 19080

  28. [28]

    OpenVLA: An Open-Source Vision-Language-Action Model

    Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, Quan Vuong, Thomas Kollar, Benjamin Burchfiel, Russ Tedrake, Dorsa Sadigh, Sergey Levine, Percy Liang, and Chelsea Finn. OpenVLA: An Open-Source Vision-Language-Action Model, September 2024. URL http://arxiv....

  29. [29]

    Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success

    Moo Jin Kim, Chelsea Finn, and Percy Liang. Fine-tuning vision-language-action models: Optimizing speed and success, 2025. URLhttps://arxiv.org/abs/2502.19645. 11

  30. [30]

    HAMLET: Switch your Vision-Language-Action Model into a History-Aware Policy

    Myungkyu Koo, Daewon Choi, Taeyoung Kim, Kyungmin Lee, Changyeon Kim, Younggyo Seo, and Jinwoo Shin. HAMLET: Switch your Vision-Language-Action Model into a History- Aware Policy, April 2026. URL http://arxiv.org/abs/2510.00695. arXiv:2510.00695 [cs]

  31. [31]

    Li, Berlin Chen, Caitlin Wang, Aviv Bick, J

    Aakash Lahoti, Kevin Y . Li, Berlin Chen, Caitlin Wang, Aviv Bick, J. Zico Kolter, Tri Dao, and Albert Gu. Mamba-3: Improved Sequence Modeling using State Space Principles, March 2026. URLhttp://arxiv.org/abs/2603.15569. arXiv:2603.15569 [cs]

  32. [32]

    Behavior generation with latent actions

    Seungjae Lee, Yibin Wang, Haritheja Etukuru, H. Jin Kim, Nur Muhammad Mahi Shafiullah, and Lerrel Pinto. Behavior Generation with Latent Actions, June 2024. URL http://arxiv. org/abs/2403.03181. arXiv:2403.03181 [cs]

  33. [33]

    End-to-End Training of Deep Visuomotor Policies

    Sergey Levine, Chelsea Finn, Trevor Darrell, and Pieter Abbeel. End-to-End Training of Deep Visuomotor Policies, April 2016. URL http://arxiv.org/abs/1504.00702. arXiv:1504.00702 [cs]

  34. [34]

    CronusVLA: Towards Efficient and Robust Manipulation via Multi-Frame Vision-Language-Action Modeling, October 2025

    Hao Li, Shuai Yang, Yilun Chen, Xinyi Chen, Xiaoda Yang, Yang Tian, Hanqing Wang, Tai Wang, Dahua Lin, Feng Zhao, and Jiangmiao Pang. CronusVLA: Towards Efficient and Robust Manipulation via Multi-Frame Vision-Language-Action Modeling, October 2025. URL http://arxiv.org/abs/2506.19816. arXiv:2506.19816 [cs]

  35. [35]

    RDT-1B: a Diffusion Foundation Model for Bimanual Manipulation

    Songming Liu, Lingxuan Wu, Bangguo Li, Hengkai Tan, Huayu Chen, Zhengyi Wang, Ke Xu, Hang Su, and Jun Zhu. RDT-1B: a Diffusion Foundation Model for Bimanual Manipulation, March 2025. URLhttp://arxiv.org/abs/2410.07864. arXiv:2410.07864 [cs]

  36. [36]

    Gwm: Towards scalable gaussian world models for robotic manipulation, 2025

    Guanxing Lu, Baoxiong Jia, Puhao Li, Yixin Chen, Ziwei Wang, Yansong Tang, and Siyuan Huang. Gwm: Towards scalable gaussian world models for robotic manipulation, 2025. URL https://arxiv.org/abs/2508.17600

  37. [37]

    H$^{\mathbf{3}}$DP: Triply-Hierarchical Diffusion Policy for Visuomotor Learning, May

    Yiyang Lu, Yufeng Tian, Zhecheng Yuan, Xianbang Wang, Pu Hua, Zhengrong Xue, and Huazhe Xu. H$^{\mathbf{3}}$DP: Triply-Hierarchical Diffusion Policy for Visuomotor Learning, May

  38. [38]

    arXiv:2505.07819 [cs] version: 1

    URLhttp://arxiv.org/abs/2505.07819. arXiv:2505.07819 [cs] version: 1

  39. [39]

    CDP: Towards Robust Autoregressive Visuomotor Policy Learning via Causal Diffusion, August 2025

    Jiahua Ma, Yiran Qin, Yixiong Li, Xuanqi Liao, Yulan Guo, and Ruimao Zhang. CDP: Towards Robust Autoregressive Visuomotor Policy Learning via Causal Diffusion, August 2025. URL http://arxiv.org/abs/2506.14769. arXiv:2506.14769 [cs]

  40. [40]

    BPP: Long-Context Robot Imitation Learning by Focusing on Key History Frames, February 2026

    Max Sobol Mark, Jacky Liang, Maria Attarian, Chuyuan Fu, Debidatta Dwibedi, Dhruv Shah, and Aviral Kumar. BPP: Long-Context Robot Imitation Learning by Focusing on Key History Frames, February 2026. URL http://arxiv.org/abs/2602.15010. arXiv:2602.15010 [cs]

  41. [41]

    Dispo: Diffusion-ssm based policy learning for coarse-to-fine action discretization, 2026

    Nayoung Oh, Jaehyeong Jang, Moonkyeong Jung, and Daehyung Park. Dispo: Diffusion-ssm based policy learning for coarse-to-fine action discretization, 2026. URL https://arxiv. org/abs/2409.14719

  42. [42]

    Scalable Diffusion Models with Transformers

    William Peebles and Saining Xie. Scalable Diffusion Models with Transformers, March 2023. URLhttp://arxiv.org/abs/2212.09748. arXiv:2212.09748 [cs]

  43. [43]

    Consistency Policy: Accelerated Visuomotor Policies via Consistency Distillation, June 2024

    Aaditya Prasad, Kevin Lin, Jimmy Wu, Linqi Zhou, and Jeannette Bohg. Consistency Policy: Accelerated Visuomotor Policies via Consistency Distillation, June 2024. URL http://arxiv. org/abs/2405.07503. arXiv:2405.07503 [cs]

  44. [44]

    SpatialVLA: Exploring Spatial Representations for Visual-Language-Action Model

    Delin Qu, Haoming Song, Qizhi Chen, Yuanqi Yao, Xinyi Ye, Yan Ding, Zhigang Wang, JiaYuan Gu, Bin Zhao, Dong Wang, and Xuelong Li. Spatialvla: Exploring spatial representations for visual-language-action model, 2025. URLhttps://arxiv.org/abs/2501.15830

  45. [45]

    Learning Complex Dexterous Manipulation with Deep Reinforcement Learning and Demonstrations

    Aravind Rajeswaran, Vikash Kumar, Abhishek Gupta, Giulia Vezzani, John Schulman, Emanuel Todorov, and Sergey Levine. Learning Complex Dexterous Manipulation with Deep Reinforce- ment Learning and Demonstrations, June 2018. URL http://arxiv.org/abs/1709.10087. arXiv:1709.10087 [cs]. 12

  46. [46]

    Behavior Transformers: Cloning $k$ modes with one stone, October 2022

    Nur Muhammad Mahi Shafiullah, Zichen Jeff Cui, Ariuntuya Altanzaya, and Lerrel Pinto. Behavior Transformers: Cloning $k$ modes with one stone, October 2022. URL http: //arxiv.org/abs/2206.11251. arXiv:2206.11251 [cs]

  47. [47]

    MP1: MeanFlow Tames Policy Learning in 1-step for Robotic Manipulation, October 2025

    Juyi Sheng, Ziyi Wang, Peiming Li, and Mengyuan Liu. MP1: MeanFlow Tames Policy Learning in 1-step for Robotic Manipulation, October 2025. URL http://arxiv.org/abs/ 2507.10543. arXiv:2507.10543 [cs]

  48. [48]

    Andrew Bagnell, and Zhiwei Steven Wu

    Gokul Swamy, Sanjiban Choudhury, J. Andrew Bagnell, and Zhiwei Steven Wu. Sequence Model Imitation Learning with Unobserved Contexts, January 2023. URL http://arxiv. org/abs/2208.02225. arXiv:2208.02225 [cs]

  49. [49]

    Vlash: Real-time vlas via future-state-aware asynchronous inference, 2025

    Jiaming Tang, Yufei Sun, Yilong Zhao, Shang Yang, Yujun Lin, Zhuoyang Zhang, James Hou, Yao Lu, Zhijian Liu, and Song Han. Vlash: Real-time vlas via future-state-aware asynchronous inference, 2025. URLhttps://arxiv.org/abs/2512.01031

  50. [50]

    Octo: An Open-Source Generalist Robot Policy

    Octo Model Team, Dibya Ghosh, Homer Walke, Karl Pertsch, Kevin Black, Oier Mees, Sudeep Dasari, Joey Hejna, Tobias Kreiman, Charles Xu, Jianlan Luo, You Liang Tan, Lawrence Yunliang Chen, Pannag Sanketi, Quan Vuong, Ted Xiao, Dorsa Sadigh, Chelsea Finn, and Sergey Levine. Octo: An open-source generalist robot policy, 2024. URL https://arxiv.org/abs/2405.12213

  51. [51]

    Learning Long-Context Diffusion Policies via Past-Token Prediction, May 2025

    Marcel Torne, Andy Tang, Yuejiang Liu, and Chelsea Finn. Learning Long-Context Diffusion Policies via Past-Token Prediction, May 2025. URL http://arxiv.org/abs/2505.09561. arXiv:2505.09561 [cs]

  52. [52]

    Mamba as a Motion Encoder for Robotic Imitation Learning.IEEE Access, 13:69941–69949, 2025

    Toshiaki Tsuji. Mamba as a Motion Encoder for Robotic Imitation Learning.IEEE Access, 13:69941–69949, 2025. ISSN 2169-3536. doi: 10.1109/ACCESS.2025.3561283. URL https://ieeexplore.ieee.org/document/10966860/

  53. [53]

    Cyclemanip: Enabling cyclic task manipulation via effective historical perception and understanding.arXiv preprint arXiv:2512.01022, 2025

    Yi-Lin Wei, Haoran Liao, Yuhao Lin, Pengyue Wang, Zhizhao Liang, Guiliang Liu, and Wei-Shi Zheng. Cyclemanip: Enabling cyclic task manipulation via effective historical perception and understanding.arXiv preprint arXiv:2512.01022, 2025

  54. [54]

    Fighting Copycat Agents in Behavioral Cloning from Observation Histories, October 2020

    Chuan Wen, Jierui Lin, Trevor Darrell, Dinesh Jayaraman, and Yang Gao. Fighting Copycat Agents in Behavioral Cloning from Observation Histories, October 2020. URL http://arxiv. org/abs/2010.14876. arXiv:2010.14876 [cs]

  55. [55]

    Keyframe- Focused Visual Imitation Learning, June 2021

    Chuan Wen, Jierui Lin, Jianing Qian, Yang Gao, and Dinesh Jayaraman. Keyframe- Focused Visual Imitation Learning, June 2021. URL http://arxiv.org/abs/2106.06452. arXiv:2106.06452 [cs]

  56. [56]

    In- context adaptation for generalizable imitation learning

    Junlin Xie, Xu Luo, Hao Wu, Ji Zhang, Youguang Xing, Lianli Gao, and Jingkuan Song. In- context adaptation for generalizable imitation learning. InCoRL 2025 Workshop RemembeRL

  57. [57]

    ManiFlow: A General Robot Manipulation Policy via Consistency Flow Training, September 2025

    Ge Yan, Jiyue Zhu, Yuquan Deng, Shiqi Yang, Ri-Zhao Qiu, Xuxin Cheng, Marius Memmel, Ranjay Krishna, Ankit Goyal, Xiaolong Wang, and Dieter Fox. ManiFlow: A General Robot Manipulation Policy via Consistency Flow Training, September 2025. URL http://arxiv. org/abs/2509.01819. arXiv:2509.01819 [cs]

  58. [58]

    PlayWorld: Learning Robot World Models from Autonomous Play

    Tenny Yin, Zhiting Mei, Zhonghe Zheng, Miyu Yamane, David Wang, Jade Sceats, Samuel M. Bateman, Lihan Zha, Apurva Badithela, Ola Shorinwa, and Anirudha Majumdar. Playworld: Learning robot world models from autonomous play, 2026. URL https://arxiv.org/abs/ 2603.09030

  59. [59]

    RoboSSM: Scalable In-context Imitation Learning via State-Space Models, September

    Youngju Yoo, Jiaheng Hu, Yifeng Zhu, Bo Liu, Qiang Liu, Roberto Martín-Martín, and Peter Stone. RoboSSM: Scalable In-context Imitation Learning via State-Space Models, September

  60. [60]

    arXiv:2509.19658 [cs]

    URLhttp://arxiv.org/abs/2509.19658. arXiv:2509.19658 [cs]

  61. [61]

    Meta-World: A Bench- mark and Evaluation for Multi-Task and Meta Reinforcement Learning, June 2021

    Tianhe Yu, Deirdre Quillen, Zhanpeng He, Ryan Julian, Avnish Narayan, Hayden Shively, Adithya Bellathur, Karol Hausman, Chelsea Finn, and Sergey Levine. Meta-World: A Bench- mark and Evaluation for Multi-Task and Meta Reinforcement Learning, June 2021. URL http://arxiv.org/abs/1910.10897. arXiv:1910.10897 [cs]. 13

  62. [62]

    3D Diffusion Policy: Generalizable Visuomotor Policy Learning via Simple 3D Representations

    Yanjie Ze, Gu Zhang, Kangning Zhang, Chenyuan Hu, Muhan Wang, and Huazhe Xu. 3D Diffusion Policy: Generalizable Visuomotor Policy Learning via Simple 3D Representations, September 2024. URLhttp://arxiv.org/abs/2403.03954. arXiv:2403.03954 [cs]

  63. [63]

    Transporter Networks: Rearranging the Visual World for Robotic Manipulation, January 2022

    Andy Zeng, Pete Florence, Jonathan Tompson, Stefan Welker, Jonathan Chien, Maria Attarian, Travis Armstrong, Ivan Krasin, Dan Duong, Ayzaan Wahid, Vikas Sindhwani, and Johnny Lee. Transporter Networks: Rearranging the Visual World for Robotic Manipulation, January 2022. URLhttp://arxiv.org/abs/2010.14406. arXiv:2010.14406 [cs]

  64. [64]

    Flowpolicy: Enabling fast and robust 3d flow-based policy via consistency flow matching for robot manipulation

    Qinglun Zhang, Zhen Liu, Haoqiang Fan, Guanghui Liu, Bing Zeng, and Shuaicheng Liu. Flowpolicy: Enabling fast and robust 3d flow-based policy via consistency flow matching for robot manipulation. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 14754–14762, 2025

  65. [65]

    Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware

    Tony Z. Zhao, Vikash Kumar, Sergey Levine, and Chelsea Finn. Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware, April 2023. URL http://arxiv.org/ abs/2304.13705. arXiv:2304.13705 [cs]

  66. [66]

    TraceVLA: Visual Trace Prompting Enhances Spatial-Temporal Awareness for Generalist Robotic Policies

    Ruijie Zheng, Yongyuan Liang, Shuaiyi Huang, Jianfeng Gao, Hal Daumé III, Andrey Kolobov, Furong Huang, and Jianwei Yang. Tracevla: Visual trace prompting enhances spatial-temporal awareness for generalist robotic policies.arXiv preprint arXiv:2412.10345, 2024

  67. [67]

    MTIL: Encoding Full History With Mamba for Temporal Imitation Learning.IEEE Robotics and Automation Letters, 10(11):11761–11767, November 2025

    Yulin Zhou, Yuankai Lin, Fanzhe Peng, Jiahui Chen, Kaiji Huang, Hua Yang, and Zhouping Yin. MTIL: Encoding Full History With Mamba for Temporal Imitation Learning.IEEE Robotics and Automation Letters, 10(11):11761–11767, November 2025. ISSN 2377-3766, 2377-3774. doi: 10.1109/LRA.2025.3615520. URL https://ieeexplore.ieee.org/ document/11184145/

  68. [68]

    Irasim: A fine-grained world model for robot manipulation, 2025

    Fangqi Zhu, Hongtao Wu, Song Guo, Yuxiao Liu, Chilam Cheang, and Tao Kong. Irasim: A fine-grained world model for robot manipulation, 2025. URL https://arxiv.org/abs/ 2406.14540. A Preliminaries Diffusion Policy.Diffusion Policy [ 7] adapts Denoising Diffusion Probabilistic Models (DDPMs) [18] to action generation. The policy treats a future action sequen...

  69. [69]

    For a conditioning variable X, the minimum achievable Mean Squared Error (MSE) loss is the expected conditional variance of the expert actiona t calculated over the expert datasetD E: L∗(X) =E (X,at)∼DE[Var(at |X)].(17) Specifically, we denote the optimal losses for reactive and history-conditioned policies as: L∗(ot) =E ot∼DE[Var(at |o t)]andL ∗(ht) =E h...