DSSP: Diffusion State Space Policy with Full-History Encoding
Pith reviewed 2026-05-22 10:14 UTC · model grok-4.3
The pith
DSSP conditions diffusion policies on full robot observation history via state space model compression to resolve long-horizon ambiguities.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that an SSM-based history encoder optimized by a dynamics-aware auxiliary objective, when fused hierarchically with recent states to condition an SSM diffusion backbone, enables efficient full-history conditioning that outperforms short-window baselines and scales better with longer histories while using fewer parameters.
What carries the argument
The dynamics-aware SSM history encoder that compresses the full observation stream into a compact context representation, which is then hierarchically fused with recent observations for diffusion-based action generation.
If this is right
- Full-history conditioning resolves history-dependent ambiguities that short observation windows cannot address in long-horizon tasks.
- The auxiliary objective keeps the compressed context predictive of future states without requiring the full history at inference time.
- Using SSMs for both the history encoder and the diffusion backbone maintains architectural consistency while reducing model size and GPU memory.
- Performance remains strong or improves as history length grows without proportional increases in parameter count.
Where Pith is reading between the lines
- The hierarchical compression approach could be tested with other sequence models such as transformers to see if the same efficiency pattern holds.
- Resource-constrained robots might benefit from deploying these smaller models on tasks that previously required larger history-aware policies.
- The method suggests exploring similar auxiliary objectives for compressing other sensor streams like vision or tactile data in manipulation.
Load-bearing premise
The dynamics-aware auxiliary training objective ensures the compressed context representation from the SSM history encoder preserves critical information regarding future state evolution.
What would settle it
Measure success rates on long-horizon manipulation tasks with increasing history lengths when the auxiliary objective is ablated; if the performance gap over short-window baselines disappears or reverses, the central claim would be falsified.
Figures
read the original abstract
Diffusion-based imitation learning has shown strong promise for robot manipulation. However, most existing policies condition only on the current observation or a short window of recent observations, limiting their ability to resolve history-dependent ambiguities in long-horizon tasks. To address this, we introduce DSSP, a history-conditioned Diffusion State Space Policy that enables efficient, full-history conditioning for robot manipulation. Leveraging the continuous sequence modeling properties of State Space Models (SSMs), our history encoder effectively compresses the entire observation stream into a compact context representation. To ensure this context preserves critical information regarding future state evolution, the encoder is optimized with a dynamics-aware auxiliary training objective. This high-level context representation is then seamlessly fused with recent state observations to form a hierarchical conditioning mechanism for action generation. Furthermore, to maintain architectural consistency and minimize GPU memory overhead, we also instantiate the diffusion backbone itself using an SSM. Extensive experiments across simulation benchmarks and real-world manipulation tasks show that DSSP achieves state-of-the-art performance with a significantly smaller model size, demonstrating superior efficiency of the hierarchical conditioning in capturing crucial information as the history length increases.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces DSSP, a Diffusion State Space Policy for robot manipulation that conditions on full observation history via an SSM-based encoder. The encoder compresses the entire history into a compact context vector, trained with a dynamics-aware auxiliary objective to preserve future state evolution information. This context is hierarchically fused with recent observations to condition a diffusion policy (also SSM-based) for action generation. Experiments across simulation benchmarks and real-world tasks claim state-of-the-art performance with significantly smaller model sizes, with gains attributed to superior efficiency of the hierarchical conditioning as history length increases.
Significance. If the empirical claims hold, the work offers a practical advance in efficient long-horizon imitation learning by showing that SSM-based full-history encoding can outperform short-window baselines while reducing model size. The combination of dynamics-aware auxiliary training and hierarchical conditioning provides a concrete mechanism for resolving history-dependent ambiguities in manipulation, with potential applicability to other sequential decision tasks where memory efficiency matters.
major comments (2)
- [Abstract and §3] Abstract and §3 (Method): The central claim that the dynamics-aware auxiliary training objective ensures the compressed context representation from the SSM history encoder preserves critical information regarding future state evolution lacks direct quantitative validation. No next-state prediction error, mutual information, or reconstruction metrics between the context vector and future trajectories are reported to confirm that the auxiliary loss actually achieves the asserted preservation; observed performance gains could therefore be attributable to the SSM diffusion backbone or fusion mechanism instead.
- [§4] §4 (Experiments): The SOTA claims and the assertion of superior efficiency as history length increases rest on comparisons whose exact baselines, metrics, statistical significance tests, and ablation controls are not fully detailed in the provided abstract and summary. Without these, it is difficult to assess whether the hierarchical conditioning is the load-bearing factor or whether gains are driven by other architectural choices.
minor comments (2)
- [§3] Notation for the context representation dimension and the auxiliary loss weighting should be introduced explicitly with symbols rather than descriptive phrases to improve reproducibility.
- [§4] Figure captions for the architecture diagram and history-length scaling plots should include axis labels, legend details, and error bars to clarify the efficiency claims.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed feedback on our manuscript. We address each major comment below and outline the revisions we intend to make to strengthen the presentation and validation of our claims.
read point-by-point responses
-
Referee: [Abstract and §3] Abstract and §3 (Method): The central claim that the dynamics-aware auxiliary training objective ensures the compressed context representation from the SSM history encoder preserves critical information regarding future state evolution lacks direct quantitative validation. No next-state prediction error, mutual information, or reconstruction metrics between the context vector and future trajectories are reported to confirm that the auxiliary loss actually achieves the asserted preservation; observed performance gains could therefore be attributable to the SSM diffusion backbone or fusion mechanism instead.
Authors: We agree that direct quantitative validation of the information preserved by the context vector would provide stronger support for the role of the auxiliary objective. The current manuscript relies on downstream task performance to demonstrate the benefits of the dynamics-aware training, but does not report explicit metrics such as next-state prediction error or mutual information between the context and future states. In the revised version, we will add these evaluations, including next-state prediction accuracy computed from the context vector and an ablation comparing performance with and without the auxiliary loss, to isolate its contribution. revision: yes
-
Referee: [§4] §4 (Experiments): The SOTA claims and the assertion of superior efficiency as history length increases rest on comparisons whose exact baselines, metrics, statistical significance tests, and ablation controls are not fully detailed in the provided abstract and summary. Without these, it is difficult to assess whether the hierarchical conditioning is the load-bearing factor or whether gains are driven by other architectural choices.
Authors: We thank the referee for highlighting the need for greater experimental transparency. While the full manuscript contains the relevant experimental details, we acknowledge that the description of baselines, exact metrics, statistical tests, and controls could be expanded for clarity. In the revision, we will update §4 to explicitly enumerate all baseline configurations, report means and standard deviations across seeds with statistical significance tests, and provide additional ablations that isolate the hierarchical fusion mechanism from other architectural elements. revision: yes
Circularity Check
No circularity: architectural design with empirical validation
full rationale
The paper presents a new policy architecture (DSSP) that combines SSM-based history encoding with diffusion and a dynamics-aware auxiliary loss. The abstract and provided text describe the auxiliary objective as a training choice to encourage preservation of future-state information, but no equations, derivations, or self-citations reduce the claimed performance gains or information-preservation property to a fitted parameter or input quantity by construction. Claims of superior efficiency with longer histories rest on benchmark experiments rather than on any self-referential definition or renaming of known results. The derivation chain is therefore self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
free parameters (1)
- context representation dimension
axioms (1)
- domain assumption State space models can efficiently compress long observation sequences while preserving information relevant to future state evolution when trained with a dynamics-aware objective
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
To ensure this context preserves critical information regarding future state evolution, the encoder is optimized with a dynamics-aware auxiliary training objective... Ldyn(ψ, ϕ) = E [1 − cos(gϕ(ct, at), sg(zt+1))]
-
IndisputableMonolith/Foundation/ArithmeticFromLogic.leanLogicNat unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We instantiate the history backbone using a State-Space Model (SSM) and define the context representation ct as the final output token... Mamba as the history encoder
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Openpi Comet: Competition Solution For 2025 BEHA VIOR Challenge, December 2025
Junjie Bai, Yu-Wei Chao, Qizhi Chen, Jinwei Gu, Moo Jin Kim, Zhaoshuo Li, Xuan Li, Tsung- Yi Lin, Ming-Yu Liu, Nic Ma, Kaichun Mo, Delin Qu, Shangkun Sun, Hongchi Xia, Fangyin Wei, and Xiaohui Zeng. Openpi Comet: Competition Solution For 2025 BEHA VIOR Challenge, December 2025. URLhttp://arxiv.org/abs/2512.10071. arXiv:2512.10071 [cs]
-
[2]
RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control
Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Xi Chen, Krzysztof Choro- manski, Tianli Ding, Danny Driess, Avinava Dubey, Chelsea Finn, Pete Florence, Chuyuan Fu, Montse Gonzalez Arenas, Keerthana Gopalakrishnan, Kehang Han, Karol Hausman, Alexander Herzog, Jasmine Hsu, Brian Ichter, Alex Irpan, Nikhil Joshi, Ryan Julian, Dmitry Kalash- n...
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[3]
Mamba Policy: Towards Efficient 3D Diffusion Policy with Hybrid Selective State Models, June 2025
Jiahang Cao, Qiang Zhang, Jingkai Sun, Jiaxu Wang, Hao Cheng, Yulin Li, Jun Ma, Kun Wu, Zhiyuan Xu, Yecheng Shao, Wen Zhao, Gang Han, Yijie Guo, and Renjing Xu. Mamba Policy: Towards Efficient 3D Diffusion Policy with Hybrid Selective State Models, June 2025. URL http://arxiv.org/abs/2409.07163. arXiv:2409.07163 [cs]
-
[4]
Ying Chai, Litao Deng, Ruizhi Shao, Jiajun Zhang, Kangchen Lv, Liangjun Xing, Xiang Li, Hongwen Zhang, and Yebin Liu. Gaf: Gaussian action field as a 4d representation for dynamic world modeling in robotic manipulation, 2025. URL https://arxiv.org/abs/2506.14135
-
[5]
History-Aware Visuomotor Policy Learning via Point Tracking, March 2026
Jingjing Chen, Hongjie Fang, Chenxi Wang, Shiquan Wang, and Cewu Lu. History-Aware Visuomotor Policy Learning via Point Tracking, March 2026. URL http://arxiv.org/abs/ 2509.17141. arXiv:2509.17141 [cs]
-
[6]
Tianxing Chen, Zanxin Chen, Baijun Chen, Zijian Cai, Yibin Liu, Zixuan Li, Qiwei Liang, Xianliang Lin, Yiheng Ge, Zhenyu Gu, Weiliang Deng, Yubin Guo, Tian Nian, Xuanbing Xie, Qiangyu Chen, Kailun Su, Tianling Xu, Guodong Liu, Mengkang Hu, Huan-ang Gao, Kaixuan Wang, Zhixuan Liang, Yusen Qin, Xiaokang Yang, Ping Luo, and Yao Mu. RoboTwin 2.0: A Scalable D...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[7]
Diffusion Policy: Visuomotor Policy Learning via Action Diffusion
Cheng Chi, Zhenjia Xu, Siyuan Feng, Eric Cousineau, Yilun Du, Benjamin Burchfiel, Russ Tedrake, and Shuran Song. Diffusion Policy: Visuomotor Policy Learning via Action Diffusion, March 2024. URLhttp://arxiv.org/abs/2303.04137. arXiv:2303.04137 [cs]
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[8]
Tri Dao and Albert Gu. Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality, May 2024. URL http://arxiv.org/abs/2405. 21060. arXiv:2405.21060 [cs]
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[9]
Dream2flow: Bridging video generation and open-world manipulation with 3d object flow, 2025
Karthik Dharmarajan, Wenlong Huang, Jiajun Wu, Li Fei-Fei, and Ruohan Zhang. Dream2flow: Bridging video generation and open-world manipulation with 3d object flow, 2025. URL https://arxiv.org/abs/2512.24766
-
[10]
Omp: One-step meanflow policy with directional alignment, 2026
Han Fang, Yize Huang, Yuheng Zhao, Paul Weng, Xiao Li, and Yutong Ban. Omp: One-step meanflow policy with directional alignment, 2026. URL https://arxiv.org/abs/2512. 19347
work page 2026
-
[11]
Learning video generation for robotic manipulation with collaborative trajectory control,
Xiao Fu, Xintao Wang, Xian Liu, Jianhong Bai, Runsen Xu, Pengfei Wan, Di Zhang, and Dahua Lin. Learning video generation for robotic manipulation with collaborative trajectory control,
- [12]
-
[13]
Vita: Vision-to-action flow matching policy, 2026
Dechen Gao, Boqi Zhao, Andrew Lee, Ian Chuang, Hanchu Zhou, Hang Wang, Zhe Zhao, Junshan Zhang, and Iman Soltani. Vita: Vision-to-action flow matching policy, 2026. URL https://arxiv.org/abs/2507.13231. 10
-
[14]
Mamba: Linear-Time Sequence Modeling with Selective State Spaces
Albert Gu and Tri Dao. Mamba: Linear-Time Sequence Modeling with Selective State Spaces, May 2024. URLhttp://arxiv.org/abs/2312.00752. arXiv:2312.00752 [cs]
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[15]
SeedPolicy: Horizon Scaling via Self-Evolving Diffusion Policy for Robot Manipulation
Youqiang Gui, Yuxuan Zhou, Shen Cheng, Xinyang Yuan, Haoqiang Fan, Peng Cheng, and Shuaicheng Liu. SeedPolicy: Horizon Scaling via Self-Evolving Diffusion Policy for Robot Manipulation, March 2026. URL http://arxiv.org/abs/2603.05117. arXiv:2603.05117 [cs]
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[16]
Flowdreamer: A rgb-d world model with flow-based motion representations for robot manipulation, 2025
Jun Guo, Xiaojian Ma, Yikai Wang, Min Yang, Huaping Liu, and Qing Li. Flowdreamer: A rgb-d world model with flow-based motion representations for robot manipulation, 2025. URL https://arxiv.org/abs/2505.10075
-
[17]
Ctrl-world: A controllable generative world model for robot manipulation, 2026
Yanjiang Guo, Lucy Xiaoyang Shi, Jianyu Chen, and Chelsea Finn. Ctrl-world: A controllable generative world model for robot manipulation, 2026. URL https://arxiv.org/abs/2510. 10125
work page 2026
-
[18]
Causal Confusion in Imitation Learning, November 2019
Pim de Haan, Dinesh Jayaraman, and Sergey Levine. Causal Confusion in Imitation Learning, November 2019. URLhttp://arxiv.org/abs/1905.11979. arXiv:1905.11979 [cs]
-
[19]
Denoising Diffusion Probabilistic Models
Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models, 2020. URLhttps://arxiv.org/abs/2006.11239
work page internal anchor Pith review Pith/arXiv arXiv 2020
-
[20]
AdaFlow: Imitation Learning with Variance- Adaptive Flow-Based Policies, November 2024
Xixi Hu, Bo Liu, Xingchao Liu, and Qiang Liu. AdaFlow: Imitation Learning with Variance- Adaptive Flow-Based Policies, November 2024. URLhttp://arxiv.org/abs/2402.04292. arXiv:2402.04292 [cs]
-
[21]
Helong Huang, Min Cen, Kai Tan, Xingyue Quan, Guowei Huang, and Hong Zhang. Graphcot- vla: A 3d spatial-aware reasoning vision-language-action model for robotic manipulation with ambiguous instructions, 2025. URLhttps://arxiv.org/abs/2508.07650
-
[22]
ContextVLA: Vision-Language-Action Model with Amortized Multi-Frame Context, October
Huiwon Jang, Sihyun Yu, Heeseung Kwon, Hojin Jeon, Younggyo Seo, and Jinwoo Shin. ContextVLA: Vision-Language-Action Model with Amortized Multi-Frame Context, October
- [23]
-
[24]
arXiv preprint arXiv:2406.08234 (2024)
Xiaogang Jia, Qian Wang, Atalay Donat, Bowen Xing, Ge Li, Hongyi Zhou, Onur Celik, Denis Blessing, Rudolf Lioutikov, and Gerhard Neumann. MaIL: Improving Imitation Learning with Mamba, November 2024. URL http://arxiv.org/abs/2406.08234. arXiv:2406.08234 [cs]
-
[25]
Wang, Hanyi Zhang, Qian Wang, Rudolf Lioutikov, and Gerhard Neumann
Xiaogang Jia, Atalay Donat, Xi Huang, Xuan Zhao, Denis Blessing, Hongyi Zhou, Han A. Wang, Hanyi Zhang, Qian Wang, Rudolf Lioutikov, and Gerhard Neumann. X-IL: Exploring the Design Space of Imitation Learning Policies, February 2025. URL http://arxiv.org/ abs/2502.12330. arXiv:2502.12330 [cs]
-
[26]
AsyncVLA: Asynchronous Flow Matching for Vision-Language-Action Models
Yuhua Jiang, Shuang Cheng, Yan Ding, Feifei Gao, and Biqing Qi. Asyncvla: Asynchronous flow matching for vision-language-action models, 2025. URL https://arxiv.org/abs/ 2511.14148
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[27]
Zhennan Jiang, Kai Liu, Yuxin Qin, Shuai Tian, Yupeng Zheng, Mingcai Zhou, Chao Yu, Haoran Li, and Dongbin Zhao. World4rl: Diffusion world models for policy refinement with reinforcement learning for robotic manipulation, 2026. URL https://arxiv.org/abs/2509. 19080
work page 2026
-
[28]
OpenVLA: An Open-Source Vision-Language-Action Model
Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, Quan Vuong, Thomas Kollar, Benjamin Burchfiel, Russ Tedrake, Dorsa Sadigh, Sergey Levine, Percy Liang, and Chelsea Finn. OpenVLA: An Open-Source Vision-Language-Action Model, September 2024. URL http://arxiv....
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[29]
Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success
Moo Jin Kim, Chelsea Finn, and Percy Liang. Fine-tuning vision-language-action models: Optimizing speed and success, 2025. URLhttps://arxiv.org/abs/2502.19645. 11
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[30]
HAMLET: Switch your Vision-Language-Action Model into a History-Aware Policy
Myungkyu Koo, Daewon Choi, Taeyoung Kim, Kyungmin Lee, Changyeon Kim, Younggyo Seo, and Jinwoo Shin. HAMLET: Switch your Vision-Language-Action Model into a History- Aware Policy, April 2026. URL http://arxiv.org/abs/2510.00695. arXiv:2510.00695 [cs]
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[31]
Li, Berlin Chen, Caitlin Wang, Aviv Bick, J
Aakash Lahoti, Kevin Y . Li, Berlin Chen, Caitlin Wang, Aviv Bick, J. Zico Kolter, Tri Dao, and Albert Gu. Mamba-3: Improved Sequence Modeling using State Space Principles, March 2026. URLhttp://arxiv.org/abs/2603.15569. arXiv:2603.15569 [cs]
-
[32]
Behavior generation with latent actions
Seungjae Lee, Yibin Wang, Haritheja Etukuru, H. Jin Kim, Nur Muhammad Mahi Shafiullah, and Lerrel Pinto. Behavior Generation with Latent Actions, June 2024. URL http://arxiv. org/abs/2403.03181. arXiv:2403.03181 [cs]
-
[33]
End-to-End Training of Deep Visuomotor Policies
Sergey Levine, Chelsea Finn, Trevor Darrell, and Pieter Abbeel. End-to-End Training of Deep Visuomotor Policies, April 2016. URL http://arxiv.org/abs/1504.00702. arXiv:1504.00702 [cs]
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[34]
Hao Li, Shuai Yang, Yilun Chen, Xinyi Chen, Xiaoda Yang, Yang Tian, Hanqing Wang, Tai Wang, Dahua Lin, Feng Zhao, and Jiangmiao Pang. CronusVLA: Towards Efficient and Robust Manipulation via Multi-Frame Vision-Language-Action Modeling, October 2025. URL http://arxiv.org/abs/2506.19816. arXiv:2506.19816 [cs]
-
[35]
RDT-1B: a Diffusion Foundation Model for Bimanual Manipulation
Songming Liu, Lingxuan Wu, Bangguo Li, Hengkai Tan, Huayu Chen, Zhengyi Wang, Ke Xu, Hang Su, and Jun Zhu. RDT-1B: a Diffusion Foundation Model for Bimanual Manipulation, March 2025. URLhttp://arxiv.org/abs/2410.07864. arXiv:2410.07864 [cs]
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[36]
Gwm: Towards scalable gaussian world models for robotic manipulation, 2025
Guanxing Lu, Baoxiong Jia, Puhao Li, Yixin Chen, Ziwei Wang, Yansong Tang, and Siyuan Huang. Gwm: Towards scalable gaussian world models for robotic manipulation, 2025. URL https://arxiv.org/abs/2508.17600
-
[37]
H$^{\mathbf{3}}$DP: Triply-Hierarchical Diffusion Policy for Visuomotor Learning, May
Yiyang Lu, Yufeng Tian, Zhecheng Yuan, Xianbang Wang, Pu Hua, Zhengrong Xue, and Huazhe Xu. H$^{\mathbf{3}}$DP: Triply-Hierarchical Diffusion Policy for Visuomotor Learning, May
-
[38]
arXiv:2505.07819 [cs] version: 1
URLhttp://arxiv.org/abs/2505.07819. arXiv:2505.07819 [cs] version: 1
-
[39]
CDP: Towards Robust Autoregressive Visuomotor Policy Learning via Causal Diffusion, August 2025
Jiahua Ma, Yiran Qin, Yixiong Li, Xuanqi Liao, Yulan Guo, and Ruimao Zhang. CDP: Towards Robust Autoregressive Visuomotor Policy Learning via Causal Diffusion, August 2025. URL http://arxiv.org/abs/2506.14769. arXiv:2506.14769 [cs]
-
[40]
BPP: Long-Context Robot Imitation Learning by Focusing on Key History Frames, February 2026
Max Sobol Mark, Jacky Liang, Maria Attarian, Chuyuan Fu, Debidatta Dwibedi, Dhruv Shah, and Aviral Kumar. BPP: Long-Context Robot Imitation Learning by Focusing on Key History Frames, February 2026. URL http://arxiv.org/abs/2602.15010. arXiv:2602.15010 [cs]
-
[41]
Dispo: Diffusion-ssm based policy learning for coarse-to-fine action discretization, 2026
Nayoung Oh, Jaehyeong Jang, Moonkyeong Jung, and Daehyung Park. Dispo: Diffusion-ssm based policy learning for coarse-to-fine action discretization, 2026. URL https://arxiv. org/abs/2409.14719
-
[42]
Scalable Diffusion Models with Transformers
William Peebles and Saining Xie. Scalable Diffusion Models with Transformers, March 2023. URLhttp://arxiv.org/abs/2212.09748. arXiv:2212.09748 [cs]
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[43]
Consistency Policy: Accelerated Visuomotor Policies via Consistency Distillation, June 2024
Aaditya Prasad, Kevin Lin, Jimmy Wu, Linqi Zhou, and Jeannette Bohg. Consistency Policy: Accelerated Visuomotor Policies via Consistency Distillation, June 2024. URL http://arxiv. org/abs/2405.07503. arXiv:2405.07503 [cs]
-
[44]
SpatialVLA: Exploring Spatial Representations for Visual-Language-Action Model
Delin Qu, Haoming Song, Qizhi Chen, Yuanqi Yao, Xinyi Ye, Yan Ding, Zhigang Wang, JiaYuan Gu, Bin Zhao, Dong Wang, and Xuelong Li. Spatialvla: Exploring spatial representations for visual-language-action model, 2025. URLhttps://arxiv.org/abs/2501.15830
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[45]
Learning Complex Dexterous Manipulation with Deep Reinforcement Learning and Demonstrations
Aravind Rajeswaran, Vikash Kumar, Abhishek Gupta, Giulia Vezzani, John Schulman, Emanuel Todorov, and Sergey Levine. Learning Complex Dexterous Manipulation with Deep Reinforce- ment Learning and Demonstrations, June 2018. URL http://arxiv.org/abs/1709.10087. arXiv:1709.10087 [cs]. 12
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[46]
Behavior Transformers: Cloning $k$ modes with one stone, October 2022
Nur Muhammad Mahi Shafiullah, Zichen Jeff Cui, Ariuntuya Altanzaya, and Lerrel Pinto. Behavior Transformers: Cloning $k$ modes with one stone, October 2022. URL http: //arxiv.org/abs/2206.11251. arXiv:2206.11251 [cs]
-
[47]
MP1: MeanFlow Tames Policy Learning in 1-step for Robotic Manipulation, October 2025
Juyi Sheng, Ziyi Wang, Peiming Li, and Mengyuan Liu. MP1: MeanFlow Tames Policy Learning in 1-step for Robotic Manipulation, October 2025. URL http://arxiv.org/abs/ 2507.10543. arXiv:2507.10543 [cs]
-
[48]
Andrew Bagnell, and Zhiwei Steven Wu
Gokul Swamy, Sanjiban Choudhury, J. Andrew Bagnell, and Zhiwei Steven Wu. Sequence Model Imitation Learning with Unobserved Contexts, January 2023. URL http://arxiv. org/abs/2208.02225. arXiv:2208.02225 [cs]
-
[49]
Vlash: Real-time vlas via future-state-aware asynchronous inference, 2025
Jiaming Tang, Yufei Sun, Yilong Zhao, Shang Yang, Yujun Lin, Zhuoyang Zhang, James Hou, Yao Lu, Zhijian Liu, and Song Han. Vlash: Real-time vlas via future-state-aware asynchronous inference, 2025. URLhttps://arxiv.org/abs/2512.01031
-
[50]
Octo: An Open-Source Generalist Robot Policy
Octo Model Team, Dibya Ghosh, Homer Walke, Karl Pertsch, Kevin Black, Oier Mees, Sudeep Dasari, Joey Hejna, Tobias Kreiman, Charles Xu, Jianlan Luo, You Liang Tan, Lawrence Yunliang Chen, Pannag Sanketi, Quan Vuong, Ted Xiao, Dorsa Sadigh, Chelsea Finn, and Sergey Levine. Octo: An open-source generalist robot policy, 2024. URL https://arxiv.org/abs/2405.12213
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[51]
Learning Long-Context Diffusion Policies via Past-Token Prediction, May 2025
Marcel Torne, Andy Tang, Yuejiang Liu, and Chelsea Finn. Learning Long-Context Diffusion Policies via Past-Token Prediction, May 2025. URL http://arxiv.org/abs/2505.09561. arXiv:2505.09561 [cs]
-
[52]
Mamba as a Motion Encoder for Robotic Imitation Learning.IEEE Access, 13:69941–69949, 2025
Toshiaki Tsuji. Mamba as a Motion Encoder for Robotic Imitation Learning.IEEE Access, 13:69941–69949, 2025. ISSN 2169-3536. doi: 10.1109/ACCESS.2025.3561283. URL https://ieeexplore.ieee.org/document/10966860/
-
[53]
Yi-Lin Wei, Haoran Liao, Yuhao Lin, Pengyue Wang, Zhizhao Liang, Guiliang Liu, and Wei-Shi Zheng. Cyclemanip: Enabling cyclic task manipulation via effective historical perception and understanding.arXiv preprint arXiv:2512.01022, 2025
-
[54]
Fighting Copycat Agents in Behavioral Cloning from Observation Histories, October 2020
Chuan Wen, Jierui Lin, Trevor Darrell, Dinesh Jayaraman, and Yang Gao. Fighting Copycat Agents in Behavioral Cloning from Observation Histories, October 2020. URL http://arxiv. org/abs/2010.14876. arXiv:2010.14876 [cs]
-
[55]
Keyframe- Focused Visual Imitation Learning, June 2021
Chuan Wen, Jierui Lin, Jianing Qian, Yang Gao, and Dinesh Jayaraman. Keyframe- Focused Visual Imitation Learning, June 2021. URL http://arxiv.org/abs/2106.06452. arXiv:2106.06452 [cs]
-
[56]
In- context adaptation for generalizable imitation learning
Junlin Xie, Xu Luo, Hao Wu, Ji Zhang, Youguang Xing, Lianli Gao, and Jingkuan Song. In- context adaptation for generalizable imitation learning. InCoRL 2025 Workshop RemembeRL
work page 2025
-
[57]
ManiFlow: A General Robot Manipulation Policy via Consistency Flow Training, September 2025
Ge Yan, Jiyue Zhu, Yuquan Deng, Shiqi Yang, Ri-Zhao Qiu, Xuxin Cheng, Marius Memmel, Ranjay Krishna, Ankit Goyal, Xiaolong Wang, and Dieter Fox. ManiFlow: A General Robot Manipulation Policy via Consistency Flow Training, September 2025. URL http://arxiv. org/abs/2509.01819. arXiv:2509.01819 [cs]
-
[58]
PlayWorld: Learning Robot World Models from Autonomous Play
Tenny Yin, Zhiting Mei, Zhonghe Zheng, Miyu Yamane, David Wang, Jade Sceats, Samuel M. Bateman, Lihan Zha, Apurva Badithela, Ola Shorinwa, and Anirudha Majumdar. Playworld: Learning robot world models from autonomous play, 2026. URL https://arxiv.org/abs/ 2603.09030
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[59]
RoboSSM: Scalable In-context Imitation Learning via State-Space Models, September
Youngju Yoo, Jiaheng Hu, Yifeng Zhu, Bo Liu, Qiang Liu, Roberto Martín-Martín, and Peter Stone. RoboSSM: Scalable In-context Imitation Learning via State-Space Models, September
- [60]
-
[61]
Meta-World: A Bench- mark and Evaluation for Multi-Task and Meta Reinforcement Learning, June 2021
Tianhe Yu, Deirdre Quillen, Zhanpeng He, Ryan Julian, Avnish Narayan, Hayden Shively, Adithya Bellathur, Karol Hausman, Chelsea Finn, and Sergey Levine. Meta-World: A Bench- mark and Evaluation for Multi-Task and Meta Reinforcement Learning, June 2021. URL http://arxiv.org/abs/1910.10897. arXiv:1910.10897 [cs]. 13
-
[62]
3D Diffusion Policy: Generalizable Visuomotor Policy Learning via Simple 3D Representations
Yanjie Ze, Gu Zhang, Kangning Zhang, Chenyuan Hu, Muhan Wang, and Huazhe Xu. 3D Diffusion Policy: Generalizable Visuomotor Policy Learning via Simple 3D Representations, September 2024. URLhttp://arxiv.org/abs/2403.03954. arXiv:2403.03954 [cs]
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[63]
Transporter Networks: Rearranging the Visual World for Robotic Manipulation, January 2022
Andy Zeng, Pete Florence, Jonathan Tompson, Stefan Welker, Jonathan Chien, Maria Attarian, Travis Armstrong, Ivan Krasin, Dan Duong, Ayzaan Wahid, Vikas Sindhwani, and Johnny Lee. Transporter Networks: Rearranging the Visual World for Robotic Manipulation, January 2022. URLhttp://arxiv.org/abs/2010.14406. arXiv:2010.14406 [cs]
-
[64]
Qinglun Zhang, Zhen Liu, Haoqiang Fan, Guanghui Liu, Bing Zeng, and Shuaicheng Liu. Flowpolicy: Enabling fast and robust 3d flow-based policy via consistency flow matching for robot manipulation. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 14754–14762, 2025
work page 2025
-
[65]
Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware
Tony Z. Zhao, Vikash Kumar, Sergey Levine, and Chelsea Finn. Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware, April 2023. URL http://arxiv.org/ abs/2304.13705. arXiv:2304.13705 [cs]
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[66]
TraceVLA: Visual Trace Prompting Enhances Spatial-Temporal Awareness for Generalist Robotic Policies
Ruijie Zheng, Yongyuan Liang, Shuaiyi Huang, Jianfeng Gao, Hal Daumé III, Andrey Kolobov, Furong Huang, and Jianwei Yang. Tracevla: Visual trace prompting enhances spatial-temporal awareness for generalist robotic policies.arXiv preprint arXiv:2412.10345, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[67]
Yulin Zhou, Yuankai Lin, Fanzhe Peng, Jiahui Chen, Kaiji Huang, Hua Yang, and Zhouping Yin. MTIL: Encoding Full History With Mamba for Temporal Imitation Learning.IEEE Robotics and Automation Letters, 10(11):11761–11767, November 2025. ISSN 2377-3766, 2377-3774. doi: 10.1109/LRA.2025.3615520. URL https://ieeexplore.ieee.org/ document/11184145/
-
[68]
Irasim: A fine-grained world model for robot manipulation, 2025
Fangqi Zhu, Hongtao Wu, Song Guo, Yuxiao Liu, Chilam Cheang, and Tao Kong. Irasim: A fine-grained world model for robot manipulation, 2025. URL https://arxiv.org/abs/ 2406.14540. A Preliminaries Diffusion Policy.Diffusion Policy [ 7] adapts Denoising Diffusion Probabilistic Models (DDPMs) [18] to action generation. The policy treats a future action sequen...
-
[69]
For a conditioning variable X, the minimum achievable Mean Squared Error (MSE) loss is the expected conditional variance of the expert actiona t calculated over the expert datasetD E: L∗(X) =E (X,at)∼DE[Var(at |X)].(17) Specifically, we denote the optimal losses for reactive and history-conditioned policies as: L∗(ot) =E ot∼DE[Var(at |o t)]andL ∗(ht) =E h...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.