DSSP: Diffusion State Space Policy with Full-History Encoding

Han Fang; Jianshu Hu; Shujia Li; Xiao Li; Yize Huang; Yunpeng Jiang; Yutong Ban; Zhiyuan Guan

arxiv: 2605.14598 · v2 · pith:UTWOCQGKnew · submitted 2026-05-14 · 💻 cs.RO

DSSP: Diffusion State Space Policy with Full-History Encoding

Zhiyuan Guan , Jianshu Hu , Han Fang , Yunpeng Jiang , Yize Huang , Shujia Li , Xiao Li , Yutong Ban This is my paper

Pith reviewed 2026-05-22 10:14 UTC · model grok-4.3

classification 💻 cs.RO

keywords diffusion policystate space modelrobot manipulationimitation learningfull history conditioningauxiliary dynamics objectivehierarchical conditioning

0 comments

The pith

DSSP conditions diffusion policies on full robot observation history via state space model compression to resolve long-horizon ambiguities.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper introduces DSSP, a diffusion-based imitation learning policy for robot manipulation that conditions actions on the entire history of observations rather than a short recent window. It uses a state space model as a history encoder to compress the full observation stream into a compact context representation. The encoder is trained with a dynamics-aware auxiliary objective to preserve information about future state evolution. This context is fused hierarchically with recent observations to condition a diffusion model for action generation, and the diffusion backbone itself is also built from state space models for consistency and efficiency. Experiments on simulation benchmarks and real-world tasks show state-of-the-art performance with a significantly smaller model size, with the efficiency advantage growing as history length increases.

Core claim

The central claim is that an SSM-based history encoder optimized by a dynamics-aware auxiliary objective, when fused hierarchically with recent states to condition an SSM diffusion backbone, enables efficient full-history conditioning that outperforms short-window baselines and scales better with longer histories while using fewer parameters.

What carries the argument

The dynamics-aware SSM history encoder that compresses the full observation stream into a compact context representation, which is then hierarchically fused with recent observations for diffusion-based action generation.

If this is right

Full-history conditioning resolves history-dependent ambiguities that short observation windows cannot address in long-horizon tasks.
The auxiliary objective keeps the compressed context predictive of future states without requiring the full history at inference time.
Using SSMs for both the history encoder and the diffusion backbone maintains architectural consistency while reducing model size and GPU memory.
Performance remains strong or improves as history length grows without proportional increases in parameter count.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The hierarchical compression approach could be tested with other sequence models such as transformers to see if the same efficiency pattern holds.
Resource-constrained robots might benefit from deploying these smaller models on tasks that previously required larger history-aware policies.
The method suggests exploring similar auxiliary objectives for compressing other sensor streams like vision or tactile data in manipulation.

Load-bearing premise

The dynamics-aware auxiliary training objective ensures the compressed context representation from the SSM history encoder preserves critical information regarding future state evolution.

What would settle it

Measure success rates on long-horizon manipulation tasks with increasing history lengths when the auxiliary objective is ablated; if the performance gap over short-window baselines disappears or reverses, the central claim would be falsified.

Figures

Figures reproduced from arXiv: 2605.14598 by Han Fang, Jianshu Hu, Shujia Li, Xiao Li, Yize Huang, Yunpeng Jiang, Yutong Ban, Zhiyuan Guan.

**Figure 1.** Figure 1: The proposed DSSP leverages full-history context to resolve visual aliasing when historyblind baselines lose track of task progress, enabling consistent execution in long-horizon tasks. DSSP achieves superior success rates across both simulation tasks and real-world experiments. histories into the policy, but this increases memory and inference cost and may introduce redundant visual inputs or spurious co… view at source ↗

**Figure 2.** Figure 2: Overview of DSSP. DSSP summarizes past multi-modal observations into a compact context token using a state-space history encoder. A dynamics-aware auxiliary loss encourages this token to retain historical information predictive of future state evolution. The learned context token is then combined with recent state tokens as a hierarchical prefix condition for a state-space diffusion denoiser to generate fu… view at source ↗

**Figure 3.** Figure 3: Overview of Experimental Environments. The figure summarizes representative environments from both simulation and real-world experiments. In each row, the left three panels visualize the RoboTwin tasks, the middle three columns present representative MetaWorld tasks, the next column shows Adroit tasks, and the rightmost column shows our real-world tasks. Timestep-Decoupled Action Denoising. To provide a st… view at source ↗

read the original abstract

Diffusion-based imitation learning has shown strong promise for robot manipulation. However, most existing policies condition only on the current observation or a short window of recent observations, limiting their ability to resolve history-dependent ambiguities in long-horizon tasks. To address this, we introduce DSSP, a history-conditioned Diffusion State Space Policy that enables efficient, full-history conditioning for robot manipulation. Leveraging the continuous sequence modeling properties of State Space Models (SSMs), our history encoder effectively compresses the entire observation stream into a compact context representation. To ensure this context preserves critical information regarding future state evolution, the encoder is optimized with a dynamics-aware auxiliary training objective. This high-level context representation is then seamlessly fused with recent state observations to form a hierarchical conditioning mechanism for action generation. Furthermore, to maintain architectural consistency and minimize GPU memory overhead, we also instantiate the diffusion backbone itself using an SSM. Extensive experiments across simulation benchmarks and real-world manipulation tasks show that DSSP achieves state-of-the-art performance with a significantly smaller model size, demonstrating superior efficiency of the hierarchical conditioning in capturing crucial information as the history length increases.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

DSSP shows a workable way to add full-history conditioning to diffusion robot policies using SSM compression plus a dynamics auxiliary loss, but that loss lacks a direct check that it actually retains future-state info.

read the letter

The main point is that this work replaces short observation windows in diffusion imitation learning with an SSM-based encoder that compresses the entire history into a compact vector, then fuses it hierarchically with recent states while also swapping the diffusion network itself to an SSM for lower memory use. They add a dynamics-aware auxiliary objective during training to try to keep the compressed context useful for predicting future states. That combination is the concrete novelty here, not just another SSM plug-in. The experiments report SOTA numbers on both simulation benchmarks and real manipulation tasks, with smaller overall model size and better scaling as history length grows. Those results are the strongest part of the paper and give a practical reason to look at the architecture. The soft spot is exactly the one the stress test flags: the auxiliary objective is presented as the mechanism that preserves critical dynamics information in the SSM context, yet the write-up gives no quantitative verification such as next-state prediction error from the context vector or mutual information with future trajectories. Without that, the performance gains could be driven more by the SSM diffusion backbone or the fusion step than by successful dynamics preservation. The abstract also skips details on exact baselines, metrics, and ablations, though the full paper presumably supplies them. This is aimed at robotics researchers who need history-dependent policies for long-horizon tasks and who are open to SSM alternatives to transformers. A reader working on efficient sequence modeling for control would get direct value from the scaling results and the architectural consistency. The idea is grounded enough and the empirical claims sharp enough that it deserves peer review rather than a desk reject; reviewers will likely ask for the missing validation on the auxiliary loss and component ablations. I would send it out.

Referee Report

2 major / 2 minor

Summary. The paper introduces DSSP, a Diffusion State Space Policy for robot manipulation that conditions on full observation history via an SSM-based encoder. The encoder compresses the entire history into a compact context vector, trained with a dynamics-aware auxiliary objective to preserve future state evolution information. This context is hierarchically fused with recent observations to condition a diffusion policy (also SSM-based) for action generation. Experiments across simulation benchmarks and real-world tasks claim state-of-the-art performance with significantly smaller model sizes, with gains attributed to superior efficiency of the hierarchical conditioning as history length increases.

Significance. If the empirical claims hold, the work offers a practical advance in efficient long-horizon imitation learning by showing that SSM-based full-history encoding can outperform short-window baselines while reducing model size. The combination of dynamics-aware auxiliary training and hierarchical conditioning provides a concrete mechanism for resolving history-dependent ambiguities in manipulation, with potential applicability to other sequential decision tasks where memory efficiency matters.

major comments (2)

[Abstract and §3] Abstract and §3 (Method): The central claim that the dynamics-aware auxiliary training objective ensures the compressed context representation from the SSM history encoder preserves critical information regarding future state evolution lacks direct quantitative validation. No next-state prediction error, mutual information, or reconstruction metrics between the context vector and future trajectories are reported to confirm that the auxiliary loss actually achieves the asserted preservation; observed performance gains could therefore be attributable to the SSM diffusion backbone or fusion mechanism instead.
[§4] §4 (Experiments): The SOTA claims and the assertion of superior efficiency as history length increases rest on comparisons whose exact baselines, metrics, statistical significance tests, and ablation controls are not fully detailed in the provided abstract and summary. Without these, it is difficult to assess whether the hierarchical conditioning is the load-bearing factor or whether gains are driven by other architectural choices.

minor comments (2)

[§3] Notation for the context representation dimension and the auxiliary loss weighting should be introduced explicitly with symbols rather than descriptive phrases to improve reproducibility.
[§4] Figure captions for the architecture diagram and history-length scaling plots should include axis labels, legend details, and error bars to clarify the efficiency claims.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback on our manuscript. We address each major comment below and outline the revisions we intend to make to strengthen the presentation and validation of our claims.

read point-by-point responses

Referee: [Abstract and §3] Abstract and §3 (Method): The central claim that the dynamics-aware auxiliary training objective ensures the compressed context representation from the SSM history encoder preserves critical information regarding future state evolution lacks direct quantitative validation. No next-state prediction error, mutual information, or reconstruction metrics between the context vector and future trajectories are reported to confirm that the auxiliary loss actually achieves the asserted preservation; observed performance gains could therefore be attributable to the SSM diffusion backbone or fusion mechanism instead.

Authors: We agree that direct quantitative validation of the information preserved by the context vector would provide stronger support for the role of the auxiliary objective. The current manuscript relies on downstream task performance to demonstrate the benefits of the dynamics-aware training, but does not report explicit metrics such as next-state prediction error or mutual information between the context and future states. In the revised version, we will add these evaluations, including next-state prediction accuracy computed from the context vector and an ablation comparing performance with and without the auxiliary loss, to isolate its contribution. revision: yes
Referee: [§4] §4 (Experiments): The SOTA claims and the assertion of superior efficiency as history length increases rest on comparisons whose exact baselines, metrics, statistical significance tests, and ablation controls are not fully detailed in the provided abstract and summary. Without these, it is difficult to assess whether the hierarchical conditioning is the load-bearing factor or whether gains are driven by other architectural choices.

Authors: We thank the referee for highlighting the need for greater experimental transparency. While the full manuscript contains the relevant experimental details, we acknowledge that the description of baselines, exact metrics, statistical tests, and controls could be expanded for clarity. In the revision, we will update §4 to explicitly enumerate all baseline configurations, report means and standard deviations across seeds with statistical significance tests, and provide additional ablations that isolate the hierarchical fusion mechanism from other architectural elements. revision: yes

Circularity Check

0 steps flagged

No circularity: architectural design with empirical validation

full rationale

The paper presents a new policy architecture (DSSP) that combines SSM-based history encoding with diffusion and a dynamics-aware auxiliary loss. The abstract and provided text describe the auxiliary objective as a training choice to encourage preservation of future-state information, but no equations, derivations, or self-citations reduce the claimed performance gains or information-preservation property to a fitted parameter or input quantity by construction. Claims of superior efficiency with longer histories rest on benchmark experiments rather than on any self-referential definition or renaming of known results. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on standard assumptions about the sequence modeling capabilities of SSMs and the ability of an auxiliary dynamics objective to preserve predictive information in the context vector; no new entities are postulated.

free parameters (1)

context representation dimension
The size of the compact context vector is a tunable hyperparameter whose value affects information preservation and is likely selected via validation.

axioms (1)

domain assumption State space models can efficiently compress long observation sequences while preserving information relevant to future state evolution when trained with a dynamics-aware objective
This underpins the history encoder design and is invoked to justify full-history conditioning without prohibitive memory cost.

pith-pipeline@v0.9.0 · 5741 in / 1357 out tokens · 49827 ms · 2026-05-22T10:14:57.329043+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

To ensure this context preserves critical information regarding future state evolution, the encoder is optimized with a dynamics-aware auxiliary training objective... Ldyn(ψ, ϕ) = E [1 − cos(gϕ(ct, at), sg(zt+1))]
IndisputableMonolith/Foundation/ArithmeticFromLogic.lean LogicNat unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We instantiate the history backbone using a State-Space Model (SSM) and define the context representation ct as the final output token... Mamba as the history encoder

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

69 extracted references · 69 canonical work pages · 21 internal anchors

[1]

Openpi Comet: Competition Solution For 2025 BEHA VIOR Challenge, December 2025

Junjie Bai, Yu-Wei Chao, Qizhi Chen, Jinwei Gu, Moo Jin Kim, Zhaoshuo Li, Xuan Li, Tsung- Yi Lin, Ming-Yu Liu, Nic Ma, Kaichun Mo, Delin Qu, Shangkun Sun, Hongchi Xia, Fangyin Wei, and Xiaohui Zeng. Openpi Comet: Competition Solution For 2025 BEHA VIOR Challenge, December 2025. URLhttp://arxiv.org/abs/2512.10071. arXiv:2512.10071 [cs]

work page arXiv 2025
[2]

RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control

Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Xi Chen, Krzysztof Choro- manski, Tianli Ding, Danny Driess, Avinava Dubey, Chelsea Finn, Pete Florence, Chuyuan Fu, Montse Gonzalez Arenas, Keerthana Gopalakrishnan, Kehang Han, Karol Hausman, Alexander Herzog, Jasmine Hsu, Brian Ichter, Alex Irpan, Nikhil Joshi, Ryan Julian, Dmitry Kalash- n...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[3]

Mamba Policy: Towards Efficient 3D Diffusion Policy with Hybrid Selective State Models, June 2025

Jiahang Cao, Qiang Zhang, Jingkai Sun, Jiaxu Wang, Hao Cheng, Yulin Li, Jun Ma, Kun Wu, Zhiyuan Xu, Yecheng Shao, Wen Zhao, Gang Han, Yijie Guo, and Renjing Xu. Mamba Policy: Towards Efficient 3D Diffusion Policy with Hybrid Selective State Models, June 2025. URL http://arxiv.org/abs/2409.07163. arXiv:2409.07163 [cs]

work page arXiv 2025
[4]

Gaf: Gaussian action field as a 4d representation for dynamic world modeling in robotic manipulation, 2025

Ying Chai, Litao Deng, Ruizhi Shao, Jiajun Zhang, Kangchen Lv, Liangjun Xing, Xiang Li, Hongwen Zhang, and Yebin Liu. Gaf: Gaussian action field as a 4d representation for dynamic world modeling in robotic manipulation, 2025. URL https://arxiv.org/abs/2506.14135

work page arXiv 2025
[5]

History-Aware Visuomotor Policy Learning via Point Tracking, March 2026

Jingjing Chen, Hongjie Fang, Chenxi Wang, Shiquan Wang, and Cewu Lu. History-Aware Visuomotor Policy Learning via Point Tracking, March 2026. URL http://arxiv.org/abs/ 2509.17141. arXiv:2509.17141 [cs]

work page arXiv 2026
[6]

RoboTwin 2.0: A Scalable Data Generator and Benchmark with Strong Domain Randomization for Robust Bimanual Robotic Manipulation

Tianxing Chen, Zanxin Chen, Baijun Chen, Zijian Cai, Yibin Liu, Zixuan Li, Qiwei Liang, Xianliang Lin, Yiheng Ge, Zhenyu Gu, Weiliang Deng, Yubin Guo, Tian Nian, Xuanbing Xie, Qiangyu Chen, Kailun Su, Tianling Xu, Guodong Liu, Mengkang Hu, Huan-ang Gao, Kaixuan Wang, Zhixuan Liang, Yusen Qin, Xiaokang Yang, Ping Luo, and Yao Mu. RoboTwin 2.0: A Scalable D...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[7]

Diffusion Policy: Visuomotor Policy Learning via Action Diffusion

Cheng Chi, Zhenjia Xu, Siyuan Feng, Eric Cousineau, Yilun Du, Benjamin Burchfiel, Russ Tedrake, and Shuran Song. Diffusion Policy: Visuomotor Policy Learning via Action Diffusion, March 2024. URLhttp://arxiv.org/abs/2303.04137. arXiv:2303.04137 [cs]

work page internal anchor Pith review Pith/arXiv arXiv 2024
[8]

Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality

Tri Dao and Albert Gu. Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality, May 2024. URL http://arxiv.org/abs/2405. 21060. arXiv:2405.21060 [cs]

work page internal anchor Pith review Pith/arXiv arXiv 2024
[9]

Dream2flow: Bridging video generation and open-world manipulation with 3d object flow, 2025

Karthik Dharmarajan, Wenlong Huang, Jiajun Wu, Li Fei-Fei, and Ruohan Zhang. Dream2flow: Bridging video generation and open-world manipulation with 3d object flow, 2025. URL https://arxiv.org/abs/2512.24766

work page arXiv 2025
[10]

Omp: One-step meanflow policy with directional alignment, 2026

Han Fang, Yize Huang, Yuheng Zhao, Paul Weng, Xiao Li, and Yutong Ban. Omp: One-step meanflow policy with directional alignment, 2026. URL https://arxiv.org/abs/2512. 19347

work page 2026
[11]

Learning video generation for robotic manipulation with collaborative trajectory control,

Xiao Fu, Xintao Wang, Xian Liu, Jianhong Bai, Runsen Xu, Pengfei Wan, Di Zhang, and Dahua Lin. Learning video generation for robotic manipulation with collaborative trajectory control,

work page
[12]

URLhttps://arxiv.org/abs/2506.01943

work page arXiv
[13]

Vita: Vision-to-action flow matching policy, 2026

Dechen Gao, Boqi Zhao, Andrew Lee, Ian Chuang, Hanchu Zhou, Hang Wang, Zhe Zhao, Junshan Zhang, and Iman Soltani. Vita: Vision-to-action flow matching policy, 2026. URL https://arxiv.org/abs/2507.13231. 10

work page arXiv 2026
[14]

Mamba: Linear-Time Sequence Modeling with Selective State Spaces

Albert Gu and Tri Dao. Mamba: Linear-Time Sequence Modeling with Selective State Spaces, May 2024. URLhttp://arxiv.org/abs/2312.00752. arXiv:2312.00752 [cs]

work page internal anchor Pith review Pith/arXiv arXiv 2024
[15]

SeedPolicy: Horizon Scaling via Self-Evolving Diffusion Policy for Robot Manipulation

Youqiang Gui, Yuxuan Zhou, Shen Cheng, Xinyang Yuan, Haoqiang Fan, Peng Cheng, and Shuaicheng Liu. SeedPolicy: Horizon Scaling via Self-Evolving Diffusion Policy for Robot Manipulation, March 2026. URL http://arxiv.org/abs/2603.05117. arXiv:2603.05117 [cs]

work page internal anchor Pith review Pith/arXiv arXiv 2026
[16]

Flowdreamer: A rgb-d world model with flow-based motion representations for robot manipulation, 2025

Jun Guo, Xiaojian Ma, Yikai Wang, Min Yang, Huaping Liu, and Qing Li. Flowdreamer: A rgb-d world model with flow-based motion representations for robot manipulation, 2025. URL https://arxiv.org/abs/2505.10075

work page arXiv 2025
[17]

Ctrl-world: A controllable generative world model for robot manipulation, 2026

Yanjiang Guo, Lucy Xiaoyang Shi, Jianyu Chen, and Chelsea Finn. Ctrl-world: A controllable generative world model for robot manipulation, 2026. URL https://arxiv.org/abs/2510. 10125

work page 2026
[18]

Causal Confusion in Imitation Learning, November 2019

Pim de Haan, Dinesh Jayaraman, and Sergey Levine. Causal Confusion in Imitation Learning, November 2019. URLhttp://arxiv.org/abs/1905.11979. arXiv:1905.11979 [cs]

work page arXiv 2019
[19]

Denoising Diffusion Probabilistic Models

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models, 2020. URLhttps://arxiv.org/abs/2006.11239

work page internal anchor Pith review Pith/arXiv arXiv 2020
[20]

AdaFlow: Imitation Learning with Variance- Adaptive Flow-Based Policies, November 2024

Xixi Hu, Bo Liu, Xingchao Liu, and Qiang Liu. AdaFlow: Imitation Learning with Variance- Adaptive Flow-Based Policies, November 2024. URLhttp://arxiv.org/abs/2402.04292. arXiv:2402.04292 [cs]

work page arXiv 2024
[21]

Graphcot- vla: A 3d spatial-aware reasoning vision-language-action model for robotic manipulation with ambiguous instructions, 2025

Helong Huang, Min Cen, Kai Tan, Xingyue Quan, Guowei Huang, and Hong Zhang. Graphcot- vla: A 3d spatial-aware reasoning vision-language-action model for robotic manipulation with ambiguous instructions, 2025. URLhttps://arxiv.org/abs/2508.07650

work page arXiv 2025
[22]

ContextVLA: Vision-Language-Action Model with Amortized Multi-Frame Context, October

Huiwon Jang, Sihyun Yu, Heeseung Kwon, Hojin Jeon, Younggyo Seo, and Jinwoo Shin. ContextVLA: Vision-Language-Action Model with Amortized Multi-Frame Context, October

work page
[23]

arXiv:2510.04246 [cs]

URLhttp://arxiv.org/abs/2510.04246. arXiv:2510.04246 [cs]

work page arXiv
[24]

arXiv preprint arXiv:2406.08234 (2024)

Xiaogang Jia, Qian Wang, Atalay Donat, Bowen Xing, Ge Li, Hongyi Zhou, Onur Celik, Denis Blessing, Rudolf Lioutikov, and Gerhard Neumann. MaIL: Improving Imitation Learning with Mamba, November 2024. URL http://arxiv.org/abs/2406.08234. arXiv:2406.08234 [cs]

work page arXiv 2024
[25]

Wang, Hanyi Zhang, Qian Wang, Rudolf Lioutikov, and Gerhard Neumann

Xiaogang Jia, Atalay Donat, Xi Huang, Xuan Zhao, Denis Blessing, Hongyi Zhou, Han A. Wang, Hanyi Zhang, Qian Wang, Rudolf Lioutikov, and Gerhard Neumann. X-IL: Exploring the Design Space of Imitation Learning Policies, February 2025. URL http://arxiv.org/ abs/2502.12330. arXiv:2502.12330 [cs]

work page arXiv 2025
[26]

AsyncVLA: Asynchronous Flow Matching for Vision-Language-Action Models

Yuhua Jiang, Shuang Cheng, Yan Ding, Feifei Gao, and Biqing Qi. Asyncvla: Asynchronous flow matching for vision-language-action models, 2025. URL https://arxiv.org/abs/ 2511.14148

work page internal anchor Pith review Pith/arXiv arXiv 2025
[27]

World4rl: Diffusion world models for policy refinement with reinforcement learning for robotic manipulation, 2026

Zhennan Jiang, Kai Liu, Yuxin Qin, Shuai Tian, Yupeng Zheng, Mingcai Zhou, Chao Yu, Haoran Li, and Dongbin Zhao. World4rl: Diffusion world models for policy refinement with reinforcement learning for robotic manipulation, 2026. URL https://arxiv.org/abs/2509. 19080

work page 2026
[28]

OpenVLA: An Open-Source Vision-Language-Action Model

Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, Quan Vuong, Thomas Kollar, Benjamin Burchfiel, Russ Tedrake, Dorsa Sadigh, Sergey Levine, Percy Liang, and Chelsea Finn. OpenVLA: An Open-Source Vision-Language-Action Model, September 2024. URL http://arxiv....

work page internal anchor Pith review Pith/arXiv arXiv 2024
[29]

Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success

Moo Jin Kim, Chelsea Finn, and Percy Liang. Fine-tuning vision-language-action models: Optimizing speed and success, 2025. URLhttps://arxiv.org/abs/2502.19645. 11

work page internal anchor Pith review Pith/arXiv arXiv 2025
[30]

HAMLET: Switch your Vision-Language-Action Model into a History-Aware Policy

Myungkyu Koo, Daewon Choi, Taeyoung Kim, Kyungmin Lee, Changyeon Kim, Younggyo Seo, and Jinwoo Shin. HAMLET: Switch your Vision-Language-Action Model into a History- Aware Policy, April 2026. URL http://arxiv.org/abs/2510.00695. arXiv:2510.00695 [cs]

work page internal anchor Pith review Pith/arXiv arXiv 2026
[31]

Li, Berlin Chen, Caitlin Wang, Aviv Bick, J

Aakash Lahoti, Kevin Y . Li, Berlin Chen, Caitlin Wang, Aviv Bick, J. Zico Kolter, Tri Dao, and Albert Gu. Mamba-3: Improved Sequence Modeling using State Space Principles, March 2026. URLhttp://arxiv.org/abs/2603.15569. arXiv:2603.15569 [cs]

work page arXiv 2026
[32]

Behavior generation with latent actions

Seungjae Lee, Yibin Wang, Haritheja Etukuru, H. Jin Kim, Nur Muhammad Mahi Shafiullah, and Lerrel Pinto. Behavior Generation with Latent Actions, June 2024. URL http://arxiv. org/abs/2403.03181. arXiv:2403.03181 [cs]

work page arXiv 2024
[33]

End-to-End Training of Deep Visuomotor Policies

Sergey Levine, Chelsea Finn, Trevor Darrell, and Pieter Abbeel. End-to-End Training of Deep Visuomotor Policies, April 2016. URL http://arxiv.org/abs/1504.00702. arXiv:1504.00702 [cs]

work page internal anchor Pith review Pith/arXiv arXiv 2016
[34]

CronusVLA: Towards Efficient and Robust Manipulation via Multi-Frame Vision-Language-Action Modeling, October 2025

Hao Li, Shuai Yang, Yilun Chen, Xinyi Chen, Xiaoda Yang, Yang Tian, Hanqing Wang, Tai Wang, Dahua Lin, Feng Zhao, and Jiangmiao Pang. CronusVLA: Towards Efficient and Robust Manipulation via Multi-Frame Vision-Language-Action Modeling, October 2025. URL http://arxiv.org/abs/2506.19816. arXiv:2506.19816 [cs]

work page arXiv 2025
[35]

RDT-1B: a Diffusion Foundation Model for Bimanual Manipulation

Songming Liu, Lingxuan Wu, Bangguo Li, Hengkai Tan, Huayu Chen, Zhengyi Wang, Ke Xu, Hang Su, and Jun Zhu. RDT-1B: a Diffusion Foundation Model for Bimanual Manipulation, March 2025. URLhttp://arxiv.org/abs/2410.07864. arXiv:2410.07864 [cs]

work page internal anchor Pith review Pith/arXiv arXiv 2025
[36]

Gwm: Towards scalable gaussian world models for robotic manipulation, 2025

Guanxing Lu, Baoxiong Jia, Puhao Li, Yixin Chen, Ziwei Wang, Yansong Tang, and Siyuan Huang. Gwm: Towards scalable gaussian world models for robotic manipulation, 2025. URL https://arxiv.org/abs/2508.17600

work page arXiv 2025
[37]

H$^{\mathbf{3}}$DP: Triply-Hierarchical Diffusion Policy for Visuomotor Learning, May

Yiyang Lu, Yufeng Tian, Zhecheng Yuan, Xianbang Wang, Pu Hua, Zhengrong Xue, and Huazhe Xu. H$^{\mathbf{3}}$DP: Triply-Hierarchical Diffusion Policy for Visuomotor Learning, May

work page
[38]

arXiv:2505.07819 [cs] version: 1

URLhttp://arxiv.org/abs/2505.07819. arXiv:2505.07819 [cs] version: 1

work page arXiv
[39]

CDP: Towards Robust Autoregressive Visuomotor Policy Learning via Causal Diffusion, August 2025

Jiahua Ma, Yiran Qin, Yixiong Li, Xuanqi Liao, Yulan Guo, and Ruimao Zhang. CDP: Towards Robust Autoregressive Visuomotor Policy Learning via Causal Diffusion, August 2025. URL http://arxiv.org/abs/2506.14769. arXiv:2506.14769 [cs]

work page arXiv 2025
[40]

BPP: Long-Context Robot Imitation Learning by Focusing on Key History Frames, February 2026

Max Sobol Mark, Jacky Liang, Maria Attarian, Chuyuan Fu, Debidatta Dwibedi, Dhruv Shah, and Aviral Kumar. BPP: Long-Context Robot Imitation Learning by Focusing on Key History Frames, February 2026. URL http://arxiv.org/abs/2602.15010. arXiv:2602.15010 [cs]

work page arXiv 2026
[41]

Dispo: Diffusion-ssm based policy learning for coarse-to-fine action discretization, 2026

Nayoung Oh, Jaehyeong Jang, Moonkyeong Jung, and Daehyung Park. Dispo: Diffusion-ssm based policy learning for coarse-to-fine action discretization, 2026. URL https://arxiv. org/abs/2409.14719

work page arXiv 2026
[42]

Scalable Diffusion Models with Transformers

William Peebles and Saining Xie. Scalable Diffusion Models with Transformers, March 2023. URLhttp://arxiv.org/abs/2212.09748. arXiv:2212.09748 [cs]

work page internal anchor Pith review Pith/arXiv arXiv 2023
[43]

Consistency Policy: Accelerated Visuomotor Policies via Consistency Distillation, June 2024

Aaditya Prasad, Kevin Lin, Jimmy Wu, Linqi Zhou, and Jeannette Bohg. Consistency Policy: Accelerated Visuomotor Policies via Consistency Distillation, June 2024. URL http://arxiv. org/abs/2405.07503. arXiv:2405.07503 [cs]

work page arXiv 2024
[44]

SpatialVLA: Exploring Spatial Representations for Visual-Language-Action Model

Delin Qu, Haoming Song, Qizhi Chen, Yuanqi Yao, Xinyi Ye, Yan Ding, Zhigang Wang, JiaYuan Gu, Bin Zhao, Dong Wang, and Xuelong Li. Spatialvla: Exploring spatial representations for visual-language-action model, 2025. URLhttps://arxiv.org/abs/2501.15830

work page internal anchor Pith review Pith/arXiv arXiv 2025
[45]

Learning Complex Dexterous Manipulation with Deep Reinforcement Learning and Demonstrations

Aravind Rajeswaran, Vikash Kumar, Abhishek Gupta, Giulia Vezzani, John Schulman, Emanuel Todorov, and Sergey Levine. Learning Complex Dexterous Manipulation with Deep Reinforce- ment Learning and Demonstrations, June 2018. URL http://arxiv.org/abs/1709.10087. arXiv:1709.10087 [cs]. 12

work page internal anchor Pith review Pith/arXiv arXiv 2018
[46]

Behavior Transformers: Cloning $k$ modes with one stone, October 2022

Nur Muhammad Mahi Shafiullah, Zichen Jeff Cui, Ariuntuya Altanzaya, and Lerrel Pinto. Behavior Transformers: Cloning $k$ modes with one stone, October 2022. URL http: //arxiv.org/abs/2206.11251. arXiv:2206.11251 [cs]

work page arXiv 2022
[47]

MP1: MeanFlow Tames Policy Learning in 1-step for Robotic Manipulation, October 2025

Juyi Sheng, Ziyi Wang, Peiming Li, and Mengyuan Liu. MP1: MeanFlow Tames Policy Learning in 1-step for Robotic Manipulation, October 2025. URL http://arxiv.org/abs/ 2507.10543. arXiv:2507.10543 [cs]

work page arXiv 2025
[48]

Andrew Bagnell, and Zhiwei Steven Wu

Gokul Swamy, Sanjiban Choudhury, J. Andrew Bagnell, and Zhiwei Steven Wu. Sequence Model Imitation Learning with Unobserved Contexts, January 2023. URL http://arxiv. org/abs/2208.02225. arXiv:2208.02225 [cs]

work page arXiv 2023
[49]

Vlash: Real-time vlas via future-state-aware asynchronous inference, 2025

Jiaming Tang, Yufei Sun, Yilong Zhao, Shang Yang, Yujun Lin, Zhuoyang Zhang, James Hou, Yao Lu, Zhijian Liu, and Song Han. Vlash: Real-time vlas via future-state-aware asynchronous inference, 2025. URLhttps://arxiv.org/abs/2512.01031

work page arXiv 2025
[50]

Octo: An Open-Source Generalist Robot Policy

Octo Model Team, Dibya Ghosh, Homer Walke, Karl Pertsch, Kevin Black, Oier Mees, Sudeep Dasari, Joey Hejna, Tobias Kreiman, Charles Xu, Jianlan Luo, You Liang Tan, Lawrence Yunliang Chen, Pannag Sanketi, Quan Vuong, Ted Xiao, Dorsa Sadigh, Chelsea Finn, and Sergey Levine. Octo: An open-source generalist robot policy, 2024. URL https://arxiv.org/abs/2405.12213

work page internal anchor Pith review Pith/arXiv arXiv 2024
[51]

Learning Long-Context Diffusion Policies via Past-Token Prediction, May 2025

Marcel Torne, Andy Tang, Yuejiang Liu, and Chelsea Finn. Learning Long-Context Diffusion Policies via Past-Token Prediction, May 2025. URL http://arxiv.org/abs/2505.09561. arXiv:2505.09561 [cs]

work page arXiv 2025
[52]

Mamba as a Motion Encoder for Robotic Imitation Learning.IEEE Access, 13:69941–69949, 2025

Toshiaki Tsuji. Mamba as a Motion Encoder for Robotic Imitation Learning.IEEE Access, 13:69941–69949, 2025. ISSN 2169-3536. doi: 10.1109/ACCESS.2025.3561283. URL https://ieeexplore.ieee.org/document/10966860/

work page doi:10.1109/access.2025.3561283 2025
[53]

Cyclemanip: Enabling cyclic task manipulation via effective historical perception and understanding.arXiv preprint arXiv:2512.01022, 2025

Yi-Lin Wei, Haoran Liao, Yuhao Lin, Pengyue Wang, Zhizhao Liang, Guiliang Liu, and Wei-Shi Zheng. Cyclemanip: Enabling cyclic task manipulation via effective historical perception and understanding.arXiv preprint arXiv:2512.01022, 2025

work page arXiv 2025
[54]

Fighting Copycat Agents in Behavioral Cloning from Observation Histories, October 2020

Chuan Wen, Jierui Lin, Trevor Darrell, Dinesh Jayaraman, and Yang Gao. Fighting Copycat Agents in Behavioral Cloning from Observation Histories, October 2020. URL http://arxiv. org/abs/2010.14876. arXiv:2010.14876 [cs]

work page arXiv 2020
[55]

Keyframe- Focused Visual Imitation Learning, June 2021

Chuan Wen, Jierui Lin, Jianing Qian, Yang Gao, and Dinesh Jayaraman. Keyframe- Focused Visual Imitation Learning, June 2021. URL http://arxiv.org/abs/2106.06452. arXiv:2106.06452 [cs]

work page arXiv 2021
[56]

In- context adaptation for generalizable imitation learning

Junlin Xie, Xu Luo, Hao Wu, Ji Zhang, Youguang Xing, Lianli Gao, and Jingkuan Song. In- context adaptation for generalizable imitation learning. InCoRL 2025 Workshop RemembeRL

work page 2025
[57]

ManiFlow: A General Robot Manipulation Policy via Consistency Flow Training, September 2025

Ge Yan, Jiyue Zhu, Yuquan Deng, Shiqi Yang, Ri-Zhao Qiu, Xuxin Cheng, Marius Memmel, Ranjay Krishna, Ankit Goyal, Xiaolong Wang, and Dieter Fox. ManiFlow: A General Robot Manipulation Policy via Consistency Flow Training, September 2025. URL http://arxiv. org/abs/2509.01819. arXiv:2509.01819 [cs]

work page arXiv 2025
[58]

PlayWorld: Learning Robot World Models from Autonomous Play

Tenny Yin, Zhiting Mei, Zhonghe Zheng, Miyu Yamane, David Wang, Jade Sceats, Samuel M. Bateman, Lihan Zha, Apurva Badithela, Ola Shorinwa, and Anirudha Majumdar. Playworld: Learning robot world models from autonomous play, 2026. URL https://arxiv.org/abs/ 2603.09030

work page internal anchor Pith review Pith/arXiv arXiv 2026
[59]

RoboSSM: Scalable In-context Imitation Learning via State-Space Models, September

Youngju Yoo, Jiaheng Hu, Yifeng Zhu, Bo Liu, Qiang Liu, Roberto Martín-Martín, and Peter Stone. RoboSSM: Scalable In-context Imitation Learning via State-Space Models, September

work page
[60]

arXiv:2509.19658 [cs]

URLhttp://arxiv.org/abs/2509.19658. arXiv:2509.19658 [cs]

work page arXiv
[61]

Meta-World: A Bench- mark and Evaluation for Multi-Task and Meta Reinforcement Learning, June 2021

Tianhe Yu, Deirdre Quillen, Zhanpeng He, Ryan Julian, Avnish Narayan, Hayden Shively, Adithya Bellathur, Karol Hausman, Chelsea Finn, and Sergey Levine. Meta-World: A Bench- mark and Evaluation for Multi-Task and Meta Reinforcement Learning, June 2021. URL http://arxiv.org/abs/1910.10897. arXiv:1910.10897 [cs]. 13

work page arXiv 2021
[62]

3D Diffusion Policy: Generalizable Visuomotor Policy Learning via Simple 3D Representations

Yanjie Ze, Gu Zhang, Kangning Zhang, Chenyuan Hu, Muhan Wang, and Huazhe Xu. 3D Diffusion Policy: Generalizable Visuomotor Policy Learning via Simple 3D Representations, September 2024. URLhttp://arxiv.org/abs/2403.03954. arXiv:2403.03954 [cs]

work page internal anchor Pith review Pith/arXiv arXiv 2024
[63]

Transporter Networks: Rearranging the Visual World for Robotic Manipulation, January 2022

Andy Zeng, Pete Florence, Jonathan Tompson, Stefan Welker, Jonathan Chien, Maria Attarian, Travis Armstrong, Ivan Krasin, Dan Duong, Ayzaan Wahid, Vikas Sindhwani, and Johnny Lee. Transporter Networks: Rearranging the Visual World for Robotic Manipulation, January 2022. URLhttp://arxiv.org/abs/2010.14406. arXiv:2010.14406 [cs]

work page arXiv 2022
[64]

Flowpolicy: Enabling fast and robust 3d flow-based policy via consistency flow matching for robot manipulation

Qinglun Zhang, Zhen Liu, Haoqiang Fan, Guanghui Liu, Bing Zeng, and Shuaicheng Liu. Flowpolicy: Enabling fast and robust 3d flow-based policy via consistency flow matching for robot manipulation. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 14754–14762, 2025

work page 2025
[65]

Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware

Tony Z. Zhao, Vikash Kumar, Sergey Levine, and Chelsea Finn. Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware, April 2023. URL http://arxiv.org/ abs/2304.13705. arXiv:2304.13705 [cs]

work page internal anchor Pith review Pith/arXiv arXiv 2023
[66]

TraceVLA: Visual Trace Prompting Enhances Spatial-Temporal Awareness for Generalist Robotic Policies

Ruijie Zheng, Yongyuan Liang, Shuaiyi Huang, Jianfeng Gao, Hal Daumé III, Andrey Kolobov, Furong Huang, and Jianwei Yang. Tracevla: Visual trace prompting enhances spatial-temporal awareness for generalist robotic policies.arXiv preprint arXiv:2412.10345, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[67]

MTIL: Encoding Full History With Mamba for Temporal Imitation Learning.IEEE Robotics and Automation Letters, 10(11):11761–11767, November 2025

Yulin Zhou, Yuankai Lin, Fanzhe Peng, Jiahui Chen, Kaiji Huang, Hua Yang, and Zhouping Yin. MTIL: Encoding Full History With Mamba for Temporal Imitation Learning.IEEE Robotics and Automation Letters, 10(11):11761–11767, November 2025. ISSN 2377-3766, 2377-3774. doi: 10.1109/LRA.2025.3615520. URL https://ieeexplore.ieee.org/ document/11184145/

work page doi:10.1109/lra.2025.3615520 2025
[68]

Irasim: A fine-grained world model for robot manipulation, 2025

Fangqi Zhu, Hongtao Wu, Song Guo, Yuxiao Liu, Chilam Cheang, and Tao Kong. Irasim: A fine-grained world model for robot manipulation, 2025. URL https://arxiv.org/abs/ 2406.14540. A Preliminaries Diffusion Policy.Diffusion Policy [ 7] adapts Denoising Diffusion Probabilistic Models (DDPMs) [18] to action generation. The policy treats a future action sequen...

work page arXiv 2025
[69]

For a conditioning variable X, the minimum achievable Mean Squared Error (MSE) loss is the expected conditional variance of the expert actiona t calculated over the expert datasetD E: L∗(X) =E (X,at)∼DE[Var(at |X)].(17) Specifically, we denote the optimal losses for reactive and history-conditioned policies as: L∗(ot) =E ot∼DE[Var(at |o t)]andL ∗(ht) =E h...

work page

[1] [1]

Openpi Comet: Competition Solution For 2025 BEHA VIOR Challenge, December 2025

Junjie Bai, Yu-Wei Chao, Qizhi Chen, Jinwei Gu, Moo Jin Kim, Zhaoshuo Li, Xuan Li, Tsung- Yi Lin, Ming-Yu Liu, Nic Ma, Kaichun Mo, Delin Qu, Shangkun Sun, Hongchi Xia, Fangyin Wei, and Xiaohui Zeng. Openpi Comet: Competition Solution For 2025 BEHA VIOR Challenge, December 2025. URLhttp://arxiv.org/abs/2512.10071. arXiv:2512.10071 [cs]

work page arXiv 2025

[2] [2]

RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control

Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Xi Chen, Krzysztof Choro- manski, Tianli Ding, Danny Driess, Avinava Dubey, Chelsea Finn, Pete Florence, Chuyuan Fu, Montse Gonzalez Arenas, Keerthana Gopalakrishnan, Kehang Han, Karol Hausman, Alexander Herzog, Jasmine Hsu, Brian Ichter, Alex Irpan, Nikhil Joshi, Ryan Julian, Dmitry Kalash- n...

work page internal anchor Pith review Pith/arXiv arXiv 2023

[3] [3]

Mamba Policy: Towards Efficient 3D Diffusion Policy with Hybrid Selective State Models, June 2025

Jiahang Cao, Qiang Zhang, Jingkai Sun, Jiaxu Wang, Hao Cheng, Yulin Li, Jun Ma, Kun Wu, Zhiyuan Xu, Yecheng Shao, Wen Zhao, Gang Han, Yijie Guo, and Renjing Xu. Mamba Policy: Towards Efficient 3D Diffusion Policy with Hybrid Selective State Models, June 2025. URL http://arxiv.org/abs/2409.07163. arXiv:2409.07163 [cs]

work page arXiv 2025

[4] [4]

Gaf: Gaussian action field as a 4d representation for dynamic world modeling in robotic manipulation, 2025

Ying Chai, Litao Deng, Ruizhi Shao, Jiajun Zhang, Kangchen Lv, Liangjun Xing, Xiang Li, Hongwen Zhang, and Yebin Liu. Gaf: Gaussian action field as a 4d representation for dynamic world modeling in robotic manipulation, 2025. URL https://arxiv.org/abs/2506.14135

work page arXiv 2025

[5] [5]

History-Aware Visuomotor Policy Learning via Point Tracking, March 2026

Jingjing Chen, Hongjie Fang, Chenxi Wang, Shiquan Wang, and Cewu Lu. History-Aware Visuomotor Policy Learning via Point Tracking, March 2026. URL http://arxiv.org/abs/ 2509.17141. arXiv:2509.17141 [cs]

work page arXiv 2026

[6] [6]

RoboTwin 2.0: A Scalable Data Generator and Benchmark with Strong Domain Randomization for Robust Bimanual Robotic Manipulation

Tianxing Chen, Zanxin Chen, Baijun Chen, Zijian Cai, Yibin Liu, Zixuan Li, Qiwei Liang, Xianliang Lin, Yiheng Ge, Zhenyu Gu, Weiliang Deng, Yubin Guo, Tian Nian, Xuanbing Xie, Qiangyu Chen, Kailun Su, Tianling Xu, Guodong Liu, Mengkang Hu, Huan-ang Gao, Kaixuan Wang, Zhixuan Liang, Yusen Qin, Xiaokang Yang, Ping Luo, and Yao Mu. RoboTwin 2.0: A Scalable D...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[7] [7]

Diffusion Policy: Visuomotor Policy Learning via Action Diffusion

Cheng Chi, Zhenjia Xu, Siyuan Feng, Eric Cousineau, Yilun Du, Benjamin Burchfiel, Russ Tedrake, and Shuran Song. Diffusion Policy: Visuomotor Policy Learning via Action Diffusion, March 2024. URLhttp://arxiv.org/abs/2303.04137. arXiv:2303.04137 [cs]

work page internal anchor Pith review Pith/arXiv arXiv 2024

[8] [8]

Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality

Tri Dao and Albert Gu. Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality, May 2024. URL http://arxiv.org/abs/2405. 21060. arXiv:2405.21060 [cs]

work page internal anchor Pith review Pith/arXiv arXiv 2024

[9] [9]

Dream2flow: Bridging video generation and open-world manipulation with 3d object flow, 2025

Karthik Dharmarajan, Wenlong Huang, Jiajun Wu, Li Fei-Fei, and Ruohan Zhang. Dream2flow: Bridging video generation and open-world manipulation with 3d object flow, 2025. URL https://arxiv.org/abs/2512.24766

work page arXiv 2025

[10] [10]

Omp: One-step meanflow policy with directional alignment, 2026

Han Fang, Yize Huang, Yuheng Zhao, Paul Weng, Xiao Li, and Yutong Ban. Omp: One-step meanflow policy with directional alignment, 2026. URL https://arxiv.org/abs/2512. 19347

work page 2026

[11] [11]

Learning video generation for robotic manipulation with collaborative trajectory control,

Xiao Fu, Xintao Wang, Xian Liu, Jianhong Bai, Runsen Xu, Pengfei Wan, Di Zhang, and Dahua Lin. Learning video generation for robotic manipulation with collaborative trajectory control,

work page

[12] [12]

URLhttps://arxiv.org/abs/2506.01943

work page arXiv

[13] [13]

Vita: Vision-to-action flow matching policy, 2026

Dechen Gao, Boqi Zhao, Andrew Lee, Ian Chuang, Hanchu Zhou, Hang Wang, Zhe Zhao, Junshan Zhang, and Iman Soltani. Vita: Vision-to-action flow matching policy, 2026. URL https://arxiv.org/abs/2507.13231. 10

work page arXiv 2026

[14] [14]

Mamba: Linear-Time Sequence Modeling with Selective State Spaces

Albert Gu and Tri Dao. Mamba: Linear-Time Sequence Modeling with Selective State Spaces, May 2024. URLhttp://arxiv.org/abs/2312.00752. arXiv:2312.00752 [cs]

work page internal anchor Pith review Pith/arXiv arXiv 2024

[15] [15]

SeedPolicy: Horizon Scaling via Self-Evolving Diffusion Policy for Robot Manipulation

Youqiang Gui, Yuxuan Zhou, Shen Cheng, Xinyang Yuan, Haoqiang Fan, Peng Cheng, and Shuaicheng Liu. SeedPolicy: Horizon Scaling via Self-Evolving Diffusion Policy for Robot Manipulation, March 2026. URL http://arxiv.org/abs/2603.05117. arXiv:2603.05117 [cs]

work page internal anchor Pith review Pith/arXiv arXiv 2026

[16] [16]

Flowdreamer: A rgb-d world model with flow-based motion representations for robot manipulation, 2025

Jun Guo, Xiaojian Ma, Yikai Wang, Min Yang, Huaping Liu, and Qing Li. Flowdreamer: A rgb-d world model with flow-based motion representations for robot manipulation, 2025. URL https://arxiv.org/abs/2505.10075

work page arXiv 2025

[17] [17]

Ctrl-world: A controllable generative world model for robot manipulation, 2026

Yanjiang Guo, Lucy Xiaoyang Shi, Jianyu Chen, and Chelsea Finn. Ctrl-world: A controllable generative world model for robot manipulation, 2026. URL https://arxiv.org/abs/2510. 10125

work page 2026

[18] [18]

Causal Confusion in Imitation Learning, November 2019

Pim de Haan, Dinesh Jayaraman, and Sergey Levine. Causal Confusion in Imitation Learning, November 2019. URLhttp://arxiv.org/abs/1905.11979. arXiv:1905.11979 [cs]

work page arXiv 2019

[19] [19]

Denoising Diffusion Probabilistic Models

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models, 2020. URLhttps://arxiv.org/abs/2006.11239

work page internal anchor Pith review Pith/arXiv arXiv 2020

[20] [20]

AdaFlow: Imitation Learning with Variance- Adaptive Flow-Based Policies, November 2024

Xixi Hu, Bo Liu, Xingchao Liu, and Qiang Liu. AdaFlow: Imitation Learning with Variance- Adaptive Flow-Based Policies, November 2024. URLhttp://arxiv.org/abs/2402.04292. arXiv:2402.04292 [cs]

work page arXiv 2024

[21] [21]

Graphcot- vla: A 3d spatial-aware reasoning vision-language-action model for robotic manipulation with ambiguous instructions, 2025

Helong Huang, Min Cen, Kai Tan, Xingyue Quan, Guowei Huang, and Hong Zhang. Graphcot- vla: A 3d spatial-aware reasoning vision-language-action model for robotic manipulation with ambiguous instructions, 2025. URLhttps://arxiv.org/abs/2508.07650

work page arXiv 2025

[22] [22]

ContextVLA: Vision-Language-Action Model with Amortized Multi-Frame Context, October

Huiwon Jang, Sihyun Yu, Heeseung Kwon, Hojin Jeon, Younggyo Seo, and Jinwoo Shin. ContextVLA: Vision-Language-Action Model with Amortized Multi-Frame Context, October

work page

[23] [23]

arXiv:2510.04246 [cs]

URLhttp://arxiv.org/abs/2510.04246. arXiv:2510.04246 [cs]

work page arXiv

[24] [24]

arXiv preprint arXiv:2406.08234 (2024)

Xiaogang Jia, Qian Wang, Atalay Donat, Bowen Xing, Ge Li, Hongyi Zhou, Onur Celik, Denis Blessing, Rudolf Lioutikov, and Gerhard Neumann. MaIL: Improving Imitation Learning with Mamba, November 2024. URL http://arxiv.org/abs/2406.08234. arXiv:2406.08234 [cs]

work page arXiv 2024

[25] [25]

Wang, Hanyi Zhang, Qian Wang, Rudolf Lioutikov, and Gerhard Neumann

Xiaogang Jia, Atalay Donat, Xi Huang, Xuan Zhao, Denis Blessing, Hongyi Zhou, Han A. Wang, Hanyi Zhang, Qian Wang, Rudolf Lioutikov, and Gerhard Neumann. X-IL: Exploring the Design Space of Imitation Learning Policies, February 2025. URL http://arxiv.org/ abs/2502.12330. arXiv:2502.12330 [cs]

work page arXiv 2025

[26] [26]

AsyncVLA: Asynchronous Flow Matching for Vision-Language-Action Models

Yuhua Jiang, Shuang Cheng, Yan Ding, Feifei Gao, and Biqing Qi. Asyncvla: Asynchronous flow matching for vision-language-action models, 2025. URL https://arxiv.org/abs/ 2511.14148

work page internal anchor Pith review Pith/arXiv arXiv 2025

[27] [27]

World4rl: Diffusion world models for policy refinement with reinforcement learning for robotic manipulation, 2026

Zhennan Jiang, Kai Liu, Yuxin Qin, Shuai Tian, Yupeng Zheng, Mingcai Zhou, Chao Yu, Haoran Li, and Dongbin Zhao. World4rl: Diffusion world models for policy refinement with reinforcement learning for robotic manipulation, 2026. URL https://arxiv.org/abs/2509. 19080

work page 2026

[28] [28]

OpenVLA: An Open-Source Vision-Language-Action Model

Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, Quan Vuong, Thomas Kollar, Benjamin Burchfiel, Russ Tedrake, Dorsa Sadigh, Sergey Levine, Percy Liang, and Chelsea Finn. OpenVLA: An Open-Source Vision-Language-Action Model, September 2024. URL http://arxiv....

work page internal anchor Pith review Pith/arXiv arXiv 2024

[29] [29]

Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success

Moo Jin Kim, Chelsea Finn, and Percy Liang. Fine-tuning vision-language-action models: Optimizing speed and success, 2025. URLhttps://arxiv.org/abs/2502.19645. 11

work page internal anchor Pith review Pith/arXiv arXiv 2025

[30] [30]

HAMLET: Switch your Vision-Language-Action Model into a History-Aware Policy

Myungkyu Koo, Daewon Choi, Taeyoung Kim, Kyungmin Lee, Changyeon Kim, Younggyo Seo, and Jinwoo Shin. HAMLET: Switch your Vision-Language-Action Model into a History- Aware Policy, April 2026. URL http://arxiv.org/abs/2510.00695. arXiv:2510.00695 [cs]

work page internal anchor Pith review Pith/arXiv arXiv 2026

[31] [31]

Li, Berlin Chen, Caitlin Wang, Aviv Bick, J

Aakash Lahoti, Kevin Y . Li, Berlin Chen, Caitlin Wang, Aviv Bick, J. Zico Kolter, Tri Dao, and Albert Gu. Mamba-3: Improved Sequence Modeling using State Space Principles, March 2026. URLhttp://arxiv.org/abs/2603.15569. arXiv:2603.15569 [cs]

work page arXiv 2026

[32] [32]

Behavior generation with latent actions

Seungjae Lee, Yibin Wang, Haritheja Etukuru, H. Jin Kim, Nur Muhammad Mahi Shafiullah, and Lerrel Pinto. Behavior Generation with Latent Actions, June 2024. URL http://arxiv. org/abs/2403.03181. arXiv:2403.03181 [cs]

work page arXiv 2024

[33] [33]

End-to-End Training of Deep Visuomotor Policies

Sergey Levine, Chelsea Finn, Trevor Darrell, and Pieter Abbeel. End-to-End Training of Deep Visuomotor Policies, April 2016. URL http://arxiv.org/abs/1504.00702. arXiv:1504.00702 [cs]

work page internal anchor Pith review Pith/arXiv arXiv 2016

[34] [34]

CronusVLA: Towards Efficient and Robust Manipulation via Multi-Frame Vision-Language-Action Modeling, October 2025

Hao Li, Shuai Yang, Yilun Chen, Xinyi Chen, Xiaoda Yang, Yang Tian, Hanqing Wang, Tai Wang, Dahua Lin, Feng Zhao, and Jiangmiao Pang. CronusVLA: Towards Efficient and Robust Manipulation via Multi-Frame Vision-Language-Action Modeling, October 2025. URL http://arxiv.org/abs/2506.19816. arXiv:2506.19816 [cs]

work page arXiv 2025

[35] [35]

RDT-1B: a Diffusion Foundation Model for Bimanual Manipulation

Songming Liu, Lingxuan Wu, Bangguo Li, Hengkai Tan, Huayu Chen, Zhengyi Wang, Ke Xu, Hang Su, and Jun Zhu. RDT-1B: a Diffusion Foundation Model for Bimanual Manipulation, March 2025. URLhttp://arxiv.org/abs/2410.07864. arXiv:2410.07864 [cs]

work page internal anchor Pith review Pith/arXiv arXiv 2025

[36] [36]

Gwm: Towards scalable gaussian world models for robotic manipulation, 2025

Guanxing Lu, Baoxiong Jia, Puhao Li, Yixin Chen, Ziwei Wang, Yansong Tang, and Siyuan Huang. Gwm: Towards scalable gaussian world models for robotic manipulation, 2025. URL https://arxiv.org/abs/2508.17600

work page arXiv 2025

[37] [37]

H$^{\mathbf{3}}$DP: Triply-Hierarchical Diffusion Policy for Visuomotor Learning, May

Yiyang Lu, Yufeng Tian, Zhecheng Yuan, Xianbang Wang, Pu Hua, Zhengrong Xue, and Huazhe Xu. H$^{\mathbf{3}}$DP: Triply-Hierarchical Diffusion Policy for Visuomotor Learning, May

work page

[38] [38]

arXiv:2505.07819 [cs] version: 1

URLhttp://arxiv.org/abs/2505.07819. arXiv:2505.07819 [cs] version: 1

work page arXiv

[39] [39]

CDP: Towards Robust Autoregressive Visuomotor Policy Learning via Causal Diffusion, August 2025

Jiahua Ma, Yiran Qin, Yixiong Li, Xuanqi Liao, Yulan Guo, and Ruimao Zhang. CDP: Towards Robust Autoregressive Visuomotor Policy Learning via Causal Diffusion, August 2025. URL http://arxiv.org/abs/2506.14769. arXiv:2506.14769 [cs]

work page arXiv 2025

[40] [40]

BPP: Long-Context Robot Imitation Learning by Focusing on Key History Frames, February 2026

Max Sobol Mark, Jacky Liang, Maria Attarian, Chuyuan Fu, Debidatta Dwibedi, Dhruv Shah, and Aviral Kumar. BPP: Long-Context Robot Imitation Learning by Focusing on Key History Frames, February 2026. URL http://arxiv.org/abs/2602.15010. arXiv:2602.15010 [cs]

work page arXiv 2026

[41] [41]

Dispo: Diffusion-ssm based policy learning for coarse-to-fine action discretization, 2026

Nayoung Oh, Jaehyeong Jang, Moonkyeong Jung, and Daehyung Park. Dispo: Diffusion-ssm based policy learning for coarse-to-fine action discretization, 2026. URL https://arxiv. org/abs/2409.14719

work page arXiv 2026

[42] [42]

Scalable Diffusion Models with Transformers

William Peebles and Saining Xie. Scalable Diffusion Models with Transformers, March 2023. URLhttp://arxiv.org/abs/2212.09748. arXiv:2212.09748 [cs]

work page internal anchor Pith review Pith/arXiv arXiv 2023

[43] [43]

Consistency Policy: Accelerated Visuomotor Policies via Consistency Distillation, June 2024

Aaditya Prasad, Kevin Lin, Jimmy Wu, Linqi Zhou, and Jeannette Bohg. Consistency Policy: Accelerated Visuomotor Policies via Consistency Distillation, June 2024. URL http://arxiv. org/abs/2405.07503. arXiv:2405.07503 [cs]

work page arXiv 2024

[44] [44]

SpatialVLA: Exploring Spatial Representations for Visual-Language-Action Model

Delin Qu, Haoming Song, Qizhi Chen, Yuanqi Yao, Xinyi Ye, Yan Ding, Zhigang Wang, JiaYuan Gu, Bin Zhao, Dong Wang, and Xuelong Li. Spatialvla: Exploring spatial representations for visual-language-action model, 2025. URLhttps://arxiv.org/abs/2501.15830

work page internal anchor Pith review Pith/arXiv arXiv 2025

[45] [45]

Learning Complex Dexterous Manipulation with Deep Reinforcement Learning and Demonstrations

Aravind Rajeswaran, Vikash Kumar, Abhishek Gupta, Giulia Vezzani, John Schulman, Emanuel Todorov, and Sergey Levine. Learning Complex Dexterous Manipulation with Deep Reinforce- ment Learning and Demonstrations, June 2018. URL http://arxiv.org/abs/1709.10087. arXiv:1709.10087 [cs]. 12

work page internal anchor Pith review Pith/arXiv arXiv 2018

[46] [46]

Behavior Transformers: Cloning $k$ modes with one stone, October 2022

Nur Muhammad Mahi Shafiullah, Zichen Jeff Cui, Ariuntuya Altanzaya, and Lerrel Pinto. Behavior Transformers: Cloning $k$ modes with one stone, October 2022. URL http: //arxiv.org/abs/2206.11251. arXiv:2206.11251 [cs]

work page arXiv 2022

[47] [47]

MP1: MeanFlow Tames Policy Learning in 1-step for Robotic Manipulation, October 2025

Juyi Sheng, Ziyi Wang, Peiming Li, and Mengyuan Liu. MP1: MeanFlow Tames Policy Learning in 1-step for Robotic Manipulation, October 2025. URL http://arxiv.org/abs/ 2507.10543. arXiv:2507.10543 [cs]

work page arXiv 2025

[48] [48]

Andrew Bagnell, and Zhiwei Steven Wu

Gokul Swamy, Sanjiban Choudhury, J. Andrew Bagnell, and Zhiwei Steven Wu. Sequence Model Imitation Learning with Unobserved Contexts, January 2023. URL http://arxiv. org/abs/2208.02225. arXiv:2208.02225 [cs]

work page arXiv 2023

[49] [49]

Vlash: Real-time vlas via future-state-aware asynchronous inference, 2025

Jiaming Tang, Yufei Sun, Yilong Zhao, Shang Yang, Yujun Lin, Zhuoyang Zhang, James Hou, Yao Lu, Zhijian Liu, and Song Han. Vlash: Real-time vlas via future-state-aware asynchronous inference, 2025. URLhttps://arxiv.org/abs/2512.01031

work page arXiv 2025

[50] [50]

Octo: An Open-Source Generalist Robot Policy

Octo Model Team, Dibya Ghosh, Homer Walke, Karl Pertsch, Kevin Black, Oier Mees, Sudeep Dasari, Joey Hejna, Tobias Kreiman, Charles Xu, Jianlan Luo, You Liang Tan, Lawrence Yunliang Chen, Pannag Sanketi, Quan Vuong, Ted Xiao, Dorsa Sadigh, Chelsea Finn, and Sergey Levine. Octo: An open-source generalist robot policy, 2024. URL https://arxiv.org/abs/2405.12213

work page internal anchor Pith review Pith/arXiv arXiv 2024

[51] [51]

Learning Long-Context Diffusion Policies via Past-Token Prediction, May 2025

Marcel Torne, Andy Tang, Yuejiang Liu, and Chelsea Finn. Learning Long-Context Diffusion Policies via Past-Token Prediction, May 2025. URL http://arxiv.org/abs/2505.09561. arXiv:2505.09561 [cs]

work page arXiv 2025

[52] [52]

Mamba as a Motion Encoder for Robotic Imitation Learning.IEEE Access, 13:69941–69949, 2025

Toshiaki Tsuji. Mamba as a Motion Encoder for Robotic Imitation Learning.IEEE Access, 13:69941–69949, 2025. ISSN 2169-3536. doi: 10.1109/ACCESS.2025.3561283. URL https://ieeexplore.ieee.org/document/10966860/

work page doi:10.1109/access.2025.3561283 2025

[53] [53]

Cyclemanip: Enabling cyclic task manipulation via effective historical perception and understanding.arXiv preprint arXiv:2512.01022, 2025

Yi-Lin Wei, Haoran Liao, Yuhao Lin, Pengyue Wang, Zhizhao Liang, Guiliang Liu, and Wei-Shi Zheng. Cyclemanip: Enabling cyclic task manipulation via effective historical perception and understanding.arXiv preprint arXiv:2512.01022, 2025

work page arXiv 2025

[54] [54]

Fighting Copycat Agents in Behavioral Cloning from Observation Histories, October 2020

Chuan Wen, Jierui Lin, Trevor Darrell, Dinesh Jayaraman, and Yang Gao. Fighting Copycat Agents in Behavioral Cloning from Observation Histories, October 2020. URL http://arxiv. org/abs/2010.14876. arXiv:2010.14876 [cs]

work page arXiv 2020

[55] [55]

Keyframe- Focused Visual Imitation Learning, June 2021

Chuan Wen, Jierui Lin, Jianing Qian, Yang Gao, and Dinesh Jayaraman. Keyframe- Focused Visual Imitation Learning, June 2021. URL http://arxiv.org/abs/2106.06452. arXiv:2106.06452 [cs]

work page arXiv 2021

[56] [56]

In- context adaptation for generalizable imitation learning

Junlin Xie, Xu Luo, Hao Wu, Ji Zhang, Youguang Xing, Lianli Gao, and Jingkuan Song. In- context adaptation for generalizable imitation learning. InCoRL 2025 Workshop RemembeRL

work page 2025

[57] [57]

ManiFlow: A General Robot Manipulation Policy via Consistency Flow Training, September 2025

Ge Yan, Jiyue Zhu, Yuquan Deng, Shiqi Yang, Ri-Zhao Qiu, Xuxin Cheng, Marius Memmel, Ranjay Krishna, Ankit Goyal, Xiaolong Wang, and Dieter Fox. ManiFlow: A General Robot Manipulation Policy via Consistency Flow Training, September 2025. URL http://arxiv. org/abs/2509.01819. arXiv:2509.01819 [cs]

work page arXiv 2025

[58] [58]

PlayWorld: Learning Robot World Models from Autonomous Play

Tenny Yin, Zhiting Mei, Zhonghe Zheng, Miyu Yamane, David Wang, Jade Sceats, Samuel M. Bateman, Lihan Zha, Apurva Badithela, Ola Shorinwa, and Anirudha Majumdar. Playworld: Learning robot world models from autonomous play, 2026. URL https://arxiv.org/abs/ 2603.09030

work page internal anchor Pith review Pith/arXiv arXiv 2026

[59] [59]

RoboSSM: Scalable In-context Imitation Learning via State-Space Models, September

Youngju Yoo, Jiaheng Hu, Yifeng Zhu, Bo Liu, Qiang Liu, Roberto Martín-Martín, and Peter Stone. RoboSSM: Scalable In-context Imitation Learning via State-Space Models, September

work page

[60] [60]

arXiv:2509.19658 [cs]

URLhttp://arxiv.org/abs/2509.19658. arXiv:2509.19658 [cs]

work page arXiv

[61] [61]

Meta-World: A Bench- mark and Evaluation for Multi-Task and Meta Reinforcement Learning, June 2021

Tianhe Yu, Deirdre Quillen, Zhanpeng He, Ryan Julian, Avnish Narayan, Hayden Shively, Adithya Bellathur, Karol Hausman, Chelsea Finn, and Sergey Levine. Meta-World: A Bench- mark and Evaluation for Multi-Task and Meta Reinforcement Learning, June 2021. URL http://arxiv.org/abs/1910.10897. arXiv:1910.10897 [cs]. 13

work page arXiv 2021

[62] [62]

3D Diffusion Policy: Generalizable Visuomotor Policy Learning via Simple 3D Representations

Yanjie Ze, Gu Zhang, Kangning Zhang, Chenyuan Hu, Muhan Wang, and Huazhe Xu. 3D Diffusion Policy: Generalizable Visuomotor Policy Learning via Simple 3D Representations, September 2024. URLhttp://arxiv.org/abs/2403.03954. arXiv:2403.03954 [cs]

work page internal anchor Pith review Pith/arXiv arXiv 2024

[63] [63]

Transporter Networks: Rearranging the Visual World for Robotic Manipulation, January 2022

Andy Zeng, Pete Florence, Jonathan Tompson, Stefan Welker, Jonathan Chien, Maria Attarian, Travis Armstrong, Ivan Krasin, Dan Duong, Ayzaan Wahid, Vikas Sindhwani, and Johnny Lee. Transporter Networks: Rearranging the Visual World for Robotic Manipulation, January 2022. URLhttp://arxiv.org/abs/2010.14406. arXiv:2010.14406 [cs]

work page arXiv 2022

[64] [64]

Flowpolicy: Enabling fast and robust 3d flow-based policy via consistency flow matching for robot manipulation

Qinglun Zhang, Zhen Liu, Haoqiang Fan, Guanghui Liu, Bing Zeng, and Shuaicheng Liu. Flowpolicy: Enabling fast and robust 3d flow-based policy via consistency flow matching for robot manipulation. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 14754–14762, 2025

work page 2025

[65] [65]

Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware

Tony Z. Zhao, Vikash Kumar, Sergey Levine, and Chelsea Finn. Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware, April 2023. URL http://arxiv.org/ abs/2304.13705. arXiv:2304.13705 [cs]

work page internal anchor Pith review Pith/arXiv arXiv 2023

[66] [66]

TraceVLA: Visual Trace Prompting Enhances Spatial-Temporal Awareness for Generalist Robotic Policies

Ruijie Zheng, Yongyuan Liang, Shuaiyi Huang, Jianfeng Gao, Hal Daumé III, Andrey Kolobov, Furong Huang, and Jianwei Yang. Tracevla: Visual trace prompting enhances spatial-temporal awareness for generalist robotic policies.arXiv preprint arXiv:2412.10345, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[67] [67]

MTIL: Encoding Full History With Mamba for Temporal Imitation Learning.IEEE Robotics and Automation Letters, 10(11):11761–11767, November 2025

Yulin Zhou, Yuankai Lin, Fanzhe Peng, Jiahui Chen, Kaiji Huang, Hua Yang, and Zhouping Yin. MTIL: Encoding Full History With Mamba for Temporal Imitation Learning.IEEE Robotics and Automation Letters, 10(11):11761–11767, November 2025. ISSN 2377-3766, 2377-3774. doi: 10.1109/LRA.2025.3615520. URL https://ieeexplore.ieee.org/ document/11184145/

work page doi:10.1109/lra.2025.3615520 2025

[68] [68]

Irasim: A fine-grained world model for robot manipulation, 2025

Fangqi Zhu, Hongtao Wu, Song Guo, Yuxiao Liu, Chilam Cheang, and Tao Kong. Irasim: A fine-grained world model for robot manipulation, 2025. URL https://arxiv.org/abs/ 2406.14540. A Preliminaries Diffusion Policy.Diffusion Policy [ 7] adapts Denoising Diffusion Probabilistic Models (DDPMs) [18] to action generation. The policy treats a future action sequen...

work page arXiv 2025

[69] [69]

For a conditioning variable X, the minimum achievable Mean Squared Error (MSE) loss is the expected conditional variance of the expert actiona t calculated over the expert datasetD E: L∗(X) =E (X,at)∼DE[Var(at |X)].(17) Specifically, we denote the optimal losses for reactive and history-conditioned policies as: L∗(ot) =E ot∼DE[Var(at |o t)]andL ∗(ht) =E h...

work page