CoLA-Flow Policy: Temporally Coherent Imitation Learning via Continuous Latent Action Flow Matching for Robotic Manipulation

Jiang Zhiduo; Liu Hong; Liu Yang; Sun Wandong; Wu Songwei; Xie Guanghu; Zhao Rui

arxiv: 2601.23087 · v4 · pith:OZB7D6Z7new · submitted 2026-01-30 · 💻 cs.RO

CoLA-Flow Policy: Temporally Coherent Imitation Learning via Continuous Latent Action Flow Matching for Robotic Manipulation

Wu Songwei , Jiang Zhiduo , Sun Wandong , Xie Guanghu , Zhao Rui , Liu Hong , Liu Yang This is my paper

Pith reviewed 2026-05-21 14:34 UTC · model grok-4.3

classification 💻 cs.RO

keywords robotic manipulationimitation learningflow matchinglatent spacetrajectory generationrobotic policy

0 comments

The pith

Performing flow matching in a continuous latent action space produces smooth, stable robotic trajectories with near-single-step speed.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes CoLA-Flow Policy to address challenges in long-horizon robotic manipulation by combining expressive modeling, real-time inference, and stable execution. It encodes action sequences into temporally coherent latent trajectories and performs flow matching in that latent space instead of raw action space. This decouples global motion from low-level noise, leading to better smoothness and success rates. The method also incorporates geometry-aware point cloud conditioning for real-world robustness. Experiments demonstrate significant improvements over baselines.

Core claim

CoLA-Flow Policy is a trajectory-level imitation learning framework that performs flow matching in a continuous latent action space. By encoding action sequences into temporally coherent latent trajectories and learning an explicit latent-space flow, it decouples global motion structure from low-level control noise, enabling smooth and reliable long-horizon execution while achieving near-single-step inference.

What carries the argument

Continuous latent action flow matching, which operates flow matching on encoded temporally coherent latent trajectories rather than directly on raw actions.

If this is right

Improves trajectory smoothness by up to 93.7% compared to raw action-space flow baselines.
Boosts task success rates by up to 25 percentage points.
Achieves near-single-step inference while being significantly faster than diffusion-based policies.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Applying similar latent flow techniques could improve stability in other generative control methods beyond robotics.
Testing the approach on a wider range of manipulation tasks might reveal its scalability to more complex environments.

Load-bearing premise

Encoding action sequences into temporally coherent latent trajectories successfully decouples global motion structure from low-level control noise to enable stable execution.

What would settle it

Observing no significant improvement in trajectory smoothness or task success when using the latent flow compared to direct raw action flow matching in long-horizon robotic tasks would falsify the central claim.

Figures

Figures reproduced from arXiv: 2601.23087 by Jiang Zhiduo, Liu Hong, Liu Yang, Sun Wandong, Wu Songwei, Xie Guanghu, Zhao Rui.

**Figure 1.** Figure 1: Overall architecture of the proposed CoLA-Flow Policy. The system first encodes point cloud observations into geometry-aware scene features, then [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗

**Figure 2.** Figure 2: Trajectory-level latent action representation with recurrent encoding and [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Geometry-aware point cloud encoder. Local neighborhoods around [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Trajectory smoothness comparison across simulated manipulation tasks. [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

**Figure 5.** Figure 5: Real-world experimental setup and observations. Left: Franka Emika Panda robot with a LEAP Hand and the visual sensing setup (global L515 and [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗

**Figure 6.** Figure 6: Trajectory smoothness comparison across real-world manipulation tasks. [PITH_FULL_IMAGE:figures/full_fig_p006_6.png] view at source ↗

**Figure 7.** Figure 7: Comparison of real-world joint trajectories under identical initial con [PITH_FULL_IMAGE:figures/full_fig_p007_7.png] view at source ↗

**Figure 8.** Figure 8: Ablation study on trajectory smoothness and task success rate in real [PITH_FULL_IMAGE:figures/full_fig_p007_8.png] view at source ↗

read the original abstract

Learning long-horizon robotic manipulation requires jointly achieving expressive behavior modeling, real-time inference, and stable execution, which remains challenging for existing generative policies. Diffusion-based approaches offer strong modeling capacity but incur high inference latency, while flow matching enables fast, near-single-step generation yet often suffers from unstable execution when operating directly in the raw action space. We propose Continuous Latent Action Flow Policy (CoLA-Flow Policy), a trajectory-level imitation learning framework that performs flow matching in a continuous latent action space. By encoding action sequences into temporally coherent latent trajectories and learning an explicit latent-space flow, CoLA-Flow Policy decouples global motion structure from low-level control noise, enabling smooth and reliable long-horizon execution. The framework further integrates geometry-aware point cloud conditioning and execution-time multimodal modulation, using visual cues as a representative modality to enhance real-world robustness. Experiments in simulation and on real robots show that CoLA-Flow Policy achieves near-single-step inference, improves trajectory smoothness by up to 93.7% and task success by up to 25 percentage points over raw action-space flow baselines, while remaining significantly faster than diffusion-based policies.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CoLA-Flow moves flow matching into a continuous latent action space with point cloud conditioning to get faster inference and smoother robot trajectories, but the reported gains are hard to pin on the latent step alone.

read the letter

The main thing to know about this paper is that it presents a flow-matching policy that operates in a continuous latent action space rather than directly on raw actions, combined with geometry-aware conditioning from point clouds. This setup is meant to deliver fast inference and smoother long-horizon trajectories for robotic manipulation tasks. What the work does well is identify a practical issue with existing flow-based policies: they can be unstable when generating actions directly. By encoding sequences into latent trajectories that capture global motion structure, the model can focus the flow on coherent paths while handling noise separately. Adding visual conditioning and execution-time multimodal modulation helps bridge the sim-to-real gap. The reported results show clear advantages in speed over diffusion policies and better smoothness and success rates than raw flow baselines in both simulation and real-robot experiments. The framework appears to use standard techniques for the autoencoder and flow matching, which keeps it grounded. If the full paper includes proper implementation details and code, that would strengthen it. On the downside, the abstract lacks specifics on error bars, exact dataset sizes, and ablation studies, making it hard to assess the robustness of the gains. The central hypothesis about decoupling global structure from low-level noise is plausible, but the stress-test concern is valid: the raw action-space baselines may not have received the same point cloud conditioning or modulation. Without those controls, it's difficult to attribute the improvements specifically to the latent space rather than the added conditioning elements. This is a moderate issue that could be fixed with targeted experiments. Overall, the paper shows clear thinking on combining latent representations with modern generative methods for robotics. There are no major internal contradictions visible from the description. This paper is aimed at researchers in robot learning who focus on imitation learning and efficient policy generation. Someone looking for ways to make flow-based methods more reliable for real-world deployment would find value here. I would recommend sending it for peer review. The idea is worth testing with fuller evidence, and a referee could help clarify the contributions.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes CoLA-Flow Policy, a trajectory-level imitation learning framework for robotic manipulation that performs flow matching inside a continuous latent action space obtained by encoding action sequences. The core idea is that this latent-space flow decouples global motion structure from low-level control noise, yielding temporally coherent trajectories. The framework also integrates geometry-aware point cloud conditioning and execution-time multimodal modulation. Experiments in simulation and on real robots are reported to achieve near-single-step inference, up to 93.7% improvement in trajectory smoothness, and up to 25 percentage points higher task success relative to raw action-space flow baselines, while remaining faster than diffusion-based policies.

Significance. If the central claims are substantiated, the work would offer a practical route to combining the inference speed of flow matching with the execution stability needed for long-horizon robotic tasks. The explicit separation of latent motion structure from noise, together with visual conditioning, addresses a recognized tension between generative capacity and real-time reliability in imitation learning.

major comments (2)

[Abstract] Abstract: the reported gains (93.7% smoothness, +25 pp success) are attributed to performing flow matching in the continuous latent action space that 'decouples global motion structure from low-level control noise.' However, the same paragraph states that geometry-aware point cloud conditioning and multimodal modulation are integral parts of the framework. Without an explicit statement or ablation confirming that the raw action-space flow baselines also receive identical conditioning, the performance delta cannot be isolated to the latent encoding step; the decoupling hypothesis therefore remains untested.
[Abstract] Abstract: quantitative results are presented without error bars, dataset sizes, number of evaluation trials, or any ablation isolating the latent-flow component from the conditioning modules. This absence prevents verification of whether the claimed improvements are statistically reliable or reproducible across the simulation and real-robot tasks.

minor comments (1)

The phrase 'up to' is used for the maximum reported improvements; indicating the specific tasks or conditions under which these peak values occur would aid interpretation.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major comment below and outline the revisions we will make to strengthen the presentation of our results and experimental details.

read point-by-point responses

Referee: [Abstract] Abstract: the reported gains (93.7% smoothness, +25 pp success) are attributed to performing flow matching in the continuous latent action space that 'decouples global motion structure from low-level control noise.' However, the same paragraph states that geometry-aware point cloud conditioning and multimodal modulation are integral parts of the framework. Without an explicit statement or ablation confirming that the raw action-space flow baselines also receive identical conditioning, the performance delta cannot be isolated to the latent encoding step; the decoupling hypothesis therefore remains untested.

Authors: We thank the referee for highlighting this important point. In our experiments, the raw action-space flow baselines were implemented using the exact same geometry-aware point cloud conditioning and multimodal modulation as CoLA-Flow Policy to ensure a fair comparison. However, we agree that this equivalence was not stated with sufficient clarity in the abstract. To directly test the decoupling hypothesis, we will add an explicit ablation study in the revised manuscript that isolates the effect of the continuous latent action flow while holding all conditioning modules fixed. This will include quantitative comparisons of smoothness and task success for the latent versus raw-action variants under identical conditioning. revision: yes
Referee: [Abstract] Abstract: quantitative results are presented without error bars, dataset sizes, number of evaluation trials, or any ablation isolating the latent-flow component from the conditioning modules. This absence prevents verification of whether the claimed improvements are statistically reliable or reproducible across the simulation and real-robot tasks.

Authors: We acknowledge that the current abstract and some result summaries omit explicit error bars, precise dataset sizes, and trial counts. The full experimental section reports results averaged over multiple random seeds with standard deviations, using 100-500 demonstrations per task and 50-100 evaluation episodes in simulation (20-30 on the real robot). To improve verifiability, we will revise the abstract to reference these details and add error bars to key figures and tables. We will also incorporate the ablation isolating the latent-flow component (as described in the response to the first comment) to demonstrate statistical reliability and reproducibility. revision: yes

Circularity Check

0 steps flagged

No circularity: standard latent flow-matching construction with empirical claims

full rationale

The paper introduces CoLA-Flow as an encoding of action sequences into a continuous latent space followed by flow matching, plus point-cloud conditioning and multimodal modulation. These are presented as architectural choices whose benefits are measured empirically against baselines. No equations, uniqueness theorems, or self-citations are shown that would make the reported smoothness or success gains equivalent to the inputs by construction. The derivation chain remains independent of the claimed performance deltas.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The central claim rests on the assumption that a learned latent space can separate motion structure from noise and that flow matching in that space yields stable decoded trajectories. No explicit free parameters or invented physical entities are named in the abstract.

free parameters (1)

latent dimension
Dimensionality of the continuous latent action space must be chosen to balance expressiveness and temporal coherence.

axioms (1)

domain assumption Action sequences can be encoded into temporally coherent latent trajectories that decouple global structure from low-level noise.
This premise is invoked to justify performing flow matching in latent rather than raw action space.

invented entities (1)

Continuous Latent Action Space no independent evidence
purpose: Provides a smooth manifold on which flow matching produces temporally coherent trajectories.
New representational space introduced by the framework; no independent falsifiable prediction outside the paper is stated.

pith-pipeline@v0.9.0 · 5752 in / 1346 out tokens · 51465 ms · 2026-05-21T14:34:04.421903+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

performs flow matching in a continuous latent action space... decouples global motion structure from low-level control noise

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

35 extracted references · 35 canonical work pages · 3 internal anchors

[1]

Diffusion policy: Visuomotor policy learning via action diffusion,

C. Chi, Z. Xu, S. Feng, E. Cousineau, Y. Du, B. Burchfiel, R. Tedrake, and S. Song, “Diffusion policy: Visuomotor policy learning via action diffusion,”The International Journal of Robotics Research, vol. 44, no. 10-11, pp. 1684–1704, 2025

work page 2025
[2]

Vo-dp: Semantic-geometric adaptive diffusion policy for vision- only robotic manipulation,

Z. Ni, Y. He, L. Qian, J. Mao, F. Fu, W. Sui, H. Su, J. Peng, Z. Wang, and B. He, “Vo-dp: Semantic-geometric adaptive diffusion policy for vision- only robotic manipulation,”arXiv preprint arXiv:2510.15530, 2025

work page arXiv 2025
[3]

3d diffusion policy: Generalizable visuomotor policy learning via simple 3d representations,

Y. Ze, G. Zhang, K. Zhang, C. Hu, M. Wang, and H. Xu, “3d diffusion policy: Generalizable visuomotor policy learning via simple 3d representations,” inProceedings of Robotics: Science and Systems (RSS), 2024

work page 2024
[4]

Dexcap: Scalable and portable mocap data collection system for dexterous manipulation,

C. Wang, H. Shi, W. Wang, R. Zhang, L. Fei-Fei, and C. K. Liu, “Dexcap: Scalable and portable mocap data collection system for dexterous manipulation,” inProceedings of Robotics: Science and Systems (RSS), 2024

work page 2024
[5]

Hierarchical diffusion policy: ma- nipulation trajectory generation via contact guidance,

D. Wang, C. Liu, F. Chang, and Y. Xu, “Hierarchical diffusion policy: ma- nipulation trajectory generation via contact guidance,”IEEE Transactions on Robotics, 2025

work page 2025
[6]

Flow Matching for Generative Modeling

Y. Lipman, R. T. Chen, H. Ben-Hamu, M. Nickel, and M. Le, “Flow matching for generative modeling,”arXiv preprint arXiv:2210.02747, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[7]

Flow matching on general geometries.arXiv preprint arXiv:2302.03660, 2023

R. T. Chen and Y. Lipman, “Flow matching on general geometries,”arXiv preprint arXiv:2302.03660, 2023

work page arXiv 2023
[8]

Adaptive flow matching for resolving small- scale physics,

S. Fotiadis, N. D. Brenowitz, T. Geffner, Y. Cohen, M. Pritchard, A. Vahdat, and M. Mardani, “Adaptive flow matching for resolving small- scale physics,” inForty-second International Conference on Machine Learning, 2025

work page 2025
[9]

Flowpolicy: Enabling fast and robust 3d flow-based policy via consistency flow matching for robot manipulation,

Q. Zhang, Z. Liu, H. Fan, G. Liu, B. Zeng, and S. Liu, “Flowpolicy: Enabling fast and robust 3d flow-based policy via consistency flow matching for robot manipulation,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 39, no. 14, 2025, pp. 14 754–14 762

work page 2025
[10]

Fast and robust visuomotor riemannian flow matching policy,

H. Ding, N. Jaquier, J. Peters, and L. Rozo, “Fast and robust visuomotor riemannian flow matching policy,”IEEE Transactions on robotics, 2025

work page 2025
[11]

Riemannian flow matching policy for robot motion learning,

M. Braun, N. Jaquier, L. D. Rozo, and T. Asfour, “Riemannian flow matching policy for robot motion learning,”2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 5144–5151, 2024

work page 2024
[12]

Generalizable humanoid manipulation with 3d diffusion policies,

Y. Ze, Z. Chen, W. Wang, T. Chen, X. He, Y. Yuan, X. B. Peng, and J. Wu, “Generalizable humanoid manipulation with 3d diffusion policies,” in2025 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2025, pp. 2873–2880

work page 2025
[13]

Reactive diffusion policy: Slow-fast visual-tactile policy learning for contact-rich manipulation,

H. Xue, J. Ren, W. Chen, G. Zhang, Y. Fang, G. Gu, H. Xu, and C. Lu, “Reactive diffusion policy: Slow-fast visual-tactile policy learning for contact-rich manipulation,” inProceedings of Robotics: Science and Systems (RSS), 2025

work page 2025
[14]

Denoising diffusion probabilistic models,

J. Ho, A. Jain, and P. Abbeel, “Denoising diffusion probabilistic models,” Advances in neural information processing systems, vol. 33, pp. 6840– 6851, 2020

work page 2020
[15]

Denoising diffusion implicit models,

J. Song, C. Meng, and S. Ermon, “Denoising diffusion implicit models,” inInternational Conference on Learning Representations, 2021

work page 2021
[16]

Implicit behavioral cloning,

P. Florence, C. Lynch, A. Zeng, O. A. Ramirez, A. Wahid, L. Downs, A. Wong, J. Lee, I. Mordatch, and J. Tompson, “Implicit behavioral cloning,” inConference on robot learning. PMLR, 2022, pp. 158–168

work page 2022
[17]

Deep residual learning for image recognition,

K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” inProceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778

work page 2016
[18]

An image is worth 16x16 words: Transformers for image recognition at scale,

A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Un- terthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby, “An image is worth 16x16 words: Transformers for image recognition at scale,”ICLR, 2021

work page 2021
[19]

Dexgraspnet: A large-scale robotic dexterous grasp dataset for general objects based on simulation,

R. Wang, J. Zhang, J. Chen, Y. Xu, P. Li, T. Liu, and H. Wang, “Dexgraspnet: A large-scale robotic dexterous grasp dataset for general objects based on simulation,”arXiv preprint arXiv:2210.02697, 2022

work page arXiv 2022
[20]

Dexgraspnet 2.0: Learning generative dexterous grasping in large-scale synthetic cluttered scenes,

J. Zhang, H. Liu, D. Li, X. Yu, H. Geng, Y. Ding, J. Chen, and H. Wang, “Dexgraspnet 2.0: Learning generative dexterous grasping in large-scale synthetic cluttered scenes,” in8th Annual Conference on Robot Learning, 2024

work page 2024
[21]

Pointnet: Deep learning on point sets for 3d classification and segmentation,

C. R. Qi, H. Su, K. Mo, and L. J. Guibas, “Pointnet: Deep learning on point sets for 3d classification and segmentation,” inProceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 652–660

work page 2017
[22]

Fast-grasp’d: Dexterous multi-finger grasp generation through differentiable simulation,

D. Turpin, T. Zhong, S. Zhang, G. Zhu, E. Heiden, M. Macklin, S. Tsogkas, S. Dickinson, and A. Garg, “Fast-grasp’d: Dexterous multi-finger grasp generation through differentiable simulation,” inICRA, 2023

work page 2023
[23]

Neural discrete representation learning,

A. Van Den Oord, O. Vinyalset al., “Neural discrete representation learning,”Advances in neural information processing systems, vol. 30, 2017

work page 2017
[24]

Genie: Generative interactive environments,

J. Bruce, M. D. Dennis, A. Edwards, J. Parker-Holder, Y. Shi, E. Hughes, M. Lai, A. Mavalankar, R. Steigerwald, C. Appset al., “Genie: Generative interactive environments,” inForty-first International Conference on Machine Learning, 2024

work page 2024
[25]

Agibot world colosseo: A large-scale manipulation platform for scalable and intelligent embodied systems,

Q. Bu, J. Cai, L. Chen, X. Cui, Y. Ding, S. Feng, X. He, X. Huanget al., “Agibot world colosseo: A large-scale manipulation platform for scalable and intelligent embodied systems,” in2025 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2025

work page 2025
[26]

Adaworld: Learning adaptable world models with latent actions,

S. Gao, S. Zhou, Y. Du, J. Zhang, and C. Gan, “Adaworld: Learning adaptable world models with latent actions,” inInternational Conference on Machine Learning (ICML), 2025

work page 2025
[27]

Como: Learning continuous latent motion from internet videos for scalable robot learning,

J. Yang, Y. Shi, H. Zhu, M. Liu, K. Ma, Y. Wang, G. Wu, T. He, and L. Wang, “Como: Learning continuous latent motion from internet videos for scalable robot learning,”arXiv preprint arXiv:2505.17006, 2025

work page arXiv 2025
[28]

Latent action learning requires supervision in the pres- ence of distractors,

A. Nikulin, I. Zisman, D. Tarasov, N. Lyubaykin, A. Polubarov, I. Kiselev, and V. Kurenkov, “Latent action learning requires supervision in the pres- ence of distractors,” inInternational Conference on Machine Learning (ICML), 2025

work page 2025
[29]

Auto-Encoding Variational Bayes

D. P. Kingma and M. Welling, “Auto-encoding variational bayes,”arXiv preprint arXiv:1312.6114, 2013

work page internal anchor Pith review Pith/arXiv arXiv 2013
[30]

Film: Visual reasoning with a general conditioning layer,

E. Perez, F. Strub, H. De Vries, V. Dumoulin, and A. Courville, “Film: Visual reasoning with a general conditioning layer,” inProceedings of the AAAI conference on artificial intelligence, vol. 32, no. 1, 2018

work page 2018
[31]

Momentum contrast for unsupervised visual representation learning,

K. He, H. Fan, Y. Wu, S. Xie, and R. Girshick, “Momentum contrast for unsupervised visual representation learning,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 9729–9738

work page 2020
[32]

Consistency flow matching: Defining straight flows with velocity consistency,

L. Yang, Z. Zhang, Z. Zhang, X. Liu, M. Xu, W. Zhang, C. Meng, S. Ermon, and B. Cui, “Consistency flow matching: Defining straight flows with velocity consistency,”arXiv preprint arXiv:2407.02398, 2024

work page arXiv 2024
[33]

Learning Complex Dexterous Manipulation with Deep Reinforcement Learning and Demonstrations

A. Rajeswaran, V. Kumar, A. Gupta, G. Vezzani, J. Schulman, E. Todorov, and S. Levine, “Learning complex dexterous manipulation with deep reinforcement learning and demonstrations,”arXiv preprint arXiv:1709.10087, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[34]

Meta-world: A benchmark and evaluation for multi-task and meta reinforcement learning,

T. Yu, D. Quillen, Z. He, R. Julian, K. Hausman, C. Finn, and S. Levine, “Meta-world: A benchmark and evaluation for multi-task and meta reinforcement learning,” inConference on robot learning. PMLR, 2020, pp. 1094–1100

work page 2020
[35]

Mujoco: A physics engine for model- based control,

E. Todorov, T. Erez, and Y. Tassa, “Mujoco: A physics engine for model- based control,” in2012 IEEE/RSJ international conference on intelligent robots and systems. IEEE, 2012, pp. 5026–5033

work page 2012

[1] [1]

Diffusion policy: Visuomotor policy learning via action diffusion,

C. Chi, Z. Xu, S. Feng, E. Cousineau, Y. Du, B. Burchfiel, R. Tedrake, and S. Song, “Diffusion policy: Visuomotor policy learning via action diffusion,”The International Journal of Robotics Research, vol. 44, no. 10-11, pp. 1684–1704, 2025

work page 2025

[2] [2]

Vo-dp: Semantic-geometric adaptive diffusion policy for vision- only robotic manipulation,

Z. Ni, Y. He, L. Qian, J. Mao, F. Fu, W. Sui, H. Su, J. Peng, Z. Wang, and B. He, “Vo-dp: Semantic-geometric adaptive diffusion policy for vision- only robotic manipulation,”arXiv preprint arXiv:2510.15530, 2025

work page arXiv 2025

[3] [3]

3d diffusion policy: Generalizable visuomotor policy learning via simple 3d representations,

Y. Ze, G. Zhang, K. Zhang, C. Hu, M. Wang, and H. Xu, “3d diffusion policy: Generalizable visuomotor policy learning via simple 3d representations,” inProceedings of Robotics: Science and Systems (RSS), 2024

work page 2024

[4] [4]

Dexcap: Scalable and portable mocap data collection system for dexterous manipulation,

C. Wang, H. Shi, W. Wang, R. Zhang, L. Fei-Fei, and C. K. Liu, “Dexcap: Scalable and portable mocap data collection system for dexterous manipulation,” inProceedings of Robotics: Science and Systems (RSS), 2024

work page 2024

[5] [5]

Hierarchical diffusion policy: ma- nipulation trajectory generation via contact guidance,

D. Wang, C. Liu, F. Chang, and Y. Xu, “Hierarchical diffusion policy: ma- nipulation trajectory generation via contact guidance,”IEEE Transactions on Robotics, 2025

work page 2025

[6] [6]

Flow Matching for Generative Modeling

Y. Lipman, R. T. Chen, H. Ben-Hamu, M. Nickel, and M. Le, “Flow matching for generative modeling,”arXiv preprint arXiv:2210.02747, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[7] [7]

Flow matching on general geometries.arXiv preprint arXiv:2302.03660, 2023

R. T. Chen and Y. Lipman, “Flow matching on general geometries,”arXiv preprint arXiv:2302.03660, 2023

work page arXiv 2023

[8] [8]

Adaptive flow matching for resolving small- scale physics,

S. Fotiadis, N. D. Brenowitz, T. Geffner, Y. Cohen, M. Pritchard, A. Vahdat, and M. Mardani, “Adaptive flow matching for resolving small- scale physics,” inForty-second International Conference on Machine Learning, 2025

work page 2025

[9] [9]

Flowpolicy: Enabling fast and robust 3d flow-based policy via consistency flow matching for robot manipulation,

Q. Zhang, Z. Liu, H. Fan, G. Liu, B. Zeng, and S. Liu, “Flowpolicy: Enabling fast and robust 3d flow-based policy via consistency flow matching for robot manipulation,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 39, no. 14, 2025, pp. 14 754–14 762

work page 2025

[10] [10]

Fast and robust visuomotor riemannian flow matching policy,

H. Ding, N. Jaquier, J. Peters, and L. Rozo, “Fast and robust visuomotor riemannian flow matching policy,”IEEE Transactions on robotics, 2025

work page 2025

[11] [11]

Riemannian flow matching policy for robot motion learning,

M. Braun, N. Jaquier, L. D. Rozo, and T. Asfour, “Riemannian flow matching policy for robot motion learning,”2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 5144–5151, 2024

work page 2024

[12] [12]

Generalizable humanoid manipulation with 3d diffusion policies,

Y. Ze, Z. Chen, W. Wang, T. Chen, X. He, Y. Yuan, X. B. Peng, and J. Wu, “Generalizable humanoid manipulation with 3d diffusion policies,” in2025 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2025, pp. 2873–2880

work page 2025

[13] [13]

Reactive diffusion policy: Slow-fast visual-tactile policy learning for contact-rich manipulation,

H. Xue, J. Ren, W. Chen, G. Zhang, Y. Fang, G. Gu, H. Xu, and C. Lu, “Reactive diffusion policy: Slow-fast visual-tactile policy learning for contact-rich manipulation,” inProceedings of Robotics: Science and Systems (RSS), 2025

work page 2025

[14] [14]

Denoising diffusion probabilistic models,

J. Ho, A. Jain, and P. Abbeel, “Denoising diffusion probabilistic models,” Advances in neural information processing systems, vol. 33, pp. 6840– 6851, 2020

work page 2020

[15] [15]

Denoising diffusion implicit models,

J. Song, C. Meng, and S. Ermon, “Denoising diffusion implicit models,” inInternational Conference on Learning Representations, 2021

work page 2021

[16] [16]

Implicit behavioral cloning,

P. Florence, C. Lynch, A. Zeng, O. A. Ramirez, A. Wahid, L. Downs, A. Wong, J. Lee, I. Mordatch, and J. Tompson, “Implicit behavioral cloning,” inConference on robot learning. PMLR, 2022, pp. 158–168

work page 2022

[17] [17]

Deep residual learning for image recognition,

K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” inProceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778

work page 2016

[18] [18]

An image is worth 16x16 words: Transformers for image recognition at scale,

A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Un- terthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby, “An image is worth 16x16 words: Transformers for image recognition at scale,”ICLR, 2021

work page 2021

[19] [19]

Dexgraspnet: A large-scale robotic dexterous grasp dataset for general objects based on simulation,

R. Wang, J. Zhang, J. Chen, Y. Xu, P. Li, T. Liu, and H. Wang, “Dexgraspnet: A large-scale robotic dexterous grasp dataset for general objects based on simulation,”arXiv preprint arXiv:2210.02697, 2022

work page arXiv 2022

[20] [20]

Dexgraspnet 2.0: Learning generative dexterous grasping in large-scale synthetic cluttered scenes,

J. Zhang, H. Liu, D. Li, X. Yu, H. Geng, Y. Ding, J. Chen, and H. Wang, “Dexgraspnet 2.0: Learning generative dexterous grasping in large-scale synthetic cluttered scenes,” in8th Annual Conference on Robot Learning, 2024

work page 2024

[21] [21]

Pointnet: Deep learning on point sets for 3d classification and segmentation,

C. R. Qi, H. Su, K. Mo, and L. J. Guibas, “Pointnet: Deep learning on point sets for 3d classification and segmentation,” inProceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 652–660

work page 2017

[22] [22]

Fast-grasp’d: Dexterous multi-finger grasp generation through differentiable simulation,

D. Turpin, T. Zhong, S. Zhang, G. Zhu, E. Heiden, M. Macklin, S. Tsogkas, S. Dickinson, and A. Garg, “Fast-grasp’d: Dexterous multi-finger grasp generation through differentiable simulation,” inICRA, 2023

work page 2023

[23] [23]

Neural discrete representation learning,

A. Van Den Oord, O. Vinyalset al., “Neural discrete representation learning,”Advances in neural information processing systems, vol. 30, 2017

work page 2017

[24] [24]

Genie: Generative interactive environments,

J. Bruce, M. D. Dennis, A. Edwards, J. Parker-Holder, Y. Shi, E. Hughes, M. Lai, A. Mavalankar, R. Steigerwald, C. Appset al., “Genie: Generative interactive environments,” inForty-first International Conference on Machine Learning, 2024

work page 2024

[25] [25]

Agibot world colosseo: A large-scale manipulation platform for scalable and intelligent embodied systems,

Q. Bu, J. Cai, L. Chen, X. Cui, Y. Ding, S. Feng, X. He, X. Huanget al., “Agibot world colosseo: A large-scale manipulation platform for scalable and intelligent embodied systems,” in2025 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2025

work page 2025

[26] [26]

Adaworld: Learning adaptable world models with latent actions,

S. Gao, S. Zhou, Y. Du, J. Zhang, and C. Gan, “Adaworld: Learning adaptable world models with latent actions,” inInternational Conference on Machine Learning (ICML), 2025

work page 2025

[27] [27]

Como: Learning continuous latent motion from internet videos for scalable robot learning,

J. Yang, Y. Shi, H. Zhu, M. Liu, K. Ma, Y. Wang, G. Wu, T. He, and L. Wang, “Como: Learning continuous latent motion from internet videos for scalable robot learning,”arXiv preprint arXiv:2505.17006, 2025

work page arXiv 2025

[28] [28]

Latent action learning requires supervision in the pres- ence of distractors,

A. Nikulin, I. Zisman, D. Tarasov, N. Lyubaykin, A. Polubarov, I. Kiselev, and V. Kurenkov, “Latent action learning requires supervision in the pres- ence of distractors,” inInternational Conference on Machine Learning (ICML), 2025

work page 2025

[29] [29]

Auto-Encoding Variational Bayes

D. P. Kingma and M. Welling, “Auto-encoding variational bayes,”arXiv preprint arXiv:1312.6114, 2013

work page internal anchor Pith review Pith/arXiv arXiv 2013

[30] [30]

Film: Visual reasoning with a general conditioning layer,

E. Perez, F. Strub, H. De Vries, V. Dumoulin, and A. Courville, “Film: Visual reasoning with a general conditioning layer,” inProceedings of the AAAI conference on artificial intelligence, vol. 32, no. 1, 2018

work page 2018

[31] [31]

Momentum contrast for unsupervised visual representation learning,

K. He, H. Fan, Y. Wu, S. Xie, and R. Girshick, “Momentum contrast for unsupervised visual representation learning,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 9729–9738

work page 2020

[32] [32]

Consistency flow matching: Defining straight flows with velocity consistency,

L. Yang, Z. Zhang, Z. Zhang, X. Liu, M. Xu, W. Zhang, C. Meng, S. Ermon, and B. Cui, “Consistency flow matching: Defining straight flows with velocity consistency,”arXiv preprint arXiv:2407.02398, 2024

work page arXiv 2024

[33] [33]

Learning Complex Dexterous Manipulation with Deep Reinforcement Learning and Demonstrations

A. Rajeswaran, V. Kumar, A. Gupta, G. Vezzani, J. Schulman, E. Todorov, and S. Levine, “Learning complex dexterous manipulation with deep reinforcement learning and demonstrations,”arXiv preprint arXiv:1709.10087, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[34] [34]

Meta-world: A benchmark and evaluation for multi-task and meta reinforcement learning,

T. Yu, D. Quillen, Z. He, R. Julian, K. Hausman, C. Finn, and S. Levine, “Meta-world: A benchmark and evaluation for multi-task and meta reinforcement learning,” inConference on robot learning. PMLR, 2020, pp. 1094–1100

work page 2020

[35] [35]

Mujoco: A physics engine for model- based control,

E. Todorov, T. Erez, and Y. Tassa, “Mujoco: A physics engine for model- based control,” in2012 IEEE/RSJ international conference on intelligent robots and systems. IEEE, 2012, pp. 5026–5033

work page 2012