Action-to-Action Flow Matching
Pith reviewed 2026-05-16 06:50 UTC · model grok-4.3
The pith
Flow matching generates robot actions from prior states in one step
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By initializing the flow matching process with embedded historical proprioceptive sequences rather than random Gaussian noise, Action-to-Action flow matching (A2A) produces clean actions in a single step and better captures the robot's physical dynamics and temporal continuity.
What carries the argument
Action-to-Action flow matching (A2A), a policy that embeds historical proprioceptive action sequences into a high-dimensional latent space to serve as the informed starting point for flow-based action generation.
If this is right
- High-quality actions can be generated with minimal inference latency suitable for real-time control.
- Improved robustness to visual perturbations compared to standard methods.
- Enhanced generalization to unseen robot configurations.
- Versatility shown by extension to video generation tasks.
Where Pith is reading between the lines
- This approach may reduce the need for multiple denoising steps in other sequential prediction tasks beyond robotics.
- Integrating proprioceptive history could lead to more stable policies in dynamic environments where visual input is unreliable.
- Single-step generation might enable higher control frequencies in hardware-limited settings.
Load-bearing premise
Embedding historical proprioceptive sequences into a high-dimensional latent space provides an effective starting point that captures physical dynamics and temporal continuity without iterative denoising.
What would settle it
A direct comparison showing that multi-step denoising from random noise outperforms single-step A2A on standard robotic benchmarks would falsify the claim of superior performance.
read the original abstract
Diffusion-based policies have recently achieved remarkable success in robotics by formulating action prediction as a conditional denoising process. However, the standard practice of sampling from random Gaussian noise often requires multiple iterative steps to produce clean actions, leading to high inference latency that incurs a major bottleneck for real-time control. In this paper, we challenge the necessity of uninformed noise sampling and propose Action-to-Action flow matching (A2A), a novel policy paradigm that shifts from random sampling to initialization informed by the previous proprioceptive action. Unlike existing methods that treat proprioceptive action feedback as static conditions, A2A leverages historical proprioceptive sequences, embedding them into a high-dimensional latent space as the starting point for action generation. This design bypasses costly iterative denoising while effectively capturing the robot's physical dynamics and temporal continuity. Extensive experiments demonstrate that A2A exhibits high training efficiency, fast inference speed, and improved generalization. Notably, A2A enables high-quality action generation in as few as a single inference step, and exhibits superior robustness to visual perturbations and enhanced generalization to unseen configurations. Lastly, we also extend A2A to video generation, demonstrating its broader versatility in temporal modeling. Project site: https://lorenzo-0-0.github.io/A2A_Flow_Matching.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces Action-to-Action Flow Matching (A2A), a flow-matching policy for robotics that initializes the generative process from an embedding of historical proprioceptive action sequences rather than random Gaussian noise. This design is claimed to enable high-quality action generation in a single inference step while capturing physical dynamics and temporal continuity, yielding superior robustness to visual perturbations, enhanced generalization to unseen configurations, high training efficiency, and fast inference. The approach is also extended to video generation.
Significance. If the single-step and robustness claims hold, A2A would meaningfully reduce inference latency for generative robotic policies, addressing a key deployment bottleneck. The use of proprioceptive history as a structured starting point and the extension to video generation indicate potential for broader temporal modeling tasks. Reproducibility is supported by the linked project site.
major comments (2)
- [§3 (Method) and Experiments] The central single-step claim (abstract and §3) rests on the assumption that the proprioceptive embedding produces a flow starting point close enough for one Euler step to reach the target action distribution. This implicitly requires that consecutive actions lie on approximately straight paths in the learned metric and that the embedding encodes sufficient dynamics; neither is isolated by ablations comparing the embedding against a standard flow-matching baseline with random initialization.
- [Abstract and §4 (Experiments)] The abstract asserts 'extensive experiments' with performance gains, robustness, and generalization but supplies no quantitative metrics, baselines, or tables. Without these data the superiority claims cannot be evaluated against the stress-test concern that a non-recurrent embedding may yield an uninformed start.
minor comments (2)
- [§3.2] Specify the exact architecture (recurrent or otherwise) and dimensionality of the proprioceptive embedding, and clarify how historical sequences are tokenized or aggregated.
- [Related Work] Add explicit comparison to other few-step or single-step flow-matching or consistency-model baselines in robotics.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We agree that strengthening the isolation of the proprioceptive initialization and ensuring quantitative results are clearly presented will improve the manuscript. We address each major comment below and will incorporate revisions as noted.
read point-by-point responses
-
Referee: [§3 (Method) and Experiments] The central single-step claim (abstract and §3) rests on the assumption that the proprioceptive embedding produces a flow starting point close enough for one Euler step to reach the target action distribution. This implicitly requires that consecutive actions lie on approximately straight paths in the learned metric and that the embedding encodes sufficient dynamics; neither is isolated by ablations comparing the embedding against a standard flow-matching baseline with random initialization.
Authors: We agree that an explicit ablation isolating the effect of the proprioceptive embedding versus random initialization is necessary to rigorously support the single-step claim. In the revised manuscript we will add a new ablation in §4 that directly compares A2A (proprioceptive latent initialization) to an otherwise identical flow-matching baseline initialized from standard Gaussian noise. We will also add a short discussion of the learned metric and the degree to which consecutive actions follow approximately straight trajectories under the trained vector field. revision: yes
-
Referee: [Abstract and §4 (Experiments)] The abstract asserts 'extensive experiments' with performance gains, robustness, and generalization but supplies no quantitative metrics, baselines, or tables. Without these data the superiority claims cannot be evaluated against the stress-test concern that a non-recurrent embedding may yield an uninformed start.
Authors: We apologize for any lack of clarity in the reviewed version. Section 4 of the full manuscript already contains quantitative tables and figures comparing A2A against diffusion-policy and flow-matching baselines on success rate, inference latency, robustness to visual perturbations, and generalization to unseen configurations. In the revision we will (i) update the abstract to cite specific metrics (e.g., “single-step inference with 15% higher success rate and 8× lower latency”), (ii) ensure all tables are referenced from the abstract and introduction, and (iii) incorporate the new random-initialization ablation to directly address the concern that the embedding could be uninformed. revision: yes
Circularity Check
No significant circularity; single-step claim presented as empirical outcome of proprioceptive initialization in flow matching
full rationale
The provided abstract and description introduce A2A by replacing random Gaussian noise with an embedding of historical proprioceptive sequences as the flow starting point. No equations appear that reduce the single-inference-step result to a fitted parameter renamed as prediction or to a self-referential definition. The claim that the embedding captures physical dynamics and temporal continuity is stated as a design choice whose effectiveness is asserted via experiments, not derived tautologically from the method's own inputs. No self-citations, uniqueness theorems, or ansatzes smuggled via prior work are invoked in the text to force the architecture or outcome. The derivation chain therefore remains self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Proprioceptive action sequences can be embedded into a high-dimensional latent space that serves as a sufficient starting point for flow matching.
Forward citations
Cited by 3 Pith papers
-
Feedback World Model Enables Precise Guidance of Diffusion Policy
Feedback world model closes the prediction-observation loop at inference time to correct errors and improve diffusion policy performance under distribution shift in robotics.
-
FLASH: Efficient Visuomotor Policy via Sparse Sampling
FLASH Policy uses sparse Legendre polynomial trajectory fitting and history-anchored flow matching to enable single-step inference for visuomotor control, reporting 31.4 ms per-episode latency and >=92% success on fiv...
-
WarmPrior: Straightening Flow-Matching Policies with Temporal Priors
Replacing Gaussian noise with a temporally grounded prior from recent actions straightens flow-matching paths and improves success rates in robotic manipulation and prior-space RL.
Reference graph
Works this paper leans on
-
[1]
A noise is worth diffusion guidance.arXiv preprint arXiv:2412.03895, 2024
Donghoon Ahn, Jiwon Kang, Sanghyun Lee, Jaewon Min, Minjae Kim, Wooseok Jang, Hyoungwon Cho, Sayak Paul, SeonHwa Kim, Eunju Cha, et al. A noise is worth diffusion guidance.arXiv preprint arXiv:2412.03895,
-
[2]
GR00T N1: An Open Foundation Model for Generalist Humanoid Robots
Johan Bjorck, Fernando Castañeda, Nikita Cherniadev, Xingye Da, Runyu Ding, Linxi Fan, Yu Fang, Dieter Fox, Fengyuan Hu, Spencer Huang, et al. Gr00t n1: An open foundation model for generalist humanoid robots.arXiv preprint arXiv:2503.14734,
work page internal anchor Pith review Pith/arXiv arXiv
-
[3]
Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, et al.π0: A vision-language-action flow model for general robot control.arXiv preprint arXiv:2410.24164,
work page internal anchor Pith review Pith/arXiv arXiv
-
[4]
Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets
Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram Voleti, Adam Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets.arXiv preprint arXiv:2311.15127,
work page internal anchor Pith review Pith/arXiv arXiv
-
[5]
Rethinking video generation model for the embodied world.arXiv preprint arXiv:2601.15282, 2026
Yufan Deng, Zilin Pan, Hongyu Zhang, Xiaojie Li, Ruoqing Hu, Yufei Ding, Yiming Zou, Yan Zeng, and Daquan Zhou. Rethinking video generation model for the embodied world.arXiv preprint arXiv:2601.15282,
-
[6]
Vita: Vision-to-action flow matching policy, 2026
Dechen Gao, Boqi Zhao, Andrew Lee, Ian Chuang, Hanchu Zhou, Hang Wang, Zhe Zhao, Junshan Zhang, and Iman Soltani. VITA: Vision-to-action flow matching policy.arXiv preprint arXiv:2507.13231,
-
[7]
Haoran Geng, Feishi Wang, Songlin Wei, Yuyang Li, Bangjun Wang, Boshi An, Charlie Tianyue Cheng, Haozhe Lou, Peihao Li, Yen-Jen Wang, et al. Roboverse: Towards a unified platform, dataset and benchmark for scalable and generalizable robot learning.arXiv preprint arXiv:2504.18904, 2025a. Zhengyang Geng, Mingyang Deng, Xingjian Bai, J Zico Kolter, and Kaimi...
-
[8]
$\pi^{*}_{0.6}$: a VLA That Learns From Experience
Physical Intelligence, Ali Amin, Raichelle Aniceto, Ashwin Balakrishna, Kevin Black, Ken Conley, Grace Connors, James Darpinian, Karan Dhabalia, Jared DiCarlo, et al.π∗ 0.6: A VLA that learns from experience.arXiv preprint arXiv:2511.14759, 2025a. Physical Intelligence, Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Esmail, ...
work page internal anchor Pith review Pith/arXiv arXiv
-
[9]
Causal World Modeling for Robot Control
Lin Li, Qihang Zhang, Yiming Luo, Shuai Yang, Ruilin Wang, Fei Han, Mingrui Yu, Zelin Gao, Nan Xue, Xing Zhu, et al. Causal world modeling for robot control.arXiv preprint arXiv:2601.21998,
work page internal anchor Pith review Pith/arXiv arXiv
-
[10]
Flow Matching for Generative Modeling
Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling.arXiv preprint arXiv:2210.02747,
work page internal anchor Pith review Pith/arXiv arXiv
-
[11]
Rectified Flow: A Marginal Preserving Approach to Optimal Transport
Qiang Liu. Rectified flow: A marginal preserving approach to optimal transport.arXiv preprint arXiv:2209.14577,
work page internal anchor Pith review arXiv
-
[12]
One-step Latent-free Image Generation with Pixel Mean Flows
Yiyang Lu, Susie Lu, Qiao Sun, Hanhong Zhao, Zhicheng Jiang, Xianbang Wang, Tianhong Li, Zhengyang Geng, and Kaiming He. One-step latent-free image generation with pixel mean flows.arXiv preprint arXiv:2601.22158,
work page internal anchor Pith review Pith/arXiv arXiv
-
[13]
Maniskill: Generalizable manipulation skill benchmark with large-scale demonstrations,
Tongzhou Mu, Zhan Ling, Fanbo Xiang, Derek Yang, Xuanlin Li, Stone Tao, Zhiao Huang, Zhiwei Jia, and Hao Su. ManiSkill: Generalizable manipulation skill benchmark with large-scale demonstrations.arXiv preprint arXiv:2107.14483,
-
[14]
Chaoyi Pan, Giri Anantharaman, Nai-Chieh Huang, Claire Jin, Daniel Pfrommer, Chenyang Yuan, Frank Permenter, Guannan Qu, Nicholas Boffi, Guanya Shi, et al. Much ado about noising: Dispelling the myths of generative robotic control.arXiv preprint arXiv:2512.01809,
-
[15]
Warm starts accelerate conditional diffusion.arXiv preprint arXiv:2507.09212,
Jonas Scholz and Richard E Turner. Warm starts accelerate conditional diffusion.arXiv preprint arXiv:2507.09212,
-
[16]
Denoising Diffusion Implicit Models
Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models.arXiv preprint arXiv:2010.02502, 2020a. Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations.arXiv preprint arXiv:2011.13456, 2020b. Ashish Vaswani, Noam ...
work page internal anchor Pith review Pith/arXiv arXiv 2010
-
[17]
Steering Your Diffusion Policy with Latent Space Reinforcement Learning
Andrew Wagenmaker, Mitsuhiko Nakamoto, Yunchu Zhang, Seohong Park, Waleed Yagoub, Anusha Nagabandi, Abhishek Gupta, and Sergey Levine. Steering your diffusion policy with latent space reinforcement learning.arXiv preprint arXiv:2506.15799,
work page internal anchor Pith review arXiv
-
[18]
Dreamvla: a vision-language-action model dreamed with comprehen- sive world knowledge
Juntu Zhao, Wenbo Lu, Di Zhang, Yufeng Liu, Yushen Liang, Tianluo Zhang, Yifeng Cao, Junyuan Xie, Yingdong Hu, Shengjie Wang, et al. Do you need proprioceptive states in visuomotor policies?arXiv preprint arXiv:2509.18644, 2025a. 13 Qingqing Zhao, Yao Lu, Moo Jin Kim, Zipeng Fu, Zhuoyang Zhang, Yecheng Wu, Zhaoshuo Li, Qianli Ma, Song Han, Chelsea Finn, e...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.