arxiv: 2603.26320 · v3 · submitted 2026-03-27 · 💻 cs.RO · cs.CV

Recognition: 2 theorem links

· Lean Theorem

DFM-VLA: Iterative Action Refinement for Robot Manipulation via Discrete Flow Matching

Jiayi Chen , Wenxuan Song , Shuai Chen , Jingbo Wang , Zhijun Li , Haoang Li

Authors on Pith no claims yet

Pith reviewed 2026-05-14 22:56 UTC · model grok-4.3

classification 💻 cs.RO cs.CV

keywords discrete flow matchingvision-language-actionrobot manipulationiterative refinementaction tokenizationVLA modelsrobotic control

0 comments

The pith

Discrete flow matching lets VLA models iteratively refine every action token instead of locking in early errors.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Standard VLA models generate action tokens either one by one or all at once, but once a token is produced it stays fixed and cannot be fixed later. DFM-VLA learns a token-level probability velocity field with discrete flow matching so the full action sequence can be updated across several refinement steps. Two constructions for the velocity field are tested, followed by a deterministic validation stage to ensure stable output. The method raises success length to 4.44 on CALVIN and success rate to 95.7 percent on LIBERO while keeping inference fast. This shows that explicit iterative correction of action tokens improves manipulation reliability.

Core claim

DFM-VLA models a token-level probability velocity field using discrete flow matching that dynamically updates the full action sequence across refinement iterations. Two formulations—an auxiliary velocity head and an action-embedding-guided version—are combined with a two-stage decoder consisting of iterative refinement followed by deterministic validation, allowing early token errors to be corrected rather than fixed permanently.

What carries the argument

Token-level probability velocity field constructed via discrete flow matching, which drives iterative updates to the entire action token sequence.

If this is right

Action sequences become correctable after initial generation, reducing the impact of early mistakes in long-horizon tasks.
The same velocity-field approach can be built with either an auxiliary head or embedding guidance while preserving performance gains.
Inference stays efficient because refinement iterations replace the need for full diffusion sampling.
Two-stage decoding (refinement plus validation) produces stable outputs without sacrificing speed.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The iterative correction mechanism could transfer to other discrete-sequence tasks such as language generation or planning.
If the velocity field can be conditioned on real-time sensor feedback, closed-loop error recovery might become feasible during execution.
Extending the same refinement loop to continuous action representations would test whether the flow-matching principle generalizes beyond tokens.

Load-bearing premise

The learned velocity field reliably corrects early token errors over iterations without creating new instabilities that the final validation stage cannot remove.

What would settle it

Running DFM-VLA on CALVIN while artificially injecting uncorrectable early token errors and observing no gain over autoregressive baselines would falsify the refinement benefit.

read the original abstract

Vision--Language--Action (VLA) models that encode actions using a discrete tokenization scheme are increasingly adopted for robotic manipulation, but existing decoding paradigms remain fundamentally limited. Whether actions are decoded sequentially by autoregressive VLAs or in parallel by discrete diffusion VLAs, once a token is generated, it is typically fixed and cannot be revised in subsequent iterations, so early token errors cannot be effectively corrected later. We propose DFM-VLA, a discrete flow matching VLA for iterative refinement of action tokens. DFM-VLA~models a token-level probability velocity field that dynamically updates the full action sequence across refinement iterations. We investigate two ways to construct the velocity field: an auxiliary velocity-head formulation and an action-embedding-guided formulation. Our framework further adopts a two-stage decoding strategy with an iterative refinement stage followed by deterministic validation for stable convergence. Extensive experiments on CALVIN, LIBERO, and real-world manipulation tasks show that DFM-VLA consistently outperforms strong autoregressive, discrete diffusion, and continuous diffusion baselines in manipulation performance while retaining high inference efficiency. In particular, DFM-VLA achieves an average success length of 4.44 on CALVIN and an average success rate of 95.7\% on LIBERO, highlighting the value of action refinement via discrete flow matching for robotic manipulation. Our project is available https://chris1220313648.github.io/DFM-VLA/

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

DFM-VLA adds iterative full-sequence refinement via discrete flow matching to VLA action decoding and reports clear benchmark gains, but the evidence that the velocity field specifically corrects early token errors remains thin.

read the letter

The core idea here is using discrete flow matching to model a token-level probability velocity field that updates the entire action sequence over multiple refinement steps, instead of locking in tokens the way autoregressive or standard discrete diffusion VLAs do. They try two constructions for that field—an auxiliary head and an embedding-guided version—plus a two-stage decode that does the iterations then runs a deterministic validation pass for stability. That framing is new in the VLA literature they cite, and the reported numbers are straightforward: 4.44 average success length on CALVIN and 95.7% on LIBERO, beating the autoregressive, discrete diffusion, and continuous diffusion baselines while staying fast. Adding real-world tasks is also useful. The project link suggests code will be available, which helps for anyone who wants to check the implementation. Those are the concrete positives. The main weakness is that the central mechanism claim is not yet well supported. The abstract gives no per-iteration token-change stats, no progressive error-reduction curves, and no ablation that isolates the refinement iterations from the rest of the architecture or the final validation stage. Without those, the gains could just come from a stronger overall model rather than the velocity field actually fixing early mistakes while preserving correct tokens. The stress-test note is on target here. This paper is for people already working on discrete-token VLA decoding for manipulation who want a practical way to add revisability. A reader in that niche would find the formulations and the benchmark comparisons worth looking at, even if the mechanism still needs tighter validation. I would send it to peer review—the idea is timely, the experiments use standard suites, and the numbers are there to discuss, even if the authors will likely need to add the missing diagnostics.

Referee Report

2 major / 2 minor

Summary. The paper proposes DFM-VLA, a discrete flow matching VLA model for robotic manipulation that learns a token-level probability velocity field to enable iterative refinement of action token sequences. It introduces two formulations (auxiliary velocity head and action-embedding-guided) and a two-stage decoding process (iterative refinement followed by deterministic validation). Experiments on CALVIN, LIBERO, and real-world tasks report consistent outperformance over autoregressive, discrete diffusion, and continuous diffusion baselines, with specific gains of 4.44 average success length on CALVIN and 95.7% success rate on LIBERO, while preserving inference efficiency.

Significance. If the velocity field is shown to drive corrective refinement rather than merely enabling a stronger direct mapping, the work would meaningfully advance VLA decoding by addressing the fixed-token limitation of autoregressive and diffusion-based approaches. The reported benchmark gains and efficiency retention indicate practical relevance for manipulation tasks, and the availability of a project page with implied code supports reproducibility.

major comments (2)

[Abstract and §3] Abstract and §3 (method): the central claim that the learned token-level probability velocity field corrects early token errors across iterations is not supported by any reported loss terms, training details, token-change statistics, or per-iteration error-reduction diagnostics. Without these, the performance gains on CALVIN and LIBERO could equally be attributed to the validation stage or architectural differences alone.
[§4] §4 (experiments): the manuscript provides no ablation isolating the iterative refinement mechanism (e.g., comparing single-pass vs. multi-iteration performance with the velocity field disabled), which is load-bearing for distinguishing the proposed contribution from stronger direct prediction baselines.

minor comments (2)

[§3] Notation for the velocity field (probability velocity vs. token update rule) should be clarified with an explicit equation relating the auxiliary head output to the sequence update step.
[§4] Figure captions and axis labels in the experimental results should explicitly state the number of refinement iterations used for the reported metrics.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the two major comments below and will revise the manuscript accordingly to strengthen the empirical support for the iterative refinement claims.

read point-by-point responses

Referee: [Abstract and §3] Abstract and §3 (method): the central claim that the learned token-level probability velocity field corrects early token errors across iterations is not supported by any reported loss terms, training details, token-change statistics, or per-iteration error-reduction diagnostics. Without these, the performance gains on CALVIN and LIBERO could equally be attributed to the validation stage or architectural differences alone.

Authors: We agree that the current manuscript lacks explicit diagnostics to isolate the corrective effect of the velocity field. In the revision we will expand §3 with the full training objective (including the velocity-matching loss), report per-iteration token-flip statistics on held-out sequences, and add plots of cumulative error reduction across refinement steps. These additions will clarify that gains are not solely due to the validation stage. revision: yes
Referee: [§4] §4 (experiments): the manuscript provides no ablation isolating the iterative refinement mechanism (e.g., comparing single-pass vs. multi-iteration performance with the velocity field disabled), which is load-bearing for distinguishing the proposed contribution from stronger direct prediction baselines.

Authors: We acknowledge this ablation is necessary. We will add a controlled comparison in the revised §4: a single-pass baseline that uses the same architecture but disables the velocity field after the first iteration, versus the full multi-iteration DFM-VLA. Results will be reported on both CALVIN and LIBERO to quantify the incremental benefit of iterative refinement. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper presents DFM-VLA as a new discrete flow matching formulation for iterative token refinement in VLA models, with two explicit constructions for the velocity field (auxiliary head and embedding-guided) plus a two-stage decoding process. Performance claims rest on empirical results from CALVIN (4.44 success length) and LIBERO (95.7% success rate) against autoregressive, discrete diffusion, and continuous diffusion baselines. No equations, loss terms, or central claims reduce by construction to fitted parameters or self-citations; the velocity field is defined and trained as an independent modeling choice, and the iterative correction behavior is asserted via architecture rather than tautological redefinition. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach rests on standard assumptions from flow matching and transformer-based VLA training; no new free parameters, axioms, or invented entities are introduced beyond the model architecture itself.

axioms (1)

domain assumption Standard assumptions in discrete flow matching and VLA training hold.
The velocity field construction and two-stage decoding build directly on existing flow matching frameworks without additional justification.

pith-pipeline@v0.9.0 · 5570 in / 1159 out tokens · 29772 ms · 2026-05-14T22:56:05.726349+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

DFM-VLA models a token-level probability velocity field that dynamically updates the full action sequence across refinement iterations... ui_t(·|x_t, x_1) is the velocity field, a conditional rate function that governs the flow of probability

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

DiscreteRTC: Discrete Diffusion Policies are Natural Asynchronous Executors
cs.RO 2026-04 unverdicted novelty 7.0

Discrete diffusion policies support native asynchronous execution via unmasking for real-time chunking, delivering higher success rates and 0.7x inference cost versus flow-matching RTC on dynamic robotics benchmarks a...

Reference graph

Works this paper leans on

29 extracted references · 29 canonical work pages · cited by 1 Pith paper · 7 internal anchors

[1]

GR00T N1: An Open Foundation Model for Generalist Humanoid Robots

Johan Bjorck, Fernando Castañeda, Nikita Cherniadev, Xingye Da, Runyu Ding, Linxi Fan, Yu Fang, Dieter Fox, Fengyuan Hu, Spencer Huang, et al. Gr00t n1: An open foundation model for generalist humanoid robots.arXiv preprint arXiv:2503.14734,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, 11 Karol Hausman, Brian Ichter, et al.π0: A vision-language-action flow model for general robot control.arXiv preprint arXiv:2410.24164,

work page internal anchor Pith review Pith/arXiv arXiv
[3]

WorldVLA: Towards Autoregressive Action World Model

Jun Cen, Chaohui Yu, Hangjie Yuan, Yuming Jiang, Siteng Huang, Jiayan Guo, Xin Li, Yibing Song, Hao Luo, Fan Wang, et al. Worldvla: Towards autoregressive action world model.arXiv preprint arXiv:2506.21539,

work page internal anchor Pith review Pith/arXiv arXiv
[4]

Unified diffusionvla: Vision-language-actionmodelviajointdiscretedenoisingdiffusionprocess.arXivpreprintarXiv:2511.01718,

Jiayi Chen, Wenxuan Song, Pengxiang Ding, Ziyang Zhou, Han Zhao, Feilong Tang, Donglin Wang, and Haoang Li. Unified diffusion vla: Vision-language-action model via joint discrete denoising diffusion process.arXiv preprint arXiv:2511.01718,

work page arXiv
[5]

Openhelix: A short survey, empirical analysis, and open-source dual-system vla model for robotic manipulation.arXiv preprint arXiv:2505.03912,

Can Cui, Pengxiang Ding, Wenxuan Song, Shuanghao Bai, Xinyang Tong, Zirui Ge, Runze Suo, Wanqi Zhou, Yang Liu, Bofang Jia, et al. Openhelix: A short survey, empirical analysis, and open-source dual-system vla model for robotic manipulation.arXiv preprint arXiv:2505.03912,

work page arXiv
[6]

Uniform discrete diffusion with metric path for video generation.arXiv preprint arXiv:2510.24717,

Haoge Deng, Ting Pan, Fan Zhang, Yang Liu, Zhuoyan Luo, Yufeng Cui, Wenxuan Wang, Chunhua Shen, Shiguang Shan, Zhaoxiang Zhang, et al. Uniform discrete diffusion with metric path for video generation.arXiv preprint arXiv:2510.24717,

work page arXiv
[7]

Bert: Pre-training of deep bidirectional transformers for language understanding

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. InProceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers), pages 4171–4186,

work page 2019
[8]

Actioncodec: What makes for good action tokenizers.arXiv preprint arXiv:2602.15397,

Zibin Dong, Yicheng Liu, Shiduo Zhang, Baijun Ye, Yifu Yuan, Fei Ni, Jingjing Gong, Xipeng Qiu, Hang Zhao, Yinchuan Li, et al. Actioncodec: What makes for good action tokenizers.arXiv preprint arXiv:2602.15397,

work page arXiv
[9]

Edit flows: Flow matching with edit operations.arXiv preprint arXiv:2506.09018,

Marton Havasi, Brian Karrer, Itai Gat, and Ricky TQ Chen. Edit flows: Flow matching with edit operations.arXiv preprint arXiv:2506.09018,

work page arXiv
[10]

Thinkact: Vision-language-action reasoning via reinforced visual latent planning.arXiv preprint arXiv:2507.16815, 2025

Chi-Pin Huang, Yueh-Hua Wu, Min-Hung Chen, Yu-Chiang Frank Wang, and Fu-En Yang. Thinkact: Vision-language- action reasoning via reinforced visual latent planning.arXiv preprint arXiv:2507.16815,

work page arXiv
[11]

MolmoAct: Action Reasoning Models that can Reason in Space

Jason Lee, Jiafei Duan, Haoquan Fang, Yuquan Deng, Shuo Liu, Boyang Li, Bohan Fang, Jieyu Zhang, Yi Ru Wang, Sangho Lee, et al. Molmoact: Action reasoning models that can reason in space.arXiv preprint arXiv:2508.07917,

work page internal anchor Pith review arXiv
[12]

Mm-act: Learn from multimodal parallel generation to act.arXiv preprint arXiv:2512.00975, 2025a

Haotian Liang, Xinyi Chen, Bin Wang, Mingkang Chen, Yitian Liu, Yuhao Zhang, Zanxin Chen, Tianshuo Yang, Yilun Chen, Jiangmiao Pang, et al. Mm-act: Learn from multimodal parallel generation to act.arXiv preprint arXiv:2512.00975, 2025a. Zhixuan Liang, Yizhuo Li, Tianshuo Yang, Chengyue Wu, Sitong Mao, Tian Nian, Liuao Pei, Shunbo Zhou, Xiaokang Yang, Jian...

work page arXiv
[13]

RDT-1B: a Diffusion Foundation Model for Bimanual Manipulation

Songming Liu, Lingxuan Wu, Bangguo Li, Hengkai Tan, Huayu Chen, Zhengyi Wang, Ke Xu, Hang Su, and Jun Zhu. Rdt-1b: a diffusion foundation model for bimanual manipulation.arXiv preprint arXiv:2410.07864,

work page internal anchor Pith review Pith/arXiv arXiv
[14]

Faster: Toward efficient autoregressive vision language action modeling via neural action tokenization.ArXiv, abs/2512.04952, 2025

Yicheng Liu, Shiduo Zhang, Zibin Dong, Baijun Ye, Tianyuan Yuan, Xiaopeng Yu, Linqi Yin, Chenhao Lu, Junhao Shi, Luca Jiang-Tao Yu, et al. Faster: Toward efficient autoregressive vision language action modeling via neural action tokenization.arXiv preprint arXiv:2512.04952,

work page arXiv
[15]

Next-omni: Towards any-to-any omnimodal foundation models with discrete flow matching.arXiv preprint arXiv:2510.13721,

12 Run Luo, Xiaobo Xia, Lu Wang, Longze Chen, Renke Shan, Jing Luo, Min Yang, and Tat-Seng Chua. Next-omni: Towards any-to-any omnimodal foundation models with discrete flow matching.arXiv preprint arXiv:2510.13721,

work page arXiv
[16]

Language conditioned imitation learning over unstructured data,

Corey Lynch and Pierre Sermanet. Language conditioned imitation learning over unstructured data.arXiv preprint arXiv:2005.07648,

work page arXiv 2005
[17]

Oneflow: Concurrent mixed-modal and interleaved generation with edit flows.arXiv preprint arXiv:2510.03506,

John Nguyen, Marton Havasi, Tariq Berrada, Luke Zettlemoyer, and Ricky TQ Chen. Oneflow: Concurrent mixed-modal and interleaved generation with edit flows.arXiv preprint arXiv:2510.03506,

work page arXiv
[18]

FAST: Efficient Action Tokenization for Vision-Language-Action Models

Karl Pertsch, Kyle Stachowicz, Brian Ichter, Danny Driess, Suraj Nair, Quan Vuong, Oier Mees, Chelsea Finn, and Sergey Levine. Fast: Efficient action tokenization for vision-language-action models.arXiv preprint arXiv:2501.09747,

work page internal anchor Pith review Pith/arXiv arXiv
[19]

SpatialVLA: Exploring Spatial Representations for Visual-Language-Action Model

Delin Qu, Haoming Song, Qizhi Chen, Yuanqi Yao, Xinyi Ye, Yan Ding, Zhigang Wang, JiaYuan Gu, Bin Zhao, Dong Wang, et al. Spatialvla: Exploring spatial representations for visual-language-action model.arXiv preprint arXiv:2501.15830,

work page internal anchor Pith review Pith/arXiv arXiv
[20]

Flow matching with general discrete paths: A kinetic-optimal perspective.arXiv preprint arXiv:2412.03487,

Neta Shaul, Itai Gat, Marton Havasi, Daniel Severo, Anuroop Sriram, Peter Holderrieth, Brian Karrer, Yaron Lipman, and Ricky TQ Chen. Flow matching with general discrete paths: A kinetic-optimal perspective.arXiv preprint arXiv:2412.03487,

work page arXiv
[21]

arXiv preprint arXiv:2506.13725 (2025)

Wenxuan Song, Jiayi Chen, Pengxiang Ding, Yuxin Huang, Han Zhao, Donglin Wang, and Haoang Li. Ceed-vla: Consistency vision-language-action model with early-exit decoding.arXiv preprint arXiv:2506.13725, 2025a. Wenxuan Song, Jiayi Chen, Pengxiang Ding, Han Zhao, Wei Zhao, Zhide Zhong, Zongyuan Ge, Jun Ma, and Haoang Li. Accelerating vision-language-action ...

work page arXiv
[22]

Fudoki: Discrete flow-based unified understanding and generation via kinetic-optimal velocities.arXiv preprint arXiv:2505.20147, 2025a

Jin Wang, Yao Lai, Aoxue Li, Shifeng Zhang, Jiacheng Sun, Ning Kang, Chengyue Wu, Zhenguo Li, and Ping Luo. Fudoki: Discrete flow-based unified understanding and generation via kinetic-optimal velocities.arXiv preprint arXiv:2505.20147, 2025a. Xinlong Wang, Xiaosong Zhang, Zhengxiong Luo, Quan Sun, Yufeng Cui, Jinsheng Wang, Fan Zhang, Yueze Wang, Zhen Li...

work page arXiv
[23]

Vq-vla: Improving vision-language- action models via scaling vector-quantized action tokenizers

Yating Wang, Haoyi Zhu, Mingyu Liu, Jiange Yang, Hao-Shu Fang, and Tong He. Vq-vla: Improving vision-language- action models via scaling vector-quantized action tokenizers. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 11089–11099, 2025b. 13 Yihao Wang, Pengxiang Ding, Lingxiao Li, Can Cui, Zirui Ge, Xinyang Tong, Wenxua...

work page arXiv
[24]

2512.06112 , archivePrefix =

Yifang Xu, Jiahao Cui, Feipeng Cai, Zhihao Zhu, Hanlin Shang, Shan Luan, Mingwang Xu, Neng Zhang, Yaoyi Li, Jia Cai, et al. Wam-flow: Parallel coarse-to-fine motion planning via discrete flow matching for autonomous driving. arXiv preprint arXiv:2512.06112,

work page arXiv
[25]

S-vam: Shortcut video-action model by self-distilling geometric and semantic foresight.arXiv preprint arXiv:2603.16195,

Haodong Yan, Zhide Zhong, Jiaguan Zhu, Junjie He, Weilin Yuan, Wenxuan Song, Xin Gong, Yingjie Cai, Guanyi Zhao, Xu Yan, et al. S-vam: Shortcut video-action model by self-distilling geometric and semantic foresight.arXiv preprint arXiv:2603.16195,

work page arXiv
[26]

Dream-vl&dream-vla: Openvision-languageandvision-language-actionmodelswithdiffusionlanguagemodelbackbone

Jiacheng Ye, Shansan Gong, Jiahui Gao, Junming Fan, Shuang Wu, Wei Bi, Haoli Bai, Lifeng Shang, and Lingpeng Kong. Dream-vl & dream-vla: Open vision-language and vision-language-action models with diffusion language model backbone.arXiv preprint arXiv:2512.22615, 2025a. Jiacheng Ye, Zhihui Xie, Lin Zheng, Jiahui Gao, Zirui Wu, Xin Jiang, Zhenguo Li, and L...

work page arXiv
[27]

Up-vla: A unified understanding and prediction model for embodied agent.arXiv preprint arXiv:2501.18867, 2025

Jianke Zhang, Yanjiang Guo, Yucheng Hu, Xiaoyu Chen, Xiang Zhu, and Jianyu Chen. Up-vla: A unified understanding and prediction model for embodied agent.arXiv preprint arXiv:2501.18867, 2025a. Wenyao Zhang, Hongsi Liu, Zekun Qi, Yunnan Wang, Xinqiang Yu, Jiazhao Zhang, Runpei Dong, Jiawei He, He Wang, Zhizheng Zhang, et al. Dreamvla: a vision-language-act...

work page arXiv
[28]

Flowvla: Visual chain of thought-based motion reasoning for vision-language-action models.arXiv preprint arXiv:2508.18269, 2025

Zhide Zhong, Haodong Yan, Junfeng Li, Xiangchen Liu, Xin Gong, Wenxuan Song, Jiayi Chen, and Haoang Li. Flowvla: Thinking in motion with a visual chain of thought.arXiv preprint arXiv:2508.18269,

work page arXiv
[29]

Dualcot-vla: Visual-linguistic chain of thought via parallel reasoning for vision-language-action models.arXiv preprint arXiv:2603.22280,

Zhide Zhong, Junfeng Li, Junjie He, Haodong Yan, Xin Gong, Guanyi Zhao, Yingjie Cai, Jiantao Gao, Xu Yan, Bingbing Liu, et al. Dualcot-vla: Visual-linguistic chain of thought via parallel reasoning for vision-language-action models.arXiv preprint arXiv:2603.22280,

work page arXiv