pith. sign in

arxiv: 2606.08288 · v1 · pith:VY6ISXWNnew · submitted 2026-06-06 · 💻 cs.RO

MotionVLA: Injecting Geometric Motion into Vision-Language-Action Model

Pith reviewed 2026-06-27 19:23 UTC · model grok-4.3

classification 💻 cs.RO
keywords vision-language-actionrobot manipulationmotion historytrajectory tokenslong-horizon tasksgeometric consistencyVLA memory
0
0 comments X

The pith

Vision-language-action models improve robot control by remembering motion trajectories instead of raw past frames.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that adding more history to VLA models often fails because inconsistent evidence creates drift and fragmented cues during long tasks. It proposes representing a short video window as compact trajectory-field tokens that carry physically coherent motion evidence. Current visual tokens then query these tokens to pull relevant motion data into the action stream. If this holds, robot policies gain stability by treating memory as motion links rather than isolated frames. The approach yields smoother executions on simulation benchmarks and early real-robot tests.

Core claim

MotionVLA converts a short past-only video window into compact, time-continuous trajectory-field tokens that represent recent observations as physically coherent motion evidence rather than independent frames. Current visual tokens query this history to retrieve task-relevant motion information, which is recoupled into the VLA stream under trajectory-grounded supervision. This produces better long-horizon manipulation with smoother and more direct action sequences than simply supplying additional 4D context.

What carries the argument

The motion-history interface that converts recent video observations into trajectory-field tokens which visual tokens can query for motion-consistent control signals.

If this is right

  • Long-horizon tasks become more reliable because motion evidence stays consistent across steps.
  • Action generation avoids fragmentation by retrieving motion links directly from the token stream.
  • Real-robot rollouts exhibit smoother paths without separate post-processing for continuity.
  • VLA memory design shifts from accumulating raw 4D data to exposing queryable motion evidence.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same token conversion could be tested on non-manipulation sequences where temporal coherence matters, such as navigation.
  • If trajectory tokens scale, other memory-heavy models might replace frame buffers with motion-derived representations.
  • Extending the window length while keeping tokens compact might reveal a practical limit on how much past motion can be compressed without loss.

Load-bearing premise

Short video windows can be turned into trajectory-field tokens that remain physically coherent and free of geometric drift when the model queries them.

What would settle it

A controlled test showing that replacing standard history frames with these trajectory-field tokens produces no gain or increases drift and instability on the same long-horizon manipulation benchmarks.

Figures

Figures reproduced from arXiv: 2606.08288 by Li Yu, Shanglin Yuan, Weiheng Zhao, Wei Sui, Wenyu Liu, Xianda Guo, Xinggang Wang.

Figure 1
Figure 1. Figure 1: Discrete 4D evidence can be fragmented, while trajectory fields provide a more consistent [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of MotionVLA. MotionVLA builds a past-only motion history from trajectory￾field tokens. Current visual features query this memory via cross-attention to retrieve task-relevant motion tokens, which are then recoupled into the VLA stream. An auxiliary trajectory reconstruction head (top right) grounds the retrieved tokens in control-relevant dynamics. 3 Method 3.1 Preliminaries VLA policy (π0). We s… view at source ↗
Figure 3
Figure 3. Figure 3: Training efficiency and Stage-I sensitivity. On stack blocks two, MotionVLA reaches higher early success than the baseline, suggesting that motion-history tokens provide a useful optimization prior. Stage-I alignment peaks around 60k steps in this setting; insufficient alignment underuses trajectory-field features, while overly long Stage-I training can slightly hurt downstream adaptation. MotionVLA conver… view at source ↗
Figure 4
Figure 4. Figure 4: Training curves during motion-history interface alignment. We plot the action loss and the auxiliary trajectory-grounding loss used in Stage-I alignment. The auxiliary trajectory-grounding loss is used only to shape the retrieved motion-conditioned tokens during alignment. In Stage II, the auxiliary trajectory head is detached, and the downstream policy is optimized for action generation. This separation k… view at source ↗
Figure 5
Figure 5. Figure 5: Demonstrations of the six tasks. A3 [PITH_FULL_IMAGE:figures/full_fig_p015_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Qualitative examples on blocks touching rgb. Success (top): the gripper touches the blocks in the required order. Failure (bottom): the gripper misses the red block and perturbs the green and blue blocks during contact [PITH_FULL_IMAGE:figures/full_fig_p016_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: LIBERO trajectory-field visualization. Consecutive frames illustrate temporally struc￾tured motion cues extracted from the past observation window [PITH_FULL_IMAGE:figures/full_fig_p016_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Real-world trajectory-field visualization. Consecutive Agilex Piper observations show the trajectory-field cues used by the motion-history interface. A4 [PITH_FULL_IMAGE:figures/full_fig_p016_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Real-world rollouts. Example rollouts on Agilex Piper for the preliminary real-world validation tasks. A5 [PITH_FULL_IMAGE:figures/full_fig_p017_9.png] view at source ↗
read the original abstract

Vision-language-action (VLA) models increasingly condition robot policies on history, depth, or 4D features to resolve ambiguity in long-horizon manipulation. However, more spatiotemporal evidence is not necessarily better: when the injected evidence is not motion-consistent, it can introduce geometric drift, fragmented temporal cues, and unstable action generation. This raises a simple question: should a VLA remember past frames, or remember the motion that connects them? We introduce MotionVLA, a motion-history interface that converts a short past-only video window into compact, time-continuous trajectory-field tokens. Instead of treating history as a sparse set of ndependently lifted frames, MotionVLA represents recent observations as physically coherent motion evidence. Current visual tokens query this history to retrieve task-relevant motion information, which is then recoupled into the VLA stream under trajectory-grounded supervision. Experiments across simulation benchmarks and preliminary real-robot rollouts show that MotionVLA improves long-horizon manipulation while producing smoother and more direct executions. These results suggest that effective VLA memory is not just about providing more 4D context, but about exposing motion-consistent evidence that is usable for control.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 2 minor

Summary. The paper introduces MotionVLA, a motion-history interface for vision-language-action (VLA) models. It converts a short past-only video window into compact, time-continuous trajectory-field tokens that represent physically coherent motion evidence. Current visual tokens query this history to retrieve task-relevant motion information, which is recoupled into the VLA stream under trajectory-grounded supervision. Experiments across simulation benchmarks and preliminary real-robot rollouts demonstrate improvements in long-horizon manipulation along with smoother and more direct executions, suggesting that effective VLA memory requires motion-consistent evidence rather than simply more 4D context.

Significance. If the empirical results hold under detailed scrutiny, the work provides a concrete demonstration that the form of injected history matters for VLA stability and performance. By showing measurable gains from trajectory-field tokens over standard history conditioning, it offers a falsifiable test of the premise that motion-consistent evidence reduces geometric drift and fragmented cues, which could inform future designs of memory mechanisms in robot policies.

minor comments (2)
  1. Abstract: 'ndependently' appears to be a typographical error and should read 'independently'.
  2. The abstract states that experiments show improvements but does not specify the exact metrics, baselines, or statistical details used to quantify 'smoother and more direct executions'; adding these would strengthen the presentation even if the full methods section contains them.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the constructive summary and the recommendation of minor revision. The report correctly identifies the central claim that motion-consistent trajectory-field tokens provide more usable history for VLA policies than additional raw spatiotemporal context. No major comments were listed in the report, so we have no point-by-point rebuttals to provide at this stage.

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper introduces MotionVLA as an empirical architecture that converts short video windows into trajectory-field tokens and reports measurable gains on simulation benchmarks plus real-robot rollouts. No derivation chain, equations, or first-principles claims are present in the provided text that reduce to fitted inputs, self-definitions, or self-citation load-bearing premises. The central suggestion—that motion-consistent evidence improves control—is framed as a testable empirical outcome rather than a tautological restatement of inputs. This matches the default expectation of a non-circular empirical contribution.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no information on free parameters, axioms, or invented entities.

pith-pipeline@v0.9.1-grok · 5751 in / 1039 out tokens · 23985 ms · 2026-06-27T19:23:36.306520+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

48 extracted references · 26 canonical work pages · 17 internal anchors

  1. [1]

    H. Shi, B. Xie, Y . Liu, L. Sun, F. Liu, T. Wang, E. Zhou, H. Fan, X. Zhang, and G. Huang. Mem- oryvla: Perceptual-cognitive memory in vision-language-action models for robotic manipulation. arXiv preprint arXiv:2508.19236, 2025

  2. [2]

    Zhang, Y

    J. Zhang, Y . Chen, Y . Xu, Z. Huang, Y . Zhou, Y .-J. Yuan, X. Cai, G. Huang, X. Quan, H. Xu, et al. 4d-vla: Spatiotemporal vision-language-action pretraining with cross-scene calibration. Advances in Neural Information Processing Systems, 38:33914–33937, 2026

  3. [3]

    C. Ni, C. Chen, X. Wang, Z. Zhu, W. Zheng, B. Wang, T. Chen, G. Zhao, H. Li, Z. Dong, et al. Swiftvla: Unlocking spatiotemporal dynamics for lightweight vla models at minimal overhead. arXiv preprint arXiv:2512.00903, 2025

  4. [4]

    S. Liu, L. Wu, B. Li, H. Tan, H. Chen, Z. Wang, K. Xu, H. Su, and J. Zhu. Rdt-1b: a diffusion foundation model for bimanual manipulation. InInternational Conference on Learning Representations, volume 2025, pages 29982–30009, 2025

  5. [5]

    M. J. Kim, K. Pertsch, S. Karamcheti, T. Xiao, A. Balakrishna, S. Nair, R. Rafailov, E. Foster, G. Lam, P. Sanketi, et al. Openvla: An open-source vision-language-action model.arXiv preprint arXiv:2406.09246, 2024

  6. [6]

    $\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

    K. Black, N. Brown, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, L. Groom, K. Hausman, B. Ichter, et al. π0: A visionlanguage-action flow model for general robot control, 2024a.URL https://arxiv.org/abs/2410.24164, 2024

  7. [7]

    FAST: Efficient Action Tokenization for Vision-Language-Action Models

    K. Pertsch, K. Stachowicz, B. Ichter, D. Driess, S. Nair, Q. Vuong, O. Mees, C. Finn, and S. Levine. Fast: Efficient action tokenization for vision-language-action models.arXiv preprint arXiv:2501.09747, 2025

  8. [8]

    Q. Li, Y . Liang, Z. Wang, L. Luo, X. Chen, M. Liao, F. Wei, Y . Deng, S. Xu, Y . Zhang, et al. Cogact: A foundational vision-language-action model for synergizing cognition and action in robotic manipulation.arXiv preprint arXiv:2411.19650, 2024

  9. [9]

    SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics

    M. Shukor, D. Aubakirova, F. Capuano, P. Kooijmans, S. Palma, A. Zouitine, M. Aractingi, C. Pascal, M. Russi, A. Marafioti, et al. Smolvla: A vision-language-action model for affordable and efficient robotics.arXiv preprint arXiv:2506.01844, 2025

  10. [10]

    D. Qu, H. Song, Q. Chen, Y . Yao, X. Ye, Y . Ding, Z. Wang, J. Gu, B. Zhao, D. Wang, et al. Spatialvla: Exploring spatial representations for visual-language-action model.arXiv preprint arXiv:2501.15830, 2025

  11. [11]

    Q. Zhao, Y . Lu, M. J. Kim, Z. Fu, Z. Zhang, Y . Wu, Z. Li, Q. Ma, S. Han, C. Finn, et al. Cot-vla: Visual chain-of-thought reasoning for vision-language-action models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 1702–1713, 2025

  12. [12]

    Zheng, Y

    R. Zheng, Y . Liang, S. Huang, J. Gao, H. Daum ´e III, A. Kolobov, F. Huang, and J. Yang. Tracevla: Visual trace prompting enhances spatial-temporal awareness for generalist robotic policies. InInternational Conference on Learning Representations, volume 2025, pages 54277– 54296, 2025

  13. [13]

    $\pi_{0.5}$: a Vision-Language-Action Model with Open-World Generalization

    P. Intelligence, K. Black, N. Brown, J. Darpinian, K. Dhabalia, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, et al. π0.5: a vision-language-action model with open-world generalization. arXiv preprint arXiv:2504.16054, 2025

  14. [14]

    M. J. Kim, C. Finn, and P. Liang. Fine-tuning vision-language-action models: Optimizing speed and success.arXiv preprint arXiv:2502.19645, 2025

  15. [15]

    T. Yuan, Y . Liu, C. Lu, Z. Chen, T. Jiang, and H. Zhao. Depthvla: Enhancing vision-language- action models with depth-aware spatial reasoning.arXiv preprint arXiv:2510.13375, 2025. 9

  16. [16]

    L. Sun, B. Xie, Y . Liu, H. Shi, T. Wang, and J. Cao. Geovla: Empowering 3d representations in vision-language-action models.arXiv preprint arXiv:2508.09071, 2025

  17. [17]

    PaliGemma: A versatile 3B VLM for transfer

    L. Beyer, A. Steiner, A. S. Pinto, A. Kolesnikov, X. Wang, D. Salz, M. Neumann, I. Alabdul- mohsin, M. Tschannen, E. Bugliarello, et al. Paligemma: A versatile 3b vlm for transfer.arXiv preprint arXiv:2407.07726, 2024

  18. [18]

    RT-1: Robotics Transformer for Real-World Control at Scale

    A. Brohan, N. Brown, J. Carbajal, Y . Chebotar, J. Dabis, C. Finn, K. Gopalakrishnan, K. Haus- man, A. Herzog, J. Hsu, et al. Rt-1: Robotics transformer for real-world control at scale.arXiv preprint arXiv:2212.06817, 2022

  19. [19]

    H. Li, S. Yang, Y . Chen, Y . Tian, X. Yang, X. Chen, H. Wang, T. Wang, F. Zhao, D. Lin, et al. Cronusvla: Transferring latent motion across time for multi-frame prediction in manipulation. arXiv preprint arXiv:2506.19816, 2025

  20. [20]

    D. Li, Y . Zhang, M. Cao, D. Liu, W. Xie, T. Hui, L. Lin, Z. Xie, and Y . Li. Towards long- horizon vision-language-action system: Reasoning, acting and memory. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 6839–6848, 2025

  21. [21]

    G. Wu, T. Yi, J. Fang, L. Xie, X. Zhang, W. Wei, W. Liu, Q. Tian, and X. Wang. 4d gaussian splatting for real-time dynamic scene rendering. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 20310–20320, 2024

  22. [22]

    X. Liu, Y . Xiao, D. Y . Chen, J. Feng, Y .-W. Tai, C.-K. Tang, and B. Kang. Trace anything: Representing any video in 4d via trajectory fields.arXiv preprint arXiv:2510.13802, 2025

  23. [23]

    T. Chen, Z. Chen, B. Chen, Z. Cai, Y . Liu, Z. Li, Q. Liang, X. Lin, Y . Ge, Z. Gu, et al. Robotwin 2.0: A scalable data generator and benchmark with strong domain randomization for robust bimanual robotic manipulation.arXiv preprint arXiv:2506.18088, 2025

  24. [24]

    B. Liu, Y . Zhu, C. Gao, Y . Feng, Q. Liu, Y . Zhu, and P. Stone. Libero: Benchmarking knowledge transfer for lifelong robot learning.Advances in Neural Information Processing Systems, 36: 44776–44791, 2023

  25. [25]

    Z. Li, W. Wang, H. Li, E. Xie, C. Sima, T. Lu, Q. Yu, and J. Dai. Bevformer: learning bird’s- eye-view representation from lidar-camera via spatiotemporal transformers.IEEE Transactions on Pattern Analysis and Machine Intelligence, 47(3):2020–2036, 2024

  26. [26]

    Y . Hu, J. Yang, L. Chen, K. Li, C. Sima, X. Zhu, S. Chai, S. Du, T. Lin, W. Wang, et al. Planning- oriented autonomous driving. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 17853–17862, 2023

  27. [27]

    B. Liao, S. Chen, H. Yin, B. Jiang, C. Wang, S. Yan, X. Zhang, X. Li, Y . Zhang, Q. Zhang, et al. Diffusiondrive: Truncated diffusion model for end-to-end autonomous driving. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 12037–12047, 2025

  28. [28]

    O’Neill, A

    A. O’Neill, A. Rehman, A. Maddukuri, A. Gupta, A. Padalkar, A. Lee, A. Pooley, A. Gupta, A. Mandlekar, A. Jain, et al. Open x-embodiment: Robotic learning datasets and rt-x models: Open x-embodiment collaboration 0. In2024 IEEE International Conference on Robotics and Automation (ICRA), pages 6892–6903. IEEE, 2024

  29. [29]

    O. M. Team, D. Ghosh, H. Walke, K. Pertsch, K. Black, O. Mees, S. Dasari, J. Hejna, T. Kreiman, C. Xu, et al. Octo: An open-source generalist robot policy.arXiv preprint arXiv:2405.12213, 2024

  30. [30]

    Y . Wang, H. Zhu, M. Liu, J. Yang, H.-S. Fang, and T. He. Vq-vla: Improving vision-language- action models via scaling vector-quantized action tokenizers. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 11089–11099, 2025. 10

  31. [31]

    Zhang, H

    W. Zhang, H. Liu, Z. Qi, Y . Wang, X. Yu, J. Zhang, R. Dong, J. He, H. Wang, Z. Zhang, et al. Dreamvla: a vision-language-action model dreamed with comprehensive world knowledge. Advances in Neural Information Processing Systems, 38:24195–24228, 2026

  32. [32]

    Zhang, G

    S. Zhang, G. Wu, Z. Xie, X. Wang, B. Feng, and W. Liu. Dynamic 2d gaussians: Geometrically accurate radiance fields for dynamic objects. InProceedings of the 33rd ACM International Conference on Multimedia, pages 8144–8153, 2025

  33. [33]

    J. Wang, M. Chen, N. Karaev, A. Vedaldi, C. Rupprecht, and D. Novotny. Vggt: Visual geometry grounded transformer. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025

  34. [34]

    H. Lin, S. Chen, J. H. Liew, D. Y . Chen, Z. Li, G. Shi, J. Feng, and B. Kang. Depth anything 3: Recovering the visual space from any views.arXiv preprint arXiv:2511.10647, 2025

  35. [35]

    Y . Ze, G. Zhang, K. Zhang, C. Hu, M. Wang, and H. Xu. 3d diffusion policy: Generalizable visuomotor policy learning via simple 3d representations.arXiv preprint arXiv:2403.03954, 2024

  36. [36]

    C. Li, J. Wen, Y . Peng, Y . Peng, and Y . Zhu. Pointvla: Injecting the 3d world into vision- language-action models.IEEE Robotics and Automation Letters, 11(3):2506–2513, 2026

  37. [37]

    B. Chen, Z. Xu, S. Kirmani, B. Ichter, D. Sadigh, L. Guibas, and F. Xia. Spatialvlm: Endowing vision-language models with spatial reasoning capabilities. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14455–14465, 2024

  38. [38]

    Cheng, H

    A.-C. Cheng, H. Yin, Y . Fu, Q. Guo, R. Yang, J. Kautz, X. Wang, and S. Liu. Spatialrgpt: Grounded spatial reasoning in vision-language models.Advances in Neural Information Processing Systems, 37:135062–135093, 2024

  39. [39]

    Z. Fan, J. Zhang, R. Li, J. Zhang, R. Chen, H. Hu, K. Wang, H. Qu, S. Zhou, D. Wang, et al. Vlm-3r: Vision-language models augmented with instruction-aligned 3d reconstruction.arXiv preprint arXiv:2505.20279, 2025

  40. [40]

    S. Zhou, A. Vilesov, X. He, Z. Wan, S. Zhang, A. Nagachandra, D. Chang, D. Chen, X. E. Wang, and A. Kadambi. Vlm4d: Towards spatiotemporal awareness in vision language models. In Proceedings of the IEEE/CVF international conference on computer vision, pages 8600–8612, 2025

  41. [41]

    Zhou and G

    H. Zhou and G. H. Lee. Uni4d-llm: A unified spatiotemporal-aware vlm for 4d understanding and generation.arXiv preprint arXiv:2509.23828, 2025

  42. [42]

    W. Li, R. Zhou, J. Zhou, Y . Song, J. Herter, M. Qin, G. Huang, and H. Pfister. 4d langsplat: 4d language gaussian splatting via multimodal large language models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 22001–22011, 2025

  43. [43]

    S. Wang, Y . Liu, T. Wang, Y . Li, and X. Zhang. Exploring object-centric temporal modeling for efficient multi-view 3d object detection. InProceedings of the IEEE/CVF international conference on computer vision, pages 3621–3631, 2023

  44. [44]

    D. Niu, Y . Sharma, H. Xue, G. Biamby, J. Zhang, Z. Ji, T. Darrell, and R. Herzig. Pre-training auto-regressive robotic models with 4d representations.arXiv preprint arXiv:2502.13142, 2025

  45. [45]

    C. Wang, B. Eckart, S. Lucey, and O. Gallo. Neural trajectory fields for dynamic novel view synthesis.arXiv preprint arXiv:2105.05994, 2021

  46. [46]

    C. Chi, Z. Xu, S. Feng, E. Cousineau, Y . Du, B. Burchfiel, R. Tedrake, and S. Song. Diffusion policy: Visuomotor policy learning via action diffusion.The International Journal of Robotics Research, 44(10-11):1684–1704, 2025. 11

  47. [47]

    Liang, G

    W. Liang, G. Sun, Y . He, J. Dong, S. Dai, I. Laptev, S. Khan, and Y . Cong. Pixelvla: Advancing pixel-level understanding in vision-language-action model.arXiv preprint arXiv:2511.01571, 2025

  48. [48]

    GR00T N1: An Open Foundation Model for Generalist Humanoid Robots

    J. Bjorck, F. Casta˜neda, N. Cherniadev, X. Da, R. Ding, L. Fan, Y . Fang, D. Fox, F. Hu, S. Huang, et al. Gr00t n1: An open foundation model for generalist humanoid robots.arXiv preprint arXiv:2503.14734, 2025. 12 Appendix A Additional Experimental Results and Ablations A.1 Training Efficiency and Stage-I Sensitivity /uni00000013/uni0000002e/uni00000014/...