pith. machine review for the scientific record. sign in

arxiv: 2604.17880 · v1 · submitted 2026-04-20 · 💻 cs.RO · cs.CV

Recognition: unknown

ST-π: Structured SpatioTemporal VLA for Robotic Manipulation

Authors on Pith no claims yet

Pith reviewed 2026-05-10 04:42 UTC · model grok-4.3

classification 💻 cs.RO cs.CV
keywords vision-language-actionrobotic manipulationspatiotemporal reasoningstructured planningVLA modelschunk-level promptsaction generation
0
0 comments X

The pith

ST-π structures VLA models so a VLM generates chunk-level spatiotemporal prompts that condition an action expert to refine step-level robotic controls.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Existing vision-language-action models embed spatiotemporal knowledge implicitly inside visual and action representations, then map directly to step-level predictions, which makes it hard to manage multiple sequential behaviors that have clear spatial and temporal boundaries. ST-π introduces an explicit split: a spatiotemporal VLM encodes 4D observations and task instructions, then uses an LLM to output a sequence of causally ordered chunk-level action prompts that name sub-tasks together with their spatial and temporal grounding. These prompts condition a spatiotemporal action expert whose structured dual-generator guidance jointly captures spatial dependencies and temporal causality to produce the actual step-level action parameters. A sympathetic reader would care because the separation turns hidden reasoning into an inspectable plan that can be refined locally, potentially letting robots execute coordinated manipulation sequences more reliably than implicit cross-modal methods allow.

Core claim

The paper proposes ST-π, a structured spatiotemporal VLA model in which the VLM explicitly plans global spatiotemporal behavior by generating causally ordered chunk-level action prompts consisting of sub-tasks, spatial grounding, and temporal grounding, while the action expert further refines local spatiotemporal control through structured dual-generator guidance that jointly models spatial dependencies and temporal causality to predict step-level action parameters.

What carries the argument

The structured dual-generator guidance in the action expert, conditioned on chunk-level prompts from the spatiotemporal VLM, that jointly models spatial dependencies and temporal causality for step-level action prediction.

If this is right

  • The explicit chunk-level prompts allow the model to handle multiple sequential behaviors that have clear spatiotemporal boundaries.
  • Joint modeling of spatial dependencies and temporal causality in the action expert improves the accuracy of step-level action parameters.
  • Training on the proposed real-world dataset with structured spatiotemporal annotations supports effective fine-tuning for manipulation tasks.
  • Global planning by the VLM combined with local refinement by the action expert produces more coherent long-horizon robot behavior than implicit methods.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The chunk prompts could serve as an interpretable interface for debugging or human oversight of robot plans.
  • The same split between global prompt generation and local action refinement might transfer to other sequential control domains such as navigation or assembly.
  • Ablation studies that disable either the spatial or temporal component of the dual-generator would reveal which aspect drives the reported gains on timed tasks.
  • If the LLM prompt quality varies with task length, performance on longer sequences could be used to test the limits of the structured approach.

Load-bearing premise

An LLM can reliably generate accurate, causally ordered chunk-level action prompts containing sub-tasks, spatial grounding, and temporal grounding from 4D observations and instructions, and that these prompts provide sufficient conditioning for the dual-generator to produce correct step-level actions.

What would settle it

Compare performance on complex sequential manipulation tasks when the chunk-level prompts are replaced by direct step-level prediction; a large drop in success rate on tasks with explicit temporal boundaries would support the claim, while no difference would falsify it.

Figures

Figures reproduced from arXiv: 2604.17880 by Chuanhao Ma, Hanyu Zhou, Luxin Yan, Shihan Peng, Tao Gu, Yan Li.

Figure 1
Figure 1. Figure 1: Illustration of typical VLA paradigms for spatiotemporal robotic manipulation. (a) OpenVLA operates on single-frame [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of ST-𝜋. Our model consists of two key components: (1) ST-VLM, which constructs 4D observation represen￾tations and performs structured task decomposition; and (2) ST-AE, which generates executable action chunks conditioned on chunk-level action prompts for each sub-task, ensuring stable and coherent trajectories. sub-tasks in long-horizon scenarios. Therefore, it is essential to ex￾plicitly model… view at source ↗
Figure 3
Figure 3. Figure 3: Architecture of ST-VLM. The ST-VLM takes 4D representations, language instructions, and query tokens as inputs to [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Overview of structured task decomposition with chunk-level action prompt. A long-horizon task (a) is decomposed [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Architecture of ST-AE. The ST-AE takes chunk-level [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Illustration of our real-world robotic platform and STAR dataset. Each task in the STAR dataset is segmented into [PITH_FULL_IMAGE:figures/full_fig_p006_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Real-world performance comparison. We evaluate different VLA methods on real-world tasks with increasing [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Trajectory comparison of different motion generator variants. (a) Task illustration. (b) Only the spatial motion [PITH_FULL_IMAGE:figures/full_fig_p009_8.png] view at source ↗
read the original abstract

Vision-language-action (VLA) models have achieved great success on general robotic tasks, but still face challenges in fine-grained spatiotemporal manipulation. Typically, existing methods mainly embed spatiotemporal knowledge into visual and action representations, and directly perform a cross-modal mapping for step-level action prediction. However, such spatiotemporal reasoning remains largely implicit, making it difficult to handle multiple sequential behaviors with explicit spatiotemporal boundaries. In this work, we propose ST-$\pi$, a structured spatiotemporal VLA model for robotic manipulation. Our model is guided by two key designs: 1) Spatiotemporal VLM. We encode 4D observations and task instructions into latent spaces, and feed them into the LLM to generate a sequence of causally ordered chunk-level action prompts consisting of sub-tasks, spatial grounding and temporal grounding. 2) Spatiotemporal action expert. Conditioned on chunk-level action prompts, we design a structured dual-generator guidance to jointly model spatial dependencies and temporal causality, thus predicting step-level action parameters. Within this structured framework, the VLM explicitly plans global spatiotemporal behavior, and the action expert further refines local spatiotemporal control. In addition, we propose a real-world robotic dataset with structured spatiotemporal annotations for fine-tuning. Extensive experiments have been conducted to demonstrate the effectiveness of our model. Our code link: https://github.com/chuanhaoma/ST-pi.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript proposes ST-π, a structured spatiotemporal VLA model for robotic manipulation. It consists of (1) a Spatiotemporal VLM that encodes 4D observations and task instructions into latent spaces and feeds them to an LLM to produce a sequence of causally ordered chunk-level action prompts (sub-tasks plus spatial and temporal grounding), and (2) a Spatiotemporal action expert that conditions a dual-generator on these prompts to jointly model spatial dependencies and temporal causality for step-level action prediction. The authors introduce a new real-world dataset with structured spatiotemporal annotations and state that extensive experiments demonstrate the model's effectiveness.

Significance. If the empirical claims hold, the explicit separation of global spatiotemporal planning (via LLM-generated chunk prompts) from local control (via the dual-generator action expert) could meaningfully advance VLA models on fine-grained sequential manipulation tasks where implicit cross-modal mapping has been insufficient. The release of a structured annotation dataset would also be a useful community resource.

major comments (3)
  1. Abstract: the statement that 'extensive experiments have been conducted to demonstrate the effectiveness of our model' is unsupported by any reported metrics, baseline comparisons, ablation results, or error analysis. Without these data, the central claim that the structured VLM-plus-action-expert separation improves spatiotemporal manipulation cannot be evaluated.
  2. Abstract (Spatiotemporal VLM description): the assertion that the VLM 'explicitly plans global spatiotemporal behavior' rests on the unverified assumption that the LLM reliably produces accurate, causally ordered chunk-level prompts containing correct sub-task decomposition, spatial grounding, and temporal grounding. No independent metrics on prompt fidelity (e.g., grounding precision, ordering error rate) are supplied; if prompt generation is noisy, any observed gains could be attributable to the action expert or dataset alone rather than the proposed structure.
  3. Abstract (Spatiotemporal action expert): the 'structured dual-generator guidance' for jointly modeling spatial dependencies and temporal causality is described at a high level only, with no equations, architectural diagrams, or loss-function details. This makes it impossible to determine whether the claimed joint modeling is actually implemented in a way that differs from standard conditioning or whether it is load-bearing for the reported performance.
minor comments (2)
  1. The abstract mentions a code link but provides no details on the dataset size, task suite, or evaluation protocol; adding one sentence summarizing these would improve readability.
  2. Notation such as '4D observations' and 'chunk-level' is used without an initial definition; a brief parenthetical clarification in the abstract would help readers unfamiliar with the sub-field.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We are grateful to the referee for the thoughtful and constructive feedback on our manuscript. We have carefully considered each major comment and provide point-by-point responses below. Where appropriate, we indicate the revisions we plan to make in the updated version.

read point-by-point responses
  1. Referee: Abstract: the statement that 'extensive experiments have been conducted to demonstrate the effectiveness of our model' is unsupported by any reported metrics, baseline comparisons, ablation results, or error analysis. Without these data, the central claim that the structured VLM-plus-action-expert separation improves spatiotemporal manipulation cannot be evaluated.

    Authors: We appreciate this observation. While the full manuscript includes comprehensive experimental results with quantitative metrics, baseline comparisons (e.g., against standard VLA models), ablation studies on the spatiotemporal components, and error analysis in Sections 4 and 5, the abstract does not summarize these findings. To address the referee's concern and strengthen the abstract, we will revise it to include key performance highlights, such as success rate improvements on the real-world dataset and comparisons showing the benefits of the structured approach. revision: yes

  2. Referee: Abstract (Spatiotemporal VLM description): the assertion that the VLM 'explicitly plans global spatiotemporal behavior' rests on the unverified assumption that the LLM reliably produces accurate, causally ordered chunk-level prompts containing correct sub-task decomposition, spatial grounding, and temporal grounding. No independent metrics on prompt fidelity (e.g., grounding precision, ordering error rate) are supplied; if prompt generation is noisy, any observed gains could be attributable to the action expert or dataset alone rather than the proposed structure.

    Authors: We acknowledge the importance of verifying the quality of the generated chunk-level prompts. The design of the Spatiotemporal VLM incorporates mechanisms to promote causal ordering and accurate grounding through the encoding of 4D observations and structured prompting of the LLM. Nevertheless, to provide direct validation and rule out alternative explanations for performance gains, we will add an evaluation of prompt fidelity in the revised manuscript. This will include metrics such as sub-task decomposition accuracy, spatial grounding precision, and temporal ordering error rates, assessed via human annotation on a subset of the data. revision: yes

  3. Referee: Abstract (Spatiotemporal action expert): the 'structured dual-generator guidance' for jointly modeling spatial dependencies and temporal causality is described at a high level only, with no equations, architectural diagrams, or loss-function details. This makes it impossible to determine whether the claimed joint modeling is actually implemented in a way that differs from standard conditioning or whether it is load-bearing for the reported performance.

    Authors: The abstract offers a concise description of the Spatiotemporal action expert. However, the full manuscript provides the necessary details: an architectural diagram in Figure 2 illustrating the dual-generator structure, mathematical formulations for the spatial and temporal generators in Section 3.2, and the combined loss function in Equation (4) that enforces joint modeling of dependencies and causality. To improve clarity, we will revise the abstract to briefly reference these elements and emphasize how the dual-generator differs from standard conditioning by explicitly separating and jointly optimizing spatial and temporal aspects. revision: partial

Circularity Check

0 steps flagged

No circularity: architectural proposal without derivation chain

full rationale

The paper describes ST-π as a new VLA architecture consisting of a Spatiotemporal VLM that encodes 4D observations into LLM-generated chunk-level prompts (sub-tasks + spatial/temporal grounding) and a dual-generator action expert conditioned on those prompts. No equations, first-principles derivations, fitted parameters renamed as predictions, or self-citation chains appear in the provided text. The central claim is an explicit separation of global planning from local control, presented as a design choice supported by a new dataset and experiments rather than any reduction of outputs to inputs by construction. This is a standard architectural proposal whose validity is empirical, not tautological.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on standard transformer and LLM components plus the assumption that explicit chunk-level spatiotemporal annotations can be reliably produced and used; no new physical entities or ad-hoc constants are introduced in the abstract.

axioms (2)
  • domain assumption LLMs can produce causally ordered sequences of sub-task, spatial, and temporal grounding from 4D visual observations and language instructions
    Invoked in the description of the Spatiotemporal VLM component
  • domain assumption A dual-generator network can jointly model spatial dependencies and temporal causality when conditioned on chunk-level prompts
    Invoked in the description of the Spatiotemporal action expert

pith-pipeline@v0.9.0 · 5553 in / 1406 out tokens · 43021 ms · 2026-05-10T04:42:37.380990+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. TriRelVLA: Triadic Relational Structure for Generalizable Embodied Manipulation

    cs.CV 2026-05 unverdicted novelty 6.0

    TriRelVLA introduces triadic object-hand-task relational representations and a task-grounded graph transformer with a relational bottleneck to improve generalization in robotic manipulation across scenes, objects, and tasks.

Reference graph

Works this paper leans on

48 extracted references · 29 canonical work pages · cited by 1 Pith paper · 21 internal anchors

  1. [1]

    Michael Ahn, Anthony Brohan, Noah Brown, Yevgen Chebotar, Omar Cortes, Byron David, Chelsea Finn, Chuyuan Fu, Keerthana Gopalakrishnan, Karol Haus- man, et al . 2022. Do as i can, not as i say: Grounding language in robotic affordances.arXiv preprint arXiv:2204.01691(2022)

  2. [2]

    Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al . 2025. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631(2025)

  3. [3]

    Lucas Beyer, Andreas Steiner, André Susano Pinto, Alexander Kolesnikov, Xiao Wang, Daniel Salz, Maxim Neumann, Ibrahim Alabdulmohsin, Michael Tschan- nen, Emanuele Bugliarello, et al. 2024. Paligemma: A versatile 3b vlm for transfer. arXiv preprint arXiv:2407.07726(2024)

  4. [4]

    $\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

    Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, et al. 2024.𝜋 0: A Vision-Language-Action Flow Model for General Robot Control.arXiv preprint arXiv:2410.24164(2024)

  5. [5]

    Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Joseph Dabis, Chelsea Finn, Keerthana Gopalakrishnan, Karol Hausman, Alex Herzog, Jasmine Hsu, et al. 2022. Rt-1: Robotics transformer for real-world control at scale.arXiv preprint arXiv:2212.06817(2022)

  6. [6]

    Dave Zhenyu Chen, Angel X Chang, and Matthias Nießner. 2020. Scanrefer: 3d object localization in rgb-d scans using natural language. InEuropean conference on computer vision. Springer, 202–221

  7. [7]

    Zhenyu Chen, Ali Gholami, Matthias Nießner, and Angel X Chang. 2021. Scan2cap: Context-aware dense captioning in rgb-d scans. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 3193–3203

  8. [8]

    Cheng Chi, Zhenjia Xu, Siyuan Feng, Eric Cousineau, Yilun Du, Benjamin Burch- fiel, Russ Tedrake, and Shuran Song. 2025. Diffusion policy: Visuomotor policy learning via action diffusion.The International Journal of Robotics Research44, 10-11 (2025), 1684–1704

  9. [9]

    Chang, Manolis Savva, Maciej Halber, Thomas Funkhouser, and Matthias Nießner

    Angela Dai, Angel X. Chang, Manolis Savva, Maciej Halber, Thomas Funkhouser, and Matthias Nießner. 2017. ScanNet: Richly-annotated 3D Reconstructions of Indoor Scenes. InProc. Computer Vision and Pattern Recognition (CVPR), IEEE

  10. [10]

    Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. 2024. Scaling rectified flow transformers for high-resolution image synthesis. InForty- first international conference on machine learning

  11. [11]

    Chengkai Hou, Yanjie Ze, Yankai Fu, Zeyu Gao, Songbo Hu, Yue Yu, Shanghang Zhang, and Huazhe Xu. 2025. 4D Visual Pre-training for Robot Learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 8451– 8461

  12. [12]

    Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Liang Wang, Weizhu Chen, et al. 2022. Lora: Low-rank adaptation of large language models.Iclr1, 2 (2022), 3

  13. [13]

    Physical Intelligence, Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, et al

  14. [14]

    $\pi_{0.5}$: a Vision-Language-Action Model with Open-World Generalization

    𝜋0.5: a Vision-Language-Action Model with Open-World Generalization. arXiv preprint arXiv:2504.16054(2025)

  15. [15]

    Alexander Khazatsky, Karl Pertsch, Suraj Nair, Ashwin Balakrishna, Sudeep Dasari, Siddharth Karamcheti, Soroush Nasiriany, Mohan Kumar Srirama, Lawrence Yunliang Chen, Kirsty Ellis, et al. 2024. Droid: A large-scale in-the-wild robot manipulation dataset.arXiv preprint arXiv:2403.12945(2024)

  16. [16]

    Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakr- ishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, et al

  17. [17]

    Openvla: An open-source vision-language-action model.arXiv preprint arXiv:2406.09246(2024)

  18. [18]

    Chengmeng Li, Junjie Wen, Yan Peng, Yaxin Peng, Feifei Feng, and Yichen Zhu

  19. [19]

    arXiv preprint arXiv:2503.07511 (2025)

    PointVLA: Injecting the 3D World into Vision-Language-Action Models. arXiv preprint arXiv:2503.07511(2025)

  20. [20]

    Haozhan Li, Yuxin Zuo, Jiale Yu, Yuhao Zhang, Zhaohui Yang, Kaiyan Zhang, Xuekai Zhu, Yuchen Zhang, Tianxing Chen, Ganqu Cui, et al. 2025. Simplevla-rl: Scaling vla training via reinforcement learning.arXiv preprint arXiv:2509.09674 (2025)

  21. [21]

    Qixiu Li, Yaobo Liang, Zeyu Wang, Lin Luo, Xi Chen, Mozheng Liao, Fangyun Wei, Yu Deng, Sicheng Xu, Yizhong Zhang, et al. 2024. Cogact: A foundational vision-language-action model for synergizing cognition and action in robotic manipulation.arXiv preprint arXiv:2411.19650(2024)

  22. [22]

    Xuanlin Li, Kyle Hsu, Jiayuan Gu, Karl Pertsch, Oier Mees, Homer Rich Walke, Chuyuan Fu, Ishikaa Lunawat, Isabel Sieh, Sean Kirmani, Sergey Levine, Jiajun Wu, Chelsea Finn, Hao Su, Quan Vuong, and Ted Xiao. 2024. Evaluating Real- World Robot Manipulation Policies in Simulation.arXiv preprint arXiv:2405.05941 (2024)

  23. [23]

    Yang Li, Si Si, Gang Li, Cho-Jui Hsieh, and Samy Bengio. 2021. Learnable fourier features for multi-dimensional spatial positional encoding.Advances in Neural Information Processing Systems34 (2021), 15816–15829

  24. [24]

    Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le

  25. [25]

    Flow matching for generative modeling.arXiv preprint arXiv:2210.02747 (2022)

  26. [26]

    Bo Liu, Yifeng Zhu, Chongkai Gao, Yihao Feng, Qiang Liu, Yuke Zhu, and Peter Stone. 2023. Libero: Benchmarking knowledge transfer for lifelong robot learning. Advances in Neural Information Processing Systems36 (2023), 44776–44791

  27. [27]

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. 2023. Visual in- struction tuning.Advances in neural information processing systems36 (2023), 34892–34916

  28. [28]

    Xingchao Liu, Chengyue Gong, and Qiang Liu. 2022. Flow straight and fast: Learning to generate and transfer data with rectified flow.arXiv preprint arXiv:2209.03003(2022)

  29. [29]

    Ilya Loshchilov and Frank Hutter. 2017. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101(2017)

  30. [30]

    Dantong Niu, Yuvan Sharma, Haoru Xue, Giscard Biamby, Junyi Zhang, Ziteng Ji, Trevor Darrell, and Roei Herzig. 2025. Pre-training auto-regressive robotic models with 4d representations.arXiv preprint arXiv:2502.13142(2025)

  31. [31]

    Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El- Nouby, et al. 2023. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193(2023)

  32. [32]

    Delin Qu, Haoming Song, Qizhi Chen, Yuanqi Yao, Xinyi Ye, Yan Ding, Zhigang Wang, JiaYuan Gu, Bin Zhao, Dong Wang, et al. 2025. Spatialvla: Exploring spatial representations for visual-language-action model.arXiv preprint arXiv:2501.15830 (2025)

  33. [33]

    Gemma Team, Thomas Mesnard, Cassidy Hardin, Robert Dadashi, Surya Bhupati- raju, Shreya Pathak, Laurent Sifre, Morgane Rivière, Mihir Sanjay Kale, Juliette Love, et al. 2024. Gemma: Open models based on gemini research and technology. arXiv preprint arXiv:2403.08295(2024)

  34. [34]

    Octo Model Team, Dibya Ghosh, Homer Walke, Karl Pertsch, Kevin Black, Oier Mees, Sudeep Dasari, Joey Hejna, Tobias Kreiman, Charles Xu, et al. 2024. Octo: An open-source generalist robot policy.arXiv preprint arXiv:2405.12213(2024)

  35. [35]

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need.Advances in neural information processing systems30 (2017)

  36. [36]

    Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rup- precht, and David Novotny. 2025. Vggt: Visual geometry grounded transformer. InProceedings of the Computer Vision and Pattern Recognition Conference. 5294– 5306

  37. [37]

    Philipp Wu, Yide Shentu, Zhongke Yi, Xingyu Lin, and Pieter Abbeel. 2023. GELLO: A General, Low-Cost, and Intuitive Teleoperation Framework for Robot Manipulators

  38. [38]

    Chiao-An Yang, Ryo Hachiuma, Sifei Liu, Subhashree Radhakrishnan, Ray- mond A Yeh, Yu-Chiang Frank Wang, and Min-Hung Chen. 2025. 4D-RGPT: Toward Region-level 4D Understanding via Perceptual Distillation.arXiv preprint arXiv:2512.17012(2025)

  39. [39]

    Yi Yang, Jiaxuan Sun, Siqi Kou, Yihan Wang, and Zhijie Deng. 2025. Lohovla: A unified vision-language-action model for long-horizon embodied tasks.arXiv preprint arXiv:2506.00411(2025)

  40. [40]

    Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. 2023. Sig- moid loss for language image pre-training. InProceedings of the IEEE/CVF inter- national conference on computer vision. 11975–11986

  41. [41]

    Jiahui Zhang, Yurui Chen, Yueming Xu, Ze Huang, Yanpeng Zhou, Yu-Jie Yuan, Xinyue Cai, Guowei Huang, Xingyue Quan, Hang Xu, et al. 2025. 4d-vla: Spa- tiotemporal vision-language-action pretraining with cross-scene calibration. arXiv preprint arXiv:2506.22242(2025)

  42. [42]

    Haoyu Zhen, Xiaowen Qiu, Peihao Chen, Jincheng Yang, Xin Yan, Yilun Du, Yin- ing Hong, and Chuang Gan. 2024. 3d-vla: A 3d vision-language-action generative world model.arXiv preprint arXiv:2403.09631(2024)

  43. [43]

    Ruijie Zheng, Yongyuan Liang, Shuaiyi Huang, Jianfeng Gao, Hal Daumé III, Andrey Kolobov, Furong Huang, and Jianwei Yang. 2024. Tracevla: Visual trace prompting enhances spatial-temporal awareness for generalist robotic policies. arXiv preprint arXiv:2412.10345(2024)

  44. [44]

    Hanyu Zhou and Gim Hee Lee. 2025. Llava-4d: Embedding spatiotemporal prompt into lmms for 4d scene understanding.arXiv preprint arXiv:2505.12253 (2025)

  45. [45]

    Hanyu Zhou and Gim Hee Lee. 2025. Uni4d-llm: A unified spatiotemporal-aware vlm for 4d understanding and generation.arXiv preprint arXiv:2509.23828(2025)

  46. [46]

    Hanyu Zhou, Chuanhao Ma, and Gim Hee Lee. 2025. VLA-4D: Embedding 4D Awareness into Vision-Language-Action Models for SpatioTemporally Coherent Robotic Manipulation.arXiv preprint arXiv:2511.17199(2025)

  47. [47]

    Zewei Zhou, Tianhui Cai, Seth Z Zhao, Yun Zhang, Zhiyu Huang, Bolei Zhou, and Jiaqi Ma. 2025. Autovla: A vision-language-action model for end-to-end autonomous driving with adaptive reasoning and reinforcement fine-tuning. arXiv preprint arXiv:2506.13757(2025)

  48. [48]

    Brianna Zitkovich, Tianhe Yu, Sichun Xu, Peng Xu, Ted Xiao, Fei Xia, Jialin Wu, Paul Wohlhart, Stefan Welker, Ayzaan Wahid, et al. 2023. Rt-2: Vision-language- action models transfer web knowledge to robotic control. InConference on Robot Learning. PMLR, 2165–2183. 10