pith. sign in

arxiv: 2605.21862 · v1 · pith:SCYU6FMRnew · submitted 2026-05-21 · 💻 cs.RO · cs.AI

EvoScene-VLA: Evolving Scene Beliefs Inside the Action Decoder for Chunked Robot Control

Pith reviewed 2026-05-22 06:16 UTC · model grok-4.3

classification 💻 cs.RO cs.AI
keywords vision-language-actionrobot controlscene beliefrecurrent prefixchunked policiesgeometric alignmentaction decoder
0
0 comments X

The pith

Maintaining an action-updated scene state across control chunks improves performance in vision-language-action robot policies.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that chunked vision-language-action policies suffer when actions alter geometry before the next visual observation arrives. It introduces a recurrent scene prefix that carries a geometry-aware state forward, which the action decoder refreshes with a compact update after each chunk. At the next call the VLM merges this prior with fresh visual input to generate both the next actions and a corrected scene state. Auxiliary scene prediction and geometric alignment modules guide training but are removed at deployment. If the mechanism works, each decision begins from a scene belief that already reflects recent actions plus the latest evidence.

Core claim

EvoScene-VLA carries a geometry-aware scene state across chunks through a recurrent prefix; the action decoder outputs both the next action chunk and a compact scene update that becomes the prior for the following step, which the VLM then corrects against the new observation, yielding success rates of 89.1 percent fixed and 88.5 percent randomized on 31 RoboTwin tasks plus stronger real-robot results.

What carries the argument

The recurrent scene prefix that carries a geometry-aware scene state across chunks, refreshed by the action decoder's compact scene output and corrected by the VLM against each new observation.

If this is right

  • Each control call begins with a scene prior that already incorporates action-induced changes since the previous observation.
  • The policy produces both actions and an updated scene state in a single decoder pass.
  • Training with future scene targets and 3D geometric anchors improves the internal state without changing the deployed model.
  • Gains hold in both fixed and randomized evaluation settings as well as on physical robot hardware.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same recurrent update pattern could support longer action horizons where visual updates become even less frequent.
  • Comparable belief evolution inside the decoder might benefit other partially observable control problems such as navigation under occlusion.
  • The separation of training-only geometric teachers suggests the method can be adapted to new vision-language models by swapping only the VLM backbone.

Load-bearing premise

The compact scene update produced by the action decoder captures the essential geometry changes induced by actions so that the VLM can reliably correct it against the next observation.

What would settle it

Ablating the scene update output and measuring whether success rates on the 31 RoboTwin tasks return to the 87.2 percent fixed and 86.1 percent randomized baselines of standard chunked VLA policies.

Figures

Figures reproduced from arXiv: 2605.21862 by Chushan Zhang, Hongdong Li, Jinguang Tong, Ruihan Lu, Xuesong Li, Yikai Wang.

Figure 1
Figure 1. Figure 1: Qualitative analysis of chunked-control failures. During one action chunk, the scene can drift from the planning observation, so an action-only baseline commits to a stale target. In three RoboTwin tasks (microwave handle, block stacking, cabinet placement), the baseline follows this stale target and stalls mid-chunk (×). EvoScene-VLA carries an action-updated scene prior into the chunk, so each predicted … view at source ↗
Figure 2
Figure 2. Figure 2: Pipeline Overview. The policy receives multi-view images, an instruction, and the robot state. The VLM prefix contains image and language tokens, per-view observation slots, and recurrent prior slots. An asymmetric attention mask lets observation slots read the current views and lets prior slots absorb this evidence while preserving the pretrained image and language pathways. During training, Geometric Anc… view at source ↗
Figure 3
Figure 3. Figure 3: Trajectory plots in 3D space [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Real-robot platform and experimental setup. [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Additional 3D end-effector trajectory comparisons on four RoboTwin tasks, complementing [PITH_FULL_IMAGE:figures/full_fig_p016_5.png] view at source ↗
read the original abstract

Chunked vision-language-action (VLA) policies predict multi-step robot controls, conditioning each update on the current visual observation alone. Yet robot actions cause contact, occlusion, and object motion, and the geometry that later decisions depend on can change before the next visual update arrives. Spatial VLAs improve current-frame geometry. Temporal VLAs aggregate past frames. Neither maintains an action-updated scene prior across chunks. We argue for a persistent action-updated scene state across control calls, and introduce EvoScene-VLA. Its recurrent scene prefix carries a geometry-aware scene state across chunks. At each vision-language model (VLM) call, the VLM combines scene information from the current observation with the action-updated prior from the previous chunk; the action decoder outputs both the next action chunk and a compact scene update. This update becomes the next prior, which the VLM corrects against the new observation when the next call arrives. Each control call therefore starts from a scene prior that reflects both recent actions and fresh visual evidence. During training, \textbf{Scene Predictor} supplies future scene-token targets, and Geometric Anchor aligns scene slots with frozen depth and 3D teachers. We discard both modules at deployment. On 31 RoboTwin tasks, EvoScene-VLA raises average success from 87.2% to 89.1% in fixed evaluation and from 86.1% to 88.5% in randomized evaluation. On the Galaxea R1-Lite real robot, EvoScene-VLA outperforms all baselines.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces EvoScene-VLA, a chunked VLA policy that maintains a recurrent scene prefix carrying geometry-aware state across control calls. At each VLM invocation the model fuses the current observation with the prior from the previous chunk; the action decoder then emits both the next action chunk and a compact scene update that becomes the subsequent prior. Scene Predictor and Geometric Anchor provide training supervision only and are dropped at inference. Empirical results on 31 RoboTwin tasks show average success rising from 87.2 % to 89.1 % (fixed) and from 86.1 % to 88.5 % (randomized), with additional outperformance on a real Galaxea R1-Lite robot.

Significance. If the recurrent scene-update mechanism reliably encodes action-induced geometric changes, the approach would supply a lightweight way to maintain temporal scene consistency inside existing chunked VLA pipelines without requiring full 3-D reconstruction at deployment. The reported gains on a sizable task suite and real-robot transfer constitute a concrete empirical contribution, though the lack of error bars, statistical tests, and targeted ablations on the scene-update component leaves the magnitude and source of improvement open to further verification.

major comments (2)
  1. Abstract and method description of the recurrent scene prefix: the central claim that the compact scene update produced by the action decoder supplies a prior that meaningfully reflects contact, occlusion, and motion effects rests on an unverified assumption. No explicit 3-D consistency loss or geometric regularizer is enforced at inference, and the paper provides no ablation that isolates the contribution of this update versus other architectural choices. If the learned update is lossy, error accumulation across chunks would undermine the reported gains over standard chunked VLA baselines.
  2. Results section reporting RoboTwin success rates: the improvements (87.2 % → 89.1 % fixed; 86.1 % → 88.5 % randomized) are presented without error bars, confidence intervals, or statistical significance tests. In the absence of these, it is impossible to determine whether the observed deltas exceed run-to-run variance or are attributable to the scene-evolution mechanism rather than hyper-parameter or implementation differences.
minor comments (2)
  1. The abstract and method sections should clarify the dimensionality and representation of the compact scene update (e.g., token count, embedding size) so readers can assess its information capacity relative to the VLM context length.
  2. Figure captions and experimental protocol should explicitly state the number of evaluation seeds and whether the same random seeds were used across all compared methods.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the two major comments point by point below and will incorporate revisions to improve clarity and rigor.

read point-by-point responses
  1. Referee: Abstract and method description of the recurrent scene prefix: the central claim that the compact scene update produced by the action decoder supplies a prior that meaningfully reflects contact, occlusion, and motion effects rests on an unverified assumption. No explicit 3-D consistency loss or geometric regularizer is enforced at inference, and the paper provides no ablation that isolates the contribution of this update versus other architectural choices. If the learned update is lossy, error accumulation across chunks would undermine the reported gains over standard chunked VLA baselines.

    Authors: We acknowledge the absence of an explicit ablation isolating the recurrent scene update. The training objective does include the Scene Predictor (future scene-token targets) and Geometric Anchor (alignment with frozen depth and 3D teachers), which provide geometric supervision to the scene update produced by the action decoder. At inference the VLM fuses the current observation with this prior at every chunk, supplying a corrective signal that limits unchecked accumulation. Nevertheless, we agree that a targeted ablation is needed to quantify the update's isolated contribution. In the revised manuscript we will add an ablation that disables the recurrent prior (replacing it with a zero or static initialization) while keeping all other components fixed, and report the resulting success rates on the same 31 RoboTwin tasks. revision: yes

  2. Referee: Results section reporting RoboTwin success rates: the improvements (87.2 % → 89.1 % fixed; 86.1 % → 88.5 % randomized) are presented without error bars, confidence intervals, or statistical significance tests. In the absence of these, it is impossible to determine whether the observed deltas exceed run-to-run variance or are attributable to the scene-evolution mechanism rather than hyper-parameter or implementation differences.

    Authors: We agree that variability measures and statistical tests are required to substantiate the reported gains. In the revised manuscript we will rerun the full evaluation suite across five random seeds, report mean success rates together with standard deviations, and include paired t-test p-values comparing EvoScene-VLA against each baseline. These additions will allow readers to assess whether the 1.9 pp and 2.4 pp improvements exceed typical run-to-run variance. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark gains rest on independent evaluation, not self-referential definitions or fits

full rationale

The paper introduces EvoScene-VLA as an architectural extension to chunked VLAs, using a recurrent scene prefix updated inside the action decoder and corrected by the VLM against new observations. All reported results consist of direct empirical success-rate comparisons on 31 RoboTwin tasks (fixed and randomized) plus real-robot trials against baselines. No equations, uniqueness theorems, or parameter-fitting steps are described that would make any performance claim reduce by construction to the method's own inputs or to a self-citation chain. Auxiliary training modules (Scene Predictor, Geometric Anchor) are explicitly discarded at deployment, so the inference-time behavior and measured gains remain externally falsifiable on the benchmark tasks. This is a standard empirical ML paper whose derivation chain is self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the effectiveness of the recurrent scene prefix and the transfer from training-only Scene Predictor and Geometric Anchor modules to deployment without them.

axioms (1)
  • domain assumption The VLM can usefully combine current visual observation with the action-updated scene prior from the previous chunk.
    Invoked in the description of each VLM call that merges observation and prior.
invented entities (1)
  • Recurrent scene prefix no independent evidence
    purpose: Carries geometry-aware scene state across control chunks
    Introduced as the core architectural addition; no independent evidence supplied beyond the reported performance lift.

pith-pipeline@v0.9.0 · 5830 in / 1361 out tokens · 44653 ms · 2026-05-22T06:16:16.244342+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

46 extracted references · 46 canonical work pages · 18 internal anchors

  1. [1]

    $\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

    Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, et al. π0: A vision-language-action flow model for general robot control.arXiv preprint arXiv:2410.24164, 2024

  2. [2]

    Zero-shot robotic manipulation with pretrained image-editing diffusion models

    Kevin Black, Mitsuhiko Nakamoto, Pranav Atreya, Homer Walke, Chelsea Finn, Aviral Kumar, and Sergey Levine. Zero-shot robotic manipulation with pretrained image-editing diffusion models. InInternational Conference on Learning Representations (ICLR), 2024

  3. [3]

    RT-2: Vision-language-action models transfer web knowledge to robotic control

    Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Xi Chen, Krzysztof Choro- manski, et al. RT-2: Vision-language-action models transfer web knowledge to robotic control. InConference on Robot Learning (CoRL), 2023

  4. [4]

    GR-2: A Generative Video-Language-Action Model with Web-Scale Knowledge for Robot Manipulation

    Chilam Cheang, Guangzeng Chen, Ya Jing, Tao Kong, Hang Li, Yifeng Li, Yuxiao Liu, Hongtao Wu, Jiafeng Xu, Yichu Yang, Hanbo Zhang, and Minzhao Zhu. GR-2: A generative video- language-action model with web-scale knowledge for robot manipulation.arXiv preprint arXiv:2410.06158, 2024

  5. [5]

    Diffusion policy: Visuomotor policy learning via action diffusion

    Cheng Chi, Siyuan Feng, Yilun Du, Zhenjia Xu, Eric Cousineau, Benjamin Burchfiel, and Shuran Song. Diffusion policy: Visuomotor policy learning via action diffusion. InRobotics: Science and Systems (RSS), 2023

  6. [6]

    Embodied-SlotSSM

    Nhat Chung, Taisei Hanyu, Toan Nguyen, Huy Le, Frederick Bumgarner, Duy Minh Ho Nguyen, Khoa V o, Kashu Yamazaki, Chase Rainwater, Tung Kieu, Anh Nguyen, and Ngan Le. Rethinking progression of memory state in robotic manipulation: An object-centric perspective. arXiv preprint arXiv:2511.11478, 2025. Method named “Embodied-SlotSSM” inside the paper

  7. [7]

    Tenenbaum, Dale Schuurmans, and Pieter Abbeel

    Yilun Du, Mengjiao Yang, Bo Dai, Hanjun Dai, Ofir Nachum, Joshua B. Tenenbaum, Dale Schuurmans, and Pieter Abbeel. Learning universal policies via text-guided video generation. InInternational Conference on Machine Learning (ICML), 2023

  8. [8]

    RVT: Robotic view transformer for 3d object manipulation

    Ankit Goyal, Jie Xu, Yijie Guo, Valts Blukis, Yu-Wei Chao, and Dieter Fox. RVT: Robotic view transformer for 3d object manipulation. InConference on Robot Learning (CoRL), 2023

  9. [9]

    RVT-2: Learning precise manipulation from few demonstrations.arXiv preprint arXiv:2406.08545, 2024

    Ankit Goyal, Valts Blukis, Jie Xu, Yijie Guo, Yu-Wei Chao, and Dieter Fox. RVT-2: Learning precise manipulation from few demonstrations.arXiv preprint arXiv:2406.08545, 2024

  10. [10]

    Mastering Diverse Domains through World Models

    Danijar Hafner, Jurgis Pasukonis, Jimmy Ba, and Timothy Lillicrap. Mastering diverse domains through world models.arXiv preprint arXiv:2301.04104, 2023

  11. [11]

    Video Prediction Policy: A Generalist Robot Policy with Predictive Visual Representations

    Yucheng Hu, Yanjiang Guo, Pengchao Wang, Xiaoyu Chen, Yen-Jen Wang, Jianke Zhang, Koushil Sreenath, Chaochao Lu, and Jianyu Chen. Video prediction policy: A generalist robot policy with predictive visual representations.arXiv preprint arXiv:2412.14803, 2024

  12. [12]

    Galaxea open-world dataset and G0 dual-system VLA model.arXiv preprint arXiv:2509.00576, 2025

    Tao Jiang, Tianyuan Yuan, Yicheng Liu, Chenhao Lu, Jianning Cui, Xiao Liu, Shuiqi Cheng, Jiyang Gao, Huazhe Xu, and Hang Zhao. Galaxea open-world dataset and G0 dual-system VLA model.arXiv preprint arXiv:2509.00576, 2025. URL https://arxiv.org/abs/2509. 00576.https://github.com/OpenGalaxea/GalaxeaVLA

  13. [13]

    OpenVLA: An Open-Source Vision-Language-Action Model

    Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, et al. OpenVLA: An open-source vision-language-action model.arXiv preprint arXiv:2406.09246, 2024

  14. [14]

    Tenenbaum

    Po-Chen Ko, Jiayuan Mao, Yilun Du, Shao-Hua Sun, and Joshua B. Tenenbaum. Learning to act from actionless videos through dense correspondences. InInternational Conference on Learning Representations (ICLR), 2024

  15. [15]

    HAMLET: Switch your Vision-Language-Action Model into a History-Aware Policy

    Myungkyu Koo, Daewon Choi, Taeyoung Kim, Kyungmin Lee, Changyeon Kim, Younggyo Seo, and Jinwoo Shin. HAMLET: Switch your vision-language-action model into a history-aware policy.arXiv preprint arXiv:2510.00695, 2025

  16. [16]

    MASt3R: Grounding image matching in 3d.arXiv preprint arXiv:2406.09756, 2024

    Vincent Leroy, Yohann Cabon, and Jerome Revaud. Grounding image matching in 3d with MASt3R.arXiv preprint arXiv:2406.09756, 2024. 10

  17. [17]

    Spatial forcing: Implicit spatial representation alignment for vision- language-action model.arXiv preprint arXiv:2510.12276, 2025

    Fuhao Li, Wenxuan Song, Han Zhao, Jingbo Wang, Pengxiang Ding, Donglin Wang, Long Zeng, and Haoang Li. Spatial forcing: Implicit spatial representation alignment for vision- language-action model.arXiv preprint arXiv:2510.12276, 2025

  18. [18]

    3DS-VLA: A 3d spatial-aware vision language action model for robust multi-task manipulation

    Xiaoqi Li, Liang Heng, Jiaming Liu, Yan Shen, Chenyang Gu, Zhuoyang Liu, Hao Chen, Nuowei Han, Renrui Zhang, Hao Tang, Shanghang Zhang, and Hao Dong. 3DS-VLA: A 3d spatial-aware vision language action model for robust multi-task manipulation. InConference on Robot Learning (CoRL), 2025

  19. [19]

    QDepth-VLA: Quantized depth prediction as auxiliary supervision for vision-language-action models.arXiv preprint arXiv:2510.14836, 2025

    Yixuan Li, Yuhui Chen, Mingcai Zhou, Haoran Li, Zhengtao Zhang, and Dongbin Zhao. QDepth-VLA: Quantized depth prediction as auxiliary supervision for vision-language-action models.arXiv preprint arXiv:2510.14836, 2025

  20. [20]

    HiF-VLA: Hindsight, Insight and Foresight through Motion Representation for Vision-Language-Action Models

    Minghui Lin, Pengxiang Ding, Shu Wang, Zifeng Zhuang, Yang Liu, Xinyang Tong, Wenxuan Song, Shangke Lyu, Siteng Huang, and Donglin Wang. HiF-VLA: Hindsight, insight and foresight through motion representation for vision-language-action models.arXiv preprint arXiv:2512.09928, 2025

  21. [21]

    Yaron Lipman, Ricky T. Q. Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling. InInternational Conference on Learning Representations (ICLR), 2023

  22. [22]

    RDT-1B: a Diffusion Foundation Model for Bimanual Manipulation

    Songming Liu, Lingxuan Wu, Bangguo Li, Hengkai Tan, Huayu Chen, Zhengyi Wang, Ke Xu, Hang Su, and Jun Zhu. RDT-1B: A diffusion foundation model for bimanual manipulation. arXiv preprint arXiv:2410.07864, 2024

  23. [23]

    Flow straight and fast: Learning to generate and transfer data with rectified flow

    Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow. InInternational Conference on Learning Representations (ICLR), 2023

  24. [24]

    Mani- Gaussian: Dynamic gaussian splatting for multi-task robotic manipulation.arXiv preprint arXiv:2403.08321, 2024

    Guanxing Lu, Shiyi Zhang, Ziwei Wang, Changliu Liu, Jiwen Lu, and Yansong Tang. Mani- Gaussian: Dynamic gaussian splatting for multi-task robotic manipulation.arXiv preprint arXiv:2403.08321, 2024

  25. [25]

    RoboTwin: Dual-arm robot benchmark with generative digital twins.arXiv preprint arXiv:2409.02920, 2024

    Yao Mu, Tianxing Chen, Shijia Peng, Zanxin Chen, Zeyu Gao, Yude Zou, Lunkai Lin, Zhiqiang Xie, and Ping Luo. RoboTwin: Dual-arm robot benchmark with generative digital twins.arXiv preprint arXiv:2409.02920, 2024

  26. [26]

    Octo: An Open-Source Generalist Robot Policy

    Octo Model Team, Dibya Ghosh, Homer Walke, Karl Pertsch, et al. Octo: An open-source generalist robot policy.arXiv preprint arXiv:2405.12213, 2024

  27. [27]

    Consistency Policy: Accelerated Visuomotor Policies via Consistency Distillation, June 2024

    Aaditya Prasad, Kevin Lin, Jimmy Wu, Linqi Zhou, and Jeannette Bohg. Consistency policy: Accelerated visuomotor policies via consistency distillation.arXiv preprint arXiv:2405.07503, 2024

  28. [28]

    SpatialVLA: Exploring Spatial Representations for Visual-Language-Action Model

    Delin Qu et al. SpatialVLA: Exploring spatial representations for visual-language-action models. arXiv preprint arXiv:2501.15830, 2025

  29. [29]

    LingBot-Depth: Masked depth modeling for spatial perception

    Robbyant Team. LingBot-Depth: Masked depth modeling for spatial perception. https: //huggingface.co/robbyant/lingbot-depth, 2026. Open-source RGB-D representation model used as the frozen depth teacher

  30. [30]

    MemoryVLA: Perceptual-Cognitive Memory in Vision-Language-Action Models for Robotic Manipulation

    Hao Shi, Bin Xie, Yingfei Liu, Lin Sun, Fengrong Liu, Tiancai Wang, Erjin Zhou, Haoqiang Fan, Xiangyu Zhang, and Gao Huang. MemoryVLA: Perceptual-cognitive memory in vision- language-action models for robotic manipulation.arXiv preprint arXiv:2508.19236, 2025

  31. [31]

    Perceiver-actor: A multi-task transformer for robotic manipulation

    Mohit Shridhar, Lucas Manuelli, and Dieter Fox. Perceiver-actor: A multi-task transformer for robotic manipulation. InConference on Robot Learning (CoRL), 2022

  32. [32]

    Masked depth modeling for spatial perception.arXiv preprint arXiv:2601.17895, 2026

    Bin Tan, Changjiang Sun, Xiage Qin, Hanat Adai, Zelin Fu, Tianxiang Zhou, Han Zhang, Yinghao Xu, Xing Zhu, Yujun Shen, et al. Masked depth modeling for spatial perception.arXiv preprint arXiv:2601.17895, 2026. 11

  33. [33]

    HPT: Scaling proprioceptive-visual learning with heterogeneous pre-trained transformers.arXiv preprint arXiv:2409.20537, 2024

    Lirui Wang, Xinlei Chen, Jialiang Zhao, and Kaiming He. HPT: Scaling proprioceptive-visual learning with heterogeneous pre-trained transformers.arXiv preprint arXiv:2409.20537, 2024

  34. [34]

    Efros, and Angjoo Kanazawa

    Qianqian Wang, Yifei Zhang, Aleksander Holynski, Alexei A. Efros, and Angjoo Kanazawa. Continuous 3D perception model with persistent state.arXiv preprint arXiv:2501.12387, 2025

  35. [35]

    DUSt3R: Geometric 3D vision made easy

    Shuzhe Wang, Vincent Leroy, Yohann Cabon, Boris Chidlovskii, and Jerome Revaud. DUSt3R: Geometric 3D vision made easy. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024

  36. [36]

    π3: Permutation-equivariant visual geometry learning

    Yifan Wang, Jianjun Zhou, Haoyi Zhu, Wenzheng Chang, Yang Zhou, Zizun Li, Junyi Chen, Jiangmiao Pang, Chunhua Shen, and Tong He. π3: Permutation-equivariant visual geometry learning. InInternational Conference on Learning Representations (ICLR), 2026

  37. [37]

    Any-point Trajectory Modeling for Policy Learning

    Chuan Wen, Xingyu Lin, John So, Kai Chen, Qi Dou, Yang Gao, and Pieter Abbeel. Any-point trajectory modeling for policy learning.arXiv preprint arXiv:2401.00025, 2024

  38. [38]

    Unleashing Large-Scale Video Generative Pre-training for Visual Robot Manipulation

    Hongtao Wu, Ya Jing, Chilam Cheang, Guangzeng Chen, Jiafeng Xu, Xinghang Li, Minghuan Liu, Hang Li, and Tao Kong. Unleashing large-scale video generative pre-training for visual robot manipulation.arXiv preprint arXiv:2312.13139, 2023

  39. [39]

    Day- Dreamer: World models for physical robot learning

    Philipp Wu, Alejandro Escontrela, Danijar Hafner, Pieter Abbeel, and Ken Goldberg. Day- Dreamer: World models for physical robot learning. InConference on Robot Learning (CoRL), 2022

  40. [40]

    A Pragmatic VLA Foundation Model

    Wei Wu, Fan Lu, Yunnan Wang, Shuai Yang, Shi Liu, Fangjing Wang, Qian Zhu, He Sun, Yong Wang, Shuailei Ma, Yiyu Ren, Kejia Zhang, Hui Yu, Jingmei Zhao, Shuai Zhou, Zhenqi Qiu, Houlong Xiong, Ziyu Wang, Zechen Wang, Ran Cheng, Yong-Lu Li, Yongtao Huang, Xing Zhu, Yujun Shen, and Kecheng Zheng. LingBot-VLA: A pragmatic vla foundation model.arXiv preprint ar...

  41. [41]

    AVA-VLA: Improving Vision-Language-Action models with Active Visual Attention

    Lei Xiao, Jifeng Li, Juntao Gao, Feiyang Ye, Yan Jin, Jingjing Qian, Jing Zhang, Yong Wu, and Xiaoyuan Yu. A V A-VLA: Improving vision-language-action models with active visual attention. arXiv preprint arXiv:2511.18960, 2025

  42. [42]

    3D Diffusion Policy: Generalizable Visuomotor Policy Learning via Simple 3D Representations

    Yanjie Ze, Gu Zhang, Kangning Zhang, Chenyuan Hu, Muhan Wang, and Huazhe Xu. 3D diffusion policy: Generalizable visuomotor policy learning via simple 3D representations.arXiv preprint arXiv:2403.03954, 2024

  43. [43]

    UP- VLA: A unified understanding and prediction model for embodied agent.arXiv preprint arXiv:2501.18867, 2025

    Jianke Zhang, Yanjiang Guo, Yucheng Hu, Xiaoyu Chen, Xiang Zhu, and Jianyu Chen. UP- VLA: A unified understanding and prediction model for embodied agent.arXiv preprint arXiv:2501.18867, 2025

  44. [44]

    TraceVLA: Visual Trace Prompting Enhances Spatial-Temporal Awareness for Generalist Robotic Policies

    Ruijie Zheng, Yongyuan Liang, Shuaiyi Huang, Jianfeng Gao, Hal Daumé III, Andrey Kolobov, Furong Huang, and Jianwei Yang. TraceVLA: Visual trace prompting for generalist robot policies.arXiv preprint arXiv:2412.10345, 2024

  45. [45]

    FLARE: Robot Learning with Implicit World Modeling

    Ruijie Zheng, Jing Wang, Scott Reed, Johan Bjorck, Yu Fang, Fengyuan Hu, Joel Jang, Kaushil Kundalia, Zongyu Lin, Loic Magne, Avnish Narayan, You Liang Tan, Guanzhi Wang, Qi Wang, Jiannan Xiang, Yinzhen Xu, Seonghyeon Ye, Jan Kautz, Furong Huang, Yuke Zhu, and Linxi Fan. FLARE: Robot learning with implicit world modeling.arXiv preprint arXiv:2505.15659, 2025

  46. [46]

    VLA-4D: Embedding 4d awareness into vision- language-action models for spatiotemporally coherent robotic manipulation.arXiv preprint arXiv:2511.17199, 2025

    Hanyu Zhou, Chuanhao Ma, and Gim Hee Lee. VLA-4D: Embedding 4d awareness into vision- language-action models for spatiotemporally coherent robotic manipulation.arXiv preprint arXiv:2511.17199, 2025. 12 Overview This appendix supplements the main paper with implementation, evaluation, and discussion details. Appendix A reports the optimization recipe, hype...