EvoScene-VLA: Evolving Scene Beliefs Inside the Action Decoder for Chunked Robot Control

Chushan Zhang; Hongdong Li; Jinguang Tong; Ruihan Lu; Xuesong Li; Yikai Wang

arxiv: 2605.21862 · v1 · pith:SCYU6FMRnew · submitted 2026-05-21 · 💻 cs.RO · cs.AI

EvoScene-VLA: Evolving Scene Beliefs Inside the Action Decoder for Chunked Robot Control

Chushan Zhang , Ruihan Lu , Jinguang Tong , Xuesong Li , Yikai Wang , Hongdong Li This is my paper

Pith reviewed 2026-05-22 06:16 UTC · model grok-4.3

classification 💻 cs.RO cs.AI

keywords vision-language-actionrobot controlscene beliefrecurrent prefixchunked policiesgeometric alignmentaction decoder

0 comments

The pith

Maintaining an action-updated scene state across control chunks improves performance in vision-language-action robot policies.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that chunked vision-language-action policies suffer when actions alter geometry before the next visual observation arrives. It introduces a recurrent scene prefix that carries a geometry-aware state forward, which the action decoder refreshes with a compact update after each chunk. At the next call the VLM merges this prior with fresh visual input to generate both the next actions and a corrected scene state. Auxiliary scene prediction and geometric alignment modules guide training but are removed at deployment. If the mechanism works, each decision begins from a scene belief that already reflects recent actions plus the latest evidence.

Core claim

EvoScene-VLA carries a geometry-aware scene state across chunks through a recurrent prefix; the action decoder outputs both the next action chunk and a compact scene update that becomes the prior for the following step, which the VLM then corrects against the new observation, yielding success rates of 89.1 percent fixed and 88.5 percent randomized on 31 RoboTwin tasks plus stronger real-robot results.

What carries the argument

The recurrent scene prefix that carries a geometry-aware scene state across chunks, refreshed by the action decoder's compact scene output and corrected by the VLM against each new observation.

If this is right

Each control call begins with a scene prior that already incorporates action-induced changes since the previous observation.
The policy produces both actions and an updated scene state in a single decoder pass.
Training with future scene targets and 3D geometric anchors improves the internal state without changing the deployed model.
Gains hold in both fixed and randomized evaluation settings as well as on physical robot hardware.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same recurrent update pattern could support longer action horizons where visual updates become even less frequent.
Comparable belief evolution inside the decoder might benefit other partially observable control problems such as navigation under occlusion.
The separation of training-only geometric teachers suggests the method can be adapted to new vision-language models by swapping only the VLM backbone.

Load-bearing premise

The compact scene update produced by the action decoder captures the essential geometry changes induced by actions so that the VLM can reliably correct it against the next observation.

What would settle it

Ablating the scene update output and measuring whether success rates on the 31 RoboTwin tasks return to the 87.2 percent fixed and 86.1 percent randomized baselines of standard chunked VLA policies.

Figures

Figures reproduced from arXiv: 2605.21862 by Chushan Zhang, Hongdong Li, Jinguang Tong, Ruihan Lu, Xuesong Li, Yikai Wang.

**Figure 1.** Figure 1: Qualitative analysis of chunked-control failures. During one action chunk, the scene can drift from the planning observation, so an action-only baseline commits to a stale target. In three RoboTwin tasks (microwave handle, block stacking, cabinet placement), the baseline follows this stale target and stalls mid-chunk (×). EvoScene-VLA carries an action-updated scene prior into the chunk, so each predicted … view at source ↗

**Figure 2.** Figure 2: Pipeline Overview. The policy receives multi-view images, an instruction, and the robot state. The VLM prefix contains image and language tokens, per-view observation slots, and recurrent prior slots. An asymmetric attention mask lets observation slots read the current views and lets prior slots absorb this evidence while preserving the pretrained image and language pathways. During training, Geometric Anc… view at source ↗

**Figure 3.** Figure 3: Trajectory plots in 3D space [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: Real-robot platform and experimental setup. [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

**Figure 5.** Figure 5: Additional 3D end-effector trajectory comparisons on four RoboTwin tasks, complementing [PITH_FULL_IMAGE:figures/full_fig_p016_5.png] view at source ↗

read the original abstract

Chunked vision-language-action (VLA) policies predict multi-step robot controls, conditioning each update on the current visual observation alone. Yet robot actions cause contact, occlusion, and object motion, and the geometry that later decisions depend on can change before the next visual update arrives. Spatial VLAs improve current-frame geometry. Temporal VLAs aggregate past frames. Neither maintains an action-updated scene prior across chunks. We argue for a persistent action-updated scene state across control calls, and introduce EvoScene-VLA. Its recurrent scene prefix carries a geometry-aware scene state across chunks. At each vision-language model (VLM) call, the VLM combines scene information from the current observation with the action-updated prior from the previous chunk; the action decoder outputs both the next action chunk and a compact scene update. This update becomes the next prior, which the VLM corrects against the new observation when the next call arrives. Each control call therefore starts from a scene prior that reflects both recent actions and fresh visual evidence. During training, \textbf{Scene Predictor} supplies future scene-token targets, and Geometric Anchor aligns scene slots with frozen depth and 3D teachers. We discard both modules at deployment. On 31 RoboTwin tasks, EvoScene-VLA raises average success from 87.2% to 89.1% in fixed evaluation and from 86.1% to 88.5% in randomized evaluation. On the Galaxea R1-Lite real robot, EvoScene-VLA outperforms all baselines.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

EvoScene-VLA adds a recurrent action-updated scene prefix to chunked VLAs, with small consistent gains on RoboTwin and real robot but thin evidence on whether the compact update actually carries usable geometry.

read the letter

The main takeaway is that this paper tries to fix a practical gap in chunked VLA policies: actions change the scene through contact and motion before the next visual observation arrives, yet most methods start each chunk from scratch or just stack past frames. EvoScene-VLA keeps a recurrent scene prefix that the action decoder updates with a compact output, then feeds that prior back so the VLM can correct it against the fresh image at the next call. That design is not in the spatial or temporal baselines they cite, and the training uses a Scene Predictor plus Geometric Anchor for supervision before dropping them at deployment. On 31 RoboTwin tasks the success rates move from 87.2% to 89.1% fixed and 86.1% to 88.5% randomized, with better real-robot results on the Galaxea arm. Those numbers are consistent across settings, which is worth noting for anyone running chunked policies in manipulation. The gains are modest, though, and the abstract gives no error bars, no statistical tests, and no ablations that isolate the scene-update component. The central assumption—that the decoder’s compact output reliably encodes the geometry changes the VLM needs to correct—remains plausible but untested in the summary. If the update is lossy on occlusion or contact details, error would accumulate exactly where the method claims to help. This is incremental work aimed at people already building or extending chunked VLAs for real robots. It is not reorganizing the field, but the mechanism is concrete enough that a referee could check the representation details, the loss terms, and whether the gains survive proper ablations. I would send it to review rather than desk-reject, mainly to see the full implementation and the missing controls.

Referee Report

2 major / 2 minor

Summary. The paper introduces EvoScene-VLA, a chunked VLA policy that maintains a recurrent scene prefix carrying geometry-aware state across control calls. At each VLM invocation the model fuses the current observation with the prior from the previous chunk; the action decoder then emits both the next action chunk and a compact scene update that becomes the subsequent prior. Scene Predictor and Geometric Anchor provide training supervision only and are dropped at inference. Empirical results on 31 RoboTwin tasks show average success rising from 87.2 % to 89.1 % (fixed) and from 86.1 % to 88.5 % (randomized), with additional outperformance on a real Galaxea R1-Lite robot.

Significance. If the recurrent scene-update mechanism reliably encodes action-induced geometric changes, the approach would supply a lightweight way to maintain temporal scene consistency inside existing chunked VLA pipelines without requiring full 3-D reconstruction at deployment. The reported gains on a sizable task suite and real-robot transfer constitute a concrete empirical contribution, though the lack of error bars, statistical tests, and targeted ablations on the scene-update component leaves the magnitude and source of improvement open to further verification.

major comments (2)

Abstract and method description of the recurrent scene prefix: the central claim that the compact scene update produced by the action decoder supplies a prior that meaningfully reflects contact, occlusion, and motion effects rests on an unverified assumption. No explicit 3-D consistency loss or geometric regularizer is enforced at inference, and the paper provides no ablation that isolates the contribution of this update versus other architectural choices. If the learned update is lossy, error accumulation across chunks would undermine the reported gains over standard chunked VLA baselines.
Results section reporting RoboTwin success rates: the improvements (87.2 % → 89.1 % fixed; 86.1 % → 88.5 % randomized) are presented without error bars, confidence intervals, or statistical significance tests. In the absence of these, it is impossible to determine whether the observed deltas exceed run-to-run variance or are attributable to the scene-evolution mechanism rather than hyper-parameter or implementation differences.

minor comments (2)

The abstract and method sections should clarify the dimensionality and representation of the compact scene update (e.g., token count, embedding size) so readers can assess its information capacity relative to the VLM context length.
Figure captions and experimental protocol should explicitly state the number of evaluation seeds and whether the same random seeds were used across all compared methods.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the two major comments point by point below and will incorporate revisions to improve clarity and rigor.

read point-by-point responses

Referee: Abstract and method description of the recurrent scene prefix: the central claim that the compact scene update produced by the action decoder supplies a prior that meaningfully reflects contact, occlusion, and motion effects rests on an unverified assumption. No explicit 3-D consistency loss or geometric regularizer is enforced at inference, and the paper provides no ablation that isolates the contribution of this update versus other architectural choices. If the learned update is lossy, error accumulation across chunks would undermine the reported gains over standard chunked VLA baselines.

Authors: We acknowledge the absence of an explicit ablation isolating the recurrent scene update. The training objective does include the Scene Predictor (future scene-token targets) and Geometric Anchor (alignment with frozen depth and 3D teachers), which provide geometric supervision to the scene update produced by the action decoder. At inference the VLM fuses the current observation with this prior at every chunk, supplying a corrective signal that limits unchecked accumulation. Nevertheless, we agree that a targeted ablation is needed to quantify the update's isolated contribution. In the revised manuscript we will add an ablation that disables the recurrent prior (replacing it with a zero or static initialization) while keeping all other components fixed, and report the resulting success rates on the same 31 RoboTwin tasks. revision: yes
Referee: Results section reporting RoboTwin success rates: the improvements (87.2 % → 89.1 % fixed; 86.1 % → 88.5 % randomized) are presented without error bars, confidence intervals, or statistical significance tests. In the absence of these, it is impossible to determine whether the observed deltas exceed run-to-run variance or are attributable to the scene-evolution mechanism rather than hyper-parameter or implementation differences.

Authors: We agree that variability measures and statistical tests are required to substantiate the reported gains. In the revised manuscript we will rerun the full evaluation suite across five random seeds, report mean success rates together with standard deviations, and include paired t-test p-values comparing EvoScene-VLA against each baseline. These additions will allow readers to assess whether the 1.9 pp and 2.4 pp improvements exceed typical run-to-run variance. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark gains rest on independent evaluation, not self-referential definitions or fits

full rationale

The paper introduces EvoScene-VLA as an architectural extension to chunked VLAs, using a recurrent scene prefix updated inside the action decoder and corrected by the VLM against new observations. All reported results consist of direct empirical success-rate comparisons on 31 RoboTwin tasks (fixed and randomized) plus real-robot trials against baselines. No equations, uniqueness theorems, or parameter-fitting steps are described that would make any performance claim reduce by construction to the method's own inputs or to a self-citation chain. Auxiliary training modules (Scene Predictor, Geometric Anchor) are explicitly discarded at deployment, so the inference-time behavior and measured gains remain externally falsifiable on the benchmark tasks. This is a standard empirical ML paper whose derivation chain is self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the effectiveness of the recurrent scene prefix and the transfer from training-only Scene Predictor and Geometric Anchor modules to deployment without them.

axioms (1)

domain assumption The VLM can usefully combine current visual observation with the action-updated scene prior from the previous chunk.
Invoked in the description of each VLM call that merges observation and prior.

invented entities (1)

Recurrent scene prefix no independent evidence
purpose: Carries geometry-aware scene state across control chunks
Introduced as the core architectural addition; no independent evidence supplied beyond the reported performance lift.

pith-pipeline@v0.9.0 · 5830 in / 1361 out tokens · 44653 ms · 2026-05-22T06:16:16.244342+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

46 extracted references · 46 canonical work pages · 18 internal anchors

[1]

$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, et al. π0: A vision-language-action flow model for general robot control.arXiv preprint arXiv:2410.24164, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[2]

Zero-shot robotic manipulation with pretrained image-editing diffusion models

Kevin Black, Mitsuhiko Nakamoto, Pranav Atreya, Homer Walke, Chelsea Finn, Aviral Kumar, and Sergey Levine. Zero-shot robotic manipulation with pretrained image-editing diffusion models. InInternational Conference on Learning Representations (ICLR), 2024

work page 2024
[3]

RT-2: Vision-language-action models transfer web knowledge to robotic control

Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Xi Chen, Krzysztof Choro- manski, et al. RT-2: Vision-language-action models transfer web knowledge to robotic control. InConference on Robot Learning (CoRL), 2023

work page 2023
[4]

GR-2: A Generative Video-Language-Action Model with Web-Scale Knowledge for Robot Manipulation

Chilam Cheang, Guangzeng Chen, Ya Jing, Tao Kong, Hang Li, Yifeng Li, Yuxiao Liu, Hongtao Wu, Jiafeng Xu, Yichu Yang, Hanbo Zhang, and Minzhao Zhu. GR-2: A generative video- language-action model with web-scale knowledge for robot manipulation.arXiv preprint arXiv:2410.06158, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[5]

Diffusion policy: Visuomotor policy learning via action diffusion

Cheng Chi, Siyuan Feng, Yilun Du, Zhenjia Xu, Eric Cousineau, Benjamin Burchfiel, and Shuran Song. Diffusion policy: Visuomotor policy learning via action diffusion. InRobotics: Science and Systems (RSS), 2023

work page 2023
[6]

Embodied-SlotSSM

Nhat Chung, Taisei Hanyu, Toan Nguyen, Huy Le, Frederick Bumgarner, Duy Minh Ho Nguyen, Khoa V o, Kashu Yamazaki, Chase Rainwater, Tung Kieu, Anh Nguyen, and Ngan Le. Rethinking progression of memory state in robotic manipulation: An object-centric perspective. arXiv preprint arXiv:2511.11478, 2025. Method named “Embodied-SlotSSM” inside the paper

work page arXiv 2025
[7]

Tenenbaum, Dale Schuurmans, and Pieter Abbeel

Yilun Du, Mengjiao Yang, Bo Dai, Hanjun Dai, Ofir Nachum, Joshua B. Tenenbaum, Dale Schuurmans, and Pieter Abbeel. Learning universal policies via text-guided video generation. InInternational Conference on Machine Learning (ICML), 2023

work page 2023
[8]

RVT: Robotic view transformer for 3d object manipulation

Ankit Goyal, Jie Xu, Yijie Guo, Valts Blukis, Yu-Wei Chao, and Dieter Fox. RVT: Robotic view transformer for 3d object manipulation. InConference on Robot Learning (CoRL), 2023

work page 2023
[9]

RVT-2: Learning precise manipulation from few demonstrations.arXiv preprint arXiv:2406.08545, 2024

Ankit Goyal, Valts Blukis, Jie Xu, Yijie Guo, Yu-Wei Chao, and Dieter Fox. RVT-2: Learning precise manipulation from few demonstrations.arXiv preprint arXiv:2406.08545, 2024

work page arXiv 2024
[10]

Mastering Diverse Domains through World Models

Danijar Hafner, Jurgis Pasukonis, Jimmy Ba, and Timothy Lillicrap. Mastering diverse domains through world models.arXiv preprint arXiv:2301.04104, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[11]

Video Prediction Policy: A Generalist Robot Policy with Predictive Visual Representations

Yucheng Hu, Yanjiang Guo, Pengchao Wang, Xiaoyu Chen, Yen-Jen Wang, Jianke Zhang, Koushil Sreenath, Chaochao Lu, and Jianyu Chen. Video prediction policy: A generalist robot policy with predictive visual representations.arXiv preprint arXiv:2412.14803, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[12]

Galaxea open-world dataset and G0 dual-system VLA model.arXiv preprint arXiv:2509.00576, 2025

Tao Jiang, Tianyuan Yuan, Yicheng Liu, Chenhao Lu, Jianning Cui, Xiao Liu, Shuiqi Cheng, Jiyang Gao, Huazhe Xu, and Hang Zhao. Galaxea open-world dataset and G0 dual-system VLA model.arXiv preprint arXiv:2509.00576, 2025. URL https://arxiv.org/abs/2509. 00576.https://github.com/OpenGalaxea/GalaxeaVLA

work page arXiv 2025
[13]

OpenVLA: An Open-Source Vision-Language-Action Model

Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, et al. OpenVLA: An open-source vision-language-action model.arXiv preprint arXiv:2406.09246, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[14]

Tenenbaum

Po-Chen Ko, Jiayuan Mao, Yilun Du, Shao-Hua Sun, and Joshua B. Tenenbaum. Learning to act from actionless videos through dense correspondences. InInternational Conference on Learning Representations (ICLR), 2024

work page 2024
[15]

HAMLET: Switch your Vision-Language-Action Model into a History-Aware Policy

Myungkyu Koo, Daewon Choi, Taeyoung Kim, Kyungmin Lee, Changyeon Kim, Younggyo Seo, and Jinwoo Shin. HAMLET: Switch your vision-language-action model into a history-aware policy.arXiv preprint arXiv:2510.00695, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[16]

MASt3R: Grounding image matching in 3d.arXiv preprint arXiv:2406.09756, 2024

Vincent Leroy, Yohann Cabon, and Jerome Revaud. Grounding image matching in 3d with MASt3R.arXiv preprint arXiv:2406.09756, 2024. 10

work page arXiv 2024
[17]

Spatial forcing: Implicit spatial representation alignment for vision- language-action model.arXiv preprint arXiv:2510.12276, 2025

Fuhao Li, Wenxuan Song, Han Zhao, Jingbo Wang, Pengxiang Ding, Donglin Wang, Long Zeng, and Haoang Li. Spatial forcing: Implicit spatial representation alignment for vision- language-action model.arXiv preprint arXiv:2510.12276, 2025

work page arXiv 2025
[18]

3DS-VLA: A 3d spatial-aware vision language action model for robust multi-task manipulation

Xiaoqi Li, Liang Heng, Jiaming Liu, Yan Shen, Chenyang Gu, Zhuoyang Liu, Hao Chen, Nuowei Han, Renrui Zhang, Hao Tang, Shanghang Zhang, and Hao Dong. 3DS-VLA: A 3d spatial-aware vision language action model for robust multi-task manipulation. InConference on Robot Learning (CoRL), 2025

work page 2025
[19]

QDepth-VLA: Quantized depth prediction as auxiliary supervision for vision-language-action models.arXiv preprint arXiv:2510.14836, 2025

Yixuan Li, Yuhui Chen, Mingcai Zhou, Haoran Li, Zhengtao Zhang, and Dongbin Zhao. QDepth-VLA: Quantized depth prediction as auxiliary supervision for vision-language-action models.arXiv preprint arXiv:2510.14836, 2025

work page arXiv 2025
[20]

HiF-VLA: Hindsight, Insight and Foresight through Motion Representation for Vision-Language-Action Models

Minghui Lin, Pengxiang Ding, Shu Wang, Zifeng Zhuang, Yang Liu, Xinyang Tong, Wenxuan Song, Shangke Lyu, Siteng Huang, and Donglin Wang. HiF-VLA: Hindsight, insight and foresight through motion representation for vision-language-action models.arXiv preprint arXiv:2512.09928, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[21]

Yaron Lipman, Ricky T. Q. Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling. InInternational Conference on Learning Representations (ICLR), 2023

work page 2023
[22]

RDT-1B: a Diffusion Foundation Model for Bimanual Manipulation

Songming Liu, Lingxuan Wu, Bangguo Li, Hengkai Tan, Huayu Chen, Zhengyi Wang, Ke Xu, Hang Su, and Jun Zhu. RDT-1B: A diffusion foundation model for bimanual manipulation. arXiv preprint arXiv:2410.07864, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[23]

Flow straight and fast: Learning to generate and transfer data with rectified flow

Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow. InInternational Conference on Learning Representations (ICLR), 2023

work page 2023
[24]

Mani- Gaussian: Dynamic gaussian splatting for multi-task robotic manipulation.arXiv preprint arXiv:2403.08321, 2024

Guanxing Lu, Shiyi Zhang, Ziwei Wang, Changliu Liu, Jiwen Lu, and Yansong Tang. Mani- Gaussian: Dynamic gaussian splatting for multi-task robotic manipulation.arXiv preprint arXiv:2403.08321, 2024

work page arXiv 2024
[25]

RoboTwin: Dual-arm robot benchmark with generative digital twins.arXiv preprint arXiv:2409.02920, 2024

Yao Mu, Tianxing Chen, Shijia Peng, Zanxin Chen, Zeyu Gao, Yude Zou, Lunkai Lin, Zhiqiang Xie, and Ping Luo. RoboTwin: Dual-arm robot benchmark with generative digital twins.arXiv preprint arXiv:2409.02920, 2024

work page arXiv 2024
[26]

Octo: An Open-Source Generalist Robot Policy

Octo Model Team, Dibya Ghosh, Homer Walke, Karl Pertsch, et al. Octo: An open-source generalist robot policy.arXiv preprint arXiv:2405.12213, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[27]

Consistency Policy: Accelerated Visuomotor Policies via Consistency Distillation, June 2024

Aaditya Prasad, Kevin Lin, Jimmy Wu, Linqi Zhou, and Jeannette Bohg. Consistency policy: Accelerated visuomotor policies via consistency distillation.arXiv preprint arXiv:2405.07503, 2024

work page arXiv 2024
[28]

SpatialVLA: Exploring Spatial Representations for Visual-Language-Action Model

Delin Qu et al. SpatialVLA: Exploring spatial representations for visual-language-action models. arXiv preprint arXiv:2501.15830, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[29]

LingBot-Depth: Masked depth modeling for spatial perception

Robbyant Team. LingBot-Depth: Masked depth modeling for spatial perception. https: //huggingface.co/robbyant/lingbot-depth, 2026. Open-source RGB-D representation model used as the frozen depth teacher

work page 2026
[30]

MemoryVLA: Perceptual-Cognitive Memory in Vision-Language-Action Models for Robotic Manipulation

Hao Shi, Bin Xie, Yingfei Liu, Lin Sun, Fengrong Liu, Tiancai Wang, Erjin Zhou, Haoqiang Fan, Xiangyu Zhang, and Gao Huang. MemoryVLA: Perceptual-cognitive memory in vision- language-action models for robotic manipulation.arXiv preprint arXiv:2508.19236, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[31]

Perceiver-actor: A multi-task transformer for robotic manipulation

Mohit Shridhar, Lucas Manuelli, and Dieter Fox. Perceiver-actor: A multi-task transformer for robotic manipulation. InConference on Robot Learning (CoRL), 2022

work page 2022
[32]

Masked depth modeling for spatial perception.arXiv preprint arXiv:2601.17895, 2026

Bin Tan, Changjiang Sun, Xiage Qin, Hanat Adai, Zelin Fu, Tianxiang Zhou, Han Zhang, Yinghao Xu, Xing Zhu, Yujun Shen, et al. Masked depth modeling for spatial perception.arXiv preprint arXiv:2601.17895, 2026. 11

work page arXiv 2026
[33]

HPT: Scaling proprioceptive-visual learning with heterogeneous pre-trained transformers.arXiv preprint arXiv:2409.20537, 2024

Lirui Wang, Xinlei Chen, Jialiang Zhao, and Kaiming He. HPT: Scaling proprioceptive-visual learning with heterogeneous pre-trained transformers.arXiv preprint arXiv:2409.20537, 2024

work page arXiv 2024
[34]

Efros, and Angjoo Kanazawa

Qianqian Wang, Yifei Zhang, Aleksander Holynski, Alexei A. Efros, and Angjoo Kanazawa. Continuous 3D perception model with persistent state.arXiv preprint arXiv:2501.12387, 2025

work page arXiv 2025
[35]

DUSt3R: Geometric 3D vision made easy

Shuzhe Wang, Vincent Leroy, Yohann Cabon, Boris Chidlovskii, and Jerome Revaud. DUSt3R: Geometric 3D vision made easy. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024

work page 2024
[36]

π3: Permutation-equivariant visual geometry learning

Yifan Wang, Jianjun Zhou, Haoyi Zhu, Wenzheng Chang, Yang Zhou, Zizun Li, Junyi Chen, Jiangmiao Pang, Chunhua Shen, and Tong He. π3: Permutation-equivariant visual geometry learning. InInternational Conference on Learning Representations (ICLR), 2026

work page 2026
[37]

Any-point Trajectory Modeling for Policy Learning

Chuan Wen, Xingyu Lin, John So, Kai Chen, Qi Dou, Yang Gao, and Pieter Abbeel. Any-point trajectory modeling for policy learning.arXiv preprint arXiv:2401.00025, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[38]

Unleashing Large-Scale Video Generative Pre-training for Visual Robot Manipulation

Hongtao Wu, Ya Jing, Chilam Cheang, Guangzeng Chen, Jiafeng Xu, Xinghang Li, Minghuan Liu, Hang Li, and Tao Kong. Unleashing large-scale video generative pre-training for visual robot manipulation.arXiv preprint arXiv:2312.13139, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[39]

Day- Dreamer: World models for physical robot learning

Philipp Wu, Alejandro Escontrela, Danijar Hafner, Pieter Abbeel, and Ken Goldberg. Day- Dreamer: World models for physical robot learning. InConference on Robot Learning (CoRL), 2022

work page 2022
[40]

A Pragmatic VLA Foundation Model

Wei Wu, Fan Lu, Yunnan Wang, Shuai Yang, Shi Liu, Fangjing Wang, Qian Zhu, He Sun, Yong Wang, Shuailei Ma, Yiyu Ren, Kejia Zhang, Hui Yu, Jingmei Zhao, Shuai Zhou, Zhenqi Qiu, Houlong Xiong, Ziyu Wang, Zechen Wang, Ran Cheng, Yong-Lu Li, Yongtao Huang, Xing Zhu, Yujun Shen, and Kecheng Zheng. LingBot-VLA: A pragmatic vla foundation model.arXiv preprint ar...

work page internal anchor Pith review Pith/arXiv arXiv 2026
[41]

AVA-VLA: Improving Vision-Language-Action models with Active Visual Attention

Lei Xiao, Jifeng Li, Juntao Gao, Feiyang Ye, Yan Jin, Jingjing Qian, Jing Zhang, Yong Wu, and Xiaoyuan Yu. A V A-VLA: Improving vision-language-action models with active visual attention. arXiv preprint arXiv:2511.18960, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[42]

3D Diffusion Policy: Generalizable Visuomotor Policy Learning via Simple 3D Representations

Yanjie Ze, Gu Zhang, Kangning Zhang, Chenyuan Hu, Muhan Wang, and Huazhe Xu. 3D diffusion policy: Generalizable visuomotor policy learning via simple 3D representations.arXiv preprint arXiv:2403.03954, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[43]

UP- VLA: A unified understanding and prediction model for embodied agent.arXiv preprint arXiv:2501.18867, 2025

Jianke Zhang, Yanjiang Guo, Yucheng Hu, Xiaoyu Chen, Xiang Zhu, and Jianyu Chen. UP- VLA: A unified understanding and prediction model for embodied agent.arXiv preprint arXiv:2501.18867, 2025

work page arXiv 2025
[44]

TraceVLA: Visual Trace Prompting Enhances Spatial-Temporal Awareness for Generalist Robotic Policies

Ruijie Zheng, Yongyuan Liang, Shuaiyi Huang, Jianfeng Gao, Hal Daumé III, Andrey Kolobov, Furong Huang, and Jianwei Yang. TraceVLA: Visual trace prompting for generalist robot policies.arXiv preprint arXiv:2412.10345, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[45]

FLARE: Robot Learning with Implicit World Modeling

Ruijie Zheng, Jing Wang, Scott Reed, Johan Bjorck, Yu Fang, Fengyuan Hu, Joel Jang, Kaushil Kundalia, Zongyu Lin, Loic Magne, Avnish Narayan, You Liang Tan, Guanzhi Wang, Qi Wang, Jiannan Xiang, Yinzhen Xu, Seonghyeon Ye, Jan Kautz, Furong Huang, Yuke Zhu, and Linxi Fan. FLARE: Robot learning with implicit world modeling.arXiv preprint arXiv:2505.15659, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[46]

VLA-4D: Embedding 4d awareness into vision- language-action models for spatiotemporally coherent robotic manipulation.arXiv preprint arXiv:2511.17199, 2025

Hanyu Zhou, Chuanhao Ma, and Gim Hee Lee. VLA-4D: Embedding 4d awareness into vision- language-action models for spatiotemporally coherent robotic manipulation.arXiv preprint arXiv:2511.17199, 2025. 12 Overview This appendix supplements the main paper with implementation, evaluation, and discussion details. Appendix A reports the optimization recipe, hype...

work page arXiv 2025

[1] [1]

$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, et al. π0: A vision-language-action flow model for general robot control.arXiv preprint arXiv:2410.24164, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[2] [2]

Zero-shot robotic manipulation with pretrained image-editing diffusion models

Kevin Black, Mitsuhiko Nakamoto, Pranav Atreya, Homer Walke, Chelsea Finn, Aviral Kumar, and Sergey Levine. Zero-shot robotic manipulation with pretrained image-editing diffusion models. InInternational Conference on Learning Representations (ICLR), 2024

work page 2024

[3] [3]

RT-2: Vision-language-action models transfer web knowledge to robotic control

Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Xi Chen, Krzysztof Choro- manski, et al. RT-2: Vision-language-action models transfer web knowledge to robotic control. InConference on Robot Learning (CoRL), 2023

work page 2023

[4] [4]

GR-2: A Generative Video-Language-Action Model with Web-Scale Knowledge for Robot Manipulation

Chilam Cheang, Guangzeng Chen, Ya Jing, Tao Kong, Hang Li, Yifeng Li, Yuxiao Liu, Hongtao Wu, Jiafeng Xu, Yichu Yang, Hanbo Zhang, and Minzhao Zhu. GR-2: A generative video- language-action model with web-scale knowledge for robot manipulation.arXiv preprint arXiv:2410.06158, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[5] [5]

Diffusion policy: Visuomotor policy learning via action diffusion

Cheng Chi, Siyuan Feng, Yilun Du, Zhenjia Xu, Eric Cousineau, Benjamin Burchfiel, and Shuran Song. Diffusion policy: Visuomotor policy learning via action diffusion. InRobotics: Science and Systems (RSS), 2023

work page 2023

[6] [6]

Embodied-SlotSSM

Nhat Chung, Taisei Hanyu, Toan Nguyen, Huy Le, Frederick Bumgarner, Duy Minh Ho Nguyen, Khoa V o, Kashu Yamazaki, Chase Rainwater, Tung Kieu, Anh Nguyen, and Ngan Le. Rethinking progression of memory state in robotic manipulation: An object-centric perspective. arXiv preprint arXiv:2511.11478, 2025. Method named “Embodied-SlotSSM” inside the paper

work page arXiv 2025

[7] [7]

Tenenbaum, Dale Schuurmans, and Pieter Abbeel

Yilun Du, Mengjiao Yang, Bo Dai, Hanjun Dai, Ofir Nachum, Joshua B. Tenenbaum, Dale Schuurmans, and Pieter Abbeel. Learning universal policies via text-guided video generation. InInternational Conference on Machine Learning (ICML), 2023

work page 2023

[8] [8]

RVT: Robotic view transformer for 3d object manipulation

Ankit Goyal, Jie Xu, Yijie Guo, Valts Blukis, Yu-Wei Chao, and Dieter Fox. RVT: Robotic view transformer for 3d object manipulation. InConference on Robot Learning (CoRL), 2023

work page 2023

[9] [9]

RVT-2: Learning precise manipulation from few demonstrations.arXiv preprint arXiv:2406.08545, 2024

Ankit Goyal, Valts Blukis, Jie Xu, Yijie Guo, Yu-Wei Chao, and Dieter Fox. RVT-2: Learning precise manipulation from few demonstrations.arXiv preprint arXiv:2406.08545, 2024

work page arXiv 2024

[10] [10]

Mastering Diverse Domains through World Models

Danijar Hafner, Jurgis Pasukonis, Jimmy Ba, and Timothy Lillicrap. Mastering diverse domains through world models.arXiv preprint arXiv:2301.04104, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[11] [11]

Video Prediction Policy: A Generalist Robot Policy with Predictive Visual Representations

Yucheng Hu, Yanjiang Guo, Pengchao Wang, Xiaoyu Chen, Yen-Jen Wang, Jianke Zhang, Koushil Sreenath, Chaochao Lu, and Jianyu Chen. Video prediction policy: A generalist robot policy with predictive visual representations.arXiv preprint arXiv:2412.14803, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[12] [12]

Galaxea open-world dataset and G0 dual-system VLA model.arXiv preprint arXiv:2509.00576, 2025

Tao Jiang, Tianyuan Yuan, Yicheng Liu, Chenhao Lu, Jianning Cui, Xiao Liu, Shuiqi Cheng, Jiyang Gao, Huazhe Xu, and Hang Zhao. Galaxea open-world dataset and G0 dual-system VLA model.arXiv preprint arXiv:2509.00576, 2025. URL https://arxiv.org/abs/2509. 00576.https://github.com/OpenGalaxea/GalaxeaVLA

work page arXiv 2025

[13] [13]

OpenVLA: An Open-Source Vision-Language-Action Model

Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, et al. OpenVLA: An open-source vision-language-action model.arXiv preprint arXiv:2406.09246, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[14] [14]

Tenenbaum

Po-Chen Ko, Jiayuan Mao, Yilun Du, Shao-Hua Sun, and Joshua B. Tenenbaum. Learning to act from actionless videos through dense correspondences. InInternational Conference on Learning Representations (ICLR), 2024

work page 2024

[15] [15]

HAMLET: Switch your Vision-Language-Action Model into a History-Aware Policy

Myungkyu Koo, Daewon Choi, Taeyoung Kim, Kyungmin Lee, Changyeon Kim, Younggyo Seo, and Jinwoo Shin. HAMLET: Switch your vision-language-action model into a history-aware policy.arXiv preprint arXiv:2510.00695, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[16] [16]

MASt3R: Grounding image matching in 3d.arXiv preprint arXiv:2406.09756, 2024

Vincent Leroy, Yohann Cabon, and Jerome Revaud. Grounding image matching in 3d with MASt3R.arXiv preprint arXiv:2406.09756, 2024. 10

work page arXiv 2024

[17] [17]

Spatial forcing: Implicit spatial representation alignment for vision- language-action model.arXiv preprint arXiv:2510.12276, 2025

Fuhao Li, Wenxuan Song, Han Zhao, Jingbo Wang, Pengxiang Ding, Donglin Wang, Long Zeng, and Haoang Li. Spatial forcing: Implicit spatial representation alignment for vision- language-action model.arXiv preprint arXiv:2510.12276, 2025

work page arXiv 2025

[18] [18]

3DS-VLA: A 3d spatial-aware vision language action model for robust multi-task manipulation

Xiaoqi Li, Liang Heng, Jiaming Liu, Yan Shen, Chenyang Gu, Zhuoyang Liu, Hao Chen, Nuowei Han, Renrui Zhang, Hao Tang, Shanghang Zhang, and Hao Dong. 3DS-VLA: A 3d spatial-aware vision language action model for robust multi-task manipulation. InConference on Robot Learning (CoRL), 2025

work page 2025

[19] [19]

QDepth-VLA: Quantized depth prediction as auxiliary supervision for vision-language-action models.arXiv preprint arXiv:2510.14836, 2025

Yixuan Li, Yuhui Chen, Mingcai Zhou, Haoran Li, Zhengtao Zhang, and Dongbin Zhao. QDepth-VLA: Quantized depth prediction as auxiliary supervision for vision-language-action models.arXiv preprint arXiv:2510.14836, 2025

work page arXiv 2025

[20] [20]

HiF-VLA: Hindsight, Insight and Foresight through Motion Representation for Vision-Language-Action Models

Minghui Lin, Pengxiang Ding, Shu Wang, Zifeng Zhuang, Yang Liu, Xinyang Tong, Wenxuan Song, Shangke Lyu, Siteng Huang, and Donglin Wang. HiF-VLA: Hindsight, insight and foresight through motion representation for vision-language-action models.arXiv preprint arXiv:2512.09928, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[21] [21]

Yaron Lipman, Ricky T. Q. Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling. InInternational Conference on Learning Representations (ICLR), 2023

work page 2023

[22] [22]

RDT-1B: a Diffusion Foundation Model for Bimanual Manipulation

Songming Liu, Lingxuan Wu, Bangguo Li, Hengkai Tan, Huayu Chen, Zhengyi Wang, Ke Xu, Hang Su, and Jun Zhu. RDT-1B: A diffusion foundation model for bimanual manipulation. arXiv preprint arXiv:2410.07864, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[23] [23]

Flow straight and fast: Learning to generate and transfer data with rectified flow

Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow. InInternational Conference on Learning Representations (ICLR), 2023

work page 2023

[24] [24]

Mani- Gaussian: Dynamic gaussian splatting for multi-task robotic manipulation.arXiv preprint arXiv:2403.08321, 2024

Guanxing Lu, Shiyi Zhang, Ziwei Wang, Changliu Liu, Jiwen Lu, and Yansong Tang. Mani- Gaussian: Dynamic gaussian splatting for multi-task robotic manipulation.arXiv preprint arXiv:2403.08321, 2024

work page arXiv 2024

[25] [25]

RoboTwin: Dual-arm robot benchmark with generative digital twins.arXiv preprint arXiv:2409.02920, 2024

Yao Mu, Tianxing Chen, Shijia Peng, Zanxin Chen, Zeyu Gao, Yude Zou, Lunkai Lin, Zhiqiang Xie, and Ping Luo. RoboTwin: Dual-arm robot benchmark with generative digital twins.arXiv preprint arXiv:2409.02920, 2024

work page arXiv 2024

[26] [26]

Octo: An Open-Source Generalist Robot Policy

Octo Model Team, Dibya Ghosh, Homer Walke, Karl Pertsch, et al. Octo: An open-source generalist robot policy.arXiv preprint arXiv:2405.12213, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[27] [27]

Consistency Policy: Accelerated Visuomotor Policies via Consistency Distillation, June 2024

Aaditya Prasad, Kevin Lin, Jimmy Wu, Linqi Zhou, and Jeannette Bohg. Consistency policy: Accelerated visuomotor policies via consistency distillation.arXiv preprint arXiv:2405.07503, 2024

work page arXiv 2024

[28] [28]

SpatialVLA: Exploring Spatial Representations for Visual-Language-Action Model

Delin Qu et al. SpatialVLA: Exploring spatial representations for visual-language-action models. arXiv preprint arXiv:2501.15830, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[29] [29]

LingBot-Depth: Masked depth modeling for spatial perception

Robbyant Team. LingBot-Depth: Masked depth modeling for spatial perception. https: //huggingface.co/robbyant/lingbot-depth, 2026. Open-source RGB-D representation model used as the frozen depth teacher

work page 2026

[30] [30]

MemoryVLA: Perceptual-Cognitive Memory in Vision-Language-Action Models for Robotic Manipulation

Hao Shi, Bin Xie, Yingfei Liu, Lin Sun, Fengrong Liu, Tiancai Wang, Erjin Zhou, Haoqiang Fan, Xiangyu Zhang, and Gao Huang. MemoryVLA: Perceptual-cognitive memory in vision- language-action models for robotic manipulation.arXiv preprint arXiv:2508.19236, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[31] [31]

Perceiver-actor: A multi-task transformer for robotic manipulation

Mohit Shridhar, Lucas Manuelli, and Dieter Fox. Perceiver-actor: A multi-task transformer for robotic manipulation. InConference on Robot Learning (CoRL), 2022

work page 2022

[32] [32]

Masked depth modeling for spatial perception.arXiv preprint arXiv:2601.17895, 2026

Bin Tan, Changjiang Sun, Xiage Qin, Hanat Adai, Zelin Fu, Tianxiang Zhou, Han Zhang, Yinghao Xu, Xing Zhu, Yujun Shen, et al. Masked depth modeling for spatial perception.arXiv preprint arXiv:2601.17895, 2026. 11

work page arXiv 2026

[33] [33]

HPT: Scaling proprioceptive-visual learning with heterogeneous pre-trained transformers.arXiv preprint arXiv:2409.20537, 2024

Lirui Wang, Xinlei Chen, Jialiang Zhao, and Kaiming He. HPT: Scaling proprioceptive-visual learning with heterogeneous pre-trained transformers.arXiv preprint arXiv:2409.20537, 2024

work page arXiv 2024

[34] [34]

Efros, and Angjoo Kanazawa

Qianqian Wang, Yifei Zhang, Aleksander Holynski, Alexei A. Efros, and Angjoo Kanazawa. Continuous 3D perception model with persistent state.arXiv preprint arXiv:2501.12387, 2025

work page arXiv 2025

[35] [35]

DUSt3R: Geometric 3D vision made easy

Shuzhe Wang, Vincent Leroy, Yohann Cabon, Boris Chidlovskii, and Jerome Revaud. DUSt3R: Geometric 3D vision made easy. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024

work page 2024

[36] [36]

π3: Permutation-equivariant visual geometry learning

Yifan Wang, Jianjun Zhou, Haoyi Zhu, Wenzheng Chang, Yang Zhou, Zizun Li, Junyi Chen, Jiangmiao Pang, Chunhua Shen, and Tong He. π3: Permutation-equivariant visual geometry learning. InInternational Conference on Learning Representations (ICLR), 2026

work page 2026

[37] [37]

Any-point Trajectory Modeling for Policy Learning

Chuan Wen, Xingyu Lin, John So, Kai Chen, Qi Dou, Yang Gao, and Pieter Abbeel. Any-point trajectory modeling for policy learning.arXiv preprint arXiv:2401.00025, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[38] [38]

Unleashing Large-Scale Video Generative Pre-training for Visual Robot Manipulation

Hongtao Wu, Ya Jing, Chilam Cheang, Guangzeng Chen, Jiafeng Xu, Xinghang Li, Minghuan Liu, Hang Li, and Tao Kong. Unleashing large-scale video generative pre-training for visual robot manipulation.arXiv preprint arXiv:2312.13139, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[39] [39]

Day- Dreamer: World models for physical robot learning

Philipp Wu, Alejandro Escontrela, Danijar Hafner, Pieter Abbeel, and Ken Goldberg. Day- Dreamer: World models for physical robot learning. InConference on Robot Learning (CoRL), 2022

work page 2022

[40] [40]

A Pragmatic VLA Foundation Model

Wei Wu, Fan Lu, Yunnan Wang, Shuai Yang, Shi Liu, Fangjing Wang, Qian Zhu, He Sun, Yong Wang, Shuailei Ma, Yiyu Ren, Kejia Zhang, Hui Yu, Jingmei Zhao, Shuai Zhou, Zhenqi Qiu, Houlong Xiong, Ziyu Wang, Zechen Wang, Ran Cheng, Yong-Lu Li, Yongtao Huang, Xing Zhu, Yujun Shen, and Kecheng Zheng. LingBot-VLA: A pragmatic vla foundation model.arXiv preprint ar...

work page internal anchor Pith review Pith/arXiv arXiv 2026

[41] [41]

AVA-VLA: Improving Vision-Language-Action models with Active Visual Attention

Lei Xiao, Jifeng Li, Juntao Gao, Feiyang Ye, Yan Jin, Jingjing Qian, Jing Zhang, Yong Wu, and Xiaoyuan Yu. A V A-VLA: Improving vision-language-action models with active visual attention. arXiv preprint arXiv:2511.18960, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[42] [42]

3D Diffusion Policy: Generalizable Visuomotor Policy Learning via Simple 3D Representations

Yanjie Ze, Gu Zhang, Kangning Zhang, Chenyuan Hu, Muhan Wang, and Huazhe Xu. 3D diffusion policy: Generalizable visuomotor policy learning via simple 3D representations.arXiv preprint arXiv:2403.03954, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[43] [43]

UP- VLA: A unified understanding and prediction model for embodied agent.arXiv preprint arXiv:2501.18867, 2025

Jianke Zhang, Yanjiang Guo, Yucheng Hu, Xiaoyu Chen, Xiang Zhu, and Jianyu Chen. UP- VLA: A unified understanding and prediction model for embodied agent.arXiv preprint arXiv:2501.18867, 2025

work page arXiv 2025

[44] [44]

TraceVLA: Visual Trace Prompting Enhances Spatial-Temporal Awareness for Generalist Robotic Policies

Ruijie Zheng, Yongyuan Liang, Shuaiyi Huang, Jianfeng Gao, Hal Daumé III, Andrey Kolobov, Furong Huang, and Jianwei Yang. TraceVLA: Visual trace prompting for generalist robot policies.arXiv preprint arXiv:2412.10345, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[45] [45]

FLARE: Robot Learning with Implicit World Modeling

Ruijie Zheng, Jing Wang, Scott Reed, Johan Bjorck, Yu Fang, Fengyuan Hu, Joel Jang, Kaushil Kundalia, Zongyu Lin, Loic Magne, Avnish Narayan, You Liang Tan, Guanzhi Wang, Qi Wang, Jiannan Xiang, Yinzhen Xu, Seonghyeon Ye, Jan Kautz, Furong Huang, Yuke Zhu, and Linxi Fan. FLARE: Robot learning with implicit world modeling.arXiv preprint arXiv:2505.15659, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[46] [46]

VLA-4D: Embedding 4d awareness into vision- language-action models for spatiotemporally coherent robotic manipulation.arXiv preprint arXiv:2511.17199, 2025

Hanyu Zhou, Chuanhao Ma, and Gim Hee Lee. VLA-4D: Embedding 4d awareness into vision- language-action models for spatiotemporally coherent robotic manipulation.arXiv preprint arXiv:2511.17199, 2025. 12 Overview This appendix supplements the main paper with implementation, evaluation, and discussion details. Appendix A reports the optimization recipe, hype...

work page arXiv 2025