EvoScene-VLA: Evolving Scene Beliefs Inside the Action Decoder for Chunked Robot Control
Pith reviewed 2026-05-22 06:16 UTC · model grok-4.3
The pith
Maintaining an action-updated scene state across control chunks improves performance in vision-language-action robot policies.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
EvoScene-VLA carries a geometry-aware scene state across chunks through a recurrent prefix; the action decoder outputs both the next action chunk and a compact scene update that becomes the prior for the following step, which the VLM then corrects against the new observation, yielding success rates of 89.1 percent fixed and 88.5 percent randomized on 31 RoboTwin tasks plus stronger real-robot results.
What carries the argument
The recurrent scene prefix that carries a geometry-aware scene state across chunks, refreshed by the action decoder's compact scene output and corrected by the VLM against each new observation.
If this is right
- Each control call begins with a scene prior that already incorporates action-induced changes since the previous observation.
- The policy produces both actions and an updated scene state in a single decoder pass.
- Training with future scene targets and 3D geometric anchors improves the internal state without changing the deployed model.
- Gains hold in both fixed and randomized evaluation settings as well as on physical robot hardware.
Where Pith is reading between the lines
- The same recurrent update pattern could support longer action horizons where visual updates become even less frequent.
- Comparable belief evolution inside the decoder might benefit other partially observable control problems such as navigation under occlusion.
- The separation of training-only geometric teachers suggests the method can be adapted to new vision-language models by swapping only the VLM backbone.
Load-bearing premise
The compact scene update produced by the action decoder captures the essential geometry changes induced by actions so that the VLM can reliably correct it against the next observation.
What would settle it
Ablating the scene update output and measuring whether success rates on the 31 RoboTwin tasks return to the 87.2 percent fixed and 86.1 percent randomized baselines of standard chunked VLA policies.
Figures
read the original abstract
Chunked vision-language-action (VLA) policies predict multi-step robot controls, conditioning each update on the current visual observation alone. Yet robot actions cause contact, occlusion, and object motion, and the geometry that later decisions depend on can change before the next visual update arrives. Spatial VLAs improve current-frame geometry. Temporal VLAs aggregate past frames. Neither maintains an action-updated scene prior across chunks. We argue for a persistent action-updated scene state across control calls, and introduce EvoScene-VLA. Its recurrent scene prefix carries a geometry-aware scene state across chunks. At each vision-language model (VLM) call, the VLM combines scene information from the current observation with the action-updated prior from the previous chunk; the action decoder outputs both the next action chunk and a compact scene update. This update becomes the next prior, which the VLM corrects against the new observation when the next call arrives. Each control call therefore starts from a scene prior that reflects both recent actions and fresh visual evidence. During training, \textbf{Scene Predictor} supplies future scene-token targets, and Geometric Anchor aligns scene slots with frozen depth and 3D teachers. We discard both modules at deployment. On 31 RoboTwin tasks, EvoScene-VLA raises average success from 87.2% to 89.1% in fixed evaluation and from 86.1% to 88.5% in randomized evaluation. On the Galaxea R1-Lite real robot, EvoScene-VLA outperforms all baselines.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces EvoScene-VLA, a chunked VLA policy that maintains a recurrent scene prefix carrying geometry-aware state across control calls. At each VLM invocation the model fuses the current observation with the prior from the previous chunk; the action decoder then emits both the next action chunk and a compact scene update that becomes the subsequent prior. Scene Predictor and Geometric Anchor provide training supervision only and are dropped at inference. Empirical results on 31 RoboTwin tasks show average success rising from 87.2 % to 89.1 % (fixed) and from 86.1 % to 88.5 % (randomized), with additional outperformance on a real Galaxea R1-Lite robot.
Significance. If the recurrent scene-update mechanism reliably encodes action-induced geometric changes, the approach would supply a lightweight way to maintain temporal scene consistency inside existing chunked VLA pipelines without requiring full 3-D reconstruction at deployment. The reported gains on a sizable task suite and real-robot transfer constitute a concrete empirical contribution, though the lack of error bars, statistical tests, and targeted ablations on the scene-update component leaves the magnitude and source of improvement open to further verification.
major comments (2)
- Abstract and method description of the recurrent scene prefix: the central claim that the compact scene update produced by the action decoder supplies a prior that meaningfully reflects contact, occlusion, and motion effects rests on an unverified assumption. No explicit 3-D consistency loss or geometric regularizer is enforced at inference, and the paper provides no ablation that isolates the contribution of this update versus other architectural choices. If the learned update is lossy, error accumulation across chunks would undermine the reported gains over standard chunked VLA baselines.
- Results section reporting RoboTwin success rates: the improvements (87.2 % → 89.1 % fixed; 86.1 % → 88.5 % randomized) are presented without error bars, confidence intervals, or statistical significance tests. In the absence of these, it is impossible to determine whether the observed deltas exceed run-to-run variance or are attributable to the scene-evolution mechanism rather than hyper-parameter or implementation differences.
minor comments (2)
- The abstract and method sections should clarify the dimensionality and representation of the compact scene update (e.g., token count, embedding size) so readers can assess its information capacity relative to the VLM context length.
- Figure captions and experimental protocol should explicitly state the number of evaluation seeds and whether the same random seeds were used across all compared methods.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address the two major comments point by point below and will incorporate revisions to improve clarity and rigor.
read point-by-point responses
-
Referee: Abstract and method description of the recurrent scene prefix: the central claim that the compact scene update produced by the action decoder supplies a prior that meaningfully reflects contact, occlusion, and motion effects rests on an unverified assumption. No explicit 3-D consistency loss or geometric regularizer is enforced at inference, and the paper provides no ablation that isolates the contribution of this update versus other architectural choices. If the learned update is lossy, error accumulation across chunks would undermine the reported gains over standard chunked VLA baselines.
Authors: We acknowledge the absence of an explicit ablation isolating the recurrent scene update. The training objective does include the Scene Predictor (future scene-token targets) and Geometric Anchor (alignment with frozen depth and 3D teachers), which provide geometric supervision to the scene update produced by the action decoder. At inference the VLM fuses the current observation with this prior at every chunk, supplying a corrective signal that limits unchecked accumulation. Nevertheless, we agree that a targeted ablation is needed to quantify the update's isolated contribution. In the revised manuscript we will add an ablation that disables the recurrent prior (replacing it with a zero or static initialization) while keeping all other components fixed, and report the resulting success rates on the same 31 RoboTwin tasks. revision: yes
-
Referee: Results section reporting RoboTwin success rates: the improvements (87.2 % → 89.1 % fixed; 86.1 % → 88.5 % randomized) are presented without error bars, confidence intervals, or statistical significance tests. In the absence of these, it is impossible to determine whether the observed deltas exceed run-to-run variance or are attributable to the scene-evolution mechanism rather than hyper-parameter or implementation differences.
Authors: We agree that variability measures and statistical tests are required to substantiate the reported gains. In the revised manuscript we will rerun the full evaluation suite across five random seeds, report mean success rates together with standard deviations, and include paired t-test p-values comparing EvoScene-VLA against each baseline. These additions will allow readers to assess whether the 1.9 pp and 2.4 pp improvements exceed typical run-to-run variance. revision: yes
Circularity Check
No circularity: empirical benchmark gains rest on independent evaluation, not self-referential definitions or fits
full rationale
The paper introduces EvoScene-VLA as an architectural extension to chunked VLAs, using a recurrent scene prefix updated inside the action decoder and corrected by the VLM against new observations. All reported results consist of direct empirical success-rate comparisons on 31 RoboTwin tasks (fixed and randomized) plus real-robot trials against baselines. No equations, uniqueness theorems, or parameter-fitting steps are described that would make any performance claim reduce by construction to the method's own inputs or to a self-citation chain. Auxiliary training modules (Scene Predictor, Geometric Anchor) are explicitly discarded at deployment, so the inference-time behavior and measured gains remain externally falsifiable on the benchmark tasks. This is a standard empirical ML paper whose derivation chain is self-contained.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The VLM can usefully combine current visual observation with the action-updated scene prior from the previous chunk.
invented entities (1)
-
Recurrent scene prefix
no independent evidence
Reference graph
Works this paper leans on
-
[1]
$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control
Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, et al. π0: A vision-language-action flow model for general robot control.arXiv preprint arXiv:2410.24164, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[2]
Zero-shot robotic manipulation with pretrained image-editing diffusion models
Kevin Black, Mitsuhiko Nakamoto, Pranav Atreya, Homer Walke, Chelsea Finn, Aviral Kumar, and Sergey Levine. Zero-shot robotic manipulation with pretrained image-editing diffusion models. InInternational Conference on Learning Representations (ICLR), 2024
work page 2024
-
[3]
RT-2: Vision-language-action models transfer web knowledge to robotic control
Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Xi Chen, Krzysztof Choro- manski, et al. RT-2: Vision-language-action models transfer web knowledge to robotic control. InConference on Robot Learning (CoRL), 2023
work page 2023
-
[4]
GR-2: A Generative Video-Language-Action Model with Web-Scale Knowledge for Robot Manipulation
Chilam Cheang, Guangzeng Chen, Ya Jing, Tao Kong, Hang Li, Yifeng Li, Yuxiao Liu, Hongtao Wu, Jiafeng Xu, Yichu Yang, Hanbo Zhang, and Minzhao Zhu. GR-2: A generative video- language-action model with web-scale knowledge for robot manipulation.arXiv preprint arXiv:2410.06158, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[5]
Diffusion policy: Visuomotor policy learning via action diffusion
Cheng Chi, Siyuan Feng, Yilun Du, Zhenjia Xu, Eric Cousineau, Benjamin Burchfiel, and Shuran Song. Diffusion policy: Visuomotor policy learning via action diffusion. InRobotics: Science and Systems (RSS), 2023
work page 2023
-
[6]
Nhat Chung, Taisei Hanyu, Toan Nguyen, Huy Le, Frederick Bumgarner, Duy Minh Ho Nguyen, Khoa V o, Kashu Yamazaki, Chase Rainwater, Tung Kieu, Anh Nguyen, and Ngan Le. Rethinking progression of memory state in robotic manipulation: An object-centric perspective. arXiv preprint arXiv:2511.11478, 2025. Method named “Embodied-SlotSSM” inside the paper
-
[7]
Tenenbaum, Dale Schuurmans, and Pieter Abbeel
Yilun Du, Mengjiao Yang, Bo Dai, Hanjun Dai, Ofir Nachum, Joshua B. Tenenbaum, Dale Schuurmans, and Pieter Abbeel. Learning universal policies via text-guided video generation. InInternational Conference on Machine Learning (ICML), 2023
work page 2023
-
[8]
RVT: Robotic view transformer for 3d object manipulation
Ankit Goyal, Jie Xu, Yijie Guo, Valts Blukis, Yu-Wei Chao, and Dieter Fox. RVT: Robotic view transformer for 3d object manipulation. InConference on Robot Learning (CoRL), 2023
work page 2023
-
[9]
RVT-2: Learning precise manipulation from few demonstrations.arXiv preprint arXiv:2406.08545, 2024
Ankit Goyal, Valts Blukis, Jie Xu, Yijie Guo, Yu-Wei Chao, and Dieter Fox. RVT-2: Learning precise manipulation from few demonstrations.arXiv preprint arXiv:2406.08545, 2024
-
[10]
Mastering Diverse Domains through World Models
Danijar Hafner, Jurgis Pasukonis, Jimmy Ba, and Timothy Lillicrap. Mastering diverse domains through world models.arXiv preprint arXiv:2301.04104, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[11]
Video Prediction Policy: A Generalist Robot Policy with Predictive Visual Representations
Yucheng Hu, Yanjiang Guo, Pengchao Wang, Xiaoyu Chen, Yen-Jen Wang, Jianke Zhang, Koushil Sreenath, Chaochao Lu, and Jianyu Chen. Video prediction policy: A generalist robot policy with predictive visual representations.arXiv preprint arXiv:2412.14803, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[12]
Galaxea open-world dataset and G0 dual-system VLA model.arXiv preprint arXiv:2509.00576, 2025
Tao Jiang, Tianyuan Yuan, Yicheng Liu, Chenhao Lu, Jianning Cui, Xiao Liu, Shuiqi Cheng, Jiyang Gao, Huazhe Xu, and Hang Zhao. Galaxea open-world dataset and G0 dual-system VLA model.arXiv preprint arXiv:2509.00576, 2025. URL https://arxiv.org/abs/2509. 00576.https://github.com/OpenGalaxea/GalaxeaVLA
-
[13]
OpenVLA: An Open-Source Vision-Language-Action Model
Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, et al. OpenVLA: An open-source vision-language-action model.arXiv preprint arXiv:2406.09246, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
- [14]
-
[15]
HAMLET: Switch your Vision-Language-Action Model into a History-Aware Policy
Myungkyu Koo, Daewon Choi, Taeyoung Kim, Kyungmin Lee, Changyeon Kim, Younggyo Seo, and Jinwoo Shin. HAMLET: Switch your vision-language-action model into a history-aware policy.arXiv preprint arXiv:2510.00695, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[16]
MASt3R: Grounding image matching in 3d.arXiv preprint arXiv:2406.09756, 2024
Vincent Leroy, Yohann Cabon, and Jerome Revaud. Grounding image matching in 3d with MASt3R.arXiv preprint arXiv:2406.09756, 2024. 10
-
[17]
Fuhao Li, Wenxuan Song, Han Zhao, Jingbo Wang, Pengxiang Ding, Donglin Wang, Long Zeng, and Haoang Li. Spatial forcing: Implicit spatial representation alignment for vision- language-action model.arXiv preprint arXiv:2510.12276, 2025
-
[18]
3DS-VLA: A 3d spatial-aware vision language action model for robust multi-task manipulation
Xiaoqi Li, Liang Heng, Jiaming Liu, Yan Shen, Chenyang Gu, Zhuoyang Liu, Hao Chen, Nuowei Han, Renrui Zhang, Hao Tang, Shanghang Zhang, and Hao Dong. 3DS-VLA: A 3d spatial-aware vision language action model for robust multi-task manipulation. InConference on Robot Learning (CoRL), 2025
work page 2025
-
[19]
Yixuan Li, Yuhui Chen, Mingcai Zhou, Haoran Li, Zhengtao Zhang, and Dongbin Zhao. QDepth-VLA: Quantized depth prediction as auxiliary supervision for vision-language-action models.arXiv preprint arXiv:2510.14836, 2025
-
[20]
Minghui Lin, Pengxiang Ding, Shu Wang, Zifeng Zhuang, Yang Liu, Xinyang Tong, Wenxuan Song, Shangke Lyu, Siteng Huang, and Donglin Wang. HiF-VLA: Hindsight, insight and foresight through motion representation for vision-language-action models.arXiv preprint arXiv:2512.09928, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[21]
Yaron Lipman, Ricky T. Q. Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling. InInternational Conference on Learning Representations (ICLR), 2023
work page 2023
-
[22]
RDT-1B: a Diffusion Foundation Model for Bimanual Manipulation
Songming Liu, Lingxuan Wu, Bangguo Li, Hengkai Tan, Huayu Chen, Zhengyi Wang, Ke Xu, Hang Su, and Jun Zhu. RDT-1B: A diffusion foundation model for bimanual manipulation. arXiv preprint arXiv:2410.07864, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[23]
Flow straight and fast: Learning to generate and transfer data with rectified flow
Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow. InInternational Conference on Learning Representations (ICLR), 2023
work page 2023
-
[24]
Guanxing Lu, Shiyi Zhang, Ziwei Wang, Changliu Liu, Jiwen Lu, and Yansong Tang. Mani- Gaussian: Dynamic gaussian splatting for multi-task robotic manipulation.arXiv preprint arXiv:2403.08321, 2024
-
[25]
Yao Mu, Tianxing Chen, Shijia Peng, Zanxin Chen, Zeyu Gao, Yude Zou, Lunkai Lin, Zhiqiang Xie, and Ping Luo. RoboTwin: Dual-arm robot benchmark with generative digital twins.arXiv preprint arXiv:2409.02920, 2024
-
[26]
Octo: An Open-Source Generalist Robot Policy
Octo Model Team, Dibya Ghosh, Homer Walke, Karl Pertsch, et al. Octo: An open-source generalist robot policy.arXiv preprint arXiv:2405.12213, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[27]
Consistency Policy: Accelerated Visuomotor Policies via Consistency Distillation, June 2024
Aaditya Prasad, Kevin Lin, Jimmy Wu, Linqi Zhou, and Jeannette Bohg. Consistency policy: Accelerated visuomotor policies via consistency distillation.arXiv preprint arXiv:2405.07503, 2024
-
[28]
SpatialVLA: Exploring Spatial Representations for Visual-Language-Action Model
Delin Qu et al. SpatialVLA: Exploring spatial representations for visual-language-action models. arXiv preprint arXiv:2501.15830, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[29]
LingBot-Depth: Masked depth modeling for spatial perception
Robbyant Team. LingBot-Depth: Masked depth modeling for spatial perception. https: //huggingface.co/robbyant/lingbot-depth, 2026. Open-source RGB-D representation model used as the frozen depth teacher
work page 2026
-
[30]
MemoryVLA: Perceptual-Cognitive Memory in Vision-Language-Action Models for Robotic Manipulation
Hao Shi, Bin Xie, Yingfei Liu, Lin Sun, Fengrong Liu, Tiancai Wang, Erjin Zhou, Haoqiang Fan, Xiangyu Zhang, and Gao Huang. MemoryVLA: Perceptual-cognitive memory in vision- language-action models for robotic manipulation.arXiv preprint arXiv:2508.19236, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[31]
Perceiver-actor: A multi-task transformer for robotic manipulation
Mohit Shridhar, Lucas Manuelli, and Dieter Fox. Perceiver-actor: A multi-task transformer for robotic manipulation. InConference on Robot Learning (CoRL), 2022
work page 2022
-
[32]
Masked depth modeling for spatial perception.arXiv preprint arXiv:2601.17895, 2026
Bin Tan, Changjiang Sun, Xiage Qin, Hanat Adai, Zelin Fu, Tianxiang Zhou, Han Zhang, Yinghao Xu, Xing Zhu, Yujun Shen, et al. Masked depth modeling for spatial perception.arXiv preprint arXiv:2601.17895, 2026. 11
-
[33]
Lirui Wang, Xinlei Chen, Jialiang Zhao, and Kaiming He. HPT: Scaling proprioceptive-visual learning with heterogeneous pre-trained transformers.arXiv preprint arXiv:2409.20537, 2024
-
[34]
Qianqian Wang, Yifei Zhang, Aleksander Holynski, Alexei A. Efros, and Angjoo Kanazawa. Continuous 3D perception model with persistent state.arXiv preprint arXiv:2501.12387, 2025
-
[35]
DUSt3R: Geometric 3D vision made easy
Shuzhe Wang, Vincent Leroy, Yohann Cabon, Boris Chidlovskii, and Jerome Revaud. DUSt3R: Geometric 3D vision made easy. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024
work page 2024
-
[36]
π3: Permutation-equivariant visual geometry learning
Yifan Wang, Jianjun Zhou, Haoyi Zhu, Wenzheng Chang, Yang Zhou, Zizun Li, Junyi Chen, Jiangmiao Pang, Chunhua Shen, and Tong He. π3: Permutation-equivariant visual geometry learning. InInternational Conference on Learning Representations (ICLR), 2026
work page 2026
-
[37]
Any-point Trajectory Modeling for Policy Learning
Chuan Wen, Xingyu Lin, John So, Kai Chen, Qi Dou, Yang Gao, and Pieter Abbeel. Any-point trajectory modeling for policy learning.arXiv preprint arXiv:2401.00025, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[38]
Unleashing Large-Scale Video Generative Pre-training for Visual Robot Manipulation
Hongtao Wu, Ya Jing, Chilam Cheang, Guangzeng Chen, Jiafeng Xu, Xinghang Li, Minghuan Liu, Hang Li, and Tao Kong. Unleashing large-scale video generative pre-training for visual robot manipulation.arXiv preprint arXiv:2312.13139, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[39]
Day- Dreamer: World models for physical robot learning
Philipp Wu, Alejandro Escontrela, Danijar Hafner, Pieter Abbeel, and Ken Goldberg. Day- Dreamer: World models for physical robot learning. InConference on Robot Learning (CoRL), 2022
work page 2022
-
[40]
A Pragmatic VLA Foundation Model
Wei Wu, Fan Lu, Yunnan Wang, Shuai Yang, Shi Liu, Fangjing Wang, Qian Zhu, He Sun, Yong Wang, Shuailei Ma, Yiyu Ren, Kejia Zhang, Hui Yu, Jingmei Zhao, Shuai Zhou, Zhenqi Qiu, Houlong Xiong, Ziyu Wang, Zechen Wang, Ran Cheng, Yong-Lu Li, Yongtao Huang, Xing Zhu, Yujun Shen, and Kecheng Zheng. LingBot-VLA: A pragmatic vla foundation model.arXiv preprint ar...
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[41]
AVA-VLA: Improving Vision-Language-Action models with Active Visual Attention
Lei Xiao, Jifeng Li, Juntao Gao, Feiyang Ye, Yan Jin, Jingjing Qian, Jing Zhang, Yong Wu, and Xiaoyuan Yu. A V A-VLA: Improving vision-language-action models with active visual attention. arXiv preprint arXiv:2511.18960, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[42]
3D Diffusion Policy: Generalizable Visuomotor Policy Learning via Simple 3D Representations
Yanjie Ze, Gu Zhang, Kangning Zhang, Chenyuan Hu, Muhan Wang, and Huazhe Xu. 3D diffusion policy: Generalizable visuomotor policy learning via simple 3D representations.arXiv preprint arXiv:2403.03954, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[43]
Jianke Zhang, Yanjiang Guo, Yucheng Hu, Xiaoyu Chen, Xiang Zhu, and Jianyu Chen. UP- VLA: A unified understanding and prediction model for embodied agent.arXiv preprint arXiv:2501.18867, 2025
-
[44]
TraceVLA: Visual Trace Prompting Enhances Spatial-Temporal Awareness for Generalist Robotic Policies
Ruijie Zheng, Yongyuan Liang, Shuaiyi Huang, Jianfeng Gao, Hal Daumé III, Andrey Kolobov, Furong Huang, and Jianwei Yang. TraceVLA: Visual trace prompting for generalist robot policies.arXiv preprint arXiv:2412.10345, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[45]
FLARE: Robot Learning with Implicit World Modeling
Ruijie Zheng, Jing Wang, Scott Reed, Johan Bjorck, Yu Fang, Fengyuan Hu, Joel Jang, Kaushil Kundalia, Zongyu Lin, Loic Magne, Avnish Narayan, You Liang Tan, Guanzhi Wang, Qi Wang, Jiannan Xiang, Yinzhen Xu, Seonghyeon Ye, Jan Kautz, Furong Huang, Yuke Zhu, and Linxi Fan. FLARE: Robot learning with implicit world modeling.arXiv preprint arXiv:2505.15659, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[46]
Hanyu Zhou, Chuanhao Ma, and Gim Hee Lee. VLA-4D: Embedding 4d awareness into vision- language-action models for spatiotemporally coherent robotic manipulation.arXiv preprint arXiv:2511.17199, 2025. 12 Overview This appendix supplements the main paper with implementation, evaluation, and discussion details. Appendix A reports the optimization recipe, hype...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.