arxiv: 2604.18107 · v1 · submitted 2026-04-20 · 💻 cs.CV

Recognition: unknown

Test-Time Perturbation Learning with Delayed Feedback for Vision-Language-Action Models

Zehua Zang , Xi Wang , Fuchun Sun , Xiao Xu , Lixiang Lium , Jiahuan Zhou , Jiangmeng Li

Authors on Pith no claims yet

Pith reviewed 2026-05-10 05:51 UTC · model grok-4.3

classification 💻 cs.CV

keywords vision-language-action modelstest-time adaptationperturbation learningdelayed feedbacktrajectory overfittingaction logitsmultimodal decision makingrobustness to shifts

0 comments

The pith

PDF improves Vision-Language-Action models at test time by using delayed feedback to adjust action predictions and reduce overfitting to spurious correlations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Vision-Language-Action models often fail when objects shift slightly because they have memorized specific action sequences tied to training trajectories. The paper introduces PDF as a test-time method that avoids fine-tuning the base model or using an external verifier. It applies uncertainty-based data augmentation, aggregates actions by voting, and uses an adaptive scheduler to control the amount of extra computation. A small perturbation module then learns to revise the model's output scores once delayed outcome feedback arrives, which reduces overconfident errors. Experiments report higher success on manipulation benchmarks and game environments, indicating a practical route to more stable sequential decision agents.

Core claim

PDF is a verifier-free test-time adaptation framework that mitigates trajectory overfitting in frozen Vision-Language-Action models through uncertainty-based data augmentation combined with action voting and an adaptive budget scheduler, while a lightweight perturbation module retrospectively corrects action logits using delayed feedback signals to improve decision stability.

What carries the argument

The lightweight perturbation module that learns to adjust the base model's action logits retrospectively from delayed feedback.

If this is right

Vision-Language-Action models reach higher task success without retraining the original weights.
The same gains appear across both robotic manipulation and game-playing domains.
An adaptive scheduler keeps the added computation from growing unbounded during long episodes.
The approach works without a separate verifier or ground-truth labels at test time.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The delayed-feedback correction could be useful in real-world settings where action outcomes are observed only after a delay.
Similar retrospective adjustment might help other sequence models that suffer from training-trajectory bias.
Accurate uncertainty estimates are required for the augmentation step to target the right predictions.

Load-bearing premise

Trajectory overfitting to spurious action-entity correlations is the dominant source of fragility to environmental shifts, and uncertainty augmentation plus delayed-feedback correction can fix it without introducing new instabilities.

What would settle it

Running PDF on the LIBERO or Atari benchmarks and observing no gain or a drop in success rate relative to the vanilla Vision-Language-Action model on tasks that include small object-pose changes would falsify the claimed benefit.

Figures

Figures reproduced from arXiv: 2604.18107 by Fuchun Sun, Jiahuan Zhou, Jiangmeng Li, Lixiang Lium, Xiao Xu, Xi Wang, Zehua Zang.

**Figure 2.** Figure 2: Comparison between traditional self-supervised test [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗

**Figure 3.** Figure 3: The overall framework of our proposed PDF. At test time, the VLA receives pixel observation [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗

**Figure 4.** Figure 4: Human normalized score changes across 57 Atari games. Blue bars show performance improvements, orange bars indicate [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

**Figure 5.** Figure 5: Performance degradation under increasing data augmen [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗

**Figure 6.** Figure 6: Performance comparison across five benchmarks shows [PITH_FULL_IMAGE:figures/full_fig_p006_6.png] view at source ↗

**Figure 7.** Figure 7: Visual comparison of OpenVLA and PDF on three tasks. The green thumb indicates that the agent performed the correct action, [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗

read the original abstract

Vision-Language-Action models (VLAs) achieve remarkable performance in sequential decision-making but remain fragile to subtle environmental shifts, such as small changes in object pose. We attribute this brittleness to trajectory overfitting, where VLAs over-attend to the spurious correlation between actions and entities, then reproduce memorized action patterns. We propose Perturbation learning with Delayed Feedback (PDF), a verifier-free test-time adaptation framework that improves decision performance without fine-tuning the base model. PDF mitigates the spurious correlation through uncertainty-based data augmentation and action voting, while an adaptive scheduler allocates augmentation budgets to balance performance and efficiency. To further improve stability, PDF learns a lightweight perturbation module that retrospectively adjusts action logits guided by delayed feedback, correcting overconfidence issue. Experiments on LIBERO (+7.4\% success rate) and Atari (+10.3 human normalized score) demonstrate consistent gains of PDF in task success over vanilla VLA and VLA with test-time adaptation, establishing a practical path toward reliable test-time adaptation in multimodal decision-making agents. The code is available at \href{https://github.com/zhoujiahuan1991/CVPR2026-PDF}{https://github.com/zhoujiahuan1991/CVPR2026-PDF}.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

PDF puts together uncertainty augmentation, voting, and a delayed-feedback perturbation module into a verifier-free test-time adapter for VLAs, with reported gains on LIBERO and Atari but thin experimental detail so far.

read the letter

The main contribution is a practical test-time adaptation method called PDF for vision-language-action models. It tackles brittleness from what the authors call trajectory overfitting by combining uncertainty-based data augmentation with action voting, an adaptive scheduler to control compute cost, and a lightweight perturbation module that uses delayed feedback to adjust action logits and reduce overconfidence. No base-model fine-tuning or external verifier is required. The abstract reports concrete improvements of 7.4 percent success rate on LIBERO and 10.3 human-normalized score on Atari over vanilla VLAs and prior test-time adaptation baselines, plus public code on GitHub. That combination of pieces and the empirical numbers are what is actually new here. The framework is self-contained and directly targets a deployment pain point in robotics agents. The scheduler and delayed-feedback correction are sensible engineering choices for balancing stability and overhead. The soft spots sit in the experimental support. The abstract gives no error bars, no ablation breakdowns, no mention of run counts or statistical tests, and no discussion of whether the added modules introduce new variance or failure modes. The central assumption that trajectory overfitting is the dominant source of shift sensitivity and that delayed feedback fixes it cleanly is plausible but rests on the reported gains alone. Without the full methods and results sections it is hard to judge how general the fix is. This paper is for researchers working on robust VLAs or test-time methods in sequential decision tasks. A reader already building or evaluating these models would find the approach and benchmarks worth examining, especially with the code available. It deserves peer review because the idea is coherent, the benchmarks are standard, and the claims are falsifiable once the experiments are laid out in detail.

Referee Report

3 major / 1 minor

Summary. The paper proposes Perturbation learning with Delayed Feedback (PDF), a verifier-free test-time adaptation framework for Vision-Language-Action (VLA) models. It attributes model brittleness to trajectory overfitting that produces spurious correlations and overconfident actions. PDF combines uncertainty-based data augmentation with action voting, an adaptive scheduler to allocate augmentation budgets, and a lightweight perturbation module trained retrospectively on delayed feedback to adjust action logits. Experiments report gains of +7.4% success rate on LIBERO and +10.3 human-normalized score on Atari over vanilla VLA and other test-time adaptation baselines, with public code released.

Significance. If the empirical results hold under rigorous validation, the work provides a practical, training-free route to improving robustness of multimodal decision-making agents. The public code release is a clear strength that supports reproducibility and follow-on research in test-time adaptation.

major comments (3)

[Experiments] Experiments section: the central performance claims (+7.4% success rate on LIBERO, +10.3 HNS on Atari) are reported without error bars, number of evaluation seeds, ablation tables, or statistical significance tests. This directly undermines assessment of whether the gains are reliable and load-bearing for the paper's empirical contribution.
[Introduction and §3] Introduction and §3 (Method): the attribution of brittleness primarily to trajectory overfitting is presented as the motivating assumption, yet no direct evidence, diagnostic experiments, or comparison against alternative failure modes (e.g., visual encoder limitations or policy architecture) is provided to establish that this is the dominant factor.
[§3.3] §3.3 (Perturbation module): the description of how delayed feedback is used to train the lightweight module and whether it requires ground-truth signals at test time is insufficient to verify the verifier-free claim and to confirm that the module does not introduce new instabilities.

minor comments (1)

[Abstract] Abstract: the phrase 'correcting overconfidence issue' should read 'correcting the overconfidence issue' for grammatical consistency.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their thoughtful and constructive comments on our manuscript. We address each of the major comments below and outline the revisions we plan to make.

read point-by-point responses

Referee: [Experiments] Experiments section: the central performance claims (+7.4% success rate on LIBERO, +10.3 HNS on Atari) are reported without error bars, number of evaluation seeds, ablation tables, or statistical significance tests. This directly undermines assessment of whether the gains are reliable and load-bearing for the paper's empirical contribution.

Authors: We agree that the absence of error bars, details on the number of evaluation seeds, ablation tables, and statistical significance tests weakens the empirical claims. In the revised manuscript, we will include results averaged over multiple random seeds (at least 5), report standard deviations as error bars, expand the ablation studies, and include statistical significance tests (e.g., paired t-tests) to demonstrate that the reported improvements are reliable. We will also make the evaluation protocol clearer. revision: yes
Referee: [Introduction and §3] Introduction and §3 (Method): the attribution of brittleness primarily to trajectory overfitting is presented as the motivating assumption, yet no direct evidence, diagnostic experiments, or comparison against alternative failure modes (e.g., visual encoder limitations or policy architecture) is provided to establish that this is the dominant factor.

Authors: The attribution to trajectory overfitting stems from our empirical observations that VLAs often replicate training trajectories under minor environmental changes, leading to spurious correlations. However, we acknowledge the lack of direct diagnostic evidence in the current manuscript. We will add diagnostic experiments in the revised version, including attention visualization, controlled perturbation tests, and comparisons to isolate trajectory overfitting from other potential issues such as visual encoder limitations. This will provide stronger support for our motivating assumption. revision: yes
Referee: [§3.3] §3.3 (Perturbation module): the description of how delayed feedback is used to train the lightweight module and whether it requires ground-truth signals at test time is insufficient to verify the verifier-free claim and to confirm that the module does not introduce new instabilities.

Authors: We apologize for the lack of clarity in §3.3. The lightweight perturbation module is trained retrospectively using delayed feedback obtained from the environment after action execution (e.g., task completion signals or reward signals), without any ground-truth action labels or external verifiers. This preserves the verifier-free property as it relies solely on standard environment interactions. We will revise the section to include a detailed explanation, pseudocode for the training process, and an analysis of potential instabilities with corresponding mitigation strategies to ensure stability. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper describes an empirical test-time adaptation method (PDF) for VLAs that combines uncertainty-based augmentation, action voting, an adaptive scheduler, and a lightweight perturbation module trained on delayed feedback. No equations, derivations, or parameter-fitting steps are presented that reduce the reported performance gains to quantities defined by construction from the method's own inputs. The central claims rest on experimental results from LIBERO and Atari rather than self-referential definitions, fitted-input predictions, or load-bearing self-citations. The approach is self-contained as an engineering contribution with public code.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that trajectory overfitting causes brittleness and that the proposed perturbation and feedback mechanisms can correct it; no free parameters or invented entities are explicitly quantified in the abstract.

axioms (1)

domain assumption VLAs achieve performance but remain fragile to subtle environmental shifts due to trajectory overfitting and spurious correlations between actions and entities.
Directly stated in the opening of the abstract as the motivation for the work.

pith-pipeline@v0.9.0 · 5539 in / 1242 out tokens · 42982 ms · 2026-05-10T05:51:28.258084+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

37 extracted references · 12 canonical work pages · 5 internal anchors

[1]

Bellemare, Yavar Naddaf, Joel Veness, and Michael Bowling

Marc G. Bellemare, Yavar Naddaf, Joel Veness, and Michael Bowling. The arcade learning environment: An evaluation platform for general agents.J. Artif. Intell. Res., 47:253– 279, 2013

2013
[2]

Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, Szymon Jakubczak, Tim Jones, Liyiming Ke, Sergey Levine, Adrian Li-Bell, Mohith Mothukuri, Suraj Nair, Karl Pertsch, Lucy Xiaoyang Shi, James Tanner, Quan Vuong, Anna Walling, Haohuan Wang, and Ury Zhilinsky.π0: A visio...

work page internal anchor Pith review arXiv 2024
[3]

Dokania, Philip H

Arslan Chaudhry, Marcus Rohrbach, Mohamed Elhoseiny, Thalaiyasingam Ajanthan, Puneet K. Dokania, Philip H. S. Torr, and Marc’Aurelio Ranzato. On tiny episodic memories in continual learning, 2019

2019
[4]

arXiv preprint arXiv:2506.08440 (2025)

Zengjue Chen, Runliang Niu, He Kong, and Qi Wang. TGRPO :fine-tuning vision-language-action model via trajectory-wise group relative policy optimization.CoRR, abs/2506.08440, 2025

work page arXiv 2025
[5]

Diffusion policy: Visuomotor policy learning via action dif- fusion.Int

Cheng Chi, Zhenjia Xu, Siyuan Feng, Eric Cousineau, Yilun Du, Benjamin Burchfiel, Russ Tedrake, and Shuran Song. Diffusion policy: Visuomotor policy learning via action dif- fusion.Int. J. Robotics Res., 44(10-11):1684–1704, 2025

2025
[6]

LIBERO-Plus: In-depth Robustness Analysis of Vision-Language-Action Models

Senyu Fei, Siyin Wang, Junhao Shi, Zihao Dai, Jikun Cai, Pengfang Qian, Li Ji, Xinzhe He, Shiduo Zhang, Zhaoye Fei, et al. Libero-plus: In-depth robustness analysis of vision- language-action models.arXiv preprint arXiv:2510.13626, 2025

work page internal anchor Pith review arXiv 2025
[7]

Jack of all trades, mas- ter of some, a multi-purpose transformer agent.CoRR, abs/2402.09844, 2024

Quentin Gallou ´edec, Edward Beeching, Cl ´ement Romac, and Emmanuel Dellandr ´ea. Jack of all trades, mas- ter of some, a multi-purpose transformer agent.CoRR, abs/2402.09844, 2024

work page arXiv 2024
[8]

Sanketi, Dorsa Sadigh, Chelsea Finn, and Sergey Levine

Dibya Ghosh, Homer Rich Walke, Karl Pertsch, Kevin Black, Oier Mees, Sudeep Dasari, Joey Hejna, Tobias Kreiman, Charles Xu, Jianlan Luo, You Liang Tan, Lawrence Yunliang Chen, Quan Vuong, Ted Xiao, Pannag R. Sanketi, Dorsa Sadigh, Chelsea Finn, and Sergey Levine. Octo: An open-source generalist robot policy. InRobotics: Science and Systems XX, Delft, The ...

2024
[9]

An embodied generalist agent in 3d world

Jiangyong Huang, Silong Yong, Xiaojian Ma, Xiongkun Linghu, Puhao Li, Yan Wang, Qing Li, Song-Chun Zhu, Baoxiong Jia, and Siyuan Huang. An embodied generalist agent in 3d world. InForty-first International Conference on Machine Learning, ICML 2024, Vienna, Austria, July 21-27,

2024
[10]

OpenReview.net, 2024

2024
[11]

Verifier-free test-time sampling for vision language action models.arXiv preprint arXiv:2510.05681, 2025

Suhyeok Jang, Dongyoung Kim, Changyeon Kim, Young- suk Kim, and Jinwoo Shin. Verifier-free test-time sampling for vision language action models.CoRR, abs/2510.05681, 2025

work page arXiv 2025
[12]

Sanketi, Quan Vuong, Thomas Kollar, Benjamin Burchfiel, Russ Tedrake, Dorsa Sadigh, Sergey Levine, Percy Liang, and Chelsea Finn

Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Paul Foster, Pannag R. Sanketi, Quan Vuong, Thomas Kollar, Benjamin Burchfiel, Russ Tedrake, Dorsa Sadigh, Sergey Levine, Percy Liang, and Chelsea Finn. Openvla: An open-source vision-language-action model. InConference on Robot Learning, 6-9 ...

2024
[13]

Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success

Moo Jin Kim, Chelsea Finn, and Percy Liang. Fine-tuning vision-language-action models: Optimizing speed and suc- cess.CoRR, abs/2502.19645, 2025

work page internal anchor Pith review arXiv 2025
[14]

Test-time adaptation for online vision-language navigation with feedback-based reinforcement learning

Sungjune Kim, Gyeongrok Oh, Heeju Ko, Daehyun Ji, Dongwook Lee, Byung-Jun Lee, Sujin Jang, and Sang- pil Kim. Test-time adaptation for online vision-language navigation with feedback-based reinforcement learning. In Forty-second International Conference on Machine Learn- ing, 2025

2025
[15]

Robomonkey: Scaling test-time sampling and ver- ification for vision-language-action models.arXiv preprint arXiv:2506.17811, 2025

Jacky Kwok, Christopher Agia, Rohan Sinha, Matthew Fout- ter, Shulu Li, Ion Stoica, Azalia Mirhoseini, and Marco Pavone. Robomonkey: Scaling test-time sampling and verification for vision-language-action models.CoRR, abs/2506.17811, 2025

work page arXiv 2025
[16]

Test-time adaptation with binary feedback.CoRR, abs/2505.18514, 2025

Taeckyung Lee, Sorn Chottananurak, Junsu Kim, Jinwoo Shin, Taesik Gong, and Sung-Ju Lee. Test-time adaptation with binary feedback.CoRR, abs/2505.18514, 2025

work page arXiv 2025
[17]

Metavla: Unified meta co-training for efficient embodied adaption.CoRR, abs/2510.05580, 2025

Chen Li, Zhantao Yang, Han Zhang, Fangyi Chen, Chenchen Zhu, Anudeepsekhar Bolimera, and Marios Savvides. Metavla: Unified meta co-training for efficient embodied adaption.CoRR, abs/2510.05580, 2025

work page arXiv 2025
[18]

SimpleVLA-RL: Scaling VLA Training via Reinforcement Learning

Haozhan Li, Yuxin Zuo, Jiale Yu, Yuhao Zhang, Zhaohui Yang, Kaiyan Zhang, Xuekai Zhu, Yuchen Zhang, Tianxing Chen, Ganqu Cui, Dehui Wang, Dingxiang Luo, Yuchen Fan, Youbang Sun, Jia Zeng, Jiangmiao Pang, Shanghang Zhang, Yu Wang, Yao Mu, Bowen Zhou, and Ning Ding. Simplevla- rl: Scaling VLA training via reinforcement learning.CoRR, abs/2509.09674, 2025

work page internal anchor Pith review arXiv 2025
[19]

JARVIS-VLA: post-training large-scale vision language models to play visual games with keyboards and mouse

Muyao Li, Zihao Wang, Kaichen He, Xiaojian Ma, and Yi- tao Liang. JARVIS-VLA: post-training large-scale vision language models to play visual games with keyboards and mouse. InFindings of the Association for Computational Linguistics, ACL 2025, Vienna, Austria, July 27 - August 1, 2025, pages 17878–17899. Association for Computational Linguistics, 2025

2025
[20]

Vision-language foun- dation models as effective robot imitators

Xinghang Li, Minghuan Liu, Hanbo Zhang, Cunjun Yu, Jie Xu, Hongtao Wu, Chilam Cheang, Ya Jing, Weinan Zhang, Huaping Liu, Hang Li, and Tao Kong. Vision-language foun- dation models as effective robot imitators. InThe Twelfth In- ternational Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net, 2024

2024
[21]

LIBERO: benchmarking knowl- edge transfer for lifelong robot learning

Bo Liu, Yifeng Zhu, Chongkai Gao, Yihao Feng, Qiang Liu, Yuke Zhu, and Peter Stone. LIBERO: benchmarking knowl- edge transfer for lifelong robot learning. InAdvances in Neu- ral Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023, 2023

2023
[22]

Packnet: Adding mul- tiple tasks to a single network by iterative pruning

Arun Mallya and Svetlana Lazebnik. Packnet: Adding mul- tiple tasks to a single network by iterative pruning. In2018 IEEE Conference on Computer Vision and Pattern Recog- nition, CVPR 2018, Salt Lake City, UT, USA, June 18- 22, 2018, pages 7765–7773. Computer Vision Foundation / IEEE Computer Society, 2018

2018
[23]

Steering your generalists: Improving robotic foun- dation models via value guidance

Mitsuhiko Nakamoto, Oier Mees, Aviral Kumar, and Sergey Levine. Steering your generalists: Improving robotic foun- dation models via value guidance. InConference on Robot Learning, 6-9 November 2024, Munich, Germany, pages 4996–5013. PMLR, 2024

2024
[24]

SpatialVLA: Exploring Spatial Representations for Visual-Language-Action Model

Delin Qu, Haoming Song, Qizhi Chen, Yuanqi Yao, Xinyi Ye, Yan Ding, Zhigang Wang, JiaYuan Gu, Bin Zhao, Dong Wang, and Xuelong Li. Spatialvla: Exploring spa- tial representations for visual-language-action model.CoRR, abs/2501.15830, 2025

work page internal anchor Pith review arXiv 2025
[25]

Scott E. Reed, Konrad Zolna, Emilio Parisotto, Ser- gio G´omez Colmenarejo, Alexander Novikov, Gabriel Barth- Maron, Mai Gimenez, Yury Sulsky, Jackie Kay, Jost Tobias Springenberg, Tom Eccles, Jake Bruce, Ali Razavi, Ashley Edwards, Nicolas Heess, Yutian Chen, Raia Hadsell, Oriol Vinyals, Mahyar Bordbar, and Nando de Freitas. A general- ist agent.Trans. M...

2022
[26]

Introduc- ing rfm-1: Giving robots human-like reason-ing capabilities, 2024

A Sohn, A Nagabandi, C Florensa, D Adelberg, D Wu, H Fa- rooq, I Clavera, J Welborn, J Chen, N Mishra, et al. Introduc- ing rfm-1: Giving robots human-like reason-ing capabilities, 2024

2024
[27]

Lingo-2: Driving with natural language, 2024

Waywe Research Team et al. Lingo-2: Driving with natural language, 2024

2024
[28]

Ol- shausen, and Trevor Darrell

Dequan Wang, Evan Shelhamer, Shaoteng Liu, Bruno A. Ol- shausen, and Trevor Darrell. Tent: Fully test-time adaptation by entropy minimization. In9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Aus- tria, May 3-7, 2021. OpenReview.net, 2021

2021
[29]

Any-point trajectory modeling for policy learning

Chuan Wen, Xingyu Lin, John Ian Reyes So, Kai Chen, Qi Dou, Yang Gao, and Pieter Abbeel. Any-point trajectory modeling for policy learning. InRobotics: Science and Sys- tems XX, Delft, The Netherlands, July 15-19, 2024, 2024

2024
[30]

SCAP: transductive test-time adaptation via supportive clique-based attribute prompting

Chenyu Zhang, Kunlun Xu, Zichen Liu, Yuxin Peng, and Jiahuan Zhou. SCAP: transductive test-time adaptation via supportive clique-based attribute prompting. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2025, Nashville, TN, USA, June 11-15, 2025, pages 30032–30041. Computer Vision Foundation / IEEE, 2025

2025
[31]

Cot-vla: Visual chain-of-thought reasoning for vision-language-action mod- els

Qingqing Zhao, Yao Lu, Moo Jin Kim, Zipeng Fu, Zhuoyang Zhang, Yecheng Wu, Zhaoshuo Li, Qianli Ma, Song Han, Chelsea Finn, Ankur Handa, Tsung-Yi Lin, Gordon Wet- zstein, Ming-Yu Liu, and Donglai Xiang. Cot-vla: Visual chain-of-thought reasoning for vision-language-action mod- els. InIEEE/CVF Conference on Computer Vision and Pat- tern Recognition, CVPR 20...

2025
[32]

3d- vla: A 3d vision-language-action generative world model

Haoyu Zhen, Xiaowen Qiu, Peihao Chen, Jincheng Yang, Xin Yan, Yilun Du, Yining Hong, and Chuang Gan. 3d- vla: A 3d vision-language-action generative world model. InForty-first International Conference on Machine Learn- ing, ICML 2024, Vienna, Austria, July 21-27, 2024. Open- Review.net, 2024

2024
[33]

Tracevla: Visual trace prompting enhances spatial-temporal awareness for generalist robotic policies

Ruijie Zheng, Yongyuan Liang, Shuaiyi Huang, Jianfeng Gao, Hal Daum ´e III, Andrey Kolobov, Furong Huang, and Jianwei Yang. Tracevla: Visual trace prompting enhances spatial-temporal awareness for generalist robotic policies. In The Thirteenth International Conference on Learning Repre- sentations, ICLR 2025, Singapore, April 24-28, 2025. Open- Review.net, 2025

2025
[34]

Class-aware domain knowledge fu- sion and fission for continual test-time adaptation.CoRR, abs/2510.12150, 2025

Jiahuan Zhou, Chao Zhu, Zhenyu Cui, Zichen Liu, Xu Zou, and Gang Hua. Class-aware domain knowledge fu- sion and fission for continual test-time adaptation.CoRR, abs/2510.12150, 2025

work page arXiv 2025
[35]

Tran, Radu Soricut, Anikait Singh, Jaspiar Singh, Pierre Sermanet, Pannag R

Brianna Zitkovich, Tianhe Yu, Sichun Xu, Peng Xu, Ted Xiao, Fei Xia, Jialin Wu, Paul Wohlhart, Stefan Welker, Ayzaan Wahid, Quan Vuong, Vincent Vanhoucke, Huong T. Tran, Radu Soricut, Anikait Singh, Jaspiar Singh, Pierre Sermanet, Pannag R. Sanketi, Grecia Salazar, Michael S. Ryoo, Krista Reymann, Kanishka Rao, Karl Pertsch, Igor Mordatch, Henryk Michalew...

2023
[36]

Psydo-Code We provide a brief overview of the training pipeline out- lined in Algorithm 1, which implements PDF. Algorithm 1Perturbation Learning with Delayed Feedback (PDF) Require:Pretrained VLA parametersϕ(frozen); Perturba- tion head parametersθ(trainable); maximum augmen- tation budgetN max; bufferD. 1:foreach episodedo 2:foreach timesteptdo 3:Observ...
[37]

Additional Experiments Results on Atari 57 Table 4 presents the detailed results of PDF and JAT on the full Atari 57 benchmark. Games ID JAT (Raw Score) JAT (Human Normalized Score)PDF (Raw Score) PDF (Human Normalized Score) ALIEN 22 1427.9 ± 540.28 0.17 2034.4 ± 560.47 0.26 AMIDAR 34 105 ± 76.93 0.06 150.2 ± 57.05 0.08 ASSAULT 4 1627.57 ± 799.09 2.7 186...

2034