VGAS: Value-Guided Action-Chunk Selection for Few-Shot Vision-Language-Action Adaptation
Pith reviewed 2026-05-25 06:57 UTC · model grok-4.3
The pith
VGAS resolves geometric ambiguities in few-shot VLA adaptation by selecting precise action chunks with a value-guided critic.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper claims that VGAS performs inference-time best-of-N selection using a finetuned VLA as proposal generator and the Q-Chunk-Former as geometrically grounded Transformer critic, combined with Explicit Geometric Regularization, to resolve fine-grained geometric ambiguities and thereby consistently improve success rates and robustness under limited demonstrations and distribution shifts.
What carries the argument
The Q-Chunk-Former, a geometrically grounded Transformer critic that evaluates action chunks to resolve fine-grained geometric ambiguities among near-miss candidates.
If this is right
- Success rates rise in new tasks when only limited demonstrations are available.
- Robustness increases when test conditions differ from training data.
- Failures from near-miss actions decline because the critic preserves ranking resolution among similar candidates.
- The generation-selection split allows the VLA to focus on recall while the critic handles geometric precision.
Where Pith is reading between the lines
- The separation of proposal generation from geometric evaluation could apply to other control settings where semantic plausibility and physical precision must both be satisfied.
- Inference-time selection might reduce the amount of fine-tuning needed by shifting some resolution burden to a lightweight critic.
- Integrating geometric regularization signals earlier in training could further stabilize value estimates when data remains scarce.
Load-bearing premise
A separate Transformer critic trained with explicit geometric regularization can reliably distinguish fine geometric differences among near-miss action chunks when demonstrations are scarce.
What would settle it
An experiment in which replacing the Q-Chunk-Former critic with random selection or semantic-only ranking among action chunks yields no improvement in success rates or robustness.
Figures
read the original abstract
Vision--Language--Action (VLA) models bridge multimodal reasoning with physical control, but adapting them to new tasks with scarce demonstrations remains unreliable. While fine-tuned VLA policies often produce semantically plausible trajectories, failures often arise from unresolved geometric ambiguities, where near-miss actions lead to divergent execution outcomes under limited supervision. We study few-shot VLA adaptation from a \emph{generation--selection} perspective and propose a novel framework \textbf{VGAS} (\textbf{V}alue-\textbf{G}uided \textbf{A}ction-chunk \textbf{S}election). It performs inference-time best-of-$N$ selection to identify action chunks that are both semantically faithful and geometrically precise. Specifically, \textbf{VGAS} employs a finetuned VLA as a high-recall proposal generator and introduces the \textrm{Q-Chunk-Former}, a geometrically grounded Transformer critic to resolve fine-grained geometric ambiguities. In addition, we propose \textit{Explicit Geometric Regularization} (\texttt{EGR}), which shapes a discriminative value landscape to preserve action ranking resolution among near-miss candidates while mitigating value instability under scarce supervision. Experiments and theoretical analysis demonstrate that \textbf{VGAS} consistently improves success rates and robustness under limited demonstrations and distribution shifts. Our code is available at https://github.com/Jyugo-15/VGAS.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes VGAS, a generation-selection framework for few-shot Vision-Language-Action (VLA) adaptation. A fine-tuned VLA serves as a high-recall proposal generator producing action chunks; a separate geometrically grounded Transformer critic (Q-Chunk-Former) trained with Explicit Geometric Regularization (EGR) performs inference-time best-of-N selection to resolve fine-grained geometric ambiguities among near-miss chunks. The authors claim that experiments and theoretical analysis show consistent gains in success rate and robustness under scarce demonstrations and distribution shifts.
Significance. If the central claim holds, the separation of proposal generation from geometrically discriminative selection could offer a practical route to more reliable few-shot VLA adaptation without requiring large additional datasets. The approach is modular and could be combined with existing VLA backbones.
major comments (2)
- [Abstract, §3] Abstract and §3 (framework description): the claim that EGR 'shapes a discriminative value landscape to preserve action ranking resolution among near-miss candidates' is load-bearing for the central claim, yet no derivation or bound is supplied showing that the regularization term guarantees correct ranking when the number of geometrically successful positive examples is much smaller than the number of near-miss candidates. Without such a guarantee, it is unclear why the critic will not collapse to semantic rather than metric cues under scarce supervision.
- [§5] §5 (experiments): the abstract asserts that 'experiments and theoretical analysis demonstrate' consistent improvements, but the provided text contains no tables, error bars, baseline comparisons, or ablation results quantifying the contribution of Q-Chunk-Former + EGR versus the base VLA policy. This prevents verification that the selection step adds robustness beyond the proposal generator.
minor comments (2)
- [§3] Notation for the critic (Q-Chunk-Former) and the EGR loss term should be introduced with explicit equations rather than descriptive prose only.
- [Abstract] The GitHub link is provided but no statement is made about whether the released code includes the exact training and inference scripts used for the reported results.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address the two major comments point by point below, indicating planned revisions where the manuscript requires strengthening.
read point-by-point responses
-
Referee: [Abstract, §3] Abstract and §3 (framework description): the claim that EGR 'shapes a discriminative value landscape to preserve action ranking resolution among near-miss candidates' is load-bearing for the central claim, yet no derivation or bound is supplied showing that the regularization term guarantees correct ranking when the number of geometrically successful positive examples is much smaller than the number of near-miss candidates. Without such a guarantee, it is unclear why the critic will not collapse to semantic rather than metric cues under scarce supervision.
Authors: We acknowledge that the manuscript does not supply a formal derivation or bound guaranteeing ranking preservation when positive geometric examples are scarce relative to near-miss candidates. The EGR term is introduced as an explicit penalty on value instability derived from the critic's geometric input features, with the intent of encouraging metric sensitivity; however, no proof is given that this prevents collapse to semantic cues. In revision we will add a short subsection in §3 providing a simplified analysis of the regularization's effect on the value landscape (under the assumption of a sufficiently expressive critic) and a qualitative argument why semantic collapse is mitigated, though we stop short of claiming a general guarantee. revision: partial
-
Referee: [§5] §5 (experiments): the abstract asserts that 'experiments and theoretical analysis demonstrate' consistent improvements, but the provided text contains no tables, error bars, baseline comparisons, or ablation results quantifying the contribution of Q-Chunk-Former + EGR versus the base VLA policy. This prevents verification that the selection step adds robustness beyond the proposal generator.
Authors: The full manuscript contains §5 with the requested results: tables reporting success rates (with standard deviations over 5 random seeds), direct comparisons against the base VLA policy, and ablations isolating Q-Chunk-Former and EGR. These quantify the incremental robustness gained by the selection stage. We will ensure that all tables, error bars, and ablation figures are explicitly referenced and rendered in the revised submission so that the contribution of the critic can be verified. revision: yes
Circularity Check
No significant circularity; derivation relies on external experiments rather than self-referential definitions or fits
full rationale
The provided abstract and framework description introduce VGAS as a generation-selection approach using a finetuned VLA proposer and a separate Q-Chunk-Former critic with EGR, but contain no equations, fitted parameters, or self-citations that reduce the claimed success-rate improvements or geometric ranking to inputs by construction. The theoretical analysis is asserted without visible reduction steps, and the central mechanism (best-of-N selection via value guidance) is presented as an independent architectural choice validated by experiments. This matches the default case of a self-contained proposal whose validity rests on empirical results outside any definitional loop.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption A fine-tuned VLA model can serve as a high-recall proposal generator for action chunks.
invented entities (2)
-
Q-Chunk-Former
no independent evidence
-
Explicit Geometric Regularization (EGR)
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Explicit Geometric Regularization (EGR) ... shapes a discriminative value landscape to preserve action ranking resolution among near-miss candidates
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 1 Pith paper
-
Autonomous Drift Learning in Data Streams: A Unified Perspective
A survey proposes a novel 3D taxonomy classifying drifts into time stream, data stream, and model stream categories to unify research on non-stationary autonomous learning.
Reference graph
Works this paper leans on
-
[1]
Online learning with off- policy feedback in adversarial mdps
[Bacchiocchiet al., 2024 ] Francesco Bacchiocchi, Francesco Emanuele Stradi, Matteo Papini, Alberto Maria Metelli, Nicola Gatti, et al. Online learning with off- policy feedback in adversarial mdps. InProceedings of the Thirty-Third International Joint Conference on Artificial Intelligence (IJCAI-24), pages 3697–3705,
work page 2024
-
[2]
[Blacket al., 2024 ] Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Nic- colo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, et al.π 0: A vision-language-action flow model for general robot control.arXiv preprint arXiv:2410.24164,
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[3]
RT-1: Robotics Transformer for Real-World Control at Scale
[Brohanet al., 2022 ] Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Joseph Dabis, Chelsea Finn, Keerthana Gopalakrishnan, Karol Hausman, Alex Herzog, Jasmine Hsu, et al. Rt-1: Robotics transformer for real- world control at scale.arXiv preprint arXiv:2212.06817,
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[4]
Q-transformer: Scalable offline reinforcement learning via autoregressive q-functions
[Chebotaret al., 2023 ] Yevgen Chebotar, Quan Vuong, Karol Hausman, Fei Xia, Yao Lu, Alex Irpan, Aviral Kumar, Tianhe Yu, Alexander Herzog, Karl Pertsch, et al. Q-transformer: Scalable offline reinforcement learning via autoregressive q-functions. InConference on Robot Learning, pages 3909–3928. PMLR,
work page 2023
-
[5]
[Chenet al., 2025 ] Yuhui Chen, Shuai Tian, Shugao Liu, Yingting Zhou, Haoran Li, and Dongbin Zhao. Conrft: A reinforced fine-tuning method for vla models via con- sistency policy.arXiv preprint arXiv:2502.05450,
-
[6]
[Dasset al., 2022 ] Shivin Dass, Karl Pertsch, Hejia Zhang, Youngwoon Lee, Joseph J Lim, and Stefanos Nikolaidis. Pato: Policy assisted teleoperation for scalable robot data collection.arXiv preprint arXiv:2212.04708,
-
[7]
[Duanet al., 2025 ] Wei Duan, Jie Lu, En Yu, and Junyu Xuan. Bandwidth-constrained variational message en- coding for cooperative multi-agent reinforcement learning. arXiv preprint arXiv:2512.11179,
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[8]
D4RL: Datasets for Deep Data-Driven Reinforcement Learning
[Fuet al., 2020 ] Justin Fu, Aviral Kumar, Ofir Nachum, George Tucker, and Sergey Levine. D4rl: Datasets for deep data-driven reinforcement learning.arXiv preprint arXiv:2004.07219,
work page internal anchor Pith review Pith/arXiv arXiv 2020
-
[9]
Off-policy deep reinforcement learning without exploration
[Fujimotoet al., 2019 ] Scott Fujimoto, David Meger, and Doina Precup. Off-policy deep reinforcement learning without exploration. InInternational conference on ma- chine learning, pages 2052–2062. PMLR,
work page 2019
-
[10]
Emaq: Expected-max q-learning operator for simple yet effective offline and online rl
[Ghasemipouret al., 2021 ] Seyed Kamyar Seyed Ghasemipour, Dale Schuurmans, and Shixiang Shane Gu. Emaq: Expected-max q-learning operator for simple yet effective offline and online rl. InInternational Conference on Machine Learning, pages 3682–3691. PMLR,
work page 2021
-
[11]
[Guoet al., 2025 ] Yanjiang Guo, Jianke Zhang, Xiaoyu Chen, Xiang Ji, Yen-Jen Wang, Yucheng Hu, and Jianyu Chen. Improving vision-language-action model with online reinforcement learning.arXiv preprint arXiv:2501.16664,
-
[12]
Gaussian Error Linear Units (GELUs)
[Hendrycks, 2016] D Hendrycks. Gaussian error linear units (gelus).arXiv preprint arXiv:1606.08415,
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[13]
[Huanget al., 2025 ] Dongchi Huang, Zhirui Fang, Tianle Zhang, Yihang Li, Lin Zhao, and Chunhe Xia. Co- rft: Efficient fine-tuning of vision-language-action mod- els through chunked offline reinforcement learning.arXiv preprint arXiv:2508.02219,
-
[14]
[Intelligenceet al., 2025 ] Physical Intelligence, Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, et al.π 0.5: A vision-language-action model with open-world generalization.arXiv preprint arXiv:2504.16054,
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[15]
Planning with Diffusion for Flexible Behavior Synthesis
[Janneret al., 2022 ] Michael Janner, Yilun Du, Joshua B Tenenbaum, and Sergey Levine. Planning with dif- fusion for flexible behavior synthesis.arXiv preprint arXiv:2205.09991,
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[16]
OpenVLA: An Open-Source Vision-Language-Action Model
[Kimet al., 2024 ] Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag San- keti, et al. Openvla: An open-source vision-language- action model.arXiv preprint arXiv:2406.09246,
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[17]
Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success
[Kimet al., 2025 ] Moo Jin Kim, Chelsea Finn, and Percy Liang. Fine-tuning vision-language-action mod- els: Optimizing speed and success.arXiv preprint arXiv:2502.19645,
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[18]
Adam: A Method for Stochastic Optimization
[Kingma, 2014] Diederik P Kingma. Adam: A method for stochastic optimization.arXiv preprint arXiv:1412.6980,
work page internal anchor Pith review Pith/arXiv arXiv 2014
-
[19]
Efficient and sta- ble offline-to-online reinforcement learning via continual policy revitalization
[Konget al., 2024 ] Rui Kong, Chenyang Wu, Chen-Xiao Gao, Zongzhang Zhang, and Ming Li. Efficient and sta- ble offline-to-online reinforcement learning via continual policy revitalization. InProceedings of the Thirty-Third International Joint Conference on Artificial Intelligence, IJCAI-24, pages 4317–4325,
work page 2024
-
[20]
Offline Reinforcement Learning with Implicit Q-Learning
[Kostrikovet al., 2021 ] Ilya Kostrikov, Ashvin Nair, and Sergey Levine. Offline reinforcement learning with im- plicit q-learning.arXiv preprint arXiv:2110.06169,
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[21]
Conservative Q-learning for offline reinforcement learning
[Kumaret al., 2020 ] Aviral Kumar, Aurick Zhou, George Tucker, and Sergey Levine. Conservative Q-learning for offline reinforcement learning. InAdvances in Neural In- formation Processing Systems (NeurIPS),
work page 2020
-
[22]
[Kumaret al., 2022 ] Aviral Kumar, Anikait Singh, Frederik Ebert, Mitsuhiko Nakamoto, Yanlai Yang, Chelsea Finn, and Sergey Levine. Pre-training for robots: Offline rl en- ables learning new tasks from a handful of trials.arXiv preprint arXiv:2210.05178,
-
[23]
Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems
[Levineet al., 2020 ] Sergey Levine, Aviral Kumar, George Tucker, and Justin Fu. Offline reinforcement learning: Tu- torial, review, and perspectives on open problems.arXiv preprint arXiv:2005.01643,
work page internal anchor Pith review Pith/arXiv arXiv 2020
-
[24]
SimpleVLA-RL: Scaling VLA Training via Reinforcement Learning
[Liet al., 2025a ] Haozhan Li, Yuxin Zuo, Jiale Yu, Yuhao Zhang, Zhaohui Yang, Kaiyan Zhang, Xuekai Zhu, Yuchen Zhang, Tianxing Chen, Ganqu Cui, et al. Simplevla-rl: Scaling vla training via reinforcement learn- ing.arXiv preprint arXiv:2509.09674,
work page internal anchor Pith review Pith/arXiv arXiv
-
[25]
Reinforcement Learning with Action Chunking
[Liet al., 2025b ] Qiyang Li, Zhiyuan Zhou, and Sergey Levine. Reinforcement learning with action chunking. arXiv preprint arXiv:2507.07969,
work page internal anchor Pith review Pith/arXiv arXiv
-
[26]
[Liuet al., 2023 ] Bo Liu, Yifeng Zhu, Chongkai Gao, Yi- hao Feng, Qiang Liu, Yuke Zhu, and Peter Stone. Libero: Benchmarking knowledge transfer for lifelong robot learn- ing.Advances in Neural Information Processing Systems, 36:44776–44791,
work page 2023
-
[27]
What can rl bring to vla generalization? an empirical study.arXiv preprint arXiv:2505.19789,
[Liuet al., 2025 ] Jijia Liu, Feng Gao, Bingwen Wei, Xinlei Chen, Qingmin Liao, Yi Wu, Chao Yu, and Yu Wang. What can rl bring to vla generalization? an empirical study.arXiv preprint arXiv:2505.19789,
-
[28]
[Luet al., 2022 ] Cong Lu, Philip J Ball, Tim GJ Rudner, Jack Parker-Holder, Michael A Osborne, and Yee Whye Teh. Challenges and opportunities in offline reinforce- ment learning from visual observations.arXiv preprint arXiv:2206.04779,
-
[29]
Dreamfuser: Value- guided diffusion policy for offline reinforcement learning
[Luoet al., ] Kairong Luo, CAIWEI XIAO, Zhiao Huang, Zhan Ling, Yunhao Fang, and Hao Su. Dreamfuser: Value- guided diffusion policy for offline reinforcement learning. [Lyuet al., 2022 ] Jiafei Lyu, Xiaoteng Ma, Xiu Li, and Zongqing Lu. Mildly conservative q-learning for offline reinforcement learning.Advances in Neural Information Processing Systems, 35:...
work page 2022
-
[30]
SmolVLM: Redefining small and efficient multimodal models
[Marafiotiet al., 2025 ] Andr´es Marafioti, Orr Zohar, Miquel Farr´e, Merve Noyan, Elie Bakouch, Pedro Cuenca, Cyril Zakka, Loubna Ben Allal, Anton Lozhkov, Nouamane Tazi, et al. Smolvlm: Redefining small and efficient mul- timodal models.arXiv preprint arXiv:2504.05299,
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[31]
[Market al., 2024 ] Max Sobol Mark, Tian Gao, Geor- gia Gabriela Sampaio, Mohan Kumar Srirama, Archit Sharma, Chelsea Finn, and Aviral Kumar. Policy agnos- tic rl: Offline rl and online rl fine-tuning of any class and backbone.arXiv preprint arXiv:2412.06685,
-
[32]
Steering your general- ists: Improving robotic foundation models via value guid- ance
[Nakamotoet al., 2025 ] Mitsuhiko Nakamoto, Oier Mees, Aviral Kumar, and Sergey Levine. Steering your general- ists: Improving robotic foundation models via value guid- ance. InConference on Robot Learning, pages 4996–5013. PMLR,
work page 2025
-
[33]
[Sapkotaet al., 2025 ] Ranjan Sapkota, Yang Cao, Konstanti- nos I Roumeliotis, and Manoj Karkee. Vision-language- action models: Concepts, progress, applications and chal- lenges.arXiv preprint arXiv:2505.04769,
-
[34]
Proximal Policy Optimization Algorithms
[Schulmanet al., 2017 ] John Schulman, Filip Wolski, Pra- fulla Dhariwal, Alec Radford, and Oleg Klimov. Prox- imal policy optimization algorithms.arXiv preprint arXiv:1707.06347,
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[35]
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
[Shaoet al., 2024 ] Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open lan- guage models.arXiv preprint arXiv:2402.03300,
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[36]
[Shin and Kim, 2023] Wonchul Shin and Yusung Kim. Guide to control: Offline hierarchical reinforcement learn- ing using subgoal generation for long-horizon and sparse- reward tasks. InIJCAI, pages 4217–4225,
work page 2023
-
[37]
SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics
[Shukoret al., 2025 ] Mustafa Shukor, Dana Aubakirova, Francesco Capuano, Pepijn Kooijmans, Steven Palma, Adil Zouitine, Michel Aractingi, Caroline Pascal, Mar- tino Russi, Andres Marafioti, et al. Smolvla: A vision-language-action model for affordable and efficient robotics.arXiv preprint arXiv:2506.01844,
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[38]
[Songet al., 2025 ] Haoming Song, Delin Qu, Yuanqi Yao, Qizhi Chen, Qi Lv, Yiwen Tang, Modi Shi, Guanghui Ren, Maoqing Yao, Bin Zhao, et al. Hume: Introducing system- 2 thinking in visual-language-action model.arXiv preprint arXiv:2505.21432,
-
[39]
[Suttonet al., 1998 ] Richard S Sutton, Andrew G Barto, et al.Reinforcement learning: An introduction, volume
work page 1998
-
[40]
Between mdps and semi-mdps: A frame- work for temporal abstraction in reinforcement learning
[Suttonet al., 1999 ] Richard S Sutton, Doina Precup, and Satinder Singh. Between mdps and semi-mdps: A frame- work for temporal abstraction in reinforcement learning. Artificial intelligence, 112(1-2):181–211,
work page 1999
-
[41]
Interactive Post-Training for Vision-Language-Action Models
[Tanet al., ] Shuhan Tan, Kairan Dou, Yue Zhao, and Philipp Kr¨ahenb¨uhl. Interactive post-training for vision-language- action models (2025).arXiv preprint arXiv:2505.17016. [Teamet al., 2024 ] Octo Model Team, Dibya Ghosh, Homer Walke, Karl Pertsch, Kevin Black, Oier Mees, Sudeep Dasari, Joey Hejna, Tobias Kreiman, Charles Xu, et al. Octo: An open-sourc...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[42]
[Xinet al., 2024 ] Jimmy Xin, Linus Zheng, Kia Rahmani, Jiayi Wei, Jarrett Holtz, Isil Dillig, and Joydeep Biswas. Programmatic imitation learning from unlabeled and noisy demonstrations.IEEE Robotics and Automation Letters, 9(6):4894–4901,
work page 2024
-
[43]
Learning robust spectral dynamics for temporal domain generalization
[Yuet al., 2025 ] En Yu, Jie Lu, Xiaoyu Yang, Guangquan Zhang, and Zhen Fang. Learning robust spectral dynamics for temporal domain generalization. InThe Thirty-ninth Annual Conference on Neural Information Processing Sys- tems,
work page 2025
-
[44]
Adap- tive reward shifting based on behavior proximity for of- fline reinforcement learning
[Zhang and Tan, 2023] Zhe Zhang and Xiaoyang Tan. Adap- tive reward shifting based on behavior proximity for of- fline reinforcement learning. InIJCAI, pages 4620–4628,
work page 2023
-
[45]
Pure vision language action (vla) models: A comprehensive survey.arXiv preprint arXiv:2509.19012,
[Zhanget al., 2025 ] Dapeng Zhang, Jing Sun, Chenghui Hu, Xiaoyan Wu, Zhenlong Yuan, Rui Zhou, Fei Shen, and Qingguo Zhou. Pure vision language action (vla) models: A comprehensive survey.arXiv preprint arXiv:2509.19012,
-
[46]
Zhao, Vikash Kumar, Sergey Levine, and Chelsea Finn
[Zhaoet al., 2023 ] Tony Z. Zhao, Vikash Kumar, Sergey Levine, and Chelsea Finn. Learning fine-grained biman- ual manipulation with low-cost arms. InRobotics: Science and Systems (RSS),
work page 2023
-
[47]
Rt-2: Vision-language-action models transfer web knowledge to robotic control
[Zitkovichet al., 2023 ] Brianna Zitkovich, Tianhe Yu, Sichun Xu, Peng Xu, Ted Xiao, Fei Xia, Jialin Wu, Paul Wohlhart, Stefan Welker, Ayzaan Wahid, et al. Rt-2: Vision-language-action models transfer web knowledge to robotic control. InConference on Robot Learning, pages 2165–2183. PMLR,
work page 2023
-
[48]
A Related Work A.1 Vision-Language-Action Models. The intersection of computer vision and robotic control has been advanced by Vision-Language-Action (VLA) models, which endow high-capacity Vision-Language Models (VLMs) with actuation capabilities to map multimodal inputs (visual observa- tions and natural language instructions) to executable robot action...
work page 2023
-
[49]
show significant structural differences. This indicates that the VGAS does not merely memorize static geometric relations but adaptively adjusts its estimation according to evolving real-world dynamics, providing state-aware guidance throughout the entire task horizon. 0.14 0.16 0.18 0.20 0.22 0.24 0.26T op1 Hit Rate (candidates only) Libero_Goal 0.100 0....
work page 2000
-
[50]
Additionally, we use shifted rewards{−1,1}instead of{0,1}, which we found to yield more stable learning in practice. (ii) Training protocol.We first perform supervised fine-tuning (SFT) of the VLA model using 5-shot expert demonstrations per task, randomly sampled from the LIBERO dataset. We then train a critic using different variants of offline RL (ORL)...
work page 2025
-
[51]
Our Q-Chunk- Former is initialized from the first two layers of the SmolVLM backbone. We directly reuse the multimodal features extracted by the frozen SmolVLM (i.e., the output of the SmolVLA encoder) as the vision–language input to Q-Chunk-Former. In our notation, theQ-chunk lengthhdenotes the length of an action chunk, whileN-action-stepindicates that ...
work page 2014
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.