VGAS: Value-Guided Action-Chunk Selection for Few-Shot Vision-Language-Action Adaptation

Changhua Xu; En Yu; Jie Lu; Junyu Xuan

arxiv: 2602.07399 · v2 · pith:36ODUBJHnew · submitted 2026-02-07 · 💻 cs.AI · cs.CV

VGAS: Value-Guided Action-Chunk Selection for Few-Shot Vision-Language-Action Adaptation

Changhua Xu , En Yu , Junyu Xuan , Jie Lu This is my paper

Pith reviewed 2026-05-25 06:57 UTC · model grok-4.3

classification 💻 cs.AI cs.CV

keywords few-shot VLA adaptationaction chunk selectiongeometric regularizationvision-language-actionQ-Chunk-Formervalue-guided selection

0 comments

The pith

VGAS resolves geometric ambiguities in few-shot VLA adaptation by selecting precise action chunks with a value-guided critic.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tries to establish that failures in adapting vision-language-action models to new tasks with scarce demonstrations often arise from unresolved geometric ambiguities among near-miss actions. It proposes separating high-recall proposal generation from selection via a dedicated critic to identify chunks that are both semantically faithful and geometrically precise. VGAS introduces the Q-Chunk-Former critic trained with explicit geometric regularization to shape a discriminative value landscape. If the approach holds, few-shot adaptation becomes more reliable and robust to distribution shifts without requiring large datasets.

Core claim

The paper claims that VGAS performs inference-time best-of-N selection using a finetuned VLA as proposal generator and the Q-Chunk-Former as geometrically grounded Transformer critic, combined with Explicit Geometric Regularization, to resolve fine-grained geometric ambiguities and thereby consistently improve success rates and robustness under limited demonstrations and distribution shifts.

What carries the argument

The Q-Chunk-Former, a geometrically grounded Transformer critic that evaluates action chunks to resolve fine-grained geometric ambiguities among near-miss candidates.

If this is right

Success rates rise in new tasks when only limited demonstrations are available.
Robustness increases when test conditions differ from training data.
Failures from near-miss actions decline because the critic preserves ranking resolution among similar candidates.
The generation-selection split allows the VLA to focus on recall while the critic handles geometric precision.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The separation of proposal generation from geometric evaluation could apply to other control settings where semantic plausibility and physical precision must both be satisfied.
Inference-time selection might reduce the amount of fine-tuning needed by shifting some resolution burden to a lightweight critic.
Integrating geometric regularization signals earlier in training could further stabilize value estimates when data remains scarce.

Load-bearing premise

A separate Transformer critic trained with explicit geometric regularization can reliably distinguish fine geometric differences among near-miss action chunks when demonstrations are scarce.

What would settle it

An experiment in which replacing the Q-Chunk-Former critic with random selection or semantic-only ranking among action chunks yields no improvement in success rates or robustness.

Figures

Figures reproduced from arXiv: 2602.07399 by Changhua Xu, En Yu, Jie Lu, Junyu Xuan.

**Figure 1.** Figure 1: Illustration of near-miss actions distribution under 5-shot VLA finetuning. To concretely examine this datascarce regime, we simulate a few-shot adaptation setting that reflects realistic deployment constraints. Specifically, we fine-tune a pretrained VLA policy with five task demonstrations and evaluate its execution behavior on novel task instances. As shown in [PITH_FULL_IMAGE:figures/full_fig_p0… view at source ↗

**Figure 2.** Figure 2: The overall framework of VGAS. Generation: A fine-tuned VLA policy proposes N candidate action chunks from multimodal inputs. Selection: Q-Chunk-Former learns a scoring function Q via the EGR+TD objective. Best-of-N selection defines the induced policy π (N) µ,Q by maximizing over a discriminative value landscape shaped by EGR, prioritizing expert-aligned candidates and thereby mitigating geometric drift. … view at source ↗

**Figure 3.** Figure 3: Visualization of the Proposal-Candidate Value Landscape: CQL vs. EGR (Ours) [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Multi-view Spatial Rollouts of Action Chunks and [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Offline Ranking Evaluation on Held-out Data. [PITH_FULL_IMAGE:figures/full_fig_p015_5.png] view at source ↗

**Figure 6.** Figure 6: Impact of Inference Budget N. Evaluation on LIBERO-Goal showing monotonic improvement from the baseline (N = 1) to saturation around N = 8. Crucially, the offline dataset D used to train the critic (for both VGAS and baselines) is constructed exclusively from these same 5-shot demonstrations, ensuring that value learning operates under the same strict data-scarce constraints. Baseline Configurations. Unles… view at source ↗

read the original abstract

Vision--Language--Action (VLA) models bridge multimodal reasoning with physical control, but adapting them to new tasks with scarce demonstrations remains unreliable. While fine-tuned VLA policies often produce semantically plausible trajectories, failures often arise from unresolved geometric ambiguities, where near-miss actions lead to divergent execution outcomes under limited supervision. We study few-shot VLA adaptation from a \emph{generation--selection} perspective and propose a novel framework \textbf{VGAS} (\textbf{V}alue-\textbf{G}uided \textbf{A}ction-chunk \textbf{S}election). It performs inference-time best-of-$N$ selection to identify action chunks that are both semantically faithful and geometrically precise. Specifically, \textbf{VGAS} employs a finetuned VLA as a high-recall proposal generator and introduces the \textrm{Q-Chunk-Former}, a geometrically grounded Transformer critic to resolve fine-grained geometric ambiguities. In addition, we propose \textit{Explicit Geometric Regularization} (\texttt{EGR}), which shapes a discriminative value landscape to preserve action ranking resolution among near-miss candidates while mitigating value instability under scarce supervision. Experiments and theoretical analysis demonstrate that \textbf{VGAS} consistently improves success rates and robustness under limited demonstrations and distribution shifts. Our code is available at https://github.com/Jyugo-15/VGAS.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

VGAS adds a separate geometric critic and EGR term for action chunk selection in few-shot VLA, but the abstract supplies no results or derivations to show the critic actually resolves the claimed ambiguities.

read the letter

VGAS is a new inference-time selection method that treats a fine-tuned VLA as a proposal generator and adds a Q-Chunk-Former critic plus Explicit Geometric Regularization to pick better action chunks under scarce demonstrations. The core idea is to fix geometric near-misses that semantic fine-tuning misses. That framing is useful and the components are presented as a concrete combination not seen in the usual best-of-N baselines. The paper does a clear job naming the practical failure mode in current VLA adaptation. The soft spot is that the abstract claims experiments and theoretical analysis without showing any numbers, baselines, error bars, or a derivation that EGR preserves ranking when positive geometric examples are few. The stress-test concern about the critic collapsing to semantic cues rather than metric ones is not addressed by anything provided, so the central claim cannot be checked. This is aimed at robotics groups working on VLA deployment who need better few-shot robustness. A reader could extract the proposal-critic split and try it, but the write-up does not yet give enough to judge whether the added machinery delivers. I would send the full paper to peer review if the experiments include proper controls and the theory section actually derives the regularization effect; otherwise it stays too preliminary.

Referee Report

2 major / 2 minor

Summary. The paper proposes VGAS, a generation-selection framework for few-shot Vision-Language-Action (VLA) adaptation. A fine-tuned VLA serves as a high-recall proposal generator producing action chunks; a separate geometrically grounded Transformer critic (Q-Chunk-Former) trained with Explicit Geometric Regularization (EGR) performs inference-time best-of-N selection to resolve fine-grained geometric ambiguities among near-miss chunks. The authors claim that experiments and theoretical analysis show consistent gains in success rate and robustness under scarce demonstrations and distribution shifts.

Significance. If the central claim holds, the separation of proposal generation from geometrically discriminative selection could offer a practical route to more reliable few-shot VLA adaptation without requiring large additional datasets. The approach is modular and could be combined with existing VLA backbones.

major comments (2)

[Abstract, §3] Abstract and §3 (framework description): the claim that EGR 'shapes a discriminative value landscape to preserve action ranking resolution among near-miss candidates' is load-bearing for the central claim, yet no derivation or bound is supplied showing that the regularization term guarantees correct ranking when the number of geometrically successful positive examples is much smaller than the number of near-miss candidates. Without such a guarantee, it is unclear why the critic will not collapse to semantic rather than metric cues under scarce supervision.
[§5] §5 (experiments): the abstract asserts that 'experiments and theoretical analysis demonstrate' consistent improvements, but the provided text contains no tables, error bars, baseline comparisons, or ablation results quantifying the contribution of Q-Chunk-Former + EGR versus the base VLA policy. This prevents verification that the selection step adds robustness beyond the proposal generator.

minor comments (2)

[§3] Notation for the critic (Q-Chunk-Former) and the EGR loss term should be introduced with explicit equations rather than descriptive prose only.
[Abstract] The GitHub link is provided but no statement is made about whether the released code includes the exact training and inference scripts used for the reported results.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the two major comments point by point below, indicating planned revisions where the manuscript requires strengthening.

read point-by-point responses

Referee: [Abstract, §3] Abstract and §3 (framework description): the claim that EGR 'shapes a discriminative value landscape to preserve action ranking resolution among near-miss candidates' is load-bearing for the central claim, yet no derivation or bound is supplied showing that the regularization term guarantees correct ranking when the number of geometrically successful positive examples is much smaller than the number of near-miss candidates. Without such a guarantee, it is unclear why the critic will not collapse to semantic rather than metric cues under scarce supervision.

Authors: We acknowledge that the manuscript does not supply a formal derivation or bound guaranteeing ranking preservation when positive geometric examples are scarce relative to near-miss candidates. The EGR term is introduced as an explicit penalty on value instability derived from the critic's geometric input features, with the intent of encouraging metric sensitivity; however, no proof is given that this prevents collapse to semantic cues. In revision we will add a short subsection in §3 providing a simplified analysis of the regularization's effect on the value landscape (under the assumption of a sufficiently expressive critic) and a qualitative argument why semantic collapse is mitigated, though we stop short of claiming a general guarantee. revision: partial
Referee: [§5] §5 (experiments): the abstract asserts that 'experiments and theoretical analysis demonstrate' consistent improvements, but the provided text contains no tables, error bars, baseline comparisons, or ablation results quantifying the contribution of Q-Chunk-Former + EGR versus the base VLA policy. This prevents verification that the selection step adds robustness beyond the proposal generator.

Authors: The full manuscript contains §5 with the requested results: tables reporting success rates (with standard deviations over 5 random seeds), direct comparisons against the base VLA policy, and ablations isolating Q-Chunk-Former and EGR. These quantify the incremental robustness gained by the selection stage. We will ensure that all tables, error bars, and ablation figures are explicitly referenced and rendered in the revised submission so that the contribution of the critic can be verified. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation relies on external experiments rather than self-referential definitions or fits

full rationale

The provided abstract and framework description introduce VGAS as a generation-selection approach using a finetuned VLA proposer and a separate Q-Chunk-Former critic with EGR, but contain no equations, fitted parameters, or self-citations that reduce the claimed success-rate improvements or geometric ranking to inputs by construction. The theoretical analysis is asserted without visible reduction steps, and the central mechanism (best-of-N selection via value guidance) is presented as an independent architectural choice validated by experiments. This matches the default case of a self-contained proposal whose validity rests on empirical results outside any definitional loop.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

Only the abstract is available, so the ledger reflects components explicitly named there; full paper may contain additional fitted parameters or assumptions not visible here.

axioms (1)

domain assumption A fine-tuned VLA model can serve as a high-recall proposal generator for action chunks.
Stated directly in the abstract as the first stage of the framework.

invented entities (2)

Q-Chunk-Former no independent evidence
purpose: Geometrically grounded Transformer critic to resolve fine-grained geometric ambiguities among action chunks.
Introduced as a new component in the abstract.
Explicit Geometric Regularization (EGR) no independent evidence
purpose: Shapes a discriminative value landscape to preserve action ranking resolution and mitigate value instability under scarce supervision.
Proposed as an additional technique in the abstract.

pith-pipeline@v0.9.0 · 5778 in / 1365 out tokens · 39878 ms · 2026-05-25T06:57:33.677594+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Explicit Geometric Regularization (EGR) ... shapes a discriminative value landscape to preserve action ranking resolution among near-miss candidates

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Autonomous Drift Learning in Data Streams: A Unified Perspective
cs.LG 2026-05 unverdicted novelty 7.0

A survey proposes a novel 3D taxonomy classifying drifts into time stream, data stream, and model stream categories to unify research on non-stationary autonomous learning.

Reference graph

Works this paper leans on

51 extracted references · 51 canonical work pages · cited by 1 Pith paper · 19 internal anchors

[1]

Online learning with off- policy feedback in adversarial mdps

[Bacchiocchiet al., 2024 ] Francesco Bacchiocchi, Francesco Emanuele Stradi, Matteo Papini, Alberto Maria Metelli, Nicola Gatti, et al. Online learning with off- policy feedback in adversarial mdps. InProceedings of the Thirty-Third International Joint Conference on Artificial Intelligence (IJCAI-24), pages 3697–3705,

work page 2024
[2]

[Blacket al., 2024 ] Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Nic- colo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, et al.π 0: A vision-language-action flow model for general robot control.arXiv preprint arXiv:2410.24164,

work page internal anchor Pith review Pith/arXiv arXiv 2024
[3]

RT-1: Robotics Transformer for Real-World Control at Scale

[Brohanet al., 2022 ] Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Joseph Dabis, Chelsea Finn, Keerthana Gopalakrishnan, Karol Hausman, Alex Herzog, Jasmine Hsu, et al. Rt-1: Robotics transformer for real- world control at scale.arXiv preprint arXiv:2212.06817,

work page internal anchor Pith review Pith/arXiv arXiv 2022
[4]

Q-transformer: Scalable offline reinforcement learning via autoregressive q-functions

[Chebotaret al., 2023 ] Yevgen Chebotar, Quan Vuong, Karol Hausman, Fei Xia, Yao Lu, Alex Irpan, Aviral Kumar, Tianhe Yu, Alexander Herzog, Karl Pertsch, et al. Q-transformer: Scalable offline reinforcement learning via autoregressive q-functions. InConference on Robot Learning, pages 3909–3928. PMLR,

work page 2023
[5]

Conrft: A reinforced fine-tuning method for vla models via con- sistency policy.arXiv preprint arXiv:2502.05450,

[Chenet al., 2025 ] Yuhui Chen, Shuai Tian, Shugao Liu, Yingting Zhou, Haoran Li, and Dongbin Zhao. Conrft: A reinforced fine-tuning method for vla models via con- sistency policy.arXiv preprint arXiv:2502.05450,

work page arXiv 2025
[6]

Pato: Policy assisted teleoperation for scalable robot data collection.arXiv preprint arXiv:2212.04708,

[Dasset al., 2022 ] Shivin Dass, Karl Pertsch, Hejia Zhang, Youngwoon Lee, Joseph J Lim, and Stefanos Nikolaidis. Pato: Policy assisted teleoperation for scalable robot data collection.arXiv preprint arXiv:2212.04708,

work page arXiv 2022
[7]

Bandwidth-constrained Variational Message Encoding for Cooperative Multi-agent Reinforcement Learning

[Duanet al., 2025 ] Wei Duan, Jie Lu, En Yu, and Junyu Xuan. Bandwidth-constrained variational message en- coding for cooperative multi-agent reinforcement learning. arXiv preprint arXiv:2512.11179,

work page internal anchor Pith review Pith/arXiv arXiv 2025
[8]

D4RL: Datasets for Deep Data-Driven Reinforcement Learning

[Fuet al., 2020 ] Justin Fu, Aviral Kumar, Ofir Nachum, George Tucker, and Sergey Levine. D4rl: Datasets for deep data-driven reinforcement learning.arXiv preprint arXiv:2004.07219,

work page internal anchor Pith review Pith/arXiv arXiv 2020
[9]

Off-policy deep reinforcement learning without exploration

[Fujimotoet al., 2019 ] Scott Fujimoto, David Meger, and Doina Precup. Off-policy deep reinforcement learning without exploration. InInternational conference on ma- chine learning, pages 2052–2062. PMLR,

work page 2019
[10]

Emaq: Expected-max q-learning operator for simple yet effective offline and online rl

[Ghasemipouret al., 2021 ] Seyed Kamyar Seyed Ghasemipour, Dale Schuurmans, and Shixiang Shane Gu. Emaq: Expected-max q-learning operator for simple yet effective offline and online rl. InInternational Conference on Machine Learning, pages 3682–3691. PMLR,

work page 2021
[11]

Improving vision-language-action model with online reinforcement learning.arXiv preprint arXiv:2501.16664,

[Guoet al., 2025 ] Yanjiang Guo, Jianke Zhang, Xiaoyu Chen, Xiang Ji, Yen-Jen Wang, Yucheng Hu, and Jianyu Chen. Improving vision-language-action model with online reinforcement learning.arXiv preprint arXiv:2501.16664,

work page arXiv 2025
[12]

Gaussian Error Linear Units (GELUs)

[Hendrycks, 2016] D Hendrycks. Gaussian error linear units (gelus).arXiv preprint arXiv:1606.08415,

work page internal anchor Pith review Pith/arXiv arXiv 2016
[13]

Co- rft: Efficient fine-tuning of vision-language-action mod- els through chunked offline reinforcement learning.arXiv preprint arXiv:2508.02219,

[Huanget al., 2025 ] Dongchi Huang, Zhirui Fang, Tianle Zhang, Yihang Li, Lin Zhao, and Chunhe Xia. Co- rft: Efficient fine-tuning of vision-language-action mod- els through chunked offline reinforcement learning.arXiv preprint arXiv:2508.02219,

work page arXiv 2025
[14]

[Intelligenceet al., 2025 ] Physical Intelligence, Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, et al.π 0.5: A vision-language-action model with open-world generalization.arXiv preprint arXiv:2504.16054,

work page internal anchor Pith review Pith/arXiv arXiv 2025
[15]

Planning with Diffusion for Flexible Behavior Synthesis

[Janneret al., 2022 ] Michael Janner, Yilun Du, Joshua B Tenenbaum, and Sergey Levine. Planning with dif- fusion for flexible behavior synthesis.arXiv preprint arXiv:2205.09991,

work page internal anchor Pith review Pith/arXiv arXiv 2022
[16]

OpenVLA: An Open-Source Vision-Language-Action Model

[Kimet al., 2024 ] Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag San- keti, et al. Openvla: An open-source vision-language- action model.arXiv preprint arXiv:2406.09246,

work page internal anchor Pith review Pith/arXiv arXiv 2024
[17]

Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success

[Kimet al., 2025 ] Moo Jin Kim, Chelsea Finn, and Percy Liang. Fine-tuning vision-language-action mod- els: Optimizing speed and success.arXiv preprint arXiv:2502.19645,

work page internal anchor Pith review Pith/arXiv arXiv 2025
[18]

Adam: A Method for Stochastic Optimization

[Kingma, 2014] Diederik P Kingma. Adam: A method for stochastic optimization.arXiv preprint arXiv:1412.6980,

work page internal anchor Pith review Pith/arXiv arXiv 2014
[19]

Efficient and sta- ble offline-to-online reinforcement learning via continual policy revitalization

[Konget al., 2024 ] Rui Kong, Chenyang Wu, Chen-Xiao Gao, Zongzhang Zhang, and Ming Li. Efficient and sta- ble offline-to-online reinforcement learning via continual policy revitalization. InProceedings of the Thirty-Third International Joint Conference on Artificial Intelligence, IJCAI-24, pages 4317–4325,

work page 2024
[20]

Offline Reinforcement Learning with Implicit Q-Learning

[Kostrikovet al., 2021 ] Ilya Kostrikov, Ashvin Nair, and Sergey Levine. Offline reinforcement learning with im- plicit q-learning.arXiv preprint arXiv:2110.06169,

work page internal anchor Pith review Pith/arXiv arXiv 2021
[21]

Conservative Q-learning for offline reinforcement learning

[Kumaret al., 2020 ] Aviral Kumar, Aurick Zhou, George Tucker, and Sergey Levine. Conservative Q-learning for offline reinforcement learning. InAdvances in Neural In- formation Processing Systems (NeurIPS),

work page 2020
[22]

Pre-training for robots: Offline rl en- ables learning new tasks from a handful of trials.arXiv preprint arXiv:2210.05178,

[Kumaret al., 2022 ] Aviral Kumar, Anikait Singh, Frederik Ebert, Mitsuhiko Nakamoto, Yanlai Yang, Chelsea Finn, and Sergey Levine. Pre-training for robots: Offline rl en- ables learning new tasks from a handful of trials.arXiv preprint arXiv:2210.05178,

work page arXiv 2022
[23]

Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems

[Levineet al., 2020 ] Sergey Levine, Aviral Kumar, George Tucker, and Justin Fu. Offline reinforcement learning: Tu- torial, review, and perspectives on open problems.arXiv preprint arXiv:2005.01643,

work page internal anchor Pith review Pith/arXiv arXiv 2020
[24]

SimpleVLA-RL: Scaling VLA Training via Reinforcement Learning

[Liet al., 2025a ] Haozhan Li, Yuxin Zuo, Jiale Yu, Yuhao Zhang, Zhaohui Yang, Kaiyan Zhang, Xuekai Zhu, Yuchen Zhang, Tianxing Chen, Ganqu Cui, et al. Simplevla-rl: Scaling vla training via reinforcement learn- ing.arXiv preprint arXiv:2509.09674,

work page internal anchor Pith review Pith/arXiv arXiv
[25]

Reinforcement Learning with Action Chunking

[Liet al., 2025b ] Qiyang Li, Zhiyuan Zhou, and Sergey Levine. Reinforcement learning with action chunking. arXiv preprint arXiv:2507.07969,

work page internal anchor Pith review Pith/arXiv arXiv
[26]

Libero: Benchmarking knowledge transfer for lifelong robot learn- ing.Advances in Neural Information Processing Systems, 36:44776–44791,

[Liuet al., 2023 ] Bo Liu, Yifeng Zhu, Chongkai Gao, Yi- hao Feng, Qiang Liu, Yuke Zhu, and Peter Stone. Libero: Benchmarking knowledge transfer for lifelong robot learn- ing.Advances in Neural Information Processing Systems, 36:44776–44791,

work page 2023
[27]

What can rl bring to vla generalization? an empirical study.arXiv preprint arXiv:2505.19789,

[Liuet al., 2025 ] Jijia Liu, Feng Gao, Bingwen Wei, Xinlei Chen, Qingmin Liao, Yi Wu, Chao Yu, and Yu Wang. What can rl bring to vla generalization? an empirical study.arXiv preprint arXiv:2505.19789,

work page arXiv 2025
[28]

Challenges and opportunities in offline reinforce- ment learning from visual observations.arXiv preprint arXiv:2206.04779,

[Luet al., 2022 ] Cong Lu, Philip J Ball, Tim GJ Rudner, Jack Parker-Holder, Michael A Osborne, and Yee Whye Teh. Challenges and opportunities in offline reinforce- ment learning from visual observations.arXiv preprint arXiv:2206.04779,

work page arXiv 2022
[29]

Dreamfuser: Value- guided diffusion policy for offline reinforcement learning

[Luoet al., ] Kairong Luo, CAIWEI XIAO, Zhiao Huang, Zhan Ling, Yunhao Fang, and Hao Su. Dreamfuser: Value- guided diffusion policy for offline reinforcement learning. [Lyuet al., 2022 ] Jiafei Lyu, Xiaoteng Ma, Xiu Li, and Zongqing Lu. Mildly conservative q-learning for offline reinforcement learning.Advances in Neural Information Processing Systems, 35:...

work page 2022
[30]

SmolVLM: Redefining small and efficient multimodal models

[Marafiotiet al., 2025 ] Andr´es Marafioti, Orr Zohar, Miquel Farr´e, Merve Noyan, Elie Bakouch, Pedro Cuenca, Cyril Zakka, Loubna Ben Allal, Anton Lozhkov, Nouamane Tazi, et al. Smolvlm: Redefining small and efficient mul- timodal models.arXiv preprint arXiv:2504.05299,

work page internal anchor Pith review Pith/arXiv arXiv 2025
[31]

Policy agnos- tic rl: Offline rl and online rl fine-tuning of any class and backbone.arXiv preprint arXiv:2412.06685,

[Market al., 2024 ] Max Sobol Mark, Tian Gao, Geor- gia Gabriela Sampaio, Mohan Kumar Srirama, Archit Sharma, Chelsea Finn, and Aviral Kumar. Policy agnos- tic rl: Offline rl and online rl fine-tuning of any class and backbone.arXiv preprint arXiv:2412.06685,

work page arXiv 2024
[32]

Steering your general- ists: Improving robotic foundation models via value guid- ance

[Nakamotoet al., 2025 ] Mitsuhiko Nakamoto, Oier Mees, Aviral Kumar, and Sergey Levine. Steering your general- ists: Improving robotic foundation models via value guid- ance. InConference on Robot Learning, pages 4996–5013. PMLR,

work page 2025
[33]

Vision-language- action models: Concepts, progress, applications and chal- lenges.arXiv preprint arXiv:2505.04769,

[Sapkotaet al., 2025 ] Ranjan Sapkota, Yang Cao, Konstanti- nos I Roumeliotis, and Manoj Karkee. Vision-language- action models: Concepts, progress, applications and chal- lenges.arXiv preprint arXiv:2505.04769,

work page arXiv 2025
[34]

Proximal Policy Optimization Algorithms

[Schulmanet al., 2017 ] John Schulman, Filip Wolski, Pra- fulla Dhariwal, Alec Radford, and Oleg Klimov. Prox- imal policy optimization algorithms.arXiv preprint arXiv:1707.06347,

work page internal anchor Pith review Pith/arXiv arXiv 2017
[35]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

[Shaoet al., 2024 ] Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open lan- guage models.arXiv preprint arXiv:2402.03300,

work page internal anchor Pith review Pith/arXiv arXiv 2024
[36]

Guide to control: Offline hierarchical reinforcement learn- ing using subgoal generation for long-horizon and sparse- reward tasks

[Shin and Kim, 2023] Wonchul Shin and Yusung Kim. Guide to control: Offline hierarchical reinforcement learn- ing using subgoal generation for long-horizon and sparse- reward tasks. InIJCAI, pages 4217–4225,

work page 2023
[37]

SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics

[Shukoret al., 2025 ] Mustafa Shukor, Dana Aubakirova, Francesco Capuano, Pepijn Kooijmans, Steven Palma, Adil Zouitine, Michel Aractingi, Caroline Pascal, Mar- tino Russi, Andres Marafioti, et al. Smolvla: A vision-language-action model for affordable and efficient robotics.arXiv preprint arXiv:2506.01844,

work page internal anchor Pith review Pith/arXiv arXiv 2025
[38]

Hume: Introducing system- 2 thinking in visual-language-action model.arXiv preprint arXiv:2505.21432,

[Songet al., 2025 ] Haoming Song, Delin Qu, Yuanqi Yao, Qizhi Chen, Qi Lv, Yiwen Tang, Modi Shi, Guanghui Ren, Maoqing Yao, Bin Zhao, et al. Hume: Introducing system- 2 thinking in visual-language-action model.arXiv preprint arXiv:2505.21432,

work page arXiv 2025
[39]

[Suttonet al., 1998 ] Richard S Sutton, Andrew G Barto, et al.Reinforcement learning: An introduction, volume

work page 1998
[40]

Between mdps and semi-mdps: A frame- work for temporal abstraction in reinforcement learning

[Suttonet al., 1999 ] Richard S Sutton, Doina Precup, and Satinder Singh. Between mdps and semi-mdps: A frame- work for temporal abstraction in reinforcement learning. Artificial intelligence, 112(1-2):181–211,

work page 1999
[41]

Interactive Post-Training for Vision-Language-Action Models

[Tanet al., ] Shuhan Tan, Kairan Dou, Yue Zhao, and Philipp Kr¨ahenb¨uhl. Interactive post-training for vision-language- action models (2025).arXiv preprint arXiv:2505.17016. [Teamet al., 2024 ] Octo Model Team, Dibya Ghosh, Homer Walke, Karl Pertsch, Kevin Black, Oier Mees, Sudeep Dasari, Joey Hejna, Tobias Kreiman, Charles Xu, et al. Octo: An open-sourc...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[42]

Programmatic imitation learning from unlabeled and noisy demonstrations.IEEE Robotics and Automation Letters, 9(6):4894–4901,

[Xinet al., 2024 ] Jimmy Xin, Linus Zheng, Kia Rahmani, Jiayi Wei, Jarrett Holtz, Isil Dillig, and Joydeep Biswas. Programmatic imitation learning from unlabeled and noisy demonstrations.IEEE Robotics and Automation Letters, 9(6):4894–4901,

work page 2024
[43]

Learning robust spectral dynamics for temporal domain generalization

[Yuet al., 2025 ] En Yu, Jie Lu, Xiaoyu Yang, Guangquan Zhang, and Zhen Fang. Learning robust spectral dynamics for temporal domain generalization. InThe Thirty-ninth Annual Conference on Neural Information Processing Sys- tems,

work page 2025
[44]

Adap- tive reward shifting based on behavior proximity for of- fline reinforcement learning

[Zhang and Tan, 2023] Zhe Zhang and Xiaoyang Tan. Adap- tive reward shifting based on behavior proximity for of- fline reinforcement learning. InIJCAI, pages 4620–4628,

work page 2023
[45]

Pure vision language action (vla) models: A comprehensive survey.arXiv preprint arXiv:2509.19012,

[Zhanget al., 2025 ] Dapeng Zhang, Jing Sun, Chenghui Hu, Xiaoyan Wu, Zhenlong Yuan, Rui Zhou, Fei Shen, and Qingguo Zhou. Pure vision language action (vla) models: A comprehensive survey.arXiv preprint arXiv:2509.19012,

work page arXiv 2025
[46]

Zhao, Vikash Kumar, Sergey Levine, and Chelsea Finn

[Zhaoet al., 2023 ] Tony Z. Zhao, Vikash Kumar, Sergey Levine, and Chelsea Finn. Learning fine-grained biman- ual manipulation with low-cost arms. InRobotics: Science and Systems (RSS),

work page 2023
[47]

Rt-2: Vision-language-action models transfer web knowledge to robotic control

[Zitkovichet al., 2023 ] Brianna Zitkovich, Tianhe Yu, Sichun Xu, Peng Xu, Ted Xiao, Fei Xia, Jialin Wu, Paul Wohlhart, Stefan Welker, Ayzaan Wahid, et al. Rt-2: Vision-language-action models transfer web knowledge to robotic control. InConference on Robot Learning, pages 2165–2183. PMLR,

work page 2023
[48]

near- miss

A Related Work A.1 Vision-Language-Action Models. The intersection of computer vision and robotic control has been advanced by Vision-Language-Action (VLA) models, which endow high-capacity Vision-Language Models (VLMs) with actuation capabilities to map multimodal inputs (visual observa- tions and natural language instructions) to executable robot action...

work page 2023
[49]

show significant structural differences. This indicates that the VGAS does not merely memorize static geometric relations but adaptively adjusts its estimation according to evolving real-world dynamics, providing state-aware guidance throughout the entire task horizon. 0.14 0.16 0.18 0.20 0.22 0.24 0.26T op1 Hit Rate (candidates only) Libero_Goal 0.100 0....

work page 2000
[50]

(ii) Training protocol.We first perform supervised fine-tuning (SFT) of the VLA model using 5-shot expert demonstrations per task, randomly sampled from the LIBERO dataset

Additionally, we use shifted rewards{−1,1}instead of{0,1}, which we found to yield more stable learning in practice. (ii) Training protocol.We first perform supervised fine-tuning (SFT) of the VLA model using 5-shot expert demonstrations per task, randomly sampled from the LIBERO dataset. We then train a critic using different variants of offline RL (ORL)...

work page 2025
[51]

near-miss

Our Q-Chunk- Former is initialized from the first two layers of the SmolVLM backbone. We directly reuse the multimodal features extracted by the frozen SmolVLM (i.e., the output of the SmolVLA encoder) as the vision–language input to Q-Chunk-Former. In our notation, theQ-chunk lengthhdenotes the length of an action chunk, whileN-action-stepindicates that ...

work page 2014

[1] [1]

Online learning with off- policy feedback in adversarial mdps

[Bacchiocchiet al., 2024 ] Francesco Bacchiocchi, Francesco Emanuele Stradi, Matteo Papini, Alberto Maria Metelli, Nicola Gatti, et al. Online learning with off- policy feedback in adversarial mdps. InProceedings of the Thirty-Third International Joint Conference on Artificial Intelligence (IJCAI-24), pages 3697–3705,

work page 2024

[2] [2]

[Blacket al., 2024 ] Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Nic- colo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, et al.π 0: A vision-language-action flow model for general robot control.arXiv preprint arXiv:2410.24164,

work page internal anchor Pith review Pith/arXiv arXiv 2024

[3] [3]

RT-1: Robotics Transformer for Real-World Control at Scale

[Brohanet al., 2022 ] Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Joseph Dabis, Chelsea Finn, Keerthana Gopalakrishnan, Karol Hausman, Alex Herzog, Jasmine Hsu, et al. Rt-1: Robotics transformer for real- world control at scale.arXiv preprint arXiv:2212.06817,

work page internal anchor Pith review Pith/arXiv arXiv 2022

[4] [4]

Q-transformer: Scalable offline reinforcement learning via autoregressive q-functions

[Chebotaret al., 2023 ] Yevgen Chebotar, Quan Vuong, Karol Hausman, Fei Xia, Yao Lu, Alex Irpan, Aviral Kumar, Tianhe Yu, Alexander Herzog, Karl Pertsch, et al. Q-transformer: Scalable offline reinforcement learning via autoregressive q-functions. InConference on Robot Learning, pages 3909–3928. PMLR,

work page 2023

[5] [5]

Conrft: A reinforced fine-tuning method for vla models via con- sistency policy.arXiv preprint arXiv:2502.05450,

[Chenet al., 2025 ] Yuhui Chen, Shuai Tian, Shugao Liu, Yingting Zhou, Haoran Li, and Dongbin Zhao. Conrft: A reinforced fine-tuning method for vla models via con- sistency policy.arXiv preprint arXiv:2502.05450,

work page arXiv 2025

[6] [6]

Pato: Policy assisted teleoperation for scalable robot data collection.arXiv preprint arXiv:2212.04708,

[Dasset al., 2022 ] Shivin Dass, Karl Pertsch, Hejia Zhang, Youngwoon Lee, Joseph J Lim, and Stefanos Nikolaidis. Pato: Policy assisted teleoperation for scalable robot data collection.arXiv preprint arXiv:2212.04708,

work page arXiv 2022

[7] [7]

Bandwidth-constrained Variational Message Encoding for Cooperative Multi-agent Reinforcement Learning

[Duanet al., 2025 ] Wei Duan, Jie Lu, En Yu, and Junyu Xuan. Bandwidth-constrained variational message en- coding for cooperative multi-agent reinforcement learning. arXiv preprint arXiv:2512.11179,

work page internal anchor Pith review Pith/arXiv arXiv 2025

[8] [8]

D4RL: Datasets for Deep Data-Driven Reinforcement Learning

[Fuet al., 2020 ] Justin Fu, Aviral Kumar, Ofir Nachum, George Tucker, and Sergey Levine. D4rl: Datasets for deep data-driven reinforcement learning.arXiv preprint arXiv:2004.07219,

work page internal anchor Pith review Pith/arXiv arXiv 2020

[9] [9]

Off-policy deep reinforcement learning without exploration

[Fujimotoet al., 2019 ] Scott Fujimoto, David Meger, and Doina Precup. Off-policy deep reinforcement learning without exploration. InInternational conference on ma- chine learning, pages 2052–2062. PMLR,

work page 2019

[10] [10]

Emaq: Expected-max q-learning operator for simple yet effective offline and online rl

[Ghasemipouret al., 2021 ] Seyed Kamyar Seyed Ghasemipour, Dale Schuurmans, and Shixiang Shane Gu. Emaq: Expected-max q-learning operator for simple yet effective offline and online rl. InInternational Conference on Machine Learning, pages 3682–3691. PMLR,

work page 2021

[11] [11]

Improving vision-language-action model with online reinforcement learning.arXiv preprint arXiv:2501.16664,

[Guoet al., 2025 ] Yanjiang Guo, Jianke Zhang, Xiaoyu Chen, Xiang Ji, Yen-Jen Wang, Yucheng Hu, and Jianyu Chen. Improving vision-language-action model with online reinforcement learning.arXiv preprint arXiv:2501.16664,

work page arXiv 2025

[12] [12]

Gaussian Error Linear Units (GELUs)

[Hendrycks, 2016] D Hendrycks. Gaussian error linear units (gelus).arXiv preprint arXiv:1606.08415,

work page internal anchor Pith review Pith/arXiv arXiv 2016

[13] [13]

Co- rft: Efficient fine-tuning of vision-language-action mod- els through chunked offline reinforcement learning.arXiv preprint arXiv:2508.02219,

[Huanget al., 2025 ] Dongchi Huang, Zhirui Fang, Tianle Zhang, Yihang Li, Lin Zhao, and Chunhe Xia. Co- rft: Efficient fine-tuning of vision-language-action mod- els through chunked offline reinforcement learning.arXiv preprint arXiv:2508.02219,

work page arXiv 2025

[14] [14]

[Intelligenceet al., 2025 ] Physical Intelligence, Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, et al.π 0.5: A vision-language-action model with open-world generalization.arXiv preprint arXiv:2504.16054,

work page internal anchor Pith review Pith/arXiv arXiv 2025

[15] [15]

Planning with Diffusion for Flexible Behavior Synthesis

[Janneret al., 2022 ] Michael Janner, Yilun Du, Joshua B Tenenbaum, and Sergey Levine. Planning with dif- fusion for flexible behavior synthesis.arXiv preprint arXiv:2205.09991,

work page internal anchor Pith review Pith/arXiv arXiv 2022

[16] [16]

OpenVLA: An Open-Source Vision-Language-Action Model

[Kimet al., 2024 ] Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag San- keti, et al. Openvla: An open-source vision-language- action model.arXiv preprint arXiv:2406.09246,

work page internal anchor Pith review Pith/arXiv arXiv 2024

[17] [17]

Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success

[Kimet al., 2025 ] Moo Jin Kim, Chelsea Finn, and Percy Liang. Fine-tuning vision-language-action mod- els: Optimizing speed and success.arXiv preprint arXiv:2502.19645,

work page internal anchor Pith review Pith/arXiv arXiv 2025

[18] [18]

Adam: A Method for Stochastic Optimization

[Kingma, 2014] Diederik P Kingma. Adam: A method for stochastic optimization.arXiv preprint arXiv:1412.6980,

work page internal anchor Pith review Pith/arXiv arXiv 2014

[19] [19]

Efficient and sta- ble offline-to-online reinforcement learning via continual policy revitalization

[Konget al., 2024 ] Rui Kong, Chenyang Wu, Chen-Xiao Gao, Zongzhang Zhang, and Ming Li. Efficient and sta- ble offline-to-online reinforcement learning via continual policy revitalization. InProceedings of the Thirty-Third International Joint Conference on Artificial Intelligence, IJCAI-24, pages 4317–4325,

work page 2024

[20] [20]

Offline Reinforcement Learning with Implicit Q-Learning

[Kostrikovet al., 2021 ] Ilya Kostrikov, Ashvin Nair, and Sergey Levine. Offline reinforcement learning with im- plicit q-learning.arXiv preprint arXiv:2110.06169,

work page internal anchor Pith review Pith/arXiv arXiv 2021

[21] [21]

Conservative Q-learning for offline reinforcement learning

[Kumaret al., 2020 ] Aviral Kumar, Aurick Zhou, George Tucker, and Sergey Levine. Conservative Q-learning for offline reinforcement learning. InAdvances in Neural In- formation Processing Systems (NeurIPS),

work page 2020

[22] [22]

Pre-training for robots: Offline rl en- ables learning new tasks from a handful of trials.arXiv preprint arXiv:2210.05178,

[Kumaret al., 2022 ] Aviral Kumar, Anikait Singh, Frederik Ebert, Mitsuhiko Nakamoto, Yanlai Yang, Chelsea Finn, and Sergey Levine. Pre-training for robots: Offline rl en- ables learning new tasks from a handful of trials.arXiv preprint arXiv:2210.05178,

work page arXiv 2022

[23] [23]

Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems

[Levineet al., 2020 ] Sergey Levine, Aviral Kumar, George Tucker, and Justin Fu. Offline reinforcement learning: Tu- torial, review, and perspectives on open problems.arXiv preprint arXiv:2005.01643,

work page internal anchor Pith review Pith/arXiv arXiv 2020

[24] [24]

SimpleVLA-RL: Scaling VLA Training via Reinforcement Learning

[Liet al., 2025a ] Haozhan Li, Yuxin Zuo, Jiale Yu, Yuhao Zhang, Zhaohui Yang, Kaiyan Zhang, Xuekai Zhu, Yuchen Zhang, Tianxing Chen, Ganqu Cui, et al. Simplevla-rl: Scaling vla training via reinforcement learn- ing.arXiv preprint arXiv:2509.09674,

work page internal anchor Pith review Pith/arXiv arXiv

[25] [25]

Reinforcement Learning with Action Chunking

[Liet al., 2025b ] Qiyang Li, Zhiyuan Zhou, and Sergey Levine. Reinforcement learning with action chunking. arXiv preprint arXiv:2507.07969,

work page internal anchor Pith review Pith/arXiv arXiv

[26] [26]

Libero: Benchmarking knowledge transfer for lifelong robot learn- ing.Advances in Neural Information Processing Systems, 36:44776–44791,

[Liuet al., 2023 ] Bo Liu, Yifeng Zhu, Chongkai Gao, Yi- hao Feng, Qiang Liu, Yuke Zhu, and Peter Stone. Libero: Benchmarking knowledge transfer for lifelong robot learn- ing.Advances in Neural Information Processing Systems, 36:44776–44791,

work page 2023

[27] [27]

What can rl bring to vla generalization? an empirical study.arXiv preprint arXiv:2505.19789,

[Liuet al., 2025 ] Jijia Liu, Feng Gao, Bingwen Wei, Xinlei Chen, Qingmin Liao, Yi Wu, Chao Yu, and Yu Wang. What can rl bring to vla generalization? an empirical study.arXiv preprint arXiv:2505.19789,

work page arXiv 2025

[28] [28]

Challenges and opportunities in offline reinforce- ment learning from visual observations.arXiv preprint arXiv:2206.04779,

[Luet al., 2022 ] Cong Lu, Philip J Ball, Tim GJ Rudner, Jack Parker-Holder, Michael A Osborne, and Yee Whye Teh. Challenges and opportunities in offline reinforce- ment learning from visual observations.arXiv preprint arXiv:2206.04779,

work page arXiv 2022

[29] [29]

Dreamfuser: Value- guided diffusion policy for offline reinforcement learning

[Luoet al., ] Kairong Luo, CAIWEI XIAO, Zhiao Huang, Zhan Ling, Yunhao Fang, and Hao Su. Dreamfuser: Value- guided diffusion policy for offline reinforcement learning. [Lyuet al., 2022 ] Jiafei Lyu, Xiaoteng Ma, Xiu Li, and Zongqing Lu. Mildly conservative q-learning for offline reinforcement learning.Advances in Neural Information Processing Systems, 35:...

work page 2022

[30] [30]

SmolVLM: Redefining small and efficient multimodal models

[Marafiotiet al., 2025 ] Andr´es Marafioti, Orr Zohar, Miquel Farr´e, Merve Noyan, Elie Bakouch, Pedro Cuenca, Cyril Zakka, Loubna Ben Allal, Anton Lozhkov, Nouamane Tazi, et al. Smolvlm: Redefining small and efficient mul- timodal models.arXiv preprint arXiv:2504.05299,

work page internal anchor Pith review Pith/arXiv arXiv 2025

[31] [31]

Policy agnos- tic rl: Offline rl and online rl fine-tuning of any class and backbone.arXiv preprint arXiv:2412.06685,

[Market al., 2024 ] Max Sobol Mark, Tian Gao, Geor- gia Gabriela Sampaio, Mohan Kumar Srirama, Archit Sharma, Chelsea Finn, and Aviral Kumar. Policy agnos- tic rl: Offline rl and online rl fine-tuning of any class and backbone.arXiv preprint arXiv:2412.06685,

work page arXiv 2024

[32] [32]

Steering your general- ists: Improving robotic foundation models via value guid- ance

[Nakamotoet al., 2025 ] Mitsuhiko Nakamoto, Oier Mees, Aviral Kumar, and Sergey Levine. Steering your general- ists: Improving robotic foundation models via value guid- ance. InConference on Robot Learning, pages 4996–5013. PMLR,

work page 2025

[33] [33]

Vision-language- action models: Concepts, progress, applications and chal- lenges.arXiv preprint arXiv:2505.04769,

[Sapkotaet al., 2025 ] Ranjan Sapkota, Yang Cao, Konstanti- nos I Roumeliotis, and Manoj Karkee. Vision-language- action models: Concepts, progress, applications and chal- lenges.arXiv preprint arXiv:2505.04769,

work page arXiv 2025

[34] [34]

Proximal Policy Optimization Algorithms

[Schulmanet al., 2017 ] John Schulman, Filip Wolski, Pra- fulla Dhariwal, Alec Radford, and Oleg Klimov. Prox- imal policy optimization algorithms.arXiv preprint arXiv:1707.06347,

work page internal anchor Pith review Pith/arXiv arXiv 2017

[35] [35]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

[Shaoet al., 2024 ] Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open lan- guage models.arXiv preprint arXiv:2402.03300,

work page internal anchor Pith review Pith/arXiv arXiv 2024

[36] [36]

Guide to control: Offline hierarchical reinforcement learn- ing using subgoal generation for long-horizon and sparse- reward tasks

[Shin and Kim, 2023] Wonchul Shin and Yusung Kim. Guide to control: Offline hierarchical reinforcement learn- ing using subgoal generation for long-horizon and sparse- reward tasks. InIJCAI, pages 4217–4225,

work page 2023

[37] [37]

SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics

[Shukoret al., 2025 ] Mustafa Shukor, Dana Aubakirova, Francesco Capuano, Pepijn Kooijmans, Steven Palma, Adil Zouitine, Michel Aractingi, Caroline Pascal, Mar- tino Russi, Andres Marafioti, et al. Smolvla: A vision-language-action model for affordable and efficient robotics.arXiv preprint arXiv:2506.01844,

work page internal anchor Pith review Pith/arXiv arXiv 2025

[38] [38]

Hume: Introducing system- 2 thinking in visual-language-action model.arXiv preprint arXiv:2505.21432,

[Songet al., 2025 ] Haoming Song, Delin Qu, Yuanqi Yao, Qizhi Chen, Qi Lv, Yiwen Tang, Modi Shi, Guanghui Ren, Maoqing Yao, Bin Zhao, et al. Hume: Introducing system- 2 thinking in visual-language-action model.arXiv preprint arXiv:2505.21432,

work page arXiv 2025

[39] [39]

[Suttonet al., 1998 ] Richard S Sutton, Andrew G Barto, et al.Reinforcement learning: An introduction, volume

work page 1998

[40] [40]

Between mdps and semi-mdps: A frame- work for temporal abstraction in reinforcement learning

[Suttonet al., 1999 ] Richard S Sutton, Doina Precup, and Satinder Singh. Between mdps and semi-mdps: A frame- work for temporal abstraction in reinforcement learning. Artificial intelligence, 112(1-2):181–211,

work page 1999

[41] [41]

Interactive Post-Training for Vision-Language-Action Models

[Tanet al., ] Shuhan Tan, Kairan Dou, Yue Zhao, and Philipp Kr¨ahenb¨uhl. Interactive post-training for vision-language- action models (2025).arXiv preprint arXiv:2505.17016. [Teamet al., 2024 ] Octo Model Team, Dibya Ghosh, Homer Walke, Karl Pertsch, Kevin Black, Oier Mees, Sudeep Dasari, Joey Hejna, Tobias Kreiman, Charles Xu, et al. Octo: An open-sourc...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[42] [42]

Programmatic imitation learning from unlabeled and noisy demonstrations.IEEE Robotics and Automation Letters, 9(6):4894–4901,

[Xinet al., 2024 ] Jimmy Xin, Linus Zheng, Kia Rahmani, Jiayi Wei, Jarrett Holtz, Isil Dillig, and Joydeep Biswas. Programmatic imitation learning from unlabeled and noisy demonstrations.IEEE Robotics and Automation Letters, 9(6):4894–4901,

work page 2024

[43] [43]

Learning robust spectral dynamics for temporal domain generalization

[Yuet al., 2025 ] En Yu, Jie Lu, Xiaoyu Yang, Guangquan Zhang, and Zhen Fang. Learning robust spectral dynamics for temporal domain generalization. InThe Thirty-ninth Annual Conference on Neural Information Processing Sys- tems,

work page 2025

[44] [44]

Adap- tive reward shifting based on behavior proximity for of- fline reinforcement learning

[Zhang and Tan, 2023] Zhe Zhang and Xiaoyang Tan. Adap- tive reward shifting based on behavior proximity for of- fline reinforcement learning. InIJCAI, pages 4620–4628,

work page 2023

[45] [45]

Pure vision language action (vla) models: A comprehensive survey.arXiv preprint arXiv:2509.19012,

[Zhanget al., 2025 ] Dapeng Zhang, Jing Sun, Chenghui Hu, Xiaoyan Wu, Zhenlong Yuan, Rui Zhou, Fei Shen, and Qingguo Zhou. Pure vision language action (vla) models: A comprehensive survey.arXiv preprint arXiv:2509.19012,

work page arXiv 2025

[46] [46]

Zhao, Vikash Kumar, Sergey Levine, and Chelsea Finn

[Zhaoet al., 2023 ] Tony Z. Zhao, Vikash Kumar, Sergey Levine, and Chelsea Finn. Learning fine-grained biman- ual manipulation with low-cost arms. InRobotics: Science and Systems (RSS),

work page 2023

[47] [47]

Rt-2: Vision-language-action models transfer web knowledge to robotic control

[Zitkovichet al., 2023 ] Brianna Zitkovich, Tianhe Yu, Sichun Xu, Peng Xu, Ted Xiao, Fei Xia, Jialin Wu, Paul Wohlhart, Stefan Welker, Ayzaan Wahid, et al. Rt-2: Vision-language-action models transfer web knowledge to robotic control. InConference on Robot Learning, pages 2165–2183. PMLR,

work page 2023

[48] [48]

near- miss

A Related Work A.1 Vision-Language-Action Models. The intersection of computer vision and robotic control has been advanced by Vision-Language-Action (VLA) models, which endow high-capacity Vision-Language Models (VLMs) with actuation capabilities to map multimodal inputs (visual observa- tions and natural language instructions) to executable robot action...

work page 2023

[49] [49]

show significant structural differences. This indicates that the VGAS does not merely memorize static geometric relations but adaptively adjusts its estimation according to evolving real-world dynamics, providing state-aware guidance throughout the entire task horizon. 0.14 0.16 0.18 0.20 0.22 0.24 0.26T op1 Hit Rate (candidates only) Libero_Goal 0.100 0....

work page 2000

[50] [50]

(ii) Training protocol.We first perform supervised fine-tuning (SFT) of the VLA model using 5-shot expert demonstrations per task, randomly sampled from the LIBERO dataset

Additionally, we use shifted rewards{−1,1}instead of{0,1}, which we found to yield more stable learning in practice. (ii) Training protocol.We first perform supervised fine-tuning (SFT) of the VLA model using 5-shot expert demonstrations per task, randomly sampled from the LIBERO dataset. We then train a critic using different variants of offline RL (ORL)...

work page 2025

[51] [51]

near-miss

Our Q-Chunk- Former is initialized from the first two layers of the SmolVLM backbone. We directly reuse the multimodal features extracted by the frozen SmolVLM (i.e., the output of the SmolVLA encoder) as the vision–language input to Q-Chunk-Former. In our notation, theQ-chunk lengthhdenotes the length of an action chunk, whileN-action-stepindicates that ...

work page 2014