Flatness Preserves Instruction Following in Vision-Language-Action Models

Haochen Zhang; Yonatan Bisk

arxiv: 2606.23641 · v1 · pith:JUPQSK2Fnew · submitted 2026-06-22 · 💻 cs.RO

Flatness Preserves Instruction Following in Vision-Language-Action Models

Haochen Zhang , Yonatan Bisk This is my paper

Pith reviewed 2026-06-26 08:04 UTC · model grok-4.3

classification 💻 cs.RO

keywords vision-language-action modelsinstruction followingsharpness-aware minimizationflat minimarobot learningfinetuninginstruction blindnessloss landscape

0 comments

The pith

Sharpness-aware minimization during VLA finetuning improves instruction following by over 60% on the same data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Vision-language-action models lose their ability to follow language instructions after standard finetuning on limited robot data, defaulting to visual shortcuts instead. The paper claims this occurs because gradient updates on sparse points produce sharp, high-curvature minima in the loss landscape. Applying sharpness-aware minimization seeks flatter minima during the identical finetuning process, making the model robust to weight perturbations and preserving pretrained language grounding. This yields large gains across simulation and real-world benchmarks without new data, model changes, or retraining, and it works alongside other techniques.

Core claim

The authors claim that instruction blindness arises from high-curvature minima induced by standard gradient descent on sparse robot data, and that flatness-preserving optimization via sharpness-aware minimization during finetuning on the exact same data restores robust instruction following, yielding over 60% improvement without additional resources.

What carries the argument

sharpness-aware minimization applied during VLA finetuning, which perturbs parameters to locate flatter regions of the loss landscape and thereby maintains language grounding.

If this is right

The identical limited robot data produces policies that respect language instructions rather than visual shortcuts.
Gains appear consistently across multiple simulation environments and real-world robot tasks.
The flatness method combines with existing guidance techniques for additive benefits.
Selective application of sharpness during training allows quantification of its contribution to language preservation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same optimization change might reduce shortcut learning when finetuning other multimodal models on scarce downstream data.
Measuring loss curvature before and after finetuning could become a practical check for whether language grounding survived the update.
If flat regions better retain pretrained features, applying the method earlier in pretraining stages could be tested as a preventive step.

Load-bearing premise

That instruction blindness is caused specifically by sharp minima from standard optimization on sparse data rather than by data quality, model capacity, or other factors.

What would settle it

Running the same finetuning experiments and finding that sharpness-aware minimization produces no gain in instruction following, or that models with measurably flatter minima still ignore language instructions at the same rate.

Figures

Figures reproduced from arXiv: 2606.23641 by Haochen Zhang, Yonatan Bisk.

**Figure 2.** Figure 2: t-SNE plot of object-token representation probe for the [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 3.** Figure 3: Real-world experiments on five object layouts and a counterfactual instruction at inference [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: Left: heatmaps of the grasp position for different pick-object instructions between the [PITH_FULL_IMAGE:figures/full_fig_p015_4.png] view at source ↗

**Figure 5.** Figure 5: Hardware set up of the robot manipulator for real-world experiments [PITH_FULL_IMAGE:figures/full_fig_p017_5.png] view at source ↗

read the original abstract

Vision-language-action (VLA) models have the potential for open-world generalization by leveraging pretrained vision-language representations, yet downstream finetuning on limited robot data often degrades these representations, leading to brittle policies that ignore language instructions in favor of visual shortcuts, a failure mode we term instruction blindness. We hypothesize that standard finetuning with limited data applies gradients to a sparse set of points, which manifests as a sharp loss landscape with high-curvature minima. We propose to address this directly through flatness-preserving optimization while finetuning on the exact same data, where learning a flatter landscape results in a model more robust to perturbations in the weight space. Specifically, we demonstrate that simply applying sharpness-aware minimization during VLA finetuning significantly improves instruction following by over 60% across multiple simulation and real-world benchmarks without additional data, architectural modification, or retraining. We further analyze the effect of selective sharpness, quantify its effects, and show that our approach is complementary to existing guidance techniques. Project page can be found at https://haochenz11.github.io/papers/flatness-vla/.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SAM during VLA finetuning delivers large reported gains on instruction following, but the link to flatter minima rests on indirect evidence.

read the letter

The key takeaway is that this paper reports applying sharpness-aware minimization during finetuning of vision-language-action models improves instruction following by over 60% on simulation and real-world benchmarks, using the same data and no architecture changes. The hypothesis is that standard gradient steps on sparse robot data create sharp minima that cause the model to drop language grounding in favor of visual shortcuts.

What stands out as new is the targeted use of SAM on VLA finetuning plus the follow-up analysis of selective sharpness and its complementarity with existing guidance methods. The empirical result, if the controls and numbers hold, is the main contribution and could be practically relevant for deployment.

The soft spot is exactly the one in the stress-test note. The abstract describes the performance lift and some selective sharpness quantification, but it gives no indication that curvature was measured directly on the trained checkpoints, such as Hessian trace or maximum eigenvalue for the standard versus SAM models. Without that, the gain could trace to other properties of the two-step optimizer rather than flatness per se. The abstract also omits experimental details, baselines, and statistical tests, which makes it impossible to judge how solid the comparison is.

This work is aimed at researchers who finetune VLAs for robotics and care about robustness on limited data. A reader in that area could extract value from the optimizer trick and the selective sharpness observations. It deserves a serious referee because the claim is testable and the potential upside is clear, even though the mechanistic part needs tighter evidence.

Referee Report

2 major / 1 minor

Summary. The paper claims that finetuning vision-language-action (VLA) models on limited robot data produces sharp, high-curvature minima that cause 'instruction blindness' (ignoring language in favor of visual shortcuts); it proposes that applying sharpness-aware minimization (SAM) during the same finetuning finds flatter minima that preserve language grounding, yielding >60% gains in instruction following on simulation and real-world benchmarks without extra data, architecture changes, or retraining. It further analyzes selective sharpness and complementarity with guidance methods.

Significance. If the reported gains are robust and the flatness mechanism is substantiated, the result would offer a lightweight, optimizer-only intervention that improves VLA robustness on sparse data, with potential impact on reliable language-conditioned robot policies.

major comments (2)

[Abstract / analysis of selective sharpness] Abstract and analysis sections: the central mechanistic claim—that SAM improves performance specifically by producing flatter minima that preserve instruction following—is not supported by any reported curvature measurements (Hessian trace, maximum eigenvalue, or SAM-style sharpness) on the final standard-gradient vs. SAM checkpoints. Without these, the observed accuracy gains could stem from SAM's two-step dynamics, implicit regularization, or other optimizer effects rather than flatness.
[Experimental evaluation] Experimental sections: the abstract states a >60% improvement 'across multiple simulation and real-world benchmarks' but supplies no details on baselines, number of runs, statistical tests, ablation controls (e.g., other regularizers), or exact task definitions, preventing verification that the data support the stated claim.

minor comments (1)

[Abstract] The term 'selective sharpness' is introduced in the abstract but its precise definition and quantification method are not stated in the provided summary, which could be clarified for readers.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and indicate where revisions will be made to strengthen the manuscript.

read point-by-point responses

Referee: Abstract / analysis of selective sharpness] Abstract and analysis sections: the central mechanistic claim—that SAM improves performance specifically by producing flatter minima that preserve instruction following—is not supported by any reported curvature measurements (Hessian trace, maximum eigenvalue, or SAM-style sharpness) on the final standard-gradient vs. SAM checkpoints. Without these, the observed accuracy gains could stem from SAM's two-step dynamics, implicit regularization, or other optimizer effects rather than flatness.

Authors: We agree that direct curvature measurements would provide stronger substantiation for the flatness mechanism. The manuscript currently supports the claim via performance improvements and the established properties of SAM, but does not report Hessian trace, maximum eigenvalue, or equivalent sharpness metrics on the final checkpoints. In the revision we will add these measurements comparing standard-gradient and SAM checkpoints to directly test the flatness hypothesis. revision: yes
Referee: Experimental sections: the abstract states a >60% improvement 'across multiple simulation and real-world benchmarks' but supplies no details on baselines, number of runs, statistical tests, ablation controls (e.g., other regularizers), or exact task definitions, preventing verification that the data support the stated claim.

Authors: The experimental section reports comparisons against standard finetuning on defined simulation (LIBERO, MetaWorld) and real-world tasks, includes selective-sharpness ablations, and states the >60% aggregate gain. To improve verifiability we will expand the text with explicit reporting of random seeds, statistical tests, and additional ablation controls against other regularizers. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical claim with independent validation

full rationale

The paper advances a hypothesis that standard finetuning produces sharp minima leading to instruction blindness, then reports an empirical result from applying SAM during finetuning on the same data, with >60% gains on benchmarks. No derivation chain, fitted parameter renamed as prediction, or self-citation load-bearing step is present in the provided text. The central claim rests on direct experimental comparison rather than any reduction to inputs by construction or imported uniqueness. This is the normal case of a self-contained empirical paper.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The work rests on the domain assumption from optimization literature that flatter minima generalize better; no new free parameters, axioms, or invented entities are introduced in the abstract.

axioms (1)

domain assumption Flatter minima in the loss landscape improve robustness to perturbations and preserve pretrained capabilities during fine-tuning.
Invoked to motivate the use of sharpness-aware minimization.

pith-pipeline@v0.9.1-grok · 5719 in / 1115 out tokens · 25179 ms · 2026-06-26T08:04:51.905399+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

43 extracted references · 15 linked inside Pith

[1]

Barreiros, A

J. Barreiros, A. Beaulieu, A. Bhat, R. Cory, E. Cousineau, H. Dai, C.-H. Fang, K. Hashimoto, M. Z. Irshad, M. Itkina, et al. A careful examination of large behavior models for multitask dexterous manipulation.Science Robotics, 11(113):eaea6201, 2026

2026
[2]

Goyal, H

A. Goyal, H. Hadfield, X. Yang, V . Blukis, and F. Ramos. Vla-0: Building state-of-the-art vlas with zero modification.arXiv preprint arXiv:2510.13054, 2025

arXiv 2025
[3]

Zhang, X

J. Zhang, X. Chen, Q. Wang, M. Li, Y . Guo, Y . Hu, J. Zhang, S. Bai, J. Lin, and J. Chen. Vlm4vla: Revisiting vision-language-models in vision-language-action models.arXiv preprint arXiv:2601.03309, 2026

Pith/arXiv arXiv 2026
[4]

Grover, A

S. Grover, A. Gopalkrishnan, B. Ai, H. I. Christensen, H. Su, and X. Li. Enhancing generaliza- tion in vision-language-action models by preserving pretrained representations.arXiv preprint arXiv:2509.11417, 2025

arXiv 2025
[5]

J. Guo, Z. Wu, C. Tu, Y . Ma, X. Kong, Z. Liu, J. Ji, S. Zhang, Y . Chen, K. Chen, et al. On robustness of vision-language-action model against multi-modal perturbations.arXiv preprint arXiv:2510.00037, 2025

arXiv 2025
[6]

S. Fei, S. Wang, J. Shi, Z. Dai, J. Cai, P. Qian, L. Ji, X. He, S. Zhang, Z. Fei, et al. Libero-plus: In-depth robustness analysis of vision-language-action models.arXiv preprint arXiv:2510.13626, 2025

Pith/arXiv arXiv 2025
[7]

G. Wang, C. Zhang, Q. Liu, J. Zhang, J. Cai, J. Liu, and X. Liu. Libero-x: Robustness litmus for vision-language-action models.arXiv preprint arXiv:2602.06556, 2026

arXiv 2026
[8]

X. Zhou, Y . Xu, G. Tie, Y . Chen, G. Zhang, D. Chu, P. Zhou, and L. Sun. Libero-pro: To- wards robust and fair evaluation of vision-language-action models beyond memorization.arXiv preprint arXiv:2510.03827, 2025

Pith/arXiv arXiv 2025
[9]

Y . Fang, Y . Feng, D. Jing, J. Liu, Y . Yang, Z. Wei, D. Szafir, and M. Ding. When vision overrides language: Evaluating and mitigating counterfactual failures in vlas.arXiv preprint arXiv:2602.17659, 2026

arXiv 2026
[10]

K. Xu, Z. Zhu, A. Chen, S. Zhao, Q. Huang, Y . Yang, H. Lu, R. Xiong, M. Tomizuka, and Y . Wang. Seeing to act, prompting to specify: A bayesian factorization of vision language action policy.arXiv preprint arXiv:2512.11218, 2025

arXiv 2025
[11]

Huang, J

S. Huang, J. Shao, K. Wang, Q. Chen, J. Sun, Y . Guo, M. Schwager, and J. Bohg. Breaking lock-in: Preserving steerability under low-data vla post-training.arXiv preprint arXiv:2604.23121, 2026

Pith/arXiv arXiv 2026
[12]

Z. Zhan, Y . Chen, J. Zhou, Q. Lv, H. Liu, K. Wang, L. Lin, and G. Wang. Stable language guidance for vision-language-action models.arXiv preprint arXiv:2601.04052, 2026

Pith/arXiv arXiv 2026
[13]

S. Lian, B. Yu, X. Lin, L. T. Yang, Z. Shen, C. Wu, Y . Miao, C. Huang, and K. Chen. Bayesian- vla: Bayesian decomposition of vision language action models via latent action queries.arXiv preprint arXiv:2601.15197, 2026. 9

Pith/arXiv arXiv 2026
[14]

S. Yang, H. Li, B. Wang, Y . Chen, Y . Tian, T. Wang, H. Wang, F. Zhao, Y . Liao, and J. Pang. Instructvla: Vision-language-action instruction tuning from understanding to manipulation. arXiv preprint arXiv:2507.17520, 2025

arXiv 2025
[15]

A. J. Hancock, X. Wu, L. Zha, O. Russakovsky, and A. Majumdar. Actions as language: Fine-tuning vlms into vlas without catastrophic forgetting.arXiv preprint arXiv:2509.22195, 2025

arXiv 2025
[16]

Zhang, B

N. Zhang, B. Zhu, S. Zhou, and J. Chen. Restoring linguistic grounding in vla models via train-free attention recalibration.arXiv preprint arXiv:2603.06001, 2026

arXiv 2026
[17]

Foret, A

P. Foret, A. Kleiner, H. Mobahi, and B. Neyshabur. Sharpness-aware minimization for effi- ciently improving generalization.arXiv preprint arXiv:2010.01412, 2020

Pith/arXiv arXiv 2010
[18]

J. Kwon, J. Kim, H. Park, and I. K. Choi. Asam: Adaptive sharpness-aware minimization for scale-invariant learning of deep neural networks. InInternational conference on machine learning, pages 5905–5914. PMLR, 2021

2021
[19]

Sherborne, N

T. Sherborne, N. Saphra, P. Dasigi, and H. Peng. Tram: Bridging trust regions and sharpness aware minimization.arXiv preprint arXiv:2310.03646, 2023

arXiv 2023
[20]

Y . Liu, S. Mai, X. Chen, C.-J. Hsieh, and Y . You. Towards efficient and scalable sharpness- aware minimization. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12360–12370, 2022

2022
[21]

E. J. Hu, Y . Shen, P. Wallis, Z. Allen-Zhu, Y . Li, S. Wang, L. Wang, W. Chen, et al. Lora: Low-rank adaptation of large language models.Iclr, 1(2):3, 2022

2022
[22]

Driess, J

D. Driess, J. Springenberg, B. Ichter, L. Yu, A. Li-Bell, K. Pertsch, A. Ren, H. Walke, Q. Vuong, L. X. Shi, et al. Knowledge insulating vision-language-action models: Train fast, run fast, generalize better.Advances in Neural Information Processing Systems, 38:102867– 102888, 2026

2026
[23]

Yadav, Z

Y . Yadav, Z. Zhou, A. Wagenmaker, K. Pertsch, and S. Levine. Robust finetuning of vision- language-action robot policies via parameter merging.arXiv preprint arXiv:2512.08333, 2025

arXiv 2025
[24]

Glossop, W

C. Glossop, W. Chen, A. Bhorkar, D. Shah, and S. Levine. Cast: Counterfactual labels improve instruction following in vision-language-action models.arXiv preprint arXiv:2508.13446, 2025

Pith/arXiv arXiv 2025
[25]

Hou and L

Y . Hou and L. Zhao. Langgap: Diagnosing and closing the language gap in vision-language- action models.arXiv preprint arXiv:2603.00592, 2026

arXiv 2026
[26]

Hochreiter and J

S. Hochreiter and J. Schmidhuber. Flat minima.Neural computation, 9(1):1–42, 1997

1997
[27]

N. S. Keskar, D. Mudigere, J. Nocedal, M. Smelyanskiy, and P. T. P. Tang. On large-batch train- ing for deep learning: Generalization gap and sharp minima.arXiv preprint arXiv:1609.04836, 2016

Pith/arXiv arXiv 2016
[28]

L. Dinh, R. Pascanu, S. Bengio, and Y . Bengio. Sharp minima can generalize for deep nets. In International Conference on Machine Learning, pages 1019–1028. PMLR, 2017

2017
[29]

Jiang, B

Y . Jiang, B. Neyshabur, H. Mobahi, D. Krishnan, and S. Bengio. Fantastic generalization measures and where to find them.arXiv preprint arXiv:1912.02178, 2019

arXiv 1912
[30]

Bahri, H

D. Bahri, H. Mobahi, and Y . Tay. Sharpness-aware minimization improves language model generalization. 2022

2022
[31]

Y . Yang, Z. Zhang, R. V . Swaminathan, J. Liu, N. Susanj, and Z. Zhang. SharpZO: Hybrid sharpness-aware vision language model prompt tuning via forward-only passes. Oct. 2025. 10

2025
[32]

T. Li, P. Zhou, Z. He, X. Cheng, and X. Huang. Friendly sharpness-aware minimization. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5631–5640, 2024

2024
[33]

M. J. Kim, C. Finn, and P. Liang. Fine-tuning vision-language-action models: Optimizing speed and success.arXiv preprint arXiv:2502.19645, 2025

Pith/arXiv arXiv 2025
[34]

Black, N

K. Black, N. Brown, J. Darpinian, K. Dhabalia, D. Driess, A. Esmail, M. R. Equi, C. Finn, N. Fusai, M. Y . Galliker, et al.π0.5: a vision-language-action model with open-world general- ization. In9th Annual Conference on Robot Learning, 2025

2025
[35]

Karamcheti, S

S. Karamcheti, S. Nair, A. Balakrishna, P. Liang, T. Kollar, and D. Sadigh. Prismatic vlms: Investigating the design space of visually-conditioned language models. InForty-first Interna- tional Conference on Machine Learning, 2024

2024
[36]

Steiner, A

A. Steiner, A. S. Pinto, M. Tschannen, D. Keysers, X. Wang, Y . Bitton, A. Gritsenko, M. Min- derer, A. Sherbondy, S. Long, et al. Paligemma 2: A family of versatile vlms for transfer.arXiv preprint arXiv:2412.03555, 2024

Pith/arXiv arXiv 2024
[37]

Ho and T

J. Ho and T. Salimans. Classifier-free diffusion guidance.arXiv preprint arXiv:2207.12598, 2022

Pith/arXiv arXiv 2022
[38]

Loshchilov and F

I. Loshchilov and F. Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017

Pith/arXiv arXiv 2017
[39]

Kachaev, M

N. Kachaev, M. Kolosov, D. Zelezetsky, A. K. Kovalev, and A. I. Panov. Don’t blind your vla: Aligning visual representations for ood generalization.arXiv preprint arXiv:2510.25616, 2025

arXiv 2025
[40]

T.-Y . Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Doll´ar, and C. L. Zitnick. Microsoft coco: Common objects in context. InEuropean conference on computer vision, pages 740–755. Springer, 2014

2014
[41]

Izmailov, D

P. Izmailov, D. Podoprikhin, T. Garipov, D. Vetrov, and A. G. Wilson. Averaging weights leads to wider optima and better generalization.arXiv preprint arXiv:1803.05407, 2018

Pith/arXiv arXiv 2018
[42]

Khazatsky, K

A. Khazatsky, K. Pertsch, S. Nair, A. Balakrishna, S. Dasari, S. Karamcheti, S. Nasiriany, M. K. Srirama, L. Y . Chen, K. Ellis, et al. Droid: A large-scale in-the-wild robot manipulation dataset.arXiv preprint arXiv:2403.12945, 2024. 11 A Benchmark Details We utilize three counterfactual benchmarks to evaluate instruction following, all based on the LIBE...

Pith/arXiv arXiv 2024
[43]

pick up the and put it in the basket

The model was finetuned on 4 x NVIDIA A100 Tensor Core GPUs with 80GB memory. Table 4: Hyperparameters for finetuningπ 0.5 with SAM Hyperparameter Value Learning Rate5×10 −5 Batch Size 16 ρ(Neighborhood Size) 0.075 Weight Decay 0.1 Base Optimizer AdamW Action Horizon 10 EMA Decay 0.999 Train Steps 30,000 13 π0.5 SAM+CFG.To test the combination of using an...

[1] [1]

Barreiros, A

J. Barreiros, A. Beaulieu, A. Bhat, R. Cory, E. Cousineau, H. Dai, C.-H. Fang, K. Hashimoto, M. Z. Irshad, M. Itkina, et al. A careful examination of large behavior models for multitask dexterous manipulation.Science Robotics, 11(113):eaea6201, 2026

2026

[2] [2]

Goyal, H

A. Goyal, H. Hadfield, X. Yang, V . Blukis, and F. Ramos. Vla-0: Building state-of-the-art vlas with zero modification.arXiv preprint arXiv:2510.13054, 2025

arXiv 2025

[3] [3]

Zhang, X

J. Zhang, X. Chen, Q. Wang, M. Li, Y . Guo, Y . Hu, J. Zhang, S. Bai, J. Lin, and J. Chen. Vlm4vla: Revisiting vision-language-models in vision-language-action models.arXiv preprint arXiv:2601.03309, 2026

Pith/arXiv arXiv 2026

[4] [4]

Grover, A

S. Grover, A. Gopalkrishnan, B. Ai, H. I. Christensen, H. Su, and X. Li. Enhancing generaliza- tion in vision-language-action models by preserving pretrained representations.arXiv preprint arXiv:2509.11417, 2025

arXiv 2025

[5] [5]

J. Guo, Z. Wu, C. Tu, Y . Ma, X. Kong, Z. Liu, J. Ji, S. Zhang, Y . Chen, K. Chen, et al. On robustness of vision-language-action model against multi-modal perturbations.arXiv preprint arXiv:2510.00037, 2025

arXiv 2025

[6] [6]

S. Fei, S. Wang, J. Shi, Z. Dai, J. Cai, P. Qian, L. Ji, X. He, S. Zhang, Z. Fei, et al. Libero-plus: In-depth robustness analysis of vision-language-action models.arXiv preprint arXiv:2510.13626, 2025

Pith/arXiv arXiv 2025

[7] [7]

G. Wang, C. Zhang, Q. Liu, J. Zhang, J. Cai, J. Liu, and X. Liu. Libero-x: Robustness litmus for vision-language-action models.arXiv preprint arXiv:2602.06556, 2026

arXiv 2026

[8] [8]

X. Zhou, Y . Xu, G. Tie, Y . Chen, G. Zhang, D. Chu, P. Zhou, and L. Sun. Libero-pro: To- wards robust and fair evaluation of vision-language-action models beyond memorization.arXiv preprint arXiv:2510.03827, 2025

Pith/arXiv arXiv 2025

[9] [9]

Y . Fang, Y . Feng, D. Jing, J. Liu, Y . Yang, Z. Wei, D. Szafir, and M. Ding. When vision overrides language: Evaluating and mitigating counterfactual failures in vlas.arXiv preprint arXiv:2602.17659, 2026

arXiv 2026

[10] [10]

K. Xu, Z. Zhu, A. Chen, S. Zhao, Q. Huang, Y . Yang, H. Lu, R. Xiong, M. Tomizuka, and Y . Wang. Seeing to act, prompting to specify: A bayesian factorization of vision language action policy.arXiv preprint arXiv:2512.11218, 2025

arXiv 2025

[11] [11]

Huang, J

S. Huang, J. Shao, K. Wang, Q. Chen, J. Sun, Y . Guo, M. Schwager, and J. Bohg. Breaking lock-in: Preserving steerability under low-data vla post-training.arXiv preprint arXiv:2604.23121, 2026

Pith/arXiv arXiv 2026

[12] [12]

Z. Zhan, Y . Chen, J. Zhou, Q. Lv, H. Liu, K. Wang, L. Lin, and G. Wang. Stable language guidance for vision-language-action models.arXiv preprint arXiv:2601.04052, 2026

Pith/arXiv arXiv 2026

[13] [13]

S. Lian, B. Yu, X. Lin, L. T. Yang, Z. Shen, C. Wu, Y . Miao, C. Huang, and K. Chen. Bayesian- vla: Bayesian decomposition of vision language action models via latent action queries.arXiv preprint arXiv:2601.15197, 2026. 9

Pith/arXiv arXiv 2026

[14] [14]

S. Yang, H. Li, B. Wang, Y . Chen, Y . Tian, T. Wang, H. Wang, F. Zhao, Y . Liao, and J. Pang. Instructvla: Vision-language-action instruction tuning from understanding to manipulation. arXiv preprint arXiv:2507.17520, 2025

arXiv 2025

[15] [15]

A. J. Hancock, X. Wu, L. Zha, O. Russakovsky, and A. Majumdar. Actions as language: Fine-tuning vlms into vlas without catastrophic forgetting.arXiv preprint arXiv:2509.22195, 2025

arXiv 2025

[16] [16]

Zhang, B

N. Zhang, B. Zhu, S. Zhou, and J. Chen. Restoring linguistic grounding in vla models via train-free attention recalibration.arXiv preprint arXiv:2603.06001, 2026

arXiv 2026

[17] [17]

Foret, A

P. Foret, A. Kleiner, H. Mobahi, and B. Neyshabur. Sharpness-aware minimization for effi- ciently improving generalization.arXiv preprint arXiv:2010.01412, 2020

Pith/arXiv arXiv 2010

[18] [18]

J. Kwon, J. Kim, H. Park, and I. K. Choi. Asam: Adaptive sharpness-aware minimization for scale-invariant learning of deep neural networks. InInternational conference on machine learning, pages 5905–5914. PMLR, 2021

2021

[19] [19]

Sherborne, N

T. Sherborne, N. Saphra, P. Dasigi, and H. Peng. Tram: Bridging trust regions and sharpness aware minimization.arXiv preprint arXiv:2310.03646, 2023

arXiv 2023

[20] [20]

Y . Liu, S. Mai, X. Chen, C.-J. Hsieh, and Y . You. Towards efficient and scalable sharpness- aware minimization. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12360–12370, 2022

2022

[21] [21]

E. J. Hu, Y . Shen, P. Wallis, Z. Allen-Zhu, Y . Li, S. Wang, L. Wang, W. Chen, et al. Lora: Low-rank adaptation of large language models.Iclr, 1(2):3, 2022

2022

[22] [22]

Driess, J

D. Driess, J. Springenberg, B. Ichter, L. Yu, A. Li-Bell, K. Pertsch, A. Ren, H. Walke, Q. Vuong, L. X. Shi, et al. Knowledge insulating vision-language-action models: Train fast, run fast, generalize better.Advances in Neural Information Processing Systems, 38:102867– 102888, 2026

2026

[23] [23]

Yadav, Z

Y . Yadav, Z. Zhou, A. Wagenmaker, K. Pertsch, and S. Levine. Robust finetuning of vision- language-action robot policies via parameter merging.arXiv preprint arXiv:2512.08333, 2025

arXiv 2025

[24] [24]

Glossop, W

C. Glossop, W. Chen, A. Bhorkar, D. Shah, and S. Levine. Cast: Counterfactual labels improve instruction following in vision-language-action models.arXiv preprint arXiv:2508.13446, 2025

Pith/arXiv arXiv 2025

[25] [25]

Hou and L

Y . Hou and L. Zhao. Langgap: Diagnosing and closing the language gap in vision-language- action models.arXiv preprint arXiv:2603.00592, 2026

arXiv 2026

[26] [26]

Hochreiter and J

S. Hochreiter and J. Schmidhuber. Flat minima.Neural computation, 9(1):1–42, 1997

1997

[27] [27]

N. S. Keskar, D. Mudigere, J. Nocedal, M. Smelyanskiy, and P. T. P. Tang. On large-batch train- ing for deep learning: Generalization gap and sharp minima.arXiv preprint arXiv:1609.04836, 2016

Pith/arXiv arXiv 2016

[28] [28]

L. Dinh, R. Pascanu, S. Bengio, and Y . Bengio. Sharp minima can generalize for deep nets. In International Conference on Machine Learning, pages 1019–1028. PMLR, 2017

2017

[29] [29]

Jiang, B

Y . Jiang, B. Neyshabur, H. Mobahi, D. Krishnan, and S. Bengio. Fantastic generalization measures and where to find them.arXiv preprint arXiv:1912.02178, 2019

arXiv 1912

[30] [30]

Bahri, H

D. Bahri, H. Mobahi, and Y . Tay. Sharpness-aware minimization improves language model generalization. 2022

2022

[31] [31]

Y . Yang, Z. Zhang, R. V . Swaminathan, J. Liu, N. Susanj, and Z. Zhang. SharpZO: Hybrid sharpness-aware vision language model prompt tuning via forward-only passes. Oct. 2025. 10

2025

[32] [32]

T. Li, P. Zhou, Z. He, X. Cheng, and X. Huang. Friendly sharpness-aware minimization. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5631–5640, 2024

2024

[33] [33]

M. J. Kim, C. Finn, and P. Liang. Fine-tuning vision-language-action models: Optimizing speed and success.arXiv preprint arXiv:2502.19645, 2025

Pith/arXiv arXiv 2025

[34] [34]

Black, N

K. Black, N. Brown, J. Darpinian, K. Dhabalia, D. Driess, A. Esmail, M. R. Equi, C. Finn, N. Fusai, M. Y . Galliker, et al.π0.5: a vision-language-action model with open-world general- ization. In9th Annual Conference on Robot Learning, 2025

2025

[35] [35]

Karamcheti, S

S. Karamcheti, S. Nair, A. Balakrishna, P. Liang, T. Kollar, and D. Sadigh. Prismatic vlms: Investigating the design space of visually-conditioned language models. InForty-first Interna- tional Conference on Machine Learning, 2024

2024

[36] [36]

Steiner, A

A. Steiner, A. S. Pinto, M. Tschannen, D. Keysers, X. Wang, Y . Bitton, A. Gritsenko, M. Min- derer, A. Sherbondy, S. Long, et al. Paligemma 2: A family of versatile vlms for transfer.arXiv preprint arXiv:2412.03555, 2024

Pith/arXiv arXiv 2024

[37] [37]

Ho and T

J. Ho and T. Salimans. Classifier-free diffusion guidance.arXiv preprint arXiv:2207.12598, 2022

Pith/arXiv arXiv 2022

[38] [38]

Loshchilov and F

I. Loshchilov and F. Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017

Pith/arXiv arXiv 2017

[39] [39]

Kachaev, M

N. Kachaev, M. Kolosov, D. Zelezetsky, A. K. Kovalev, and A. I. Panov. Don’t blind your vla: Aligning visual representations for ood generalization.arXiv preprint arXiv:2510.25616, 2025

arXiv 2025

[40] [40]

T.-Y . Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Doll´ar, and C. L. Zitnick. Microsoft coco: Common objects in context. InEuropean conference on computer vision, pages 740–755. Springer, 2014

2014

[41] [41]

Izmailov, D

P. Izmailov, D. Podoprikhin, T. Garipov, D. Vetrov, and A. G. Wilson. Averaging weights leads to wider optima and better generalization.arXiv preprint arXiv:1803.05407, 2018

Pith/arXiv arXiv 2018

[42] [42]

Khazatsky, K

A. Khazatsky, K. Pertsch, S. Nair, A. Balakrishna, S. Dasari, S. Karamcheti, S. Nasiriany, M. K. Srirama, L. Y . Chen, K. Ellis, et al. Droid: A large-scale in-the-wild robot manipulation dataset.arXiv preprint arXiv:2403.12945, 2024. 11 A Benchmark Details We utilize three counterfactual benchmarks to evaluate instruction following, all based on the LIBE...

Pith/arXiv arXiv 2024

[43] [43]

pick up the and put it in the basket

The model was finetuned on 4 x NVIDIA A100 Tensor Core GPUs with 80GB memory. Table 4: Hyperparameters for finetuningπ 0.5 with SAM Hyperparameter Value Learning Rate5×10 −5 Batch Size 16 ρ(Neighborhood Size) 0.075 Weight Decay 0.1 Base Optimizer AdamW Action Horizon 10 EMA Decay 0.999 Train Steps 30,000 13 π0.5 SAM+CFG.To test the combination of using an...