Same Weights, Different Robot: A Deployment Safety View of VLA Policies
Pith reviewed 2026-06-28 09:28 UTC · model grok-4.3
The pith
Identical VLA checkpoints can be executable-inequivalent due to action metadata differences.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We formalize the gap as an executable policy specification problem: a VLA policy includes the learned model, action representation, metadata-selected unnormalizer, and controller-facing conventions. Under this view, identical checkpoints can be executable-inequivalent. For quantile-style action normalization, we derive a closed-form metadata mismatch transform and an ExecSpec certificate that measures action-space semantic drift without model inference or rollout.
What carries the argument
The ExecSpec certificate that measures action-space semantic drift from metadata mismatches in quantile-style action normalization without requiring model inference or rollout.
Load-bearing premise
The replay-based substitution experiments on LIBERO benchmarks accurately indicate a general deployment safety issue.
What would settle it
Finding that different metadata keys produce identical unnormalized action sequences and success rates on the same checkpoint would falsify the claim of executable inequivalence.
read the original abstract
Vision-language-action (VLA) policies are often treated as checkpoint-defined objects: if the weights, prompt, and benchmark suite match, the deployment is assumed to be the same policy. Robot execution breaks this assumption because the same normalized model output can become a different physical action after action unnormalization and controller conventions are applied. This creates a deployment-safety gap: safety review can certify the checkpoint while missing the executable robot policy that reaches the controller. We formalize this gap as an executable policy specification problem: a VLA policy includes the learned model, action representation, metadata-selected unnormalizer, and controller-facing conventions. Under this view, identical checkpoints can be executable-inequivalent. For quantile-style action normalization, we derive a closed-form metadata mismatch transform and an ExecSpec certificate that measures action-space semantic drift without model inference or rollout. On LIBERO-Goal replay, substituting a plausible sibling metadata key yields mean drift 0.199 over six non-gripper action dimensions and reduces success from 28/28 to 2/28 under full substitution. On LIBERO-Spatial replay, the same substituted key reduces success from 26/26 to 0/26. The same full-substitution protocol gives 0/28 success for all four Object substitutions and 0/23 or 1/23 success on Long. Identity-key, replay-validity, no-op filtering, raw-vs-correct replay, mask/gripper, synthetic upper-bound, and OpenVLA-style unnormalizer interface checks rule out several simpler explanations. These results do not certify closed-loop or hardware safety. They support a narrower deployment-safety view: action-space metadata is part of the executable policy and should be checked before rollout.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript claims that VLA policies are not fully specified by model weights alone, since action unnormalization metadata and controller conventions determine the executable robot policy. Identical checkpoints can therefore be executable-inequivalent. For quantile-style normalization the authors derive a closed-form metadata mismatch transform and introduce an ExecSpec certificate that quantifies action-space semantic drift without model inference or rollout. Replay substitution experiments on LIBERO-Goal and LIBERO-Spatial report mean drift of 0.199 across six non-gripper dimensions and success-rate drops (28/28 to 2/28; 26/26 to 0/26) under full substitution of a plausible sibling metadata key. Multiple controls (identity-key, replay-validity, no-op filtering, raw-vs-correct replay, mask/gripper, synthetic upper-bound, OpenVLA-style interface) rule out simpler explanations. The results support treating action metadata as part of the executable specification that should be checked before rollout, while explicitly stating that the replay protocol does not certify closed-loop or hardware safety.
Significance. If the central claim holds, the work highlights an under-appreciated deployment-safety consideration for VLA policies: metadata must be included in the policy specification. The closed-form derivation and the model-free ExecSpec certificate are concrete strengths that allow drift detection without inference or rollouts. The manuscript carefully scopes its conclusions to the replay setting, which prevents overgeneralization and directly addresses the bridging-assumption concern raised in the stress-test note. The empirical measurements on standard benchmarks provide direct, falsifiable evidence of the phenomenon.
minor comments (3)
- The abstract lists the controls that rule out simpler explanations but does not indicate where in the manuscript the detailed results of each control appear; a short summary table or dedicated paragraph would improve traceability.
- The ExecSpec certificate is introduced as a model-free measure, yet the abstract provides no equation or pseudocode; including the precise definition (even if only referenced) would aid reproducibility.
- Success rates are reported as exact fractions (28/28, 2/28) without accompanying trial counts, variance, or statistical tests; adding these details would strengthen the presentation of the quantitative results.
Simulated Author's Rebuttal
We thank the referee for the positive and accurate summary of our manuscript, including the recognition of the closed-form metadata mismatch transform, the ExecSpec certificate, and the careful scoping to replay-based evidence. The recommendation of minor revision is noted; we will incorporate any editorial or minor clarifications in the revised version.
Circularity Check
No circularity: closed-form transform follows directly from quantile definition; results are independent empirical measurements
full rationale
The paper's central derivation is an algebraic closed-form transform obtained directly from the standard definition of quantile-style action normalization; this is ordinary mathematical expansion rather than any self-referential loop or fitted input renamed as prediction. The LIBERO replay substitution results are direct empirical observations of success-rate changes under metadata substitution and are not quantities generated by the paper's own equations. No self-citations, uniqueness theorems, or ansatzes imported from prior author work appear as load-bearing steps. The derivation chain is therefore self-contained against external benchmarks and does not reduce to its inputs by construction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Action normalization follows a quantile-style process
invented entities (1)
-
ExecSpec certificate
no independent evidence
Reference graph
Works this paper leans on
-
[1]
D.; Chernova, S.; Veloso, M.; and Browning, B
Argall, B. D.; Chernova, S.; Veloso, M.; and Browning, B. 2009. A Survey of Robot Learning from Demonstration. Robotics and Autonomous Systems, 57(5): 469--483
2009
-
[2]
Brohan, A.; Brown, N.; Carbajal, J.; Chebotar, Y.; Chen, X.; Choromanski, K.; Ding, T.; Driess, D.; Dubey, A.; Finn, C.; Florence, P.; Fu, C.; Gonzalez Arenas, M.; Gopalakrishnan, K.; Han, K.; Hausman, K.; Herzog, A.; Hsu, J.; Ichter, B.; Irpan, A.; Joshi, N.; Julian, R.; Kalashnikov, D.; Kuang, Y.; Leal, I.; Lee, L.; Lee, T.-W. E.; Levine, S.; Lu, Y.; Mi...
Pith/arXiv arXiv 2023
-
[3]
W.; Yuan, Z.; Zhou, S.; Panerati, J.; and Schoellig, A
Brunke, L.; Greeff, M.; Hall, A. W.; Yuan, Z.; Zhou, S.; Panerati, J.; and Schoellig, A. P. 2022. Safe Learning in Robotics: From Learning-Based Control to Safe Reinforcement Learning. Annual Review of Control, Robotics, and Autonomous Systems, 5: 411--444
2022
-
[4]
Cadene, R.; Aliberts, S.; Capuano, F.; Aractingi, M.; Zouitine, A.; Kooijmans, P.; Choghari, J.; Russi, M.; Pascal, C.; Palma, S.; Shukor, M.; Moss, J.; Soare, A.; Aubakirova, D.; Lhoest, Q.; Gallouedec, Q.; and Wolf, T. 2026. LeRobot : An Open-Source Library for End-to-End Robot Learning. arXiv:2602.22818
arXiv 2026
-
[5]
Chi, C.; Xu, Z.; Feng, S.; Cousineau, E.; Du, Y.; Burchfiel, B.; Tedrake, R.; and Song, S. 2023. Diffusion Policy: Visuomotor Policy Learning via Action Diffusion. In Robotics: Science and Systems
2023
-
[6]
Chi, C.; Xu, Z.; Pan, C.; Cousineau, E.; Burchfiel, B.; Feng, S.; Tedrake, R.; and Song, S. 2024. Universal Manipulation Interface: In-The-Wild Robot Teaching Without In-The-Wild Robots. arXiv:2402.10329
Pith/arXiv arXiv 2024
-
[7]
D.; Krishna, R.; Fox, D.; and Yu, Y
Choi, S.; Lee, Y.; Park, Y.; Kim, C. D.; Krishna, R.; Fox, D.; and Yu, Y. 2026. vla-eval : A Unified Evaluation Harness for Vision-Language-Action Models. arXiv:2603.13966
Pith/arXiv arXiv 2026
-
[8]
Dasari, S.; Ebert, F.; Tian, S.; Nair, S.; Bucher, B.; Schmeckpeper, K.; Singh, S.; Levine, S.; and Finn, C. 2020. RoboNet : Large-Scale Multi-Robot Learning. In Proceedings of the Conference on Robot Learning, volume 100 of Proceedings of Machine Learning Research, 885--897. PMLR
2020
-
[9]
W.; Wallach, H.; Daum \'e III, H.; and Crawford, K
Gebru, T.; Morgenstern, J.; Vecchione, B.; Vaughan, J. W.; Wallach, H.; Daum \'e III, H.; and Crawford, K. 2021. Datasheets for Datasets. Communications of the ACM, 64(12): 86--92
2021
-
[10]
S.; Zhang, J.; Tang, S.; and Xiang, Y
Huang, A. S.; Zhang, J.; Tang, S.; and Xiang, Y. 2026. VLA-REPLICA : A Low-Cost, Reproducible Benchmark for Real-World Evaluation of Vision-Language-Action Models. arXiv:2605.20774
Pith/arXiv arXiv 2026
-
[11]
Huang, S.; Papernot, N.; Goodfellow, I.; Duan, Y.; and Abbeel, P. 2017. Adversarial Attacks on Neural Network Policies. arXiv:1702.02284
Pith/arXiv arXiv 2017
-
[12]
Khazatsky, A.; Pertsch, K.; Nair, S.; Balakrishna, A.; Dasari, S.; Karamcheti, S.; Nasiriany, S.; Sreekanth, K.; Fang, K.; Schaal, S.; Finn, C.; and Levine, S. 2024. DROID : A Large-Scale In-The-Wild Robot Manipulation Dataset. arXiv:2403.12945
Pith/arXiv arXiv 2024
-
[13]
Kim, M. J.; Pertsch, K.; Karamcheti, S.; Xiao, T.; Balakrishna, A.; Nair, S.; Rafailov, R.; Foster, E.; Lam, G.; Sanketi, P.; Vuong, Q.; Kollar, T.; Burchfiel, B.; Tedrake, R.; Sadigh, D.; Levine, S.; Liang, P.; and Finn, C. 2024. OpenVLA : An Open-Source Vision-Language-Action Model. arXiv:2406.09246
Pith/arXiv arXiv 2024
-
[14]
Li, Q.; Liang, Y.; Wang, Z.; Luo, L.; Chen, X.; et al. 2024. CogACT : A Foundational Vision-Language-Action Model for Synergizing Cognition and Action in Robotic Manipulation. arXiv:2411.19650
Pith/arXiv arXiv 2024
-
[15]
Liu, B.; Zhu, Y.; Gao, C.; Feng, Y.; Liu, Q.; Zhu, Y.; and Stone, P. 2023. LIBERO : Benchmarking Knowledge Transfer for Lifelong Robot Learning. In Advances in Neural Information Processing Systems, volume 36
2023
-
[16]
Mandlekar, A.; Xu, D.; Wong, J.; Nasiriany, S.; Wang, C.; Kulkarni, R.; Fei-Fei, L.; Savarese, S.; Zhu, Y.; and Mart \'i n-Mart \'i n, R. 2022. What Matters in Learning from Offline Human Demonstrations for Robot Manipulation. In Proceedings of the 5th Conference on Robot Learning, volume 164 of Proceedings of Machine Learning Research, 1678--1690. PMLR
2022
-
[17]
D.; and Gebru, T
Mitchell, M.; Wu, S.; Zaldivar, A.; Barnes, P.; Vasserman, L.; Hutchinson, B.; Spitzer, E.; Raji, I. D.; and Gebru, T. 2019. Model Cards for Model Reporting. In Proceedings of the Conference on Fairness, Accountability, and Transparency, 220--229
2019
-
[18]
Octo Model Team ; Ghosh, D.; Walke, H.; Pertsch, K.; Black, K.; Mees, O.; Dasari, S.; Hejna, J.; Kreiman, T.; Xu, C.; Luo, J.; Tan, Y. L.; Chen, L. Y.; Sanketi, P.; Vuong, Q.; Xiao, T.; Sadigh, D.; Finn, C.; and Levine, S. 2024. Octo : An Open-Source Generalist Robot Policy. arXiv:2405.12213
Pith/arXiv arXiv 2024
-
[19]
Open X-Embodiment Collaboration ; O'Neill, A.; Rehman, A.; Gupta, A.; Maddukuri, A.; Gupta, A.; Padalkar, A.; Lee, A.; Pooley, A.; Gupta, A.; Mandlekar, A.; Jain, A.; Tung, A.; Bewley, A.; Herzog, A.; Irpan, A.; Khazatsky, A.; Rai, A.; Gupta, A.; Wang, A.; Kolobov, A.; Singh, A.; Garg, A.; Kembhavi, A.; Xie, A.; Brohan, A.; Finn, C.; Ichter, B.; Levine, S...
Pith/arXiv arXiv 2023
-
[20]
Physical Intelligence . 2026. OpenPI Normalization Statistics Documentation. https://github.com/Physical-Intelligence/openpi. Docs/norm\_stats.md
2026
-
[21]
Pineau, J.; Vincent-Lamarre, P.; Sinha, K.; Larivi \`e re, V.; Beygelzimer, A.; d'Alch \'e Buc, F.; Fox, E.; and Larochelle, H. 2021. Improving Reproducibility in Machine Learning Research: A Report from the NeurIPS 2019 Reproducibility Program. Journal of Machine Learning Research, 22(164): 1--20
2021
-
[22]
StarVLA Community . 2026. StarVLA : A Lego-like Codebase for Vision-Language-Action Model Developing. arXiv:2604.05014
Pith/arXiv arXiv 2026
-
[23]
Zhang, J.; and Cho, K. 2017. Query-Efficient Imitation Learning for End-to-End Simulated Driving. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 31
2017
-
[24]
Z.; Kumar, V.; Levine, S.; and Finn, C
Zhao, T. Z.; Kumar, V.; Levine, S.; and Finn, C. 2023. Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware. arXiv:2304.13705
Pith/arXiv arXiv 2023
-
[25]
Zhu, Y.; Wong, J.; Mandlekar, A.; Mart \'i n-Mart \'i n, R.; Joshi, A.; Lin, K.; Maddukuri, A.; Nasiriany, S.; and Zhu, Y. 2020. robosuite : A Modular Simulation Framework and Benchmark for Robot Learning. arXiv:2009.12293
Pith/arXiv arXiv 2020
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.