pith. machine review for the scientific record. sign in

arxiv: 2604.23121 · v1 · submitted 2026-04-25 · 💻 cs.RO · cs.CV

Recognition: unknown

Breaking Lock-In: Preserving Steerability under Low-Data VLA Post-Training

Authors on Pith no claims yet

Pith reviewed 2026-05-08 08:04 UTC · model grok-4.3

classification 💻 cs.RO cs.CV
keywords VLAlock-inpost-trainingsteerabilityvision-language-actionfine-tuningcontrastive guidancedenoising
0
0 comments X

The pith

DeLock breaks lock-in in low-data VLA post-training by preserving visual grounding and applying contrastive test-time prompt guidance to steer toward novel instructions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Vision-language-action policies lose the ability to follow new instructions after low-data supervised fine-tuning, a problem the paper calls lock-in that shows up as fixation on training objects or spatial targets. The authors argue that the model's pre-trained knowledge already contains what is needed for novel tasks, so no extra data or external supervision is required. DeLock keeps visual grounding intact throughout post-training and adds contrastive prompt guidance at test time to steer the denoising process according to the new instructions. In eight simulation and real-world tests, this method beats strong baselines and performs as well as or better than a state-of-the-art generalist policy that was post-trained on far more curated demonstrations.

Core claim

The pre-trained knowledge inside a VLA policy is already sufficient for novel instructions; lock-in after low-data post-training can be avoided by preserving visual grounding during supervised fine-tuning and steering the policy's denoising dynamics at test time with contrastive prompts that contrast the novel instruction against the locked-in behavior.

What carries the argument

DeLock, a two-part method that preserves visual grounding during low-data SFT and applies test-time contrastive prompt guidance to redirect the policy's denoising toward novel instructions.

If this is right

  • VLA policies can be adapted to new tasks using only small demonstration sets while retaining responsiveness to unseen instructions.
  • Performance on novel instructions can match or exceed that of policies post-trained with substantially larger curated datasets.
  • Lock-in appears in two forms: concept lock-in on training objects and attributes, and spatial lock-in on training targets.
  • Test-time guidance can steer denoising dynamics without requiring retraining or additional task-specific data collection.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The approach may reduce the data and curation costs of deploying generalist VLAs in new environments.
  • Similar preservation of grounding combined with test-time steering could be tested on other generative control models that overfit during fine-tuning.
  • Future checks could measure whether the method still works when the novel instructions diverge further from pre-training distributions.

Load-bearing premise

The pre-trained VLA model already holds enough knowledge for novel instructions, and preserving grounding plus contrastive guidance will surface that knowledge without new failure modes.

What would settle it

A benchmark where DeLock is applied to instructions that clearly require knowledge absent from the original pre-training and it shows no gain over standard low-data fine-tuning.

Figures

Figures reproduced from arXiv: 2604.23121 by Jeannette Bohg, Jiankai Sun, Jiaqi Shao, Ke Wang, Mac Schwager, Qianzhong Chen, Suning Huang, Yanjiang Guo.

Figure 1
Figure 1. Figure 1: Lock-In Failure Mode. In low-data post￾training, VLA policies can over-specialize into the training-demo distribution, becoming difficult to steer under novel prompts. We highlight concept lock-in under novel object concepts and spatial lock-in under novel spatial relations. Blue and green arrows denote desired and executed trajectories, respectively. In this paper, we introduce DeLock, a simple yet effect… view at source ↗
Figure 2
Figure 2. Figure 2 view at source ↗
Figure 3
Figure 3. Figure 3: Lock-In Failure Evaluation Benchmark. Our 8-task suite spans four LIBERO sim￾ulation tasks and four real-world DROID tasks. Labels [C] and [S] denote concept- and spatial￾lock-in probes, respectively. Yellow arrows illustrate the manipulation pattern demonstrated during post-training. In tasks with shaded regions, we additionally evaluate OOD location shift: the green￾shaded region indicates the object pla… view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative Evaluation of Lock-In Failure. (a) BLOCK-STACKING [C], from “stack blue block on green block” to “stack green block on blue block”. Standard SFT shows weak prompt￾conditioned attention shift, while DeLock exhibits clearer instruction-aligned attention reallocation. (b) CUP-TO-BOX [S], evaluated on the novel prompt “put left cup to box”. The red curve shows the observed rollout with DeLock (CPG … view at source ↗
Figure 5
Figure 5. Figure 5: Novel-Prompt Rollouts on Articulated Tasks. We compare DeLock and RETAIN on the two challenging tasks involving articulated objects: OPEN-MICROWAVE [S] and OPEN-LABELED￾DOOR [C+S]. Under novel prompts, RETAIN largely repeats the post-training trajectory and fails to follow the changed spatial/concept specification, whereas DeLock successfully re-steers the learned skill to follow the new instruction. fine-… view at source ↗
Figure 6
Figure 6. Figure 6: Real-World Experimental Setup. C.2 Task Design The benchmark suite is shown in view at source ↗
Figure 7
Figure 7. Figure 7: Full Qualitative Results with Novel Prompts. Additionally, in the MUG-ON-PLATE [S] task, we place the book at the center of the scene and directly prompt the policy with “put book on left plate” without any further fine-tuning. DeLock 18 view at source ↗
Figure 8
Figure 8. Figure 8: Prompted for a Novel Object. D.3 CPG with an Invalid Positive Prompt Finally, we test whether CPG succeeds specifically by leveraging the semantic content of the positive prompt. We again consider the MUG-ON-PLATE [S] task, but replace the intended positive prompt τ + with the malformed instruction “put mug on write plate”. Since “write” does not specify a meaningful spatial target, the resulting positive … view at source ↗
Figure 9
Figure 9. Figure 9: CPG with an Invalid Positive Prompt. 19 view at source ↗
read the original abstract

Have you ever post-trained a generalist vision-language-action (VLA) policy on a small demonstration dataset, only to find that it stops responding to new instructions and is limited to behaviors observed during post-training? We identify this phenomenon as lock-in: after low-data, supervised fine-tuning (SFT), the policy becomes overly specialized to the post-training data and fails to generalize to novel instructions, manifesting as concept lock-in (fixation on training objects/attributes) and spatial lock-in (fixation on training spatial targets). Many existing remedies introduce additional supervision signals, such as those derived from foundation models or auxiliary objectives, or rely on augmented datasets to recover generalization. In this paper, we show that the policy's internal pre-trained knowledge is sufficient: DeLock mitigates lock-in by preserving visual grounding during post-training and applying test-time contrastive prompt guidance to steer the policy's denoising dynamics according to novel instructions. Across eight simulation and real-world evaluations, DeLock consistently outperforms strong baselines and matches or exceeds the performance of a state-of-the-art generalist policy post-trained with substantially more curated demonstrations.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript identifies 'lock-in' in vision-language-action (VLA) policies after low-data supervised fine-tuning (SFT), manifesting as concept lock-in (fixation on training objects/attributes) and spatial lock-in (fixation on training spatial targets). It proposes DeLock, which preserves visual grounding during post-training and applies test-time contrastive prompt guidance to steer the policy's denoising dynamics for novel instructions. Across eight simulation and real-world evaluations, DeLock outperforms strong baselines and matches or exceeds a state-of-the-art generalist policy post-trained with substantially more curated demonstrations, supporting the claim that the policy's internal pre-trained knowledge is sufficient.

Significance. If the results hold, this is significant for scalable robot learning: it shows that pre-trained VLA knowledge can be preserved and surfaced for novel tasks using minimal data and no auxiliary supervision or augmented datasets. The consistent gains across diverse evaluations, combined with the method's reliance on internal knowledge rather than external signals, offer an efficient path to maintaining steerability in generalist policies.

major comments (2)
  1. [§4 (Experiments)] §4 (Experiments) and Table 1: The evaluations compare DeLock against post-trained baselines but do not report results for the unmodified pre-trained VLA equipped solely with the test-time contrastive prompt guidance on the novel-instruction tasks. This ablation is load-bearing for the central claim that 'the policy's internal pre-trained knowledge is sufficient' and that DeLock merely preserves access to it; without it, the results remain compatible with the low-data SFT stage introducing or recovering capabilities.
  2. [§3.2 (Method)] §3.2 (Method): The description of how visual grounding preservation during SFT interacts with the contrastive guidance at test time lacks a formal derivation or pseudocode showing that the combined procedure does not alter the pre-trained denoising distribution in ways that could introduce new failure modes on out-of-distribution instructions.
minor comments (2)
  1. [Figure 3] Figure 3: Error bars or standard deviations across runs are not reported for the success rates; this makes it difficult to assess the statistical reliability of the claimed consistent outperformance.
  2. [§3.2] The notation for the contrastive prompt guidance (e.g., the weighting parameter between positive and negative prompts) is introduced without an explicit equation; adding Eq. (X) would improve clarity.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that pre-trained VLA knowledge is already sufficient once lock-in is prevented; no free parameters or new entities are visible in the abstract.

axioms (1)
  • domain assumption Pre-trained VLA policies contain sufficient internal knowledge to generalize to novel instructions once lock-in is avoided.
    This premise justifies the claim that no additional supervision signals or larger datasets are required.

pith-pipeline@v0.9.0 · 5520 in / 1229 out tokens · 72952 ms · 2026-05-08T08:04:57.398146+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

61 extracted references · 47 canonical work pages · 17 internal anchors

  1. [1]

    Intelligence, K

    P. Intelligence, K. Black, N. Brown, J. Darpinian, K. Dhabalia, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, et al. pi 05: a vision-language-action model with open-world generalization. pi05: a vision-language-action model with open-world generalization, 2025

  2. [2]

    DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset

    A. Khazatsky, K. Pertsch, S. Nair, A. Balakrishna, S. Dasari, S. Karamcheti, S. Nasiriany, M. K. Srirama, L. Y . Chen, K. Ellis, et al. Droid: A large-scale in-the-wild robot manipulation dataset.arXiv preprint arXiv:2403.12945, 2024

  3. [3]

    O’Neill, A

    A. O’Neill, A. Rehman, A. Maddukuri, A. Gupta, A. Padalkar, A. Lee, A. Pooley, A. Gupta, A. Mandlekar, A. Jain, et al. Open x-embodiment: Robotic learning datasets and rt-x models: Open x-embodiment collaboration 0. In2024 IEEE International Conference on Robotics and Automation (ICRA), pages 6892–6903. IEEE, 2024

  4. [4]

    M. J. Kim, K. Pertsch, S. Karamcheti, T. Xiao, A. Balakrishna, S. Nair, R. Rafailov, E. Foster, G. Lam, P. Sanketi, et al. Openvla: An open-source vision-language-action model.arXiv preprint arXiv:2406.09246, 2024

  5. [5]

    GR00T N1: An Open Foundation Model for Generalist Humanoid Robots

    J. Bjorck, F. Casta ˜neda, N. Cherniadev, X. Da, R. Ding, L. Fan, Y . Fang, D. Fox, F. Hu, S. Huang, et al. Gr00t n1: An open foundation model for generalist humanoid robots.arXiv preprint arXiv:2503.14734, 2025

  6. [6]

    FAST: Efficient Action Tokenization for Vision-Language-Action Models

    K. Pertsch, K. Stachowicz, B. Ichter, D. Driess, S. Nair, Q. Vuong, O. Mees, C. Finn, and S. Levine. Fast: Efficient action tokenization for vision-language-action models.arXiv preprint arXiv:2501.09747, 2025

  7. [7]

    G. R. Team, S. Abeyruwan, J. Ainslie, J.-B. Alayrac, M. G. Arenas, T. Armstrong, A. Balakr- ishna, R. Baruch, M. Bauza, M. Blokzijl, et al. Gemini robotics: Bringing ai into the physical world.arXiv preprint arXiv:2503.20020, 2025

  8. [8]

    Robocat: A self-improving foundation agent for robotic manipulation.arXiv preprint arXiv:2306.11706, 1(8), 2023

    K. Bousmalis, G. Vezzani, D. Rao, C. Devin, A. X. Lee, M. Bauz ´a, T. Davchev, Y . Zhou, A. Gupta, A. Raju, et al. Robocat: A self-improving generalist agent for robotic manipulation. arXiv preprint arXiv:2306.11706, 2023

  9. [9]

    $\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

    K. Black, N. Brown, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, L. Groom, K. Hausman, B. Ichter, et al. pi 0: A vision-language-action flow model for general robot control.arXiv preprint arXiv:2410.24164, 2024

  10. [10]

    Y . Guo, L. X. Shi, J. Chen, and C. Finn. Ctrl-world: A controllable generative world model for robot manipulation.arXiv preprint arXiv:2510.10125, 2025

  11. [11]

    Y . Guo, T. Lee, L. X. Shi, J. Chen, P. Liang, and C. Finn. Vlaw: Iterative co-improvement of vision-language-action policy and world model.arXiv preprint arXiv:2602.12063, 2026

  12. [12]

    H. Zang, M. Wei, S. Xu, Y . Wu, Z. Guo, Y . Wang, H. Lin, L. Shi, Y . Xie, Z. Xu, et al. Rlinf-vla: A unified and efficient framework for vla+ rl training.arXiv preprint arXiv:2510.06710, 2025

  13. [13]

    P. Li, Y . Wu, Z. Xi, W. Li, Y . Huang, Z. Zhang, Y . Chen, J. Wang, S.-C. Zhu, T. Liu, et al. Controlvla: Few-shot object-centric adaptation for pre-trained vision-language-action models. arXiv preprint arXiv:2506.16211, 2025

  14. [14]

    Moe-dp: An moe-enhanced diffusion policy for robust long-horizon robotic manipulation with skill decomposition and failure recovery.arXiv preprint arXiv:2511.05007, 2025

    B. Cheng, T. Liang, S. Huang, M. Shao, F. Zhang, B. Xu, Z. Xue, and H. Xu. Moe-dp: An moe-enhanced diffusion policy for robust long-horizon robotic manipulation with skill decom- position and failure recovery.arXiv preprint arXiv:2511.05007, 2025

  15. [15]

    Y . Guo, J. Zhang, X. Chen, X. Ji, Y .-J. Wang, Y . Hu, and J. Chen. Improving vision-language- action model with online reinforcement learning. In2025 IEEE International Conference on Robotics and Automation (ICRA), pages 15665–15672. IEEE, 2025. 9

  16. [16]

    X. Li, K. Hsu, J. Gu, K. Pertsch, O. Mees, H. R. Walke, C. Fu, I. Lunawat, I. Sieh, S. Kir- mani, et al. Evaluating real-world robot manipulation policies in simulation.arXiv preprint arXiv:2405.05941, 2024

  17. [17]

    Rl’s razor: Why online reinforcement learning forgets less, 2025

    I. Shenfeld, J. Pari, and P. Agrawal. Rl’s razor: Why online reinforcement learning forgets less.arXiv preprint arXiv:2509.04259, 2025

  18. [18]

    S. Fei, S. Wang, J. Shi, Z. Dai, J. Cai, P. Qian, L. Ji, X. He, S. Zhang, Z. Fei, et al. Libero-plus: In-depth robustness analysis of vision-language-action models.arXiv preprint arXiv:2510.13626, 2025

  19. [19]

    X. Zhou, Y . Xu, G. Tie, Y . Chen, G. Zhang, D. Chu, P. Zhou, and L. Sun. Libero-pro: To- wards robust and fair evaluation of vision-language-action models beyond memorization.arXiv preprint arXiv:2510.03827, 2025

  20. [20]

    Wortsman, G

    M. Wortsman, G. Ilharco, J. W. Kim, M. Li, S. Kornblith, R. Roelofs, R. G. Lopes, H. Ha- jishirzi, A. Farhadi, H. Namkoong, et al. Robust fine-tuning of zero-shot models. InProceed- ings of the IEEE/CVF conference on computer vision and pattern recognition, pages 7959– 7971, 2022

  21. [21]

    X. Jin, X. Ren, D. Preotiuc-Pietro, and P. Cheng. Dataless knowledge fusion by merging weights of language models.arXiv preprint arXiv:2212.09849, 2022

  22. [22]

    Robust finetuning of vision- language-action robot policies via parameter merging.arXiv preprint arXiv:2512.08333,

    Y . Yadav, Z. Zhou, A. Wagenmaker, K. Pertsch, and S. Levine. Robust finetuning of vision- language-action robot policies via parameter merging.arXiv preprint arXiv:2512.08333, 2025

  23. [23]

    S. Liu, I. S. Singh, Y . Xu, J. Duan, and R. Krishna. Vls: Steering pretrained robot policies via vision-language models.arXiv preprint arXiv:2602.03973, 2026

  24. [24]

    W. Chen, J. S. Bhatia, C. Glossop, N. Mathihalli, R. Doshi, A. Tang, D. Driess, K. Pertsch, and S. Levine. Steerable vision-language-action policies for embodied reasoning and hierarchical control.arXiv preprint arXiv:2602.13193, 2026

  25. [25]

    F. Li, W. Song, H. Zhao, J. Wang, P. Ding, D. Wang, L. Zeng, and H. Li. Spatial forcing: Implicit spatial representation alignment for vision-language-action model.arXiv preprint arXiv:2510.12276, 2025

  26. [26]

    K., and Panov, A

    N. Kachaev, M. Kolosov, D. Zelezetsky, A. K. Kovalev, and A. I. Panov. Don’t blind your vla: Aligning visual representations for ood generalization, 2025.URL https://arxiv. org/abs/2510.25616, 2(4)

  27. [27]

    Grover, A

    S. Grover, A. Gopalkrishnan, B. Ai, H. I. Christensen, H. Su, and X. Li. Enhancing generaliza- tion in vision-language-action models by preserving pretrained representations.arXiv preprint arXiv:2509.11417, 2025

  28. [28]

    arXiv preprint arXiv:2508.09976 (2025)

    M. Lepert, J. Fang, and J. Bohg. Masquerade: Learning from in-the-wild human videos using data-editing.arXiv preprint arXiv:2508.09976, 2025

  29. [29]

    Punamiya, D

    R. Punamiya, D. Patel, P. Aphiwetsa, P. Kuppili, L. Y . Zhu, S. Kareer, J. Hoffman, and D. Xu. Egobridge: Domain adaptation for generalizable imitation from egocentric human data. In Human to Robot: Workshop on Sensorizing, Modeling, and Learning from Humans, 2025

  30. [30]

    On the Opportunities and Risks of Foundation Models

    R. Bommasani, D. A. Hudson, E. Adeli, R. Altman, S. Arora, S. von Arx, M. S. Bernstein, J. Bohg, A. Bosselut, E. Brunskill, et al. On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258, 2021

  31. [31]

    Y . Ma, Z. Song, Y . Zhuang, J. Hao, and I. King. A survey on vision-language-action models for embodied ai.arXiv preprint arXiv:2405.14093, 2024. 10

  32. [32]

    Nguyen, M

    T. Nguyen, M. N. Vu, B. Huang, A. Vuong, Q. Vuong, N. Le, T. V o, and A. Nguyen. Language- driven 6-dof grasp detection using negative prompt guidance. InEuropean Conference on Computer Vision, pages 363–381. Springer, 2024

  33. [33]

    Y . Ban, R. Wang, T. Zhou, M. Cheng, B. Gong, and C.-J. Hsieh. Understanding the impact of negative prompts: When and how do they take effect? Ineuropean conference on computer vision, pages 190–206. Springer, 2024

  34. [34]

    J. Jang, S. Ye, and M. Seo. Can large language models truly understand prompts? a case study with negated prompts. InTransfer learning for natural language processing workshop, pages 52–62. PMLR, 2023

  35. [35]

    D. Wan, J. Cho, E. Stengel-Eskin, and M. Bansal. Contrastive region guidance: Improving grounding in vision-language models without training. InEuropean Conference on Computer Vision, pages 198–215. Springer, 2024

  36. [36]

    Jeong, J

    J. Jeong, J. Kim, G. Lee, Y . Choi, and Y . Uh. Stylekeeper: Prevent content leakage using negative visual query guidance. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 15760–15769, 2025

  37. [37]

    B. Liu, Y . Zhu, C. Gao, Y . Feng, Q. Liu, Y . Zhu, and P. Stone. Libero: Benchmarking knowl- edge transfer for lifelong robot learning.Advances in Neural Information Processing Systems, 36:44776–44791, 2023

  38. [38]

    Zitkovich, T

    B. Zitkovich, T. Yu, S. Xu, P. Xu, T. Xiao, F. Xia, J. Wu, P. Wohlhart, S. Welker, A. Wahid, et al. Rt-2: Vision-language-action models transfer web knowledge to robotic control. In Conference on Robot Learning, pages 2165–2183. PMLR, 2023

  39. [39]

    R. Anil, A. M. Dai, O. Firat, M. Johnson, D. Lepikhin, A. Passos, S. Shakeri, E. Taropa, P. Bailey, Z. Chen, et al. Palm 2 technical report.arXiv preprint arXiv:2305.10403, 2023

  40. [40]

    Q. Chen, J. Yu, M. Schwager, P. Abbeel, Y . Shentu, and P. Wu. Sarm: Stage-aware reward modeling for long horizon robot manipulation.arXiv preprint arXiv:2509.25358, 2025

  41. [41]

    Zero-shot robotic manipu- lation with pretrained image-editing diffusion models,

    K. Black, M. Nakamoto, P. Atreya, H. Walke, C. Finn, A. Kumar, and S. Levine. Zero- shot robotic manipulation with pretrained image-editing diffusion models.arXiv preprint arXiv:2310.10639, 2023

  42. [42]

    O. M. Team, D. Ghosh, H. Walke, K. Pertsch, K. Black, O. Mees, S. Dasari, J. Hejna, T. Kreiman, C. Xu, et al. Octo: An open-source generalist robot policy.arXiv preprint arXiv:2405.12213, 2024

  43. [43]

    L. Fu, H. Huang, G. Datta, L. Y . Chen, W. C.-H. Panitch, F. Liu, H. Li, and K. Goldberg. In- context imitation learning via next-token prediction.arXiv preprint arXiv:2408.15980, 2024

  44. [44]

    Lossless adaptation of pre- trained vision models for robotic manipulation,

    M. Sharma, C. Fantacci, Y . Zhou, S. Koppula, N. Heess, J. Scholz, and Y . Aytar. Lossless adap- tation of pretrained vision models for robotic manipulation.arXiv preprint arXiv:2304.06600, 2023

  45. [45]

    M. Pan, S. Feng, Q. Zhang, X. Li, J. Song, C. Qu, Y . Wang, C. Li, Z. Xiong, Z. Chen, et al. Sop: A scalable online post-training system for vision-language-action models.arXiv preprint arXiv:2601.03044, 2026

  46. [46]

    Parallels between vla model post-training and human motor learning: Progress, challenges, and trends.arXiv preprint arXiv:2506.20966, 2025

    T.-Y . Xiang, A.-Q. Jin, X.-H. Zhou, M.-J. Gui, X.-L. Xie, S.-Q. Liu, S.-Y . Wang, S.-B. Duan, F.-C. Xie, W.-K. Wang, et al. Parallels between vla model post-training and human motor learning: Progress, challenges, and trends.arXiv preprint arXiv:2506.20966, 2025

  47. [47]

    Huang, Z

    S. Huang, Z. Zhang, T. Liang, Y . Xu, Z. Kou, C. Lu, G. Xu, Z. Xue, and H. Xu. Mentor: Mixture-of-experts network with task-oriented perturbation for visual reinforcement learning. arXiv preprint arXiv:2410.14972, 2024. 11

  48. [48]

    E. J. Hu, Y . Shen, P. Wallis, Z. Allen-Zhu, Y . Li, S. Wang, L. Wang, W. Chen, et al. Lora: Low-rank adaptation of large language models.Iclr, 1(2):3, 2022

  49. [49]

    H. Li, Y . Zuo, J. Yu, Y . Zhang, Z. Yang, K. Zhang, X. Zhu, Y . Zhang, T. Chen, G. Cui, et al. Simplevla-rl: Scaling vla training via reinforcement learning.arXiv preprint arXiv:2509.09674, 2025

  50. [50]

    Y . Xing, X. Luo, J. Xie, L. Gao, H. Shen, and J. Song. Shortcut learning in generalist robot policies: The role of dataset diversity and fragmentation.arXiv preprint arXiv:2508.06426, 2025

  51. [51]

    L. Wang, J. Zhao, Y . Du, E. H. Adelson, and R. Tedrake. Poco: Policy composition from and for heterogeneous robot learning.arXiv preprint arXiv:2402.02511, 2024

  52. [52]

    J. Cao, Y . Huang, H. Guo, R. Zhang, M. Nan, W. Mai, J. Wang, H. Cheng, J. Sun, G. Han, et al. Compose your policies! improving diffusion-based or flow-based robot policies via test-time distribution-level composition.arXiv preprint arXiv:2510.01068, 2025

  53. [53]

    Dynaguide: Steering diffusion polices with active dynamic guidance.arXiv preprint arXiv:2506.13922, 2025

    M. Du and S. Song. Dynaguide: Steering diffusion polices with active dynamic guidance. arXiv preprint arXiv:2506.13922, 2025

  54. [54]

    Sun and S

    Z. Sun and S. Song. Latent policy barrier: Learning robust visuomotor policies by staying in-distribution.arXiv preprint arXiv:2508.05941, 2025

  55. [55]

    Huang, Q

    S. Huang, Q. Chen, X. Zhang, J. Sun, and M. Schwager. Particleformer: A 3d point cloud world model for multi-object, multi-material robotic manipulation.arXiv preprint arXiv:2506.23126, 2025

  56. [56]

    Koulischer, J

    F. Koulischer, J. Deleu, G. Raya, T. Demeester, and L. Ambrogioni. Dynamic negative guid- ance of diffusion models.arXiv preprint arXiv:2410.14398, 2024

  57. [57]

    Steering your generalists: Improving robotic foundation models via value guidance.arXiv preprint arXiv:2410.13816, 2024

    M. Nakamoto, O. Mees, A. Kumar, and S. Levine. Steering your generalists: Improving robotic foundation models via value guidance.arXiv preprint arXiv:2410.13816, 2024

  58. [58]

    Steering your diffusion policy with latent space reinforcement learning

    A. Wagenmaker, M. Nakamoto, Y . Zhang, S. Park, W. Yagoub, A. Nagabandi, A. Gupta, and S. Levine. Steering your diffusion policy with latent space reinforcement learning.arXiv preprint arXiv:2506.15799, 2025

  59. [59]

    M. Xu, Z. Xu, C. Chi, M. Veloso, and S. Song. Xskill: Cross embodiment skill discovery. In Conference on robot learning, pages 3536–3555. PMLR, 2023

  60. [60]

    Z. Li, J. Liu, Z. Dong, T. Teng, Q. Rouxel, D. Caldwell, and F. Chen. Towards deploying vla without fine-tuning: Plug-and-play inference-time vla policy steering via embodied evolution- ary diffusion.arXiv preprint arXiv:2511.14178, 2025

  61. [61]

    PaliGemma: A versatile 3B VLM for transfer

    L. Beyer, A. Steiner, A. S. Pinto, A. Kolesnikov, X. Wang, D. Salz, M. Neumann, I. Alabdul- mohsin, M. Tschannen, E. Bugliarello, et al. Paligemma: A versatile 3b vlm for transfer.arXiv preprint arXiv:2407.07726, 2024. 12 Appendix A Pseudocode for Training and Inference For completeness, we provide pseudocode for the two core components ofDeLock. Algorith...