arxiv: 2604.20012 · v1 · submitted 2026-04-21 · 💻 cs.CV · cs.AI· cs.CL

Recognition: unknown

EmbodiedMidtrain: Bridging the Gap between Vision-Language Models and Vision-Language-Action Models via Mid-training

Yiyang Du , Zhanqiu Guo , Xin Ye , Liu Ren , Chenyan Xiong

Authors on Pith no claims yet

Pith reviewed 2026-05-10 02:12 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.CL

keywords vision-language-action modelsmid-trainingvision-language modelsembodied AIrobot manipulationdata alignmentproximity estimatorVLA fine-tuning

0 comments

The pith

Mid-training VLMs on curated VLA-aligned data improves robot manipulation performance and yields stronger fine-tuning initialization.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows a clear data distribution gap where vision-language-action data for robots sits in compact, separated regions inside the wider spread of general vision-language model data. It builds a mid-training data engine that deploys a lightweight learnable proximity estimator to pick the most aligned samples from large VLM pools. The VLM is then mid-trained on this mixture before being fine-tuned into a VLA. Experiments across three robot manipulation benchmarks demonstrate consistent gains over standard off-the-shelf VLMs, reaching levels competitive with expert VLAs trained at larger scale. These gains start early in fine-tuning and trace to better capture of spatial reasoning while keeping data diversity intact.

Core claim

EmbodiedMidtrain first maps how VLA data occupies compact regions largely separated from broader VLM distributions, with alignment varying across and within sources. It then uses a lightweight learnable proximity estimator to curate the most VLA-aligned candidates from a VLM pool, mid-trains the VLM on this mixture, and shows that the resulting initialization produces stronger VLA fine-tuning on manipulation tasks.

What carries the argument

The lightweight learnable proximity estimator that selects VLA-aligned candidates from a broad VLM data pool to form the mid-training mixture.

If this is right

Performance gains appear consistently across different VLM backbones on three robot manipulation benchmarks.
Results become competitive with expert VLAs and with off-the-shelf VLMs trained using larger model scales and budgets.
Mid-training supplies a stronger initialization for VLA fine-tuning, with advantages visible from the earliest steps and widening over training.
The data engine captures both dataset-level and sample-level alignment signals and favors spatial reasoning over purely text-centric tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same selection engine could be reused to adapt general models to other embodied domains such as navigation or human-robot interaction with modest additional cost.
Measuring proximity scores before and after mid-training might offer a diagnostic for how much domain shift remains between VLM and target VLA tasks.
If the estimator generalizes, it could reduce the total data volume needed to reach a given VLA performance level.

Load-bearing premise

The lightweight learnable proximity estimator can accurately identify the most VLA-aligned candidates from a broad VLM data pool and mid-training on this curated mixture reliably provides a stronger initialization for subsequent VLA fine-tuning.

What would settle it

Conducting the same VLA fine-tuning runs after mid-training on randomly sampled VLM data instead of proximity-selected data and observing no performance improvement on the three robot manipulation benchmarks.

Figures

Figures reproduced from arXiv: 2604.20012 by Chenyan Xiong, Liu Ren, Xin Ye, Yiyang Du, Zhanqiu Guo.

**Figure 1.** Figure 1: Overview of EmbodiedMidtrain. We analyze the data distribution gap between VLMs and VLAs, and select VLM samples with higher proximity to the VLA domain for mid-training, yielding a stronger initialization for downstream VLA fine-tuning. 2025). Bridging this gap remains an open challenge: simply fine-tuning VLM on curated embodied data does not reliably translate into better VLA performance (Zhang et al., … view at source ↗

**Figure 2.** Figure 2: Distribution analysis of VLM and VLA datasets. (a) Pairwise MMD distances quantify this distribution gap, with cross-group distances larger than within-group distances. (b) VLA datasets form more compact and concentrated clusters that are separated from the broader and more dispersed VLM distributions. To represent VLM and VLA data in a unified distribution space, we extract the last hidden states of a VL… view at source ↗

**Figure 3.** Figure 3: The training dynamics across VLA tasks for VLMs with and without [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: Analysis of proximity-based data selection. (a) Distribution of proximity scores [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

**Figure 5.** Figure 5: VLM data distribution shift of proximity score after the data selection. The overall proximity score distribution before and after data selection is presented in [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

**Figure 6.** Figure 6: Composition of the selected mid-training data mixture after proximity-based [PITH_FULL_IMAGE:figures/full_fig_p020_6.png] view at source ↗

read the original abstract

Vision-Language-Action Models (VLAs) inherit their visual and linguistic capabilities from Vision-Language Models (VLMs), yet most VLAs are built from off-the-shelf VLMs that are not adapted to the embodied domain, limiting their downstream performance. In this work, we propose EmbodiedMidtrain to bridge the gap between VLMs and VLAs. We first characterize the data distribution gap between them, showing that VLA data occupy compact regions that are largely separated from the broader VLM distribution, while the degree of alignment varies substantially both across and within VLM data sources. Then, we build a mid-training data engine that leverages a lightweight learnable proximity estimator to select the most VLA-aligned candidates from a large VLM pool, and mid-trains the VLM on this curated mixture before downstream VLA fine-tuning. Experiments on three robot manipulation benchmarks show that mid-training consistently improves performance across different VLM backbones, achieving results competitive with expert VLAs and off-the-shelf VLMs trained with larger model scale and training budgets. Further analysis reveals that mid-training provides a stronger initialization for VLA fine-tuning, with gains emerging from the earliest steps and widening throughout training. Moreover, the data engine captures both dataset-level and sample-level alignment signals, favoring spatial reasoning over text-centric tasks while preserving the diversity of the VLM data. We will release all code, data and models for future research.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Mid-training on VLM data selected by a learnable proximity estimator gives VLAs a better initialization and shows consistent benchmark gains, but the abstract leaves the controls and scale of those gains unclear.

read the letter

The main thing to know is that this paper adds a mid-training stage between a general VLM and VLA fine-tuning, where they first measure the distribution gap and then use a small learnable estimator to pull the most aligned samples from a large VLM pool for that stage. The reported outcome is better performance on three robot manipulation benchmarks across different backbones, with the advantage showing up early in fine-tuning and sometimes matching larger or more heavily trained models. They also break down the selected data at both dataset and sample level, noting a tilt toward spatial reasoning while keeping overall diversity. That combination of gap characterization, the estimator-driven selection, and the early-gain analysis is the concrete addition here. Prior VLA work has mostly started from off-the-shelf VLMs without this explicit bridge step, so the data engine is the part that feels new. The pipeline is straightforward and the claims rest on downstream task numbers rather than circular definitions. The soft spots sit in the missing specifics: the abstract gives no exact baseline numbers, no mention of statistical tests, no ablation results on the estimator itself, and no training compute or data volume details. Those omissions make it hard to judge how large or robust the lift really is or whether a simpler non-learnable filter would have done similar work. The estimator adds a few parameters, but they describe it as lightweight, so that part is not a major red flag on its own. This paper is aimed at people building or adapting VLAs for robotics who care about efficient domain-shift handling. A reader who already works on multimodal adaptation or embodied agents could extract the selection method and the timing analysis even if they end up tweaking the estimator. It has enough structure and testable claims, plus the promised code and data release, to merit sending out for peer review rather than a desk reject.

Referee Report

2 major / 2 minor

Summary. The paper proposes EmbodiedMidtrain to bridge the gap between VLMs and VLAs. It first characterizes the data distribution gap, showing VLA data occupy compact, largely separated regions from the broader VLM distribution with varying alignment across and within sources. A mid-training data engine then uses a lightweight learnable proximity estimator to curate the most VLA-aligned candidates from a large VLM pool. The VLM is mid-trained on this mixture before downstream VLA fine-tuning. Experiments on three robot manipulation benchmarks show consistent performance improvements across VLM backbones, achieving competitiveness with expert VLAs and larger-scale off-the-shelf VLMs, with gains emerging early and widening during fine-tuning. The data engine captures both dataset- and sample-level signals, favoring spatial reasoning while preserving diversity.

Significance. If the empirical claims hold under rigorous controls, the work provides a practical and efficient pathway for adapting general-purpose VLMs to embodied tasks without requiring full-scale VLA pretraining from scratch. The early emergence of benefits and the dual dataset/sample-level analysis strengthen the case that mid-training supplies a stronger initialization. Explicit commitment to releasing code, data, and models is a clear strength for reproducibility and community follow-up in vision-language-action research.

major comments (2)

Abstract and Experiments section: the central claim of 'consistent improvements' and 'competitive with expert VLAs and off-the-shelf VLMs trained with larger model scale and training budgets' is load-bearing yet unsupported by any reported baseline details, model sizes, training budgets, statistical tests, or variance across runs. Without these, the magnitude and reliability of the gains cannot be assessed.
Method section on the proximity estimator: the lightweight learnable proximity estimator is the key mechanism for data curation and is presented as accurately identifying VLA-aligned candidates, but no validation (e.g., correlation with downstream VLA performance, ablation replacing it with random or heuristic selection, or precision against held-out VLA data) is described. This assumption directly determines whether the reported gains are attributable to the proposed mid-training pipeline.

minor comments (2)

Abstract: the three robot manipulation benchmarks are not named. Adding their identities (and a brief reference to the specific metrics used) would improve immediate readability.
The paper states it will release code, data, and models; confirming the exact repositories or DOIs in the camera-ready version would strengthen the reproducibility claim.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed review of our manuscript. We address each major comment point by point below, providing clarifications from the current manuscript and indicating where revisions will be made to strengthen the presentation and evidence.

read point-by-point responses

Referee: Abstract and Experiments section: the central claim of 'consistent improvements' and 'competitive with expert VLAs and off-the-shelf VLMs trained with larger model scale and training budgets' is load-bearing yet unsupported by any reported baseline details, model sizes, training budgets, statistical tests, or variance across runs. Without these, the magnitude and reliability of the gains cannot be assessed.

Authors: We agree that the abstract's claims require clear supporting details for readers to evaluate the improvements. The manuscript reports results across three robot manipulation benchmarks and multiple VLM backbones, with comparisons to expert VLAs and off-the-shelf models. However, to directly address this concern, we will revise the Experiments section to explicitly tabulate model sizes, training budgets (including epochs and data scale), any statistical tests, and run-to-run variance. We will also ensure the abstract phrasing is precisely aligned with these details. This will allow proper assessment of the gains' magnitude and reliability. revision: yes
Referee: Method section on the proximity estimator: the lightweight learnable proximity estimator is the key mechanism for data curation and is presented as accurately identifying VLA-aligned candidates, but no validation (e.g., correlation with downstream VLA performance, ablation replacing it with random or heuristic selection, or precision against held-out VLA data) is described. This assumption directly determines whether the reported gains are attributable to the proposed mid-training pipeline.

Authors: We acknowledge that direct validation of the proximity estimator is necessary to attribute gains specifically to the curation pipeline. The manuscript includes analysis demonstrating that the data engine captures both dataset-level and sample-level signals while favoring spatial reasoning and preserving diversity. To strengthen this, we will add in revision: (1) ablations comparing the learnable estimator against random selection and heuristic baselines, and (2) any available correlations between proximity scores and downstream VLA performance, along with precision metrics on held-out VLA data. These additions will clarify the estimator's contribution. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected in the derivation or claims

full rationale

The paper's pipeline—characterizing the VLM/VLA distribution gap via analysis, training a separate lightweight proximity estimator on available data to curate a mid-training mixture from the VLM pool, performing mid-training, and then evaluating VLA fine-tuning gains—is self-contained. All performance claims rest on downstream results from three independent robot manipulation benchmarks rather than any definitional equivalence, fitted parameter renamed as prediction, or self-citation chain. No equations or steps reduce the claimed improvements to the inputs by construction; the estimator's selection and the resulting initialization benefits are externally validated through benchmark metrics.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the existence of a measurable data distribution gap between VLM and VLA sources and on the ability of a learnable estimator to select useful mid-training data; no explicit free parameters or invented entities are described.

free parameters (1)

proximity estimator parameters
The estimator is described as learnable, implying its internal weights are fitted during the mid-training stage.

axioms (1)

domain assumption VLA data occupy compact regions largely separated from the broader VLM distribution
Directly stated as the basis for the data characterization step.

pith-pipeline@v0.9.0 · 5572 in / 1285 out tokens · 36796 ms · 2026-05-10T02:12:55.716778+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

24 extracted references · 22 canonical work pages · 13 internal anchors

[1]

Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond.arXiv preprint arXiv:2308.12966,

work page internal anchor Pith review arXiv
[2]

Qwen3-VL Technical Report

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, Wenbin Ge, Zhifang Guo, Qidong Huang, Jie Huang, Fei Huang, Binyuan Hui, Shutong Jiang, Zhaohai Li, Mingsheng Li, Mei Li, Kaixin Li, Zicheng Lin, Junyang Lin, Xuejing Liu, Jiawei Liu, Chenglong Liu, Yang Liu, Dayiheng Liu, Shixuan ...

work page internal anchor Pith review Pith/arXiv arXiv
[3]

$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, Szymon Jakubczak, Tim Jones, Liyiming Ke, Sergey Levine, Adrian Li-Bell, Mohith Mothukuri, Suraj Nair, Karl Pertsch, Lucy Xiaoyang Shi, James Tanner, Quan Vuong, Anna Walling, Haohuan Wang, and Ury Zhilinsky. π0: A visi...

work page internal anchor Pith review arXiv
[4]

Spatialbot: Precise spatial understanding with vision language models

Wenxiao Cai, Yaroslav Ponomarenko, Jianhao Yuan, Xiaoqi Li, Wankou Yang, Hao Dong, and Bo Zhao. Spatialbot: Precise spatial understanding with vision language models. arXiv preprint arXiv:2406.13642,

work page arXiv
[5]

Yiguo Fan, Pengxiang Ding, Shuanghao Bai, Xinyang Tong, Yuyang Zhu, Hongchao Lu, Fengqi Dai, Wei Zhao, Yang Liu, Siteng Huang, Zhaoxin Fan, Badong Chen, and Donglin Wang

Kaiyuan Chen, Shuangyu Xie, Zehan Ma, and Ken Goldberg. Robo2vlm: Visual question answering from large-scale in-the-wild robot manipulation datasets, 2025a. URL https: //arxiv.org/abs/2505.15517. 10 Preprint. Under review. Xinyi Chen, Yilun Chen, Yanwei Fu, Ning Gao, Jiaya Jia, Weiyang Jin, Hao Li, Yao Mu, Jiangmiao Pang, Yu Qiao, Yang Tian, Bin Wang, Bol...

work page arXiv
[6]

doi: 10.18653/v1/2024.acl-short.33

Association for Computational Linguistics. doi: 10.18653/v1/2024.acl-short.33. URLhttps://aclanthology.org/2024.acl-short.33/. Xingyu Fu, Yushi Hu, Bangzheng Li, Yu Feng, Haoyu Wang, Xudong Lin, Dan Roth, Noah A. Smith, Wei-Chiu Ma, and Ranjay Krishna. Blink: Multimodal large language models can see but not perceive,

work page doi:10.18653/v1/2024.acl-short.33 2024
[7]

Blink: Multimodal large language models can see but not perceive

URLhttps://arxiv.org/abs/2404.12390. Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial networks.Advances in Neural Information Processing Systems, 3, 06

work page arXiv
[8]

doi: 10.1145/3422622. GR Team. Gemini robotics 1.5: Pushing the frontier of generalist embodied agents.arXiv preprint arXiv:2510.03342,

work page doi:10.1145/3422622
[9]

The Llama 3 Herd of Models

URLhttps://arxiv.org/abs/2407.21783. Arthur Gretton, Karsten M. Borgwardt, Malte J. Rasch, Bernhard Sch¨olkopf, and Alexander Smola. A kernel two-sample test.Journal of Machine Learning Research, 13(25):723–773,

work page internal anchor Pith review Pith/arXiv arXiv
[10]

MiniCPM: Unveiling the Potential of Small Language Models with Scalable Training Strategies

URL https://arxiv.org/abs/2404.06395. Yuheng Ji et al. Robobrain: A unified brain model for robotic manipulation from abstract to concrete.arXiv preprint arXiv:2502.21257,

work page internal anchor Pith review arXiv
[11]

OpenVLA: An Open-Source Vision-Language-Action Model

Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, Quan Vuong, Thomas Kollar, Benjamin Burchfiel, Russ Tedrake, Dorsa Sadigh, Sergey Levine, Percy Liang, and Chelsea Finn. Openvla: An open-source vision-language-action model.arXiv preprint arXiv:2406.09246,

work page internal anchor Pith review arXiv
[12]

Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success

Moo Jin Kim, Chelsea Finn, and Percy Liang. Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success. InProceedings of Robotics: Science and Systems, LosAngeles, CA, USA, June 2025a. doi: 10.15607/RSS.2025.XXI.017. Taeyoung Kim, Jimin Lee, Myungkyu Koo, Dongyoung Kim, Kyungmin Lee, Changyeon Kim, Younggyo Seo, and Jinwoo Shin. Contrastive r...

work page doi:10.15607/rss.2025.xxi.017 2025
[13]

Junnan Li, Dongxu Li, Silvio Savarese, and Steven C. H. Hoi. Blip-2: Bootstrapping language- image pre-training with frozen image encoders and large language models. InInternational Conference on Machine Learning, 2023a. URL https://api.semanticscholar.org/CorpusID: 256390509. Xuanlin Li, Kyle Hsu, Jiayuan Gu, Karl Pertsch, Oier Mees, Homer Rich Walke, Ch...

work page arXiv 2023
[14]

URLhttps://arxiv.org/abs/2503.14734. Team OLMo, Pete Walsh, Luca Soldaini, Dirk Groeneveld, Kyle Lo, Shane Arora, Akshita Bhagia, Yuling Gu, Shengyi Huang, Matt Jordan, Nathan Lambert, Dustin Schwenk, Oyvind Tafjord, Taira Anderson, David Atkinson, Faeze Brahman, Christopher Clark, Pradeep Dasigi, Nouha Dziri, Allyson Ettinger, Michal Guerquin, David Hein...

work page internal anchor Pith review arXiv
[15]

2 OLMo 2 Furious

URLhttps://arxiv.org/abs/2501.00656. Zhiliang Peng, Wenhui Wang, Li Dong, Yaru Hao, Shaohan Huang, Shuming Ma, and Furu Wei. Kosmos-2: Grounding multimodal large language models to the world.arXiv preprint arXiv:2306.14824,

work page internal anchor Pith review arXiv
[16]

Physical Intelligence, Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Manuel Y. Galliker, Dibya Ghosh, Lachy Groom, Karol Hausman, Brian Ichter, Szymon Jakubczak, Tim Jones, Liyiming Ke, Devin LeBlanc, Sergey Levine, Adrian Li-Bell, Mohith Mothukuri, Suraj Nair, Karl Pertsch...

work page internal anchor Pith review Pith/arXiv arXiv
[18]

URLhttps://arxiv.org/abs/2111.02114. Andreas Steiner, Andr ´e Susano Pinto, Michael Tschannen, Daniel Keysers, Xiao Wang, Yonatan Bitton, Alexey Gritsenko, Matthias Minderer, Anthony Sherbondy, Shangbang Long, Siyang Qin, Reeve Ingle, Emanuele Bugliarello, Sahar Kazemzadeh, Thomas Mes- nard, Ibrahim Alabdulmohsin, Lucas Beyer, and Xiaohua Zhai. Paligemma ...

work page internal anchor Pith review arXiv
[19]

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

Jiayu Wang, Yifei Ming, Zhenmei Shi, Vibhav Vineet, Xin Wang, Yixuan Li, and Neel Joshi. Is a picture worth a thousand words? delving into spatial reasoning for vision language models. InThe Thirty-Eighth Annual Conference on Neural Information Processing Systems, 2024a. Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuej...

work page internal anchor Pith review Pith/arXiv arXiv
[20]

InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

Weiyun Wang, Zhangwei Gao, Lixin Gu, Hengjun Pu, Long Cui, Xingguang Wei, Zhaoyang Liu, Linglin Jing, Shenglong Ye, Jie Shao, et al. Internvl3.5: Advancing open-source multimodal models in versatility, reasoning, and efficiency.arXiv preprint arXiv:2508.18265, 2025a. Zengzhi Wang, Fan Zhou, Xuefeng Li, and Pengfei Liu. Octothinker: Mid-training in- centiv...

work page internal anchor Pith review arXiv
[22]

Fanghua Yu, Jinjin Gu, Zheyuan Li, Jinfan Hu, Xiangtao Kong, Xintao Wang, Jingwen He, Yu Qiao, and Chao Dong

URL https://arxiv.org/abs/2504.15279. Ganlin Yang, Tianyi Zhang, Haoran Hao, Weiyun Wang, Yibin Liu, Dehui Wang, Guanzhou Chen, Zijian Cai, Junting Chen, Weijie Su, et al. Vlaser: Vision-language-action model with synergistic embodied reasoning.arXiv preprint arXiv:2510.11027,

work page arXiv
[23]

arXiv preprint arXiv:2601.03309 , year=

Jianke Zhang, Xiaoyu Chen, Qiuyue Wang, Mingsheng Li, Yanjiang Guo, Yucheng Hu, Jiajun Zhang, Shuai Bai, Junyang Lin, and Jianyu Chen. Vlm4vla: Revisiting vision-language- models in vision-language-action models.arXiv preprint arXiv:2601.03309,

work page arXiv
[24]

LlamaFactory: Unified Efficient Fine-Tuning of 100+ Language Models

Association for Computational Linguistics. URLhttp://arxiv.org/abs/2403.13372. Enshen Zhou, Jingkun An, Cheng Chi, Yi Han, Shanyu Rong, Chi Zhang, Pengwei Wang, Zhongyuan Wang, Tiejun Huang, Lu Sheng, et al. Roborefer: Towards spatial referring with reasoning in vision-language models for robotics.arXiv preprint arXiv:2506.04308,

work page internal anchor Pith review arXiv
[25]

Samples with smaller davg(x) are ranked as closer to the VLA distribution and selected first. VLA-conditioned Perplexity.We fine-tune the VLM on converted in-domain VLA data in which robot actions are represented as text tokens, yielding a VLA-conditioned model with parameters θVLA. The perplexity of a candidate sample x= (x 1, . . ., xT) under this model...

2025
[26]

Notably, our selected VLM data achieves a diversity score of 1.93, closely matching that of the full general VLM pool and substantially exceeding both the embodied-oriented subset and VLA data. This indicates that proximity-based selection does not collapse the mid-training distribution onto a narrow region near the VLA domain; instead, it retains a broad...

2024