Recognition: unknown
PokeVLA: Empowering Pocket-Sized Vision-Language-Action Model with Comprehensive World Knowledge Guidance
Pith reviewed 2026-05-10 00:00 UTC · model grok-4.3
The pith
A compact two-stage model pre-trains broad vision-language skills on 2.4 million samples then aligns them to robot actions for stronger manipulation.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
PokeVLA shows that a pocket-sized VLA model reaches strong performance on manipulation tasks by pre-training a compact vision-language model on 2.4M multimodal samples covering spatial grounding, affordance, and embodied reasoning, then injecting the resulting representations into the action space via multi-view goal-aware semantics learning, geometry alignment, and a novel action expert.
What carries the argument
Two-stage training paradigm that first pre-trains a vision-language model on spatial and reasoning tasks then injects its outputs into action generation through multi-view semantics, geometry alignment, and an action expert.
If this is right
- Higher success rates and robustness appear on the LIBERO-Plus benchmark and in physical robot deployments compared with comparable models.
- The model remains effective under varied lighting, object placements, and other perturbations.
- Open release of code, weights, and dataset scripts enables direct reproduction and extension by others.
- The approach supports running capable manipulation policies on hardware with tight compute and memory limits.
Where Pith is reading between the lines
- The same pre-training-plus-alignment pattern could be tested on navigation or mobile manipulation tasks that also require spatial reasoning.
- Geometry alignment might be reused in standalone vision-language models to improve 3D understanding without robot data.
- Scaling the pre-training dataset size further while keeping the model compact could produce even stronger small-scale controllers.
Load-bearing premise
The representations learned in the vision-language pre-training stage transfer directly to robot actions through the alignment steps without large domain gaps or extra heavy retraining.
What would settle it
Removing the pre-training stage or the geometry-alignment and action-expert modules produces success rates on LIBERO-Plus and real-world robot trials that match or fall below those of standard lightweight baselines under the same perturbations.
Figures
read the original abstract
Recent advances in Vision-Language-Action (VLA) models have opened new avenues for robot manipulation, yet existing methods exhibit limited efficiency and a lack of high-level knowledge and spatial awareness. To address these challenges, we propose PokeVLA, a lightweight yet powerful foundation model for embodied manipulation that effectively infuses vision-language understanding into action learning. Our framework introduces a two-stage training paradigm: first, we pre-train a compact vision-language model (PokeVLM) on a curated multimodal dataset of 2.4M samples encompassing spatial grounding, affordance, and embodied reasoning tasks; second, we inject manipulation-relevant representations into the action space through multi-view goal-aware semantics learning, geometry alignment, and a novel action expert. Extensive experiments demonstrate state-of-the-art performance on the LIBERO-Plus benchmark and in real-world deployment, outperforming comparable baselines in success rate and robustness under diverse perturbations. To foster reproducibility and community progress, we will open-source our code, model weights, and the scripts for the curated pre-training dataset. Project page: https://getterupper.github.io/PokeVLA
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes PokeVLA, a lightweight Vision-Language-Action model for embodied robot manipulation. It employs a two-stage training paradigm: (1) pre-training a compact vision-language model (PokeVLM) on a curated 2.4M-sample multimodal dataset covering spatial grounding, affordance, and embodied reasoning tasks; (2) injecting the learned representations into the action space via multi-view goal-aware semantics learning, geometry alignment, and a novel action expert. The central empirical claim is state-of-the-art performance on the LIBERO-Plus benchmark and in real-world deployments, with improved success rates and robustness under perturbations, accompanied by a commitment to open-source the code, weights, and dataset scripts.
Significance. If the transfer mechanism is shown to work without substantial domain shift or additional tuning, the approach could provide an efficient route to infusing high-level world knowledge into small-scale VLA models, potentially improving sample efficiency and generalization in manipulation tasks. The open-sourcing pledge would further support reproducibility in the robotics community.
major comments (3)
- Abstract and §4 (Experiments): The claim of 'state-of-the-art performance' and 'outperforming comparable baselines in success rate and robustness' is presented without any numerical results, baseline tables, ablation deltas, or error bars. This absence directly undermines assessment of the central empirical claim, as no quantitative evidence is supplied for the reported SOTA results on LIBERO-Plus or real-world tests.
- §3 (Method, two-stage paradigm): The transfer step from pre-trained PokeVLM representations to action space via multi-view goal-aware semantics, geometry alignment, and the action expert is load-bearing for the performance claims, yet no ablation studies, domain-shift metrics (e.g., feature distribution distances), or loss-function details are provided to demonstrate that alignment occurs with minimal tuning and without degradation.
- §4 (Experiments): No quantitative validation is given for the weakest assumption that pre-trained VLM representations transfer effectively; the manuscript lacks reported success-rate deltas attributable to each injection module or comparisons against a direct fine-tuning baseline without the proposed alignment components.
minor comments (2)
- The abstract and method sections use several undefined or inconsistently introduced acronyms (e.g., VLA, VLM) without an initial glossary or expansion on first use.
- Figure captions and table references in the experimental section should explicitly link each result to the corresponding ablation or baseline configuration for clarity.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. The comments highlight important areas where the empirical support for our claims can be strengthened, and we will revise the paper to address them directly.
read point-by-point responses
-
Referee: Abstract and §4 (Experiments): The claim of 'state-of-the-art performance' and 'outperforming comparable baselines in success rate and robustness' is presented without any numerical results, baseline tables, ablation deltas, or error bars. This absence directly undermines assessment of the central empirical claim, as no quantitative evidence is supplied for the reported SOTA results on LIBERO-Plus or real-world tests.
Authors: We agree that the absence of explicit numerical results in the current version limits the ability to fully evaluate the SOTA claims. In the revised manuscript, we will add comprehensive tables in §4 reporting success rates on LIBERO-Plus for PokeVLA versus all baselines, real-world deployment metrics, standard deviations across runs as error bars, and ablation deltas. The abstract will remain a high-level summary but will reference the new quantitative results in the experiments section. revision: yes
-
Referee: §3 (Method, two-stage paradigm): The transfer step from pre-trained PokeVLM representations to action space via multi-view goal-aware semantics, geometry alignment, and the action expert is load-bearing for the performance claims, yet no ablation studies, domain-shift metrics (e.g., feature distribution distances), or loss-function details are provided to demonstrate that alignment occurs with minimal tuning and without degradation.
Authors: The transfer mechanisms are indeed central, and we acknowledge the need for supporting analyses. We will expand the experiments to include ablation studies isolating each component (multi-view semantics, geometry alignment, action expert), provide the exact loss formulations used for alignment, and report domain-shift metrics such as feature distribution distances or similarity scores to demonstrate effective transfer without substantial degradation or heavy tuning. revision: yes
-
Referee: §4 (Experiments): No quantitative validation is given for the weakest assumption that pre-trained VLM representations transfer effectively; the manuscript lacks reported success-rate deltas attributable to each injection module or comparisons against a direct fine-tuning baseline without the proposed alignment components.
Authors: We agree that direct quantitative validation of the transfer is necessary. The revised §4 will include success-rate deltas attributable to each injection module and an explicit baseline comparison of direct fine-tuning of the pre-trained PokeVLM on manipulation tasks without the multi-view semantics and geometry alignment components. This will allow clear assessment of the transfer effectiveness. revision: yes
Circularity Check
No circularity: empirical model training and benchmark evaluation
full rationale
The paper proposes a two-stage training paradigm for PokeVLA: pre-training PokeVLM on a 2.4M-sample multimodal dataset for spatial grounding, affordance, and embodied reasoning, followed by injection of representations into the action space via multi-view goal-aware semantics, geometry alignment, and an action expert. All performance claims rest on empirical results from the LIBERO-Plus benchmark and real-world deployment, with no equations, derivations, fitted parameters renamed as predictions, or self-referential definitions. No load-bearing self-citations, uniqueness theorems, or ansatzes are invoked in the provided text to close any loop. The central claims are therefore self-contained experimental outcomes rather than reductions to inputs by construction.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Openvla: An open-source vision-language-action model,
M. J. Kim, K. Pertsch, S. Karamcheti, T. Xiao, A. Balakrishna, S. Nair, R. Rafailov, E. P. Foster, P. R. Sanketi, Q. Vuonget al., “Openvla: An open-source vision-language-action model,” inConference on Robot Learning, 2024, pp. 2679–2713
2024
-
[2]
Fine-tuning vision-language-action models: Optimizing speed and success,
M. J. Kim, C. Finn, and P. Liang, “Fine-tuning vision-language-action models: Optimizing speed and success,” inRobotics: Science and Systems, 2025
2025
-
[3]
Open x-embodiment: Robotic learning datasets and rt-x models: Open x-embodiment collab- oration 0,
A. O’Neill, A. Rehman, A. Maddukuri, A. Gupta, A. Padalkar, A. Lee, A. Pooley, A. Gupta, A. Mandlekar, A. Jainet al., “Open x-embodiment: Robotic learning datasets and rt-x models: Open x-embodiment collab- oration 0,” inIEEE International Conference on Robotics and Automa- tion, 2024, pp. 6892–6903
2024
-
[4]
Dreamvla: A vision-language-action model dreamed with comprehensive world knowledge,
W. Zhang, H. Liu, Z. Qi, Y . Wang, X. Yu, J. Zhang, R. Dong, J. He, H. Wang, Z. Zhanget al., “Dreamvla: A vision-language-action model dreamed with comprehensive world knowledge,” inAdvances in Neural Information Processing Systems, 2025
2025
-
[5]
Reconvla: Reconstructive vision-language-action model as effective robot perceiver,
W. Song, Z. Zhou, H. Zhao, J. Chen, P. Ding, H. Yan, Y . Huang, F. Tang, D. Wang, and H. Li, “Reconvla: Reconstructive vision-language-action model as effective robot perceiver,” inProceedings of the AAAI Confer- ence on Artificial Intelligence, 2026
2026
-
[6]
Vla-adapter: An effective paradigm for tiny-scale vision-language-action model,
Y . Wang, P. Ding, L. Li, C. Cui, Z. Ge, X. Tong, W. Song, H. Zhao, W. Zhao, P. Houet al., “Vla-adapter: An effective paradigm for tiny-scale vision-language-action model,” inProceedings of the AAAI Conference on Artificial Intelligence, 2025
2025
-
[7]
Prismatic vlms: Investigating the design space of visually- conditioned language models,
S. Karamcheti, S. Nair, A. Balakrishna, P. Liang, T. Kollar, and D. Sadigh, “Prismatic vlms: Investigating the design space of visually- conditioned language models,” inInternational Conference on Machine Learning, 2024
2024
-
[8]
F. Li, W. Song, H. Zhao, J. Wang, P. Ding, D. Wang, L. Zeng, and H. Li, “Spatial forcing: Implicit spatial representation alignment for vision- language-action model,”arXiv preprint arXiv:2510.12276, 2025
-
[9]
LIBERO-Plus: In-depth Robustness Analysis of Vision-Language-Action Models
S. Fei, S. Wang, J. Shi, Z. Dai, J. Cai, P. Qian, L. Ji, X. He, S. Zhang, Z. Feiet al., “Libero-plus: In-depth robustness analysis of vision- language-action models,”arXiv preprint arXiv:2510.13626, 2025
work page internal anchor Pith review arXiv 2025
-
[10]
Libero: Benchmarking knowledge transfer for lifelong robot learning,
B. Liu, Y . Zhu, C. Gao, Y . Feng, Q. Liu, Y . Zhu, and P. Stone, “Libero: Benchmarking knowledge transfer for lifelong robot learning,” inAdvances in Neural Information Processing Systems, 2023
2023
-
[11]
Learning fine-grained bimanual manipulation with low-cost hardware,
T. Zhao, V . Kumar, S. Levine, and C. Finn, “Learning fine-grained bimanual manipulation with low-cost hardware,” inRobotics: Science and Systems, 2023
2023
-
[12]
Diffusion policy: Visuomotor policy learning via action diffusion,
C. Chi, Z. Xu, S. Feng, E. Cousineau, Y . Du, B. Burchfiel, R. Tedrake, and S. Song, “Diffusion policy: Visuomotor policy learning via action diffusion,”The International Journal of Robotics Research, vol. 44, no. 10-11, pp. 1684–1704, 2025
2025
-
[13]
3d diffusion policy: Generalizable visuomotor policy learning via simple 3d representations,
Y . Ze, G. Zhang, K. Zhang, C. Hu, M. Wang, and H. Xu, “3d diffusion policy: Generalizable visuomotor policy learning via simple 3d representations,” inRobotics: Science and Systems, 2024
2024
-
[14]
Octo: An open-source generalist robot policy,
D. Ghosh, H. R. Walke, K. Pertsch, K. Black, O. Mees, S. Dasari, J. Hejna, T. Kreiman, C. Xu, J. Luoet al., “Octo: An open-source generalist robot policy,” inRobotics: Science and Systems, 2024
2024
-
[15]
Rt-1: Robotics transformer for real-world control at scale,
A. Brohan, N. Brown, J. Carbajal, Y . Chebotar, J. Dabis, C. Finn, K. Gopalakrishnan, K. Hausman, A. Herzog, J. Hsuet al., “Rt-1: Robotics transformer for real-world control at scale,” inRobotics: Science and Systems, 2023
2023
-
[16]
Rt-2: Vision-language-action models transfer web knowledge to robotic control,
B. Zitkovich, T. Yu, S. Xu, P. Xu, T. Xiao, F. Xia, J. Wu, P. Wohlhart, S. Welker, A. Wahidet al., “Rt-2: Vision-language-action models transfer web knowledge to robotic control,” inConference on Robot Learning, 2023, pp. 2165–2183
2023
-
[17]
Vision-language foundation models as effective robot imitators,
X. Li, M. Liu, H. Zhang, C. Yu, J. Xu, H. Wu, C. Cheang, Y . Jing, W. Zhang, H. Liuet al., “Vision-language foundation models as effective robot imitators,” inInternational Conference on Learning Representa- tions, 2024
2024
-
[18]
π0: A vision-language-action flow model for general robot control,
K. Black, N. Brown, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, L. Groom, K. Hausman, B. Ichteret al., “π0: A vision-language-action flow model for general robot control,” inRobotics: Science and Systems, 2025
2025
-
[19]
π0.5: a vision-language-action model with open-world generalization,
K. Black, N. Brown, J. Darpinian, K. Dhabalia, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusaiet al., “π0.5: a vision-language-action model with open-world generalization,” inConference on Robot Learning, 2025
2025
-
[20]
Towards generalist robot policies: What matters in build- ing vision-language-action models,
H. Liu, X. Li, P. Li, M. Liu, D. Wang, J. Liu, B. Kang, X. Ma, T. Kong, and H. Zhang, “Towards generalist robot policies: What matters in build- ing vision-language-action models,”arXiv preprint arXiv:412.14058, 2024
2024
-
[21]
Long-vla: Unleashing long-horizon capability of vision language action model for robot manipulation,
Y . Fan, S. Bai, X. Tong, P. Ding, Y . Zhu, H. Lu, F. Dai, W. Zhao, Y . Liu, S. Huanget al., “Long-vla: Unleashing long-horizon capability of vision language action model for robot manipulation,” inConference on Robot Learning, 2025, pp. 2018–2037
2025
-
[22]
Univla: Learning to act anywhere with task-centric latent actions,
Q. Bu, Y . Yang, J. Cai, S. Gao, G. Ren, M. Yao, P. Luo, and H. Li, “Univla: Learning to act anywhere with task-centric latent actions,” in Robotics: Science and Systems, 2025
2025
-
[23]
GR00T N1: An Open Foundation Model for Generalist Humanoid Robots
J. Bjorck, F. Casta ˜neda, N. Cherniadev, X. Da, R. Ding, L. Fan, Y . Fang, D. Fox, F. Hu, S. Huanget al., “Gr00t n1: An open foundation model for generalist humanoid robots,”arXiv preprint arXiv:2503.14734, 2025
work page internal anchor Pith review arXiv 2025
-
[24]
arXiv preprint arXiv:2505.03912 (2025) 1 16 H
C. Cui, P. Ding, W. Song, S. Bai, X. Tong, Z. Ge, R. Suo, W. Zhou, Y . Liu, B. Jiaet al., “Openhelix: A short survey, empirical analysis, and open-source dual-system vla model for robotic manipulation,”arXiv preprint arXiv:2505.03912, 2025
-
[25]
Closed-loop visuomotor control with generative expectation for robotic manipulation,
Q. Bu, J. Zeng, L. Chen, Y . Yang, G. Zhou, J. Yan, P. Luo, H. Cui, Y . Ma, and H. Li, “Closed-loop visuomotor control with generative expectation for robotic manipulation,” inAdvances in Neural Information Processing Systems, 2024, pp. 139 002–139 029
2024
-
[26]
Diffusionvla: Scaling robot foundation models via unified diffusion and autoregression,
J. Wen, Y . Zhu, M. Zhu, Z. Tang, J. Li, Z. Zhou, X. Liu, C. Shen, Y . Peng, and F. Feng, “Diffusionvla: Scaling robot foundation models via unified diffusion and autoregression,” inInternational Conference on Machine Learning, 2025
2025
-
[27]
Onetwovla: A unified vision-language-action model with adaptive reasoning
F. Lin, R. Nai, Y . Hu, J. You, J. Zhao, and Y . Gao, “Onetwovla: A unified vision-language-action model with adaptive reasoning,”arXiv preprint arXiv:2505.11917, 2025
-
[28]
S. Yang, H. Li, Y . Chen, B. Wang, Y . Tian, T. Wang, H. Wang, F. Zhao, Y . Liao, and J. Pang, “Instructvla: Vision-language-action instruction tuning from understanding to manipulation,”arXiv preprint arXiv:2507.17520, 2025
-
[29]
SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics
M. Shukor, D. Aubakirova, F. Capuano, P. Kooijmans, S. Palma, A. Zouitine, M. Aractingi, C. Pascal, M. Russi, A. Marafiotiet al., “Smolvla: A vision-language-action model for affordable and efficient robotics,”arXiv preprint arXiv:2506.01844, 2025. 16
work page internal anchor Pith review arXiv 2025
-
[30]
Tinyvla: Towards fast, data-efficient vision-language- action models for robotic manipulation,
J. Wen, Y . Zhu, J. Li, M. Zhu, Z. Tang, K. Wu, Z. Xu, N. Liu, R. Cheng, C. Shenet al., “Tinyvla: Towards fast, data-efficient vision-language- action models for robotic manipulation,”IEEE Robotics and Automation Letters, 2025
2025
-
[31]
Vla-os: Structuring and dissecting planning representations and paradigms in vision-language-action models,
C. Gao, Z. Liu, Z. Chi, J. Huang, X. Fei, Y . Hou, Y . Zhang, Y . Lin, Z. Fang, Z. Jianget al., “Vla-os: Structuring and dissecting planning representations and paradigms in vision-language-action models,” in Advances in Neural Information Processing Systems, 2025
2025
-
[32]
3d-vla: a 3d vision-language-action generative world model,
H. Zhen, X. Qiu, P. Chen, J. Yang, X. Yan, Y . Du, Y . Hong, and C. Gan, “3d-vla: a 3d vision-language-action generative world model,” inInternational Conference on Machine Learning, 2024, pp. 61 229– 61 245
2024
-
[33]
4d-vla: Spatiotemporal vision- language-action pretraining with cross-scene calibration,
J. Zhang, Y . Chen, Y . Xu, Z. Huang, Y . Zhou, Y .-J. Yuan, X. Cai, G. Huang, X. Quan, H. Xuet al., “4d-vla: Spatiotemporal vision- language-action pretraining with cross-scene calibration,” inAdvances in Neural Information Processing Systems, 2025
2025
-
[34]
Spatialvla: Exploring spatial representations for visual-language-action model,
D. Qu, H. Song, Q. Chen, Y . Yao, X. Ye, Y . Ding, Z. Wang, J. Gu, B. Zhao, D. Wanget al., “Spatialvla: Exploring spatial representations for visual-language-action model,” inRobotics: Science and Systems, 2025
2025
-
[35]
Cot-vla: Visual chain-of-thought reasoning for vision-language-action models,
Q. Zhao, Y . Lu, M. J. Kim, Z. Fu, Z. Zhang, Y . Wu, Z. Li, Q. Ma, S. Han, C. Finnet al., “Cot-vla: Visual chain-of-thought reasoning for vision-language-action models,” inIEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025, pp. 1702–1713
2025
-
[36]
arXiv preprint arXiv:2503.07511 (2025)
C. Li, J. Wen, Y . Peng, Y . Peng, F. Feng, and Y . Zhu, “Pointvla: In- jecting the 3d world into vision-language-action models,”arXiv preprint arXiv:2503.07511, 2025
-
[37]
Geovla: Em- powering 3d representations in vision-language-action models,
L. Sun, B. Xie, Y . Liu, H. Shi, T. Wang, and J. Cao, “Geovla: Empower- ing 3d representations in vision-language-action models,”arXiv preprint arXiv:2508.09071, 2025
-
[38]
Bridgevla: Input-output alignment for efficient 3d manip- ulation learning with vision-language models,
P. Li, Y . Chen, H. Wu, X. Ma, X. Wu, Y . Huang, L. Wang, T. Kong, and T. Tan, “Bridgevla: Input-output alignment for efficient 3d manip- ulation learning with vision-language models,” inAdvances in Neural Information Processing Systems, 2025
2025
-
[39]
Predictive inverse dynamics models are scalable learners for robotic manipulation,
Y . Tian, S. Yang, J. Zeng, P. Wang, D. Lin, H. Dong, and J. Pang, “Predictive inverse dynamics models are scalable learners for robotic manipulation,” inInternational Conference on Learning Representa- tions, 2025
2025
-
[40]
Lisa: Reason- ing segmentation via large language model,
X. Lai, Z. Tian, Y . Chen, Y . Li, Y . Yuan, S. Liu, and J. Jia, “Lisa: Reason- ing segmentation via large language model,” inIEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 9579–9589
2024
-
[41]
Planning-oriented autonomous driving,
Y . Hu, J. Yang, L. Chen, K. Li, C. Sima, X. Zhu, S. Chai, S. Du, T. Lin, W. Wanget al., “Planning-oriented autonomous driving,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2023, pp. 17 853–17 862
2023
-
[42]
Vad: Vectorized scene representation for efficient autonomous driving,
B. Jiang, S. Chen, Q. Xu, B. Liao, J. Chen, H. Zhou, Q. Zhang, W. Liu, C. Huang, and X. Wang, “Vad: Vectorized scene representation for efficient autonomous driving,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 8340–8350
2023
-
[43]
Improved baselines with visual instruction tuning,
H. Liu, C. Li, Y . Li, and Y . J. Lee, “Improved baselines with visual instruction tuning,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2024
2024
-
[44]
Roborefer: Towards spatial referring with reasoning in vision-language models for robotics,
E. Zhou, J. An, C. Chi, Y . Han, S. Rong, C. Zhang, P. Wang, Z. Wang, T. Huang, L. Sheng, and S. Zhang, “Roborefer: Towards spatial referring with reasoning in vision-language models for robotics,” inAdvances in Neural Information Processing Systems, 2025
2025
-
[45]
A. Azzolini, J. Bai, H. Brandon, J. Cao, P. Chattopadhyay, H. Chen, J. Chu, Y . Cui, J. Diamond, Y . Dinget al., “Cosmos-reason1: From physical common sense to embodied reasoning,”arXiv preprint arXiv:2503.15558, 2025
-
[46]
Robopoint: A vision-language model for spatial affordance prediction in robotics,
W. Yuan, J. Duan, V . Blukis, W. Pumacay, R. Krishna, A. Murali, A. Mousavian, and D. Fox, “Robopoint: A vision-language model for spatial affordance prediction in robotics,” inConference on Robot Learning. PMLR, 2025, pp. 4005–4020
2025
-
[47]
RoboSpatial: Teaching spatial understanding to 2D and 3D vision- language models for robotics,
C. H. Song, V . Blukis, J. Tremblay, S. Tyree, Y . Su, and S. Birchfield, “RoboSpatial: Teaching spatial understanding to 2D and 3D vision- language models for robotics,” inProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, 2025
2025
-
[48]
Vl-grasp: a 6-dof interactive grasp policy for language-oriented objects in cluttered indoor scenes,
Y . Lu, Y . Fan, B. Deng, F. Liu, Y . Li, and S. Wang, “Vl-grasp: a 6-dof interactive grasp policy for language-oriented objects in cluttered indoor scenes,” in2023 IEEE/RSJ International Conference on Intelligent Robots and Systems, 2023
2023
-
[49]
T. Ma, J. Zheng, Z. Wang, Z. Gao, J. Zhou, and J. Liang, “Glover++: Unleashing the potential of affordance learning from human behaviors for robotic manipulation,”arXiv preprint arXiv:2505.11865, 2025
-
[50]
MolmoAct: Action Reasoning Models that can Reason in Space
J. Lee, J. Duan, H. Fang, Y . Deng, S. Liu, B. Li, B. Fang, J. Zhang, Y . R. Wang, S. Leeet al., “Molmoact: Action reasoning models that can reason in space,” inarXiv preprint arXiv:2508.07917, 2025
work page internal anchor Pith review arXiv 2025
-
[51]
Visual instruction tuning,
H. Liu, C. Li, Q. Wu, and Y . J. Lee, “Visual instruction tuning,” in NeurIPS, 2023
2023
-
[52]
Ego4d: Around the world in 3,000 hours of egocentric video,
K. Grauman, A. Westbury, E. Byrne, Z. Chavis, A. Furnari, R. Girdhar, J. Hamburger, H. Jiang, M. Liu, X. Liuet al., “Ego4d: Around the world in 3,000 hours of egocentric video,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022
2022
-
[53]
Rescaling egocentric vision: Collection, pipeline and challenges for epic-kitchens-100,
D. Damen, H. Doughty, G. M. Farinella, A. Furnari, E. Kazakos, J. Ma, D. Moltisanti, J. Munro, T. Perrett, W. Priceet al., “Rescaling egocentric vision: Collection, pipeline and challenges for epic-kitchens-100,” in International Journal of Computer Vision, 2022
2022
-
[54]
A. Yang, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Li, D. Liu, F. Huang, H. Weiet al., “Qwen2. 5 technical report,”arXiv preprint arXiv:2412.15115, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[55]
Sigmoid loss for language image pre-training,
X. Zhai, B. Mustafa, A. Kolesnikov, and L. Beyer, “Sigmoid loss for language image pre-training,” inProceedings of the IEEE/CVF international conference on computer vision, 2023
2023
-
[56]
DINOv2: Learning Robust Visual Features without Supervision
M. Oquab, T. Darcet, T. Moutakanni, H. V o, M. Szafraniec, V . Khalidov, P. Fernandez, D. Haziza, F. Massa, A. El-Noubyet al., “Dinov2: Learning robust visual features without supervision,”arXiv preprint arXiv:2304.07193, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[57]
Segment anything,
A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. Whitehead, A. C. Berg, W.-Y . Loet al., “Segment anything,” inProceedings of the IEEE/CVF international conference on computer vision, 2023, pp. 4015–4026
2023
-
[58]
Vggt: Visual geometry grounded transformer,
J. Wang, M. Chen, N. Karaev, A. Vedaldi, C. Rupprecht, and D. Novotny, “Vggt: Visual geometry grounded transformer,” inProceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 5294– 5306
2025
-
[59]
Cambrian-1: A fully open, vision-centric exploration of multimodal llms,
P. Tong, E. Brown, P. Wu, S. Woo, A. J. V . IYER, S. C. Akula, S. Yang, J. Yang, M. Middepogu, Z. Wanget al., “Cambrian-1: A fully open, vision-centric exploration of multimodal llms,” inAdvances in Neural Information Processing Systems, 2024
2024
-
[60]
Decoupled weight decay regularization,
I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,” inInternational Conference on Learning Representations, 2019
2019
-
[61]
Lora: Low-rank adaptation of large language models
E. J. Hu, Y . Shen, P. Wallis, Z. Allen-Zhu, Y . Li, S. Wang, L. Wang, W. Chenet al., “Lora: Low-rank adaptation of large language models.” inInternational Conference on Learning Representations, 2022
2022
-
[62]
WorldVLA: Towards Autoregressive Action World Model
J. Cen, C. Yu, H. Yuan, Y . Jiang, S. Huang, J. Guo, X. Li, Y . Song, H. Luo, F. Wanget al., “Worldvla: Towards autoregressive action world model,”arXiv preprint arXiv:2506.21539, 2025
work page internal anchor Pith review arXiv 2025
-
[63]
Fast: Efficient action tokenization for vision-language-action models,
K. Pertsch, K. Stachowicz, B. Ichter, D. Driess, S. Nair, Q. Vuong, O. Mees, C. Finn, and S. Levine, “Fast: Efficient action tokenization for vision-language-action models,” inRobotics: Science and Systems, 2025
2025
-
[64]
Gello: A general, low-cost, and intuitive teleoperation framework for robot manipulators,
P. Wu, Y . Shentu, Z. Yi, X. Lin, and P. Abbeel, “Gello: A general, low-cost, and intuitive teleoperation framework for robot manipulators,” in2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2024, pp. 12 156–12 163
2024
-
[65]
Sam 2: Segment anything in images and videos,
N. Ravi, V . Gabeur, Y .-T. Hu, R. Hu, C. Ryali, T. Ma, H. Khedr, R. R ¨adle, C. Rolland, L. Gustafsonet al., “Sam 2: Segment anything in images and videos,” inInternational Conference on Learning Repre- sentations, 2025
2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.