Self-Supervised Bootstrapping of Action-Predictive Embodied Reasoning
Pith reviewed 2026-05-21 13:58 UTC · model grok-4.3
The pith
Models bootstrap action-predictive embodied reasoning by treating it as a latent variable in variational inference to distill refined strategies without external supervision.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
R&B-EnCoRe enables models to bootstrap embodied reasoning from internet-scale knowledge through self-supervised refinement. By treating reasoning as a latent variable within importance-weighted variational inference, models can generate and distill a refined reasoning training dataset of embodiment-specific strategies without external rewards, verifiers, or human annotation. Validation across manipulation, legged navigation, and autonomous driving shows substantial gains over baselines that reason about all primitives.
What carries the argument
The treatment of reasoning as a latent variable in importance-weighted variational inference that allows selection and distillation of strategies based on downstream action success.
If this is right
- Leads to 28% gains in manipulation success.
- Produces 101% improvement in navigation scores.
- Reduces collision rates by 21%.
- Works across different VLA architectures from 1B to 30B parameters and multiple embodiments.
- Bypasses the need for manual template engineering and external supervision signals.
Where Pith is reading between the lines
- The approach could support lifelong learning where robots refine their reasoning from continued physical interactions.
- Similar latent variable techniques might address alignment between reasoning and outcomes in non-embodied AI systems.
- One could test the method on tasks with longer time horizons to see if the benefits persist.
- It may connect to problems in efficient exploration where reasoning guides better data collection.
Load-bearing premise
That judging reasoning quality solely by whether the actions it leads to succeed is enough to produce useful embodiment-specific strategies.
What would settle it
Running the method on a held-out set of tasks and finding that the distilled reasoning does not lead to higher success rates than using unfiltered reasoning primitives.
Figures
read the original abstract
Embodied Chain-of-Thought (CoT) reasoning has significantly enhanced Vision-Language-Action (VLA) models, yet current methods rely on rigid templates to specify reasoning primitives (e.g., objects in the scene, high-level plans, structural affordances). These templates can force policies to process irrelevant information that distracts from critical action-prediction signals. This creates a bottleneck: without successful policies, we cannot verify reasoning quality; without quality reasoning, we cannot build robust policies. We introduce R&B-EnCoRe, which enables models to bootstrap embodied reasoning from internet-scale knowledge through self-supervised refinement. By treating reasoning as a latent variable within importance-weighted variational inference, models can generate and distill a refined reasoning training dataset of embodiment-specific strategies without external rewards, verifiers, or human annotation. We validate R&B-EnCoRe across manipulation (Franka Panda in simulation, WidowX in hardware), legged navigation (bipedal, wheeled, bicycle, quadruped), and autonomous driving embodiments using various VLA architectures with 1B, 4B, 7B, and 30B parameters. Our approach achieves 28% gains in manipulation success, 101% improvement in navigation scores, and 21% reduction in collision-rate metric over models that indiscriminately reason about all available primitives. R&B-EnCoRe enables models to distill reasoning that is predictive of successful control, bypassing manual annotation engineering while grounding internet-scale knowledge in physical execution.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces R&B-EnCoRe, a self-supervised bootstrapping method for action-predictive embodied reasoning in Vision-Language-Action models. Reasoning is modeled as a latent variable inside an importance-weighted variational inference framework initialized from internet-scale knowledge; this is used to generate and distill a refined training dataset of embodiment-specific strategies without external rewards, verifiers, or human annotation. The method is evaluated on manipulation (Franka Panda simulation, WidowX hardware), legged navigation (bipedal/wheeled/bicycle/quadruped), and autonomous driving across VLA architectures of 1B–30B parameters, reporting 28% gains in manipulation success, 101% improvement in navigation scores, and 21% reduction in collision rate relative to baselines that reason indiscriminately over all primitives.
Significance. If the central mechanism is sound, the work would be significant for embodied AI: it offers a route to ground large-scale pretrained knowledge in physical control without manual template engineering or external supervision, while validating across diverse embodiments and model scales. The multi-platform experimental design and scale of reported gains are strengths that would support broader adoption if alternative explanations for the improvements can be ruled out.
major comments (2)
- [§3.2, Eq. (3)] §3.2, Eq. (3): The importance-weighted variational objective defines weights directly from downstream policy success; this makes the central claim that the procedure surfaces causally effective reasoning strategies load-bearing on an assumption that has not been isolated from dataset-filtering or co-occurrence effects. An ablation that replaces the success-derived weights with uniform or random weights while keeping the distillation pipeline fixed would be required to establish that IWVI is the operative mechanism.
- [§4.2, Table 3] §4.2, Table 3 (navigation rows): The 101% relative improvement is reported without per-seed standard deviations or statistical significance tests; given the stochastic nature of both policy rollouts and the variational sampling, it is unclear whether the magnitude is robust or could be explained by variance in the baseline runs.
minor comments (2)
- [§3.1] Notation for the variational posterior q(·) and the importance weight w(·) is introduced without an explicit statement of whether they are reparameterized or whether the bound is optimized jointly with the policy parameters; a short clarifying paragraph would improve reproducibility.
- [Figure 4] Figure 4 caption does not specify the exact number of reasoning samples drawn per trajectory during distillation; this detail affects interpretation of the reported efficiency gains.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed feedback. We address each major comment point by point below and have revised the manuscript accordingly to strengthen the empirical support for our claims.
read point-by-point responses
-
Referee: [§3.2, Eq. (3)] §3.2, Eq. (3): The importance-weighted variational objective defines weights directly from downstream policy success; this makes the central claim that the procedure surfaces causally effective reasoning strategies load-bearing on an assumption that has not been isolated from dataset-filtering or co-occurrence effects. An ablation that replaces the success-derived weights with uniform or random weights while keeping the distillation pipeline fixed would be required to establish that IWVI is the operative mechanism.
Authors: We agree that isolating the contribution of the importance weights is necessary to substantiate that the IWVI mechanism, rather than generic filtering or co-occurrence, drives the selection of causally effective reasoning. The success-derived weights are computed from policy execution outcomes on the generated traces, which is integral to the self-supervised bootstrapping. To directly address this concern, we will add the requested ablation in the revised manuscript: we will rerun the distillation pipeline with uniform weights and with randomly sampled weights (while preserving the rest of the architecture and data generation) and report the resulting performance on the manipulation and navigation benchmarks. revision: yes
-
Referee: [§4.2, Table 3] §4.2, Table 3 (navigation rows): The 101% relative improvement is reported without per-seed standard deviations or statistical significance tests; given the stochastic nature of both policy rollouts and the variational sampling, it is unclear whether the magnitude is robust or could be explained by variance in the baseline runs.
Authors: We acknowledge that the absence of per-seed variability measures and formal statistical tests leaves the robustness of the 101% navigation improvement open to question, especially given stochasticity in rollouts and sampling. We have conducted additional experimental runs across multiple random seeds for the navigation tasks. In the revised manuscript we will update Table 3 to report mean performance with per-seed standard deviations and will include paired t-test results (with p-values) comparing R&B-EnCoRe against the indiscriminate-reasoning baselines. revision: yes
Circularity Check
No significant circularity in derivation chain
full rationale
The paper presents a self-supervised method using importance-weighted variational inference to treat reasoning as a latent variable for distilling embodiment-specific strategies from internet-scale knowledge. No equations, self-citations, or load-bearing steps are visible in the provided text that reduce the central claim (refined reasoning predictive of control success) to a tautological fit or redefinition of the input success metric itself. The approach is framed as bypassing external verifiers by grounding in physical execution, with the variational objective providing independent structure rather than circular attribution. This is the most common honest finding for papers whose core mechanism remains self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
By treating reasoning as a latent variable within importance-weighted variational inference, models can generate and distill a refined reasoning training dataset of embodiment-specific strategies without external rewards, verifiers, or human annotation.
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We introduce R&B-EnCoRe... across manipulation... legged navigation... autonomous driving embodiments using various VLA architectures with 1B, 4B, 7B, and 30B parameters.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
On the Opportunities and Risks of Foundation Models
Rishi Bommasani, Drew A Hudson, Ehsan Adeli, Russ Altman, Simran Arora, Sydney von Arx, Michael S Bernstein, Jeannette Bohg, Antoine Bosselut, Emma Brunskill, et al. On the opportunities and risks of foundation models.arXiv preprint arXiv:2108.07258, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[2]
Vision-language-action models for robotics: A review towards real-world applications
Kento Kawaharazuka, Jihoon Oh, Jun Yamada, Ingmar Posner, and Yuke Zhu. Vision-language-action models for robotics: A review towards real-world applications. IEEE Access, 2025
work page 2025
-
[3]
A Survey on Vision-Language-Action Models: An Action Tokenization Perspective
Yifan Zhong, Fengshuo Bai, Shaofei Cai, Xuchuan Huang, Zhang Chen, Xiaowei Zhang, Yuanfei Wang, Shaoyang Guo, Tianrui Guan, Ka Nam Lui, et al. A sur- vey on vision-language-action models: An action tok- enization perspective.arXiv preprint arXiv:2507.01925, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[4]
Vision-language models for vision tasks: A survey
Jingyi Zhang, Jiaxing Huang, Sheng Jin, and Shijian Lu. Vision-language models for vision tasks: A survey. IEEE transactions on pattern analysis and machine intelligence, 46(8):5625–5644, 2024
work page 2024
-
[5]
Openvla: An open-source vision-language-action model
Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan P Foster, Pannag R Sanketi, Quan Vuong, Thomas Kollar, Benjamin Burchfiel, Russ Tedrake, Dorsa Sadigh, Sergey Levine, Percy Liang, and Chelsea Finn. Openvla: An open-source vision-language-action model. In Pulkit Agrawal, Oliver Kroemer, and W...
work page 2025
-
[6]
Minivla: A better vla with a smaller footprint, 2024
Suneel Belkhale and Dorsa Sadigh. Minivla: A better vla with a smaller footprint, 2024. URL https://github. com/Stanford-ILIAD/openvla-mini
work page 2024
-
[7]
InProceedings of Robotics: Science and Systems, LosAngeles, CA, USA, June 2025
Kevin Black, Noah Brown, Danny Driess, Adnan Es- mail, Michael Robert Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, Szymon Jakubczak, Tim Jones, Liyiming Ke, Sergey Levine, Adrian Li-Bell, Mohith Mothukuri, Suraj Nair, Karl Pertsch, Lucy Xiaoyang Shi, Laura Smith, James Tanner, Quan Vuong, Anna Walling, Haohuan Wang, and Ury...
-
[8]
Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Esmail, Michael Robert Equi, Chelsea Finn, Niccolo Fusai, Manuel Y . Galliker, Dibya Ghosh, Lachy Groom, Karol Hausman, brian ichter, Szymon Jakubczak, Tim Jones, Liyiming Ke, Devin LeBlanc, Sergey Levine, Adrian Li-Bell, Mohith Mothukuri, Suraj Nair, Karl Pertsch, Allen Z. Ren,...
-
[9]
PMLR, 27–30 Sep 2025
work page 2025
-
[10]
$\pi^{*}_{0.6}$: a VLA That Learns From Experience
Physical Intelligence, Ali Amin, Raichelle Aniceto, Ashwin Balakrishna, Kevin Black, Ken Conley, Grace Connors, James Darpinian, Karan Dhabalia, Jared Di- Carlo, et al.π 0.6: a vla that learns from experience. arXiv preprint arXiv:2511.14759, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[11]
RT-H: Action Hierarchies using Language.Proceedings of Robotics: Science and Systems, July 2024
Suneel Belkhale, Tianli Ding, Ted Xiao, Pierre Ser- manet, Quan Vuong, Jonathan Tompson, Yevgen Cheb- otar, Debidatta Dwibedi, and Dorsa Sadigh. RT-H: Action Hierarchies using Language.Proceedings of Robotics: Science and Systems, July 2024. doi: 10. 15607/RSS.2024.XX.049
work page 2024
-
[12]
Robotic control via embodied chain-of-thought reasoning
Michał Zawalski, William Chen, Karl Pertsch, Oier Mees, Chelsea Finn, and Sergey Levine. Robotic control via embodied chain-of-thought reasoning. In Pulkit Agrawal, Oliver Kroemer, and Wolfram Bur- gard, editors,Proceedings of The 8th Conference on Robot Learning, volume 270 ofProceedings of Machine Learning Research, pages 3157–3181. PMLR, 06–09 Nov 2025
work page 2025
-
[13]
Training strategies for efficient embodied rea- soning
William Chen, Suneel Belkhale, Suvir Mirchandani, Karl Pertsch, Danny Driess, Oier Mees, and Sergey Levine. Training strategies for efficient embodied rea- soning. In Joseph Lim, Shuran Song, and Hae-Won Park, editors,Proceedings of The 9th Conference on Robot Learning, volume 305 ofProceedings of Machine Learning Research, pages 365–391. PMLR, 27–30 Sep 2025
work page 2025
-
[14]
Gemini Robotics Team, Abbas Abdolmaleki, Saminda Abeyruwan, Joshua Ainslie, Jean-Baptiste Alayrac, Montserrat Gonzalez Arenas, Ashwin Balakrishna, Nathan Batchelor, Alex Bewley, Jeff Bingham, et al. Gemini robotics 1.5: Pushing the frontier of generalist robots with advanced embodied reasoning, thinking, and motion transfer.arXiv preprint arXiv:2510.03342, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[15]
Cosmos-Reason1: From Physical Common Sense To Embodied Reasoning
Alisson Azzolini, Junjie Bai, Hannah Brandon, Jiaxin Cao, Prithvijit Chattopadhyay, Huayu Chen, Jinju Chu, Yin Cui, Jenna Diamond, Yifan Ding, et al. Cosmos- reason1: From physical common sense to embodied reasoning.arXiv preprint arXiv:2503.15558, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[16]
Llama 2: Open Foundation and Fine-Tuned Chat Models
Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[17]
Smith, Hannaneh Hajishirzi, Ross Girshick, Ali Farhadi, and Aniruddha Kembhavi
Matt Deitke, Christopher Clark, Sangho Lee, Ro- hun Tripathi, Yue Yang, Jae Sung Park, Moham- madreza Salehi, Niklas Muennighoff, Kyle Lo, Luca Soldaini, Jiasen Lu, Taira Anderson, Erin Bransom, Kiana Ehsani, Huong Ngo, YenSung Chen, Ajay Patel, Mark Yatskar, Chris Callison-Burch, Andrew Head, Rose Hendrix, Favyen Bastani, Eli VanderBilt, Nathan Lambert, ...
work page 2025
-
[18]
Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, Wenbin Ge, Zhifang Guo, Qidong Huang, Jie Huang, Fei Huang, Binyuan Hui, Shutong Jiang, Zhaohai Li, Mingsheng Li, Mei Li, Kaixin Li, Zicheng Lin, Junyang Lin, Xuejing Liu, Jiawei Liu, Chenglong Liu, Yang Liu, Dayiheng Liu, Shixuan ...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[19]
Robovqa: Multimodal long-horizon reasoning for robotics
Pierre Sermanet, Tianli Ding, Jeffrey Zhao, Fei Xia, Debidatta Dwibedi, Keerthana Gopalakrishnan, Chris- tine Chan, Gabriel Dulac-Arnold, Sharath Maddineni, Nikhil J Joshi, et al. Robovqa: Multimodal long-horizon reasoning for robotics. In2024 IEEE International Conference on Robotics and Automation (ICRA), pages 645–652. IEEE, 2024
work page 2024
-
[20]
Haowen Liu, Shaoxiong Yao, Haonan Chen, Jiawei Gao, Jiayuan Mao, Jia-Bin Huang, and Yilun Du. Sim- pact: Simulation-enabled action planning using vision- language models.arXiv preprint arXiv:2512.05955, 2025
-
[21]
Evovla: Self-evolving vision-language-action model
Zeting Liu, Zida Yang, Zeyu Zhang, and Hao Tang. Evovla: Self-evolving vision-language-action model. arXiv preprint arXiv:2511.16166, 2025
-
[22]
Qi Sun, Pengfei Hong, Tej Deep Pala, Vernon Toh, U-Xuan Tan, Deepanway Ghosal, and Soujanya Poria. Emma-x: An embodied multimodal action model with grounded chain of thought and look-ahead spatial rea- soning. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 14199–14214, 2025
work page 2025
-
[23]
Argus: Vision-centric reasoning with grounded chain-of-thought
Yunze Man, De-An Huang, Guilin Liu, Shiwei Sheng, Shilong Liu, Liang-Yan Gui, Jan Kautz, Yu-Xiong Wang, and Zhiding Yu. Argus: Vision-centric reasoning with grounded chain-of-thought. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 14268–14280, 2025
work page 2025
-
[24]
Igniting vlms toward the embodied space.arXiv preprint arXiv:2509.11766, 2025
Andy Zhai, Brae Liu, Bruno Fang, Chalse Cai, Ellie Ma, Ethan Yin, Hao Wang, Hugo Zhou, James Wang, Lights Shi, et al. Igniting vlms toward the embodied space.arXiv preprint arXiv:2509.11766, 2025
-
[25]
Theodore Sumers, Kenneth Marino, Arun Ahuja, Rob Fergus, and Ishita Dasgupta. Distilling internet-scale vision-language models into embodied agents.arXiv preprint arXiv:2301.12507, 2023
-
[26]
Chatvla: Unified mul- timodal understanding and robot control with vision- language-action model
Zhongyi Zhou, Yichen Zhu, Minjie Zhu, Junjie Wen, Ning Liu, Zhiyuan Xu, Weibin Meng, Yaxin Peng, Chaomin Shen, Feifei Feng, et al. Chatvla: Unified mul- timodal understanding and robot control with vision- language-action model. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 5377–5395, 2025
work page 2025
-
[27]
PaliGemma 2: A Family of Versatile VLMs for Transfer
Andreas Steiner, Andr ´e Susano Pinto, Michael Tschan- nen, Daniel Keysers, Xiao Wang, Yonatan Bitton, Alexey Gritsenko, Matthias Minderer, Anthony Sher- bondy, Shangbang Long, et al. Paligemma 2: A family of versatile vlms for transfer.arXiv preprint arXiv:2412.03555, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[28]
Michael Tschannen, Alexey Gritsenko, Xiao Wang, Muhammad Ferjad Naeem, Ibrahim Alabdulmohsin, Nikhil Parthasarathy, Talfan Evans, Lucas Beyer, Ye Xia, Basil Mustafa, et al. Siglip 2: Multilingual vision-language encoders with improved semantic un- derstanding, localization, and dense features.arXiv preprint arXiv:2502.14786, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[29]
Pris- matic vlms: Investigating the design space of visually- conditioned language models
Siddharth Karamcheti, Suraj Nair, Ashwin Balakrishna, Percy Liang, Thomas Kollar, and Dorsa Sadigh. Pris- matic vlms: Investigating the design space of visually- conditioned language models. InForty-first Interna- tional Conference on Machine Learning, 2024
work page 2024
-
[30]
Open x-embodiment: Robotic learning datasets and rt-x models: Open x-embodiment collabo- ration 0
Abby O’Neill, Abdul Rehman, Abhiram Maddukuri, Abhishek Gupta, Abhishek Padalkar, Abraham Lee, Acorn Pooley, Agrim Gupta, Ajay Mandlekar, Ajinkya Jain, et al. Open x-embodiment: Robotic learning datasets and rt-x models: Open x-embodiment collabo- ration 0. In2024 IEEE International Conference on Robotics and Automation (ICRA), pages 6892–6903. IEEE, 2024
work page 2024
-
[31]
Homer Rich Walke, Kevin Black, Tony Z. Zhao, Quan Vuong, Chongyi Zheng, Philippe Hansen-Estruch, An- dre Wang He, Vivek Myers, Moo Jin Kim, Max Du, Abraham Lee, Kuan Fang, Chelsea Finn, and Sergey Levine. Bridgedata v2: A dataset for robot learning at scale. In Jie Tan, Marc Toussaint, and Kourosh Darvish, editors,Proceedings of The 7th Conference on Robo...
work page 2023
-
[32]
DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset
Alexander Khazatsky, Karl Pertsch, Suraj Nair, Ash- win Balakrishna, Sudeep Dasari, Siddharth Karam- cheti, Soroush Nasiriany, Mohan Kumar Srirama, Lawrence Yunliang Chen, Kirsty Ellis, et al. Droid: A large-scale in-the-wild robot manipulation dataset. arXiv preprint arXiv:2403.12945, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[33]
nuscenes: A multimodal dataset for autonomous driv- ing
Holger Caesar, Varun Bankiti, Alex H Lang, Sourabh V ora, Venice Erin Liong, Qiang Xu, Anush Krish- nan, Yu Pan, Giancarlo Baldan, and Oscar Beijbom. nuscenes: A multimodal dataset for autonomous driv- ing. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11621– 11631, 2020
work page 2020
-
[34]
FAST: Efficient Action Tokenization for Vision-Language-Action Models
Karl Pertsch, Kyle Stachowicz, Brian Ichter, Danny Driess, Suraj Nair, Quan Vuong, Oier Mees, Chelsea Finn, and Sergey Levine. FAST: Efficient Action Tokenization for Vision-Language-Action Models. In Proceedings of Robotics: Science and Systems, LosAn- geles, CA, USA, June 2025. doi: 10.15607/RSS.2025. XXI.012
-
[35]
Vq-vla: Improving vision- language-action models via scaling vector-quantized action tokenizers
Yating Wang, Haoyi Zhu, Mingyu Liu, Jiange Yang, Hao-Shu Fang, and Tong He. Vq-vla: Improving vision- language-action models via scaling vector-quantized action tokenizers. InProceedings of the IEEE/CVF International Conference on Computer Vision, 2025
work page 2025
-
[36]
Roya Firoozi, Johnathan Tucker, Stephen Tian, Anirudha Majumdar, Jiankai Sun, Weiyu Liu, Yuke Zhu, Shuran Song, Ashish Kapoor, Karol Hausman, et al. Foundation models in robotics: Applications, challenges, and the future.The International Journal of Robotics Research, 44(5):701–739, 2025
work page 2025
-
[37]
RT-1: Robotics Transformer for Real-World Control at Scale
Anthony Brohan, Noah Brown, Justice Carbajal, Yev- gen Chebotar, Joseph Dabis, Chelsea Finn, Keerthana Gopalakrishnan, Karol Hausman, Alex Herzog, Jasmine Hsu, Julian Ibarz, Brian Ichter, Alex Irpan, Tomas Jack- son, Sally Jesmonth, Nikhil Joshi, Ryan Julian, Dmitry Kalashnikov, Yuheng Kuang, Isabel Leal, Kuang-Huei Lee, Sergey Levine, Yao Lu, Utsav Malla...
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[38]
Rt-2: Vision-language- action models transfer web knowledge to robotic con- trol
Brianna Zitkovich, Tianhe Yu, Sichun Xu, Peng Xu, Ted Xiao, Fei Xia, Jialin Wu, Paul Wohlhart, Stefan Welker, Ayzaan Wahid, et al. Rt-2: Vision-language- action models transfer web knowledge to robotic con- trol. InConference on Robot Learning, pages 2165–
-
[39]
Octo: An open-source generalist robot policy
Octo Model Team, Dibya Ghosh, Homer Walke, Karl Pertsch, Kevin Black, Oier Mees, Sudeep Dasari, Joey Hejna, Charles Xu, Jianlan Luo, Tobias Kreiman, You Liang Tan, Lawrence Yunliang Chen, Pannag Sanketi, Quan Vuong, Ted Xiao, Dorsa Sadigh, Chelsea Finn, and Sergey Levine. Octo: An open-source generalist robot policy. InProceedings of Robotics: Science and...
work page 2024
-
[40]
GR00T N1: An Open Foundation Model for Generalist Humanoid Robots
Johan Bjorck, Fernando Casta ˜neda, Nikita Cherniadev, Xingye Da, Runyu Ding, Linxi Fan, Yu Fang, Dieter Fox, Fengyuan Hu, Spencer Huang, et al. Gr00t n1: An open foundation model for generalist humanoid robots. arXiv preprint arXiv:2503.14734, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[41]
NaVILA: Legged Robot Vision-Language-Action Model for Navigation
An-Chieh Cheng, Yandong Ji, Zhaojing Yang, Zaitian Gongye, Xueyan Zou, Jan Kautz, Erdem Biyik, Hongxu Yin, Sifei Liu, and Xiaolong Wang. NaVILA: Legged Robot Vision-Language-Action Model for Navigation. InProceedings of Robotics: Science and Systems, LosAngeles, CA, USA, June 2025. doi: 10.15607/RSS. 2025.XXI.018
-
[42]
Quar-vla: Vision-language-action model for quadruped robots
Pengxiang Ding, Han Zhao, Wenjie Zhang, Wenxuan Song, Min Zhang, Siteng Huang, Ningxi Yang, and Donglin Wang. Quar-vla: Vision-language-action model for quadruped robots. InEuropean Conference on Computer Vision, pages 352–367. Springer, 2024
work page 2024
-
[43]
Pengxiang Ding, Jianfei Ma, Xinyang Tong, Binghong Zou, Xinxin Luo, Yiguo Fan, Ting Wang, Hongchao Lu, Panzhong Mo, Jinxin Liu, et al. Humanoid-vla: Towards universal humanoid control with visual inte- gration.arXiv preprint arXiv:2502.14795, 2025
-
[44]
Haoran Jiang, Jin Chen, Qingwen Bu, Li Chen, Modi Shi, Yanjie Zhang, Delong Li, Chuanzhe Suo, Chuang Wang, Zhihui Peng, et al. Wholebodyvla: Towards unified latent vla for whole-body loco-manipulation control.arXiv preprint arXiv:2512.11047, 2025
-
[45]
Opendrivevla: Towards end-to-end au- tonomous driving with large vision language action model
Xingcheng Zhou, Xuyuan Han, Feng Yang, Yunpu Ma, V olker Tresp, and Alois Knoll. Opendrivevla: Towards end-to-end autonomous driving with large vision lan- guage action model, 2025. URL https://arxiv.org/abs/ 2503.23463
-
[46]
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models.Advances in neural information processing systems, 35:24824–24837, 2022
work page 2022
-
[47]
Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. Large language models are zero-shot reasoners.Advances in neural information processing systems, 35:22199–22213, 2022
work page 2022
-
[48]
William Merrill and Ashish Sabharwal. The expressive power of transformers with chain of thought.arXiv preprint arXiv:2310.07923, 2023
-
[49]
Chain of Thought Empowers Transformers to Solve Inherently Serial Problems, 2024
Zhiyuan Li, Hong Liu, Denny Zhou, and Tengyu Ma. Chain of thought empowers transformers to solve inher- ently serial problems.arXiv preprint arXiv:2402.12875, 1, 2024
-
[50]
Guhao Feng, Bohang Zhang, Yuntian Gu, Haotian Ye, Di He, and Liwei Wang. Towards revealing the mystery behind chain of thought: a theoretical perspective.Ad- vances in Neural Information Processing Systems, 36: 70757–70798, 2023
work page 2023
-
[51]
Xuezhi Wang and Denny Zhou. Chain-of-thought reasoning without prompting.Advances in Neural In- formation Processing Systems, 37:66383–66409, 2024
work page 2024
-
[52]
A survey on large language models for mathematical reasoning
Peng-Yuan Wang, Tian-Shuo Liu, Chenyang Wang, Ziniu Li, Yidi Wang, Shu Yan, Chengxing Jia, Xu- Hui Liu, Xinwei Chen, Jiacheng Xu, et al. A survey on large language models for mathematical reasoning. ACM Computing Surveys, 2025
work page 2025
-
[53]
Dayu Yang, Tianyang Liu, Daoan Zhang, Antoine Simoulin, Xiaoyi Liu, Yuwei Cao, Zhaopu Teng, Xin Qian, Grey Yang, Jiebo Luo, et al. Code to think, think to code: A survey on code-enhanced reasoning and reasoning-driven code intelligence in llms.arXiv preprint arXiv:2502.19411, 2025
-
[54]
Llava-cot: Let vision language models reason step-by-step
Guowei Xu, Peng Jin, Ziang Wu, Hao Li, Yibing Song, Lichao Sun, and Li Yuan. Llava-cot: Let vision language models reason step-by-step. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 2087–2098, 2025
work page 2087
-
[55]
Yiyang Zhou, Haoqin Tu, Zijun Wang, Zeyu Wang, Niklas Muennighoff, Fan Nie, Yejin Choi, James Zou, Chaorui Deng, Shen Yan, et al. When visualizing is the first step to reasoning: Mira, a benchmark for visual chain-of-thought.arXiv preprint arXiv:2511.02779, 2025
-
[56]
Marah Abdin, Jyoti Aneja, Harkirat Behl, S ´ebastien Bubeck, Ronen Eldan, Suriya Gunasekar, Michael Har- rison, Russell J Hewett, Mojan Javaheripi, Piero Kauff- mann, et al. Phi-4 technical report.arXiv preprint arXiv:2412.08905, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[57]
Textbooks Are All You Need II: phi-1.5 technical report
Yuanzhi Li, S ´ebastien Bubeck, Ronen Eldan, Allie Del Giorno, Suriya Gunasekar, and Yin Tat Lee. Text- books are all you need ii: phi-1.5 technical report.arXiv preprint arXiv:2309.05463, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[58]
Ziyang Luo, Can Xu, Pu Zhao, Qingfeng Sun, Xiubo Geng, Wenxiang Hu, Chongyang Tao, Jing Ma, Qing- wei Lin, and Daxin Jiang. Wizardcoder: Empowering code large language models with evol-instruct.arXiv preprint arXiv:2306.08568, 2023
-
[59]
Magicoder: Empowering code generation with oss-instruct.arXiv preprint arXiv:2312.02120, 2023
Yuxiang Wei, Zhe Wang, Jiawei Liu, Yifeng Ding, and Lingming Zhang. Magicoder: Empowering code generation with oss-instruct.arXiv preprint arXiv:2312.02120, 2023
-
[60]
Tinygsm: achieving ¿80% on gsm8k with small language models
Bingbin Liu, Sebastien Bubeck, Ronen Eldan, Ja- nardhan Kulkarni, Yuanzhi Li, Anh Nguyen, Rachel Ward, and Yi Zhang. Tinygsm: achieving>80% on gsm8k with small language models.arXiv preprint arXiv:2312.09241, 2023
-
[61]
WizardLM: Empowering large pre-trained language models to follow complex instructions
Can Xu, Qingfeng Sun, Kai Zheng, Xiubo Geng, Pu Zhao, Jiazhan Feng, Chongyang Tao, and Daxin Jiang. Wizardlm: Empowering large language mod- els to follow complex instructions.arXiv preprint arXiv:2304.12244, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[62]
Scaling Synthetic Data Creation with 1,000,000,000 Personas
Tao Ge, Xin Chan, Xiaoyang Wang, Dian Yu, Haitao Mi, and Dong Yu. Scaling synthetic data cre- ation with 1,000,000,000 personas.arXiv preprint arXiv:2406.20094, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[63]
Stanford alpaca: An instruction-following llama model, 2023
Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, and Tatsunori B Hashimoto. Stanford alpaca: An instruction-following llama model, 2023
work page 2023
-
[64]
Solving olympiad geometry without human demonstrations.Nature, 625(7995):476–482, 2024
Trieu H Trinh, Yuhuai Wu, Quoc V Le, He He, and Thang Luong. Solving olympiad geometry without human demonstrations.Nature, 625(7995):476–482, 2024
work page 2024
-
[65]
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[66]
Mastering the game of go with deep neural networks and tree search.nature, 529 (7587):484–489, 2016
David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershel- vam, Marc Lanctot, et al. Mastering the game of go with deep neural networks and tree search.nature, 529 (7587):484–489, 2016
work page 2016
-
[67]
Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm
David Silver, Thomas Hubert, Julian Schrittwieser, Ioannis Antonoglou, Matthew Lai, Arthur Guez, Marc Lanctot, Laurent Sifre, Dharshan Kumaran, Thore Grae- pel, et al. Mastering chess and shogi by self-play with a general reinforcement learning algorithm.arXiv preprint arXiv:1712.01815, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[68]
Haolin Chen, Yihao Feng, Zuxin Liu, Weiran Yao, Akshara Prabhakar, Shelby Heinecke, Ricky Ho, Phil Mui, Silvio Savarese, Caiming Xiong, et al. Lan- guage models are hidden reasoners: Unlocking latent reasoning capabilities via self-rewarding.arXiv preprint arXiv:2411.04282, 2024
-
[69]
Training chain-of-thought via latent-variable inference
Matthew Douglas Hoffman, Du Phan, David Dohan, Sholto Douglas, Tuan Anh Le, Aaron Parisi, Pavel Sountsov, Charles Sutton, Sharad Vikram, and Rif A Saurous. Training chain-of-thought via latent-variable inference. InNeurIPS, 2023
work page 2023
-
[70]
Amortizing intractable inference in large lan- guage models
Edward J Hu, Moksh Jain, Eric Elmoznino, Younesse Kaddar, Guillaume Lajoie, Yoshua Bengio, and Nikolay Malkin. Amortizing intractable inference in large lan- guage models. InThe Twelfth International Conference on Learning Representations, 2024
work page 2024
-
[71]
Han Zhong, Yutong Yin, Shenao Zhang, Xiaojun Xu, Yuanxin Liu, Yifei Zuo, Zhihan Liu, Boyi Liu, Sirui Zheng, Hongyi Guo, et al. Brite: Bootstrapping re- inforced thinking process to enhance language model reasoning.arXiv preprint arXiv:2501.18858, 2025
-
[72]
Beyond human data: Scaling self- training for problem-solving with language models
Avi Singh, John D Co-Reyes, Rishabh Agarwal, Ankesh Anand, Piyush Patil, Xavier Garcia, Peter J Liu, James Harrison, Jaehoon Lee, Kelvin Xu, Aaron T Parisi, Abhishek Kumar, Alexander A Alemi, Alex Rizkowsky, Azade Nova, Ben Adlam, Bernd Bohnet, Gamaleldin Fathy Elsayed, Hanie Sedghi, Igor Mor- datch, Isabelle Simpson, Izzeddin Gur, Jasper Snoek, Jeffrey P...
work page 2024
-
[73]
Reasoning to learn from latent thoughts.arXiv preprint arXiv:2503.18866, 2025
Yangjun Ruan, Neil Band, Chris J Maddison, and Tatsunori Hashimoto. Reasoning to learn from latent thoughts.arXiv preprint arXiv:2503.18866, 2025
-
[74]
Skill induction and planning with latent language
Pratyusha Sharma, Antonio Torralba, and Jacob An- dreas. Skill induction and planning with latent language. InProceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1713–1726, 2022
work page 2022
-
[75]
Yilin Wu, Anqi Li, Tucker Hermans, Fabio Ramos, Andrea Bajcsy, and Claudia P’erez-D’Arpino. Do what you say: Steering vision-language-action models via runtime reasoning-action alignment verification.arXiv preprint arXiv:2510.16281, 2025
-
[77]
MolmoAct: Action Reasoning Models that can Reason in Space
Jason Lee, Jiafei Duan, Haoquan Fang, Yuquan Deng, Shuo Liu, Boyang Li, Bohan Fang, Jieyu Zhang, Yi Ru Wang, Sangho Lee, et al. Molmoact: Action reason- ing models that can reason in space.arXiv preprint arXiv:2508.07917, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[78]
Minjie Zhu, Yichen Zhu, Jinming Li, Zhongyi Zhou, Junjie Wen, Xiaoyu Liu, Chaomin Shen, Yaxin Peng, and Feifei Feng. Objectvla: End-to-end open-world ob- ject manipulation without demonstration.arXiv preprint arXiv:2502.19250, 2025
-
[79]
Cot-vla: Visual chain-of-thought reasoning for vision-language-action models
Qingqing Zhao, Yao Lu, Moo Jin Kim, Zipeng Fu, Zhuoyang Zhang, Yecheng Wu, Zhaoshuo Li, Qianli Ma, Song Han, Chelsea Finn, et al. Cot-vla: Visual chain-of-thought reasoning for vision-language-action models. InProceedings of the Computer Vision and Pat- tern Recognition Conference, pages 1702–1713, 2025
work page 2025
-
[80]
Spa- tialvlm: Endowing vision-language models with spatial reasoning capabilities
Boyuan Chen, Zhuo Xu, Sean Kirmani, Brain Ichter, Dorsa Sadigh, Leonidas Guibas, and Fei Xia. Spa- tialvlm: Endowing vision-language models with spatial reasoning capabilities. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recogni- tion, pages 14455–14465, 2024
work page 2024
-
[81]
TraceVLA: Visual Trace Prompting Enhances Spatial-Temporal Awareness for Generalist Robotic Policies
Ruijie Zheng, Yongyuan Liang, Shuaiyi Huang, Jianfeng Gao, Hal Daum ´e III, Andrey Kolobov, Furong Huang, and Jianwei Yang. Tracevla: Vi- sual trace prompting enhances spatial-temporal aware- ness for generalist robotic policies.arXiv preprint arXiv:2412.10345, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.