Self-Supervised Bootstrapping of Action-Predictive Embodied Reasoning

Clark Barrett; Jonas Frey; Katie Luo; Marco Pavone; Milan Ganai

arxiv: 2602.08167 · v2 · pith:OZH3HQDAnew · submitted 2026-02-09 · 💻 cs.RO · cs.AI· cs.CV· cs.LG

Self-Supervised Bootstrapping of Action-Predictive Embodied Reasoning

Milan Ganai , Katie Luo , Jonas Frey , Clark Barrett , Marco Pavone This is my paper

Pith reviewed 2026-05-21 13:58 UTC · model grok-4.3

classification 💻 cs.RO cs.AIcs.CVcs.LG

keywords self-supervisedembodied reasoningchain-of-thoughtvision-language-actionvariational inferencerobot manipulationnavigation

0 comments

The pith

Models bootstrap action-predictive embodied reasoning by treating it as a latent variable in variational inference to distill refined strategies without external supervision.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper is trying to establish that rigid templates for embodied chain-of-thought reasoning cause models to process irrelevant information, creating a bottleneck in developing robust vision-language-action policies. By modeling reasoning as a latent variable in importance-weighted variational inference, the proposed method generates and distills a training dataset of embodiment-specific strategies that are predictive of successful actions. This is done without any external rewards, verifiers, or human annotations, using only internet-scale knowledge refined through action outcomes. A sympathetic reader would care because it offers a way to scale embodied reasoning to better match physical execution, potentially improving robot performance in manipulation and navigation tasks significantly.

Core claim

R&B-EnCoRe enables models to bootstrap embodied reasoning from internet-scale knowledge through self-supervised refinement. By treating reasoning as a latent variable within importance-weighted variational inference, models can generate and distill a refined reasoning training dataset of embodiment-specific strategies without external rewards, verifiers, or human annotation. Validation across manipulation, legged navigation, and autonomous driving shows substantial gains over baselines that reason about all primitives.

What carries the argument

The treatment of reasoning as a latent variable in importance-weighted variational inference that allows selection and distillation of strategies based on downstream action success.

If this is right

Leads to 28% gains in manipulation success.
Produces 101% improvement in navigation scores.
Reduces collision rates by 21%.
Works across different VLA architectures from 1B to 30B parameters and multiple embodiments.
Bypasses the need for manual template engineering and external supervision signals.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The approach could support lifelong learning where robots refine their reasoning from continued physical interactions.
Similar latent variable techniques might address alignment between reasoning and outcomes in non-embodied AI systems.
One could test the method on tasks with longer time horizons to see if the benefits persist.
It may connect to problems in efficient exploration where reasoning guides better data collection.

Load-bearing premise

That judging reasoning quality solely by whether the actions it leads to succeed is enough to produce useful embodiment-specific strategies.

What would settle it

Running the method on a held-out set of tasks and finding that the distilled reasoning does not lead to higher success rates than using unfiltered reasoning primitives.

Figures

Figures reproduced from arXiv: 2602.08167 by Clark Barrett, Jonas Frey, Katie Luo, Marco Pavone, Milan Ganai.

**Figure 1.** Figure 1: We generate diverse embodied reasoning primitives and refine them based on action-prediction information benefit. We bootstrap policy performance by retraining on these self-refined, high-quality reasoning traces, discovering embodiment-specific reasoning distributions that reveal effective strategies, significantly improving VLA task success while producing more efficient CoT traces. are increasingly use… view at source ↗

**Figure 2.** Figure 2: Top: Probabilistic Graphical Model relating the Task Context (C), Reasoning (Z), and Action (A). The latent reasoning Z is induced from a set of primitives R (e.g., subtask reasoning, move reasoning). Bottom: An example reasoning trace on the Bridge setup. reasoning, yet identifying such reasoning requires an alreadysuccessful policy. Current approaches struggle to bridge this gap, often resorting to rigi… view at source ↗

**Figure 3.** Figure 3: Overview of R&B-EnCoRe. (a) We generate diverse reasoning primitives (e.g., Plan, Visible Objects) and combine them via dropout to warmstart model capturing prior and posterior distributions. (b) We sample candidates from posterior and apply importance weighting to filter for reasoning that maximizes action prediction power. These refined, high-quality reasoning traces are used to bootstrap the final VLA. … view at source ↗

**Figure 4.** Figure 4: This plot shows the reasoning primitives distributions that are generated from R&B-EnCoRe refining warmstarting diverse reasoning strategy data. In a) the distribution for manipulation shows differences between reasoning for Franka Panda in simulation versus WidowX hardware in real-world data, notably for Visible Object, Move Explain, and Subtask Explain reasoning primitives. In b) we observe that the four… view at source ↗

**Figure 5.** Figure 5: Visible Objects generated in LIBERO-90 by R&B-EnCoRe’s model and a model producing a full list. The latter model attends to task-irrelevant objects like plate and bowl, while our model emits reasoning focused on task-critical objects. where ZR and Z✚R denote the set of traces with and without strategy R. Our importance weighting estimates this quantity: Proposition (Importance Weight Ratios Estimate Inform… view at source ↗

**Figure 7.** Figure 7: Success rates and latency of test-time reasoning on WidowX hardware. R&B-EnCoRe produces performant reasoning VLAs with shorter reasoning traces (so faster inference). Reasoning on all primitives degrades performance for cluttered scenes with OOD objects. Applying R&B-EnCoRe to refine a wider set of reasoning primitives in LIBERO-90, we see in Table II that R&B-EnCoRe achieves higher success over other r… view at source ↗

**Figure 8.** Figure 8: Quadruped Navigation Waypoint Trajectories. The quadruped robot must follow the trail while avoiding slippery ice. No Reason navigation VLA ignores terrain hazards and traverses the ice. Reasoning with all primitives is confounded by irrelevant signals; while it has reduced ice contact (perhaps due to affordance reasoning), it fails to follow the path. Random Primitives tracks some of the path but likely d… view at source ↗

**Figure 10.** Figure 10: R&B-EnCoRe prunes uninformative subjective weather reasoning from refined traces (∼36.7%; lower than other primitives). Q4 How does R&B-EnCoRe improve task performance and reduce test-time reasoning latency compared to baseline reasoning on all primitives for WidowX hardware? We perform an ablation study evaluating the performance and latency of explicit test-time reasoning on the WidowX robot, comparing… view at source ↗

**Figure 11.** Figure 11: Planned trajectories comparing driving VLAs using reasoning by R&B-EnCoRe’s model and a model producing a full list. 5 10 15 20 25 30 35 K (Number of Posterior Samples) 0 0.2 0.4 0.6 Collision Rate (%) Scaling Collision Rate ( ) with Posterior Inferencing [PITH_FULL_IMAGE:figures/full_fig_p008_11.png] view at source ↗

**Figure 12.** Figure 12: Collision Rate scaling with posterior inference. More samples K from posterior distribution results in improved action prediction estimate, and ultimately lower collision rate. Scaling curve fitted with Collision Rate = (3.65/K) 1.65 + 0.25. cally, we finetune a Qwen3-VL-4B-Instruct Dense Model [17] to take in the front camera image, and output the ego-vehicle’s planning trajectory over 3 seconds. We repo… view at source ↗

**Figure 13.** Figure 13: Reasoning primitive distributions from the raw poseterior and prior distributions. Note this is before the reweighting and importance sampling step (minor gaps due to warmstarting sampling noise and potential base model prior bias). APPENDIX C IMPORTANCE-WEIGHTED VARIATIONAL INFERENCE WITH CATEGORICAL RESAMPLING In the main text, we introduced the Importance Weighted Autoencoder (IWAE) framework [94], whi… view at source ↗

**Figure 14.** Figure 14: Prior and Posterior architecture. The prior architecture is the same as the standard generative VLA that takes as input the task context (scene and task) and outputs textual reasoning followed by action tokens. The posterior architecture takes as input the context and action and outputs only the reasoning tokens. TABLE V: Experimental Configuration Details Across Embodiment Domains Configuration LIBERO-90… view at source ↗

**Figure 15.** Figure 15: For our R&B-EnCoRe algorithm applied to the Legged navigation embodiments, we perform an ablation study on varying the dropout rate parameter d that affects the initial warmstart reasoning strategy training distribution. We find that 50% dropout provides best downstream performance. This dropout rate encourages the prior and posterior model to see a diverse set of reasoning strategies with overall minimal… view at source ↗

**Figure 16.** Figure 16: For the Legged Navigation Dataset we perform an ablation on performing posterior sampling (from Alg. 2) across 32 different sampling seed to validate whether the refined reasoning primitive distribution remains consistent. This plot confirms the generally consistency of the reasoning primitive frequencies (note the error bars and compare with the result of a single sample seed of Fig. 4b.) [PITH_FULL_IMA… view at source ↗

**Figure 17.** Figure 17: NaviTrace scores on the various VLA models with the additional Weather Reasoning primitive. R&B-EnCoRe refines the traces to remove irrelevant Weather reasoning primitive scores as seen in [PITH_FULL_IMAGE:figures/full_fig_p032_17.png] view at source ↗

**Figure 18.** Figure 18: Visible Object-only reasoning (Section V-A-Q1) in LIBERO-90 across steps in episode. Notice that the generated object bounding boxes for R&B-EnCoRe’s reasoning model generally attend to primarily task salient objects, while reasoning with all visible objects attends to all objects (including distracting/irrelevant ones) at every frame [PITH_FULL_IMAGE:figures/full_fig_p033_18.png] view at source ↗

**Figure 19.** Figure 19: Reasoning Traces with wider set of primitives (Section V-A-Q2) from the reasoning VLAs for LIBERO-90 across episode. Notice how R&B-EnCoRe reasons less frequently about visible objects compared with the other two models [PITH_FULL_IMAGE:figures/full_fig_p034_19.png] view at source ↗

**Figure 20.** Figure 20: Reasoning Traces with wider set of primitives (Section V-A-Q2) from the reasoning VLAs for LIBERO-90 across episode. Notice how R&B-EnCoRe reasons less frequently about visible objects compared with the other two models [PITH_FULL_IMAGE:figures/full_fig_p035_20.png] view at source ↗

**Figure 21.** Figure 21: Reasoning Traces (Section V-B-Q2) from the reasoning VLAs in Bridgev2 setup on WidowX hardware with Test-Time Reasoning enabled across episode [PITH_FULL_IMAGE:figures/full_fig_p036_21.png] view at source ↗

**Figure 22.** Figure 22: Reasoning Traces (Section V-B-Q2) from the reasoning VLAs in Bridgev2 setup on WidowX hardware with Test-Time Reasoning enabled across episode [PITH_FULL_IMAGE:figures/full_fig_p037_22.png] view at source ↗

**Figure 23.** Figure 23: Reasoning Traces for NaviTrace dataset with Quadruped embodiment (expanded version of [PITH_FULL_IMAGE:figures/full_fig_p038_23.png] view at source ↗

**Figure 24.** Figure 24: Reasoning Traces (Section V-D-Q7) from the driving VLAs. We visualize predictions across models on two samples from the nuScenes dataset. Observe that using R&B-EnCoRe improves performance and yields concise reasoning traces that is more informative than not reasoning at all. Reasoning types are colored for visualization purposes [PITH_FULL_IMAGE:figures/full_fig_p039_24.png] view at source ↗

**Figure 25.** Figure 25: Reasoning Traces (Section V-D-Q7) from the driving VLAs. We visualize results on two more samples from the nuScenes dataset [PITH_FULL_IMAGE:figures/full_fig_p040_25.png] view at source ↗

read the original abstract

Embodied Chain-of-Thought (CoT) reasoning has significantly enhanced Vision-Language-Action (VLA) models, yet current methods rely on rigid templates to specify reasoning primitives (e.g., objects in the scene, high-level plans, structural affordances). These templates can force policies to process irrelevant information that distracts from critical action-prediction signals. This creates a bottleneck: without successful policies, we cannot verify reasoning quality; without quality reasoning, we cannot build robust policies. We introduce R&B-EnCoRe, which enables models to bootstrap embodied reasoning from internet-scale knowledge through self-supervised refinement. By treating reasoning as a latent variable within importance-weighted variational inference, models can generate and distill a refined reasoning training dataset of embodiment-specific strategies without external rewards, verifiers, or human annotation. We validate R&B-EnCoRe across manipulation (Franka Panda in simulation, WidowX in hardware), legged navigation (bipedal, wheeled, bicycle, quadruped), and autonomous driving embodiments using various VLA architectures with 1B, 4B, 7B, and 30B parameters. Our approach achieves 28% gains in manipulation success, 101% improvement in navigation scores, and 21% reduction in collision-rate metric over models that indiscriminately reason about all available primitives. R&B-EnCoRe enables models to distill reasoning that is predictive of successful control, bypassing manual annotation engineering while grounding internet-scale knowledge in physical execution.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

R&B-EnCoRe uses importance-weighted variational inference to bootstrap embodiment-specific reasoning without external verifiers, reporting large gains on manipulation and navigation, but the method details and independent checks on reasoning quality are thin.

read the letter

The core idea is treating reasoning traces as latent variables inside an importance-weighted variational setup so the model can refine its own training data from internet-scale priors and then distill the useful parts for action prediction. That sidesteps the usual template engineering and annotation loop, and they show it across Franka simulation, WidowX hardware, multiple legged platforms, and driving with models from 1B to 30B parameters. The reported lifts—28% manipulation success, 101% navigation improvement, 21% collision drop—are large enough to notice, and the fact that they beat the “reason about everything” baseline suggests the filtering step is doing something real. Experiments on real hardware and across embodiment types are a plus; most VLA papers stay in simulation or one robot type. The self-supervised claim is the part that feels freshest, since prior embodied CoT work still leans on hand-crafted primitives or external rewards. The soft spots are mostly around transparency and verification. The abstract gives no derivation or pseudocode for how the importance weights are computed or kept stable, nor whether the variational objective runs jointly with the policy or in separate stages. Without those pieces it is hard to tell whether the gains come from better latent reasoning or simply from training on a curated subset of trajectories that happen to succeed. The stress-test concern about circularity also lands: if downstream success is the only signal used to up-weight reasoning paths, it is possible the method is mostly doing dataset filtering or regularization rather than surfacing causally correct affordances. An independent check—human rating of the distilled traces or a small set of held-out verifier questions—would have helped separate those stories. This paper is aimed at groups trying to scale VLAs beyond toy tasks without drowning in annotation costs. It has enough empirical breadth and a distinct technical move to deserve referee time, even if the current version needs clearer methods and extra controls before it can be taken as settled.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces R&B-EnCoRe, a self-supervised bootstrapping method for action-predictive embodied reasoning in Vision-Language-Action models. Reasoning is modeled as a latent variable inside an importance-weighted variational inference framework initialized from internet-scale knowledge; this is used to generate and distill a refined training dataset of embodiment-specific strategies without external rewards, verifiers, or human annotation. The method is evaluated on manipulation (Franka Panda simulation, WidowX hardware), legged navigation (bipedal/wheeled/bicycle/quadruped), and autonomous driving across VLA architectures of 1B–30B parameters, reporting 28% gains in manipulation success, 101% improvement in navigation scores, and 21% reduction in collision rate relative to baselines that reason indiscriminately over all primitives.

Significance. If the central mechanism is sound, the work would be significant for embodied AI: it offers a route to ground large-scale pretrained knowledge in physical control without manual template engineering or external supervision, while validating across diverse embodiments and model scales. The multi-platform experimental design and scale of reported gains are strengths that would support broader adoption if alternative explanations for the improvements can be ruled out.

major comments (2)

[§3.2, Eq. (3)] §3.2, Eq. (3): The importance-weighted variational objective defines weights directly from downstream policy success; this makes the central claim that the procedure surfaces causally effective reasoning strategies load-bearing on an assumption that has not been isolated from dataset-filtering or co-occurrence effects. An ablation that replaces the success-derived weights with uniform or random weights while keeping the distillation pipeline fixed would be required to establish that IWVI is the operative mechanism.
[§4.2, Table 3] §4.2, Table 3 (navigation rows): The 101% relative improvement is reported without per-seed standard deviations or statistical significance tests; given the stochastic nature of both policy rollouts and the variational sampling, it is unclear whether the magnitude is robust or could be explained by variance in the baseline runs.

minor comments (2)

[§3.1] Notation for the variational posterior q(·) and the importance weight w(·) is introduced without an explicit statement of whether they are reparameterized or whether the bound is optimized jointly with the policy parameters; a short clarifying paragraph would improve reproducibility.
[Figure 4] Figure 4 caption does not specify the exact number of reasoning samples drawn per trajectory during distillation; this detail affects interpretation of the reported efficiency gains.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback. We address each major comment point by point below and have revised the manuscript accordingly to strengthen the empirical support for our claims.

read point-by-point responses

Referee: [§3.2, Eq. (3)] §3.2, Eq. (3): The importance-weighted variational objective defines weights directly from downstream policy success; this makes the central claim that the procedure surfaces causally effective reasoning strategies load-bearing on an assumption that has not been isolated from dataset-filtering or co-occurrence effects. An ablation that replaces the success-derived weights with uniform or random weights while keeping the distillation pipeline fixed would be required to establish that IWVI is the operative mechanism.

Authors: We agree that isolating the contribution of the importance weights is necessary to substantiate that the IWVI mechanism, rather than generic filtering or co-occurrence, drives the selection of causally effective reasoning. The success-derived weights are computed from policy execution outcomes on the generated traces, which is integral to the self-supervised bootstrapping. To directly address this concern, we will add the requested ablation in the revised manuscript: we will rerun the distillation pipeline with uniform weights and with randomly sampled weights (while preserving the rest of the architecture and data generation) and report the resulting performance on the manipulation and navigation benchmarks. revision: yes
Referee: [§4.2, Table 3] §4.2, Table 3 (navigation rows): The 101% relative improvement is reported without per-seed standard deviations or statistical significance tests; given the stochastic nature of both policy rollouts and the variational sampling, it is unclear whether the magnitude is robust or could be explained by variance in the baseline runs.

Authors: We acknowledge that the absence of per-seed variability measures and formal statistical tests leaves the robustness of the 101% navigation improvement open to question, especially given stochasticity in rollouts and sampling. We have conducted additional experimental runs across multiple random seeds for the navigation tasks. In the revised manuscript we will update Table 3 to report mean performance with per-seed standard deviations and will include paired t-test results (with p-values) comparing R&B-EnCoRe against the indiscriminate-reasoning baselines. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper presents a self-supervised method using importance-weighted variational inference to treat reasoning as a latent variable for distilling embodiment-specific strategies from internet-scale knowledge. No equations, self-citations, or load-bearing steps are visible in the provided text that reduce the central claim (refined reasoning predictive of control success) to a tautological fit or redefinition of the input success metric itself. The approach is framed as bypassing external verifiers by grounding in physical execution, with the variational objective providing independent structure rather than circular attribution. This is the most common honest finding for papers whose core mechanism remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields insufficient detail to enumerate specific free parameters, axioms, or invented entities. The method description invokes importance-weighted variational inference and latent reasoning variables, but no explicit fitting procedure, background assumptions, or new postulated entities are stated.

pith-pipeline@v0.9.0 · 5810 in / 1311 out tokens · 32682 ms · 2026-05-21T13:58:13.310185+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

By treating reasoning as a latent variable within importance-weighted variational inference, models can generate and distill a refined reasoning training dataset of embodiment-specific strategies without external rewards, verifiers, or human annotation.
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We introduce R&B-EnCoRe... across manipulation... legged navigation... autonomous driving embodiments using various VLA architectures with 1B, 4B, 7B, and 30B parameters.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

167 extracted references · 167 canonical work pages · 27 internal anchors

[1]

On the Opportunities and Risks of Foundation Models

Rishi Bommasani, Drew A Hudson, Ehsan Adeli, Russ Altman, Simran Arora, Sydney von Arx, Michael S Bernstein, Jeannette Bohg, Antoine Bosselut, Emma Brunskill, et al. On the opportunities and risks of foundation models.arXiv preprint arXiv:2108.07258, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[2]

Vision-language-action models for robotics: A review towards real-world applications

Kento Kawaharazuka, Jihoon Oh, Jun Yamada, Ingmar Posner, and Yuke Zhu. Vision-language-action models for robotics: A review towards real-world applications. IEEE Access, 2025

work page 2025
[3]

A Survey on Vision-Language-Action Models: An Action Tokenization Perspective

Yifan Zhong, Fengshuo Bai, Shaofei Cai, Xuchuan Huang, Zhang Chen, Xiaowei Zhang, Yuanfei Wang, Shaoyang Guo, Tianrui Guan, Ka Nam Lui, et al. A sur- vey on vision-language-action models: An action tok- enization perspective.arXiv preprint arXiv:2507.01925, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[4]

Vision-language models for vision tasks: A survey

Jingyi Zhang, Jiaxing Huang, Sheng Jin, and Shijian Lu. Vision-language models for vision tasks: A survey. IEEE transactions on pattern analysis and machine intelligence, 46(8):5625–5644, 2024

work page 2024
[5]

Openvla: An open-source vision-language-action model

Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan P Foster, Pannag R Sanketi, Quan Vuong, Thomas Kollar, Benjamin Burchfiel, Russ Tedrake, Dorsa Sadigh, Sergey Levine, Percy Liang, and Chelsea Finn. Openvla: An open-source vision-language-action model. In Pulkit Agrawal, Oliver Kroemer, and W...

work page 2025
[6]

Minivla: A better vla with a smaller footprint, 2024

Suneel Belkhale and Dorsa Sadigh. Minivla: A better vla with a smaller footprint, 2024. URL https://github. com/Stanford-ILIAD/openvla-mini

work page 2024
[7]

InProceedings of Robotics: Science and Systems, LosAngeles, CA, USA, June 2025

Kevin Black, Noah Brown, Danny Driess, Adnan Es- mail, Michael Robert Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, Szymon Jakubczak, Tim Jones, Liyiming Ke, Sergey Levine, Adrian Li-Bell, Mohith Mothukuri, Suraj Nair, Karl Pertsch, Lucy Xiaoyang Shi, Laura Smith, James Tanner, Quan Vuong, Anna Walling, Haohuan Wang, and Ury...

work page doi:10.15607/rss.2025.xxi.010 2025
[8]

Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Esmail, Michael Robert Equi, Chelsea Finn, Niccolo Fusai, Manuel Y . Galliker, Dibya Ghosh, Lachy Groom, Karol Hausman, brian ichter, Szymon Jakubczak, Tim Jones, Liyiming Ke, Devin LeBlanc, Sergey Levine, Adrian Li-Bell, Mohith Mothukuri, Suraj Nair, Karl Pertsch, Allen Z. Ren,...

work page
[9]

PMLR, 27–30 Sep 2025

work page 2025
[10]

$\pi^{*}_{0.6}$: a VLA That Learns From Experience

Physical Intelligence, Ali Amin, Raichelle Aniceto, Ashwin Balakrishna, Kevin Black, Ken Conley, Grace Connors, James Darpinian, Karan Dhabalia, Jared Di- Carlo, et al.π 0.6: a vla that learns from experience. arXiv preprint arXiv:2511.14759, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[11]

RT-H: Action Hierarchies using Language.Proceedings of Robotics: Science and Systems, July 2024

Suneel Belkhale, Tianli Ding, Ted Xiao, Pierre Ser- manet, Quan Vuong, Jonathan Tompson, Yevgen Cheb- otar, Debidatta Dwibedi, and Dorsa Sadigh. RT-H: Action Hierarchies using Language.Proceedings of Robotics: Science and Systems, July 2024. doi: 10. 15607/RSS.2024.XX.049

work page 2024
[12]

Robotic control via embodied chain-of-thought reasoning

Michał Zawalski, William Chen, Karl Pertsch, Oier Mees, Chelsea Finn, and Sergey Levine. Robotic control via embodied chain-of-thought reasoning. In Pulkit Agrawal, Oliver Kroemer, and Wolfram Bur- gard, editors,Proceedings of The 8th Conference on Robot Learning, volume 270 ofProceedings of Machine Learning Research, pages 3157–3181. PMLR, 06–09 Nov 2025

work page 2025
[13]

Training strategies for efficient embodied rea- soning

William Chen, Suneel Belkhale, Suvir Mirchandani, Karl Pertsch, Danny Driess, Oier Mees, and Sergey Levine. Training strategies for efficient embodied rea- soning. In Joseph Lim, Shuran Song, and Hae-Won Park, editors,Proceedings of The 9th Conference on Robot Learning, volume 305 ofProceedings of Machine Learning Research, pages 365–391. PMLR, 27–30 Sep 2025

work page 2025
[14]

Gemini Robotics 1.5: Pushing the Frontier of Generalist Robots with Advanced Embodied Reasoning, Thinking, and Motion Transfer

Gemini Robotics Team, Abbas Abdolmaleki, Saminda Abeyruwan, Joshua Ainslie, Jean-Baptiste Alayrac, Montserrat Gonzalez Arenas, Ashwin Balakrishna, Nathan Batchelor, Alex Bewley, Jeff Bingham, et al. Gemini robotics 1.5: Pushing the frontier of generalist robots with advanced embodied reasoning, thinking, and motion transfer.arXiv preprint arXiv:2510.03342, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[15]

Cosmos-Reason1: From Physical Common Sense To Embodied Reasoning

Alisson Azzolini, Junjie Bai, Hannah Brandon, Jiaxin Cao, Prithvijit Chattopadhyay, Huayu Chen, Jinju Chu, Yin Cui, Jenna Diamond, Yifan Ding, et al. Cosmos- reason1: From physical common sense to embodied reasoning.arXiv preprint arXiv:2503.15558, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[16]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[17]

Smith, Hannaneh Hajishirzi, Ross Girshick, Ali Farhadi, and Aniruddha Kembhavi

Matt Deitke, Christopher Clark, Sangho Lee, Ro- hun Tripathi, Yue Yang, Jae Sung Park, Moham- madreza Salehi, Niklas Muennighoff, Kyle Lo, Luca Soldaini, Jiasen Lu, Taira Anderson, Erin Bransom, Kiana Ehsani, Huong Ngo, YenSung Chen, Ajay Patel, Mark Yatskar, Chris Callison-Burch, Andrew Head, Rose Hendrix, Favyen Bastani, Eli VanderBilt, Nathan Lambert, ...

work page 2025
[18]

Qwen3-VL Technical Report

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, Wenbin Ge, Zhifang Guo, Qidong Huang, Jie Huang, Fei Huang, Binyuan Hui, Shutong Jiang, Zhaohai Li, Mingsheng Li, Mei Li, Kaixin Li, Zicheng Lin, Junyang Lin, Xuejing Liu, Jiawei Liu, Chenglong Liu, Yang Liu, Dayiheng Liu, Shixuan ...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[19]

Robovqa: Multimodal long-horizon reasoning for robotics

Pierre Sermanet, Tianli Ding, Jeffrey Zhao, Fei Xia, Debidatta Dwibedi, Keerthana Gopalakrishnan, Chris- tine Chan, Gabriel Dulac-Arnold, Sharath Maddineni, Nikhil J Joshi, et al. Robovqa: Multimodal long-horizon reasoning for robotics. In2024 IEEE International Conference on Robotics and Automation (ICRA), pages 645–652. IEEE, 2024

work page 2024
[20]

Sim- pact: Simulation-enabled action planning using vision- language models.arXiv preprint arXiv:2512.05955, 2025

Haowen Liu, Shaoxiong Yao, Haonan Chen, Jiawei Gao, Jiayuan Mao, Jia-Bin Huang, and Yilun Du. Sim- pact: Simulation-enabled action planning using vision- language models.arXiv preprint arXiv:2512.05955, 2025

work page arXiv 2025
[21]

Evovla: Self-evolving vision-language-action model

Zeting Liu, Zida Yang, Zeyu Zhang, and Hao Tang. Evovla: Self-evolving vision-language-action model. arXiv preprint arXiv:2511.16166, 2025

work page arXiv 2025
[22]

Emma-x: An embodied multimodal action model with grounded chain of thought and look-ahead spatial rea- soning

Qi Sun, Pengfei Hong, Tej Deep Pala, Vernon Toh, U-Xuan Tan, Deepanway Ghosal, and Soujanya Poria. Emma-x: An embodied multimodal action model with grounded chain of thought and look-ahead spatial rea- soning. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 14199–14214, 2025

work page 2025
[23]

Argus: Vision-centric reasoning with grounded chain-of-thought

Yunze Man, De-An Huang, Guilin Liu, Shiwei Sheng, Shilong Liu, Liang-Yan Gui, Jan Kautz, Yu-Xiong Wang, and Zhiding Yu. Argus: Vision-centric reasoning with grounded chain-of-thought. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 14268–14280, 2025

work page 2025
[24]

Igniting vlms toward the embodied space.arXiv preprint arXiv:2509.11766, 2025

Andy Zhai, Brae Liu, Bruno Fang, Chalse Cai, Ellie Ma, Ethan Yin, Hao Wang, Hugo Zhou, James Wang, Lights Shi, et al. Igniting vlms toward the embodied space.arXiv preprint arXiv:2509.11766, 2025

work page arXiv 2025
[25]

Distilling internet-scale vision-language models into embodied agents.arXiv preprint arXiv:2301.12507, 2023

Theodore Sumers, Kenneth Marino, Arun Ahuja, Rob Fergus, and Ishita Dasgupta. Distilling internet-scale vision-language models into embodied agents.arXiv preprint arXiv:2301.12507, 2023

work page arXiv 2023
[26]

Chatvla: Unified mul- timodal understanding and robot control with vision- language-action model

Zhongyi Zhou, Yichen Zhu, Minjie Zhu, Junjie Wen, Ning Liu, Zhiyuan Xu, Weibin Meng, Yaxin Peng, Chaomin Shen, Feifei Feng, et al. Chatvla: Unified mul- timodal understanding and robot control with vision- language-action model. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 5377–5395, 2025

work page 2025
[27]

PaliGemma 2: A Family of Versatile VLMs for Transfer

Andreas Steiner, Andr ´e Susano Pinto, Michael Tschan- nen, Daniel Keysers, Xiao Wang, Yonatan Bitton, Alexey Gritsenko, Matthias Minderer, Anthony Sher- bondy, Shangbang Long, et al. Paligemma 2: A family of versatile vlms for transfer.arXiv preprint arXiv:2412.03555, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[28]

SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

Michael Tschannen, Alexey Gritsenko, Xiao Wang, Muhammad Ferjad Naeem, Ibrahim Alabdulmohsin, Nikhil Parthasarathy, Talfan Evans, Lucas Beyer, Ye Xia, Basil Mustafa, et al. Siglip 2: Multilingual vision-language encoders with improved semantic un- derstanding, localization, and dense features.arXiv preprint arXiv:2502.14786, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[29]

Pris- matic vlms: Investigating the design space of visually- conditioned language models

Siddharth Karamcheti, Suraj Nair, Ashwin Balakrishna, Percy Liang, Thomas Kollar, and Dorsa Sadigh. Pris- matic vlms: Investigating the design space of visually- conditioned language models. InForty-first Interna- tional Conference on Machine Learning, 2024

work page 2024
[30]

Open x-embodiment: Robotic learning datasets and rt-x models: Open x-embodiment collabo- ration 0

Abby O’Neill, Abdul Rehman, Abhiram Maddukuri, Abhishek Gupta, Abhishek Padalkar, Abraham Lee, Acorn Pooley, Agrim Gupta, Ajay Mandlekar, Ajinkya Jain, et al. Open x-embodiment: Robotic learning datasets and rt-x models: Open x-embodiment collabo- ration 0. In2024 IEEE International Conference on Robotics and Automation (ICRA), pages 6892–6903. IEEE, 2024

work page 2024
[31]

Zhao, Quan Vuong, Chongyi Zheng, Philippe Hansen-Estruch, An- dre Wang He, Vivek Myers, Moo Jin Kim, Max Du, Abraham Lee, Kuan Fang, Chelsea Finn, and Sergey Levine

Homer Rich Walke, Kevin Black, Tony Z. Zhao, Quan Vuong, Chongyi Zheng, Philippe Hansen-Estruch, An- dre Wang He, Vivek Myers, Moo Jin Kim, Max Du, Abraham Lee, Kuan Fang, Chelsea Finn, and Sergey Levine. Bridgedata v2: A dataset for robot learning at scale. In Jie Tan, Marc Toussaint, and Kourosh Darvish, editors,Proceedings of The 7th Conference on Robo...

work page 2023
[32]

DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset

Alexander Khazatsky, Karl Pertsch, Suraj Nair, Ash- win Balakrishna, Sudeep Dasari, Siddharth Karam- cheti, Soroush Nasiriany, Mohan Kumar Srirama, Lawrence Yunliang Chen, Kirsty Ellis, et al. Droid: A large-scale in-the-wild robot manipulation dataset. arXiv preprint arXiv:2403.12945, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[33]

nuscenes: A multimodal dataset for autonomous driv- ing

Holger Caesar, Varun Bankiti, Alex H Lang, Sourabh V ora, Venice Erin Liong, Qiang Xu, Anush Krish- nan, Yu Pan, Giancarlo Baldan, and Oscar Beijbom. nuscenes: A multimodal dataset for autonomous driv- ing. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11621– 11631, 2020

work page 2020
[34]

FAST: Efficient Action Tokenization for Vision-Language-Action Models

Karl Pertsch, Kyle Stachowicz, Brian Ichter, Danny Driess, Suraj Nair, Quan Vuong, Oier Mees, Chelsea Finn, and Sergey Levine. FAST: Efficient Action Tokenization for Vision-Language-Action Models. In Proceedings of Robotics: Science and Systems, LosAn- geles, CA, USA, June 2025. doi: 10.15607/RSS.2025. XXI.012

work page doi:10.15607/rss.2025 2025
[35]

Vq-vla: Improving vision- language-action models via scaling vector-quantized action tokenizers

Yating Wang, Haoyi Zhu, Mingyu Liu, Jiange Yang, Hao-Shu Fang, and Tong He. Vq-vla: Improving vision- language-action models via scaling vector-quantized action tokenizers. InProceedings of the IEEE/CVF International Conference on Computer Vision, 2025

work page 2025
[36]

Foundation models in robotics: Applications, challenges, and the future.The International Journal of Robotics Research, 44(5):701–739, 2025

Roya Firoozi, Johnathan Tucker, Stephen Tian, Anirudha Majumdar, Jiankai Sun, Weiyu Liu, Yuke Zhu, Shuran Song, Ashish Kapoor, Karol Hausman, et al. Foundation models in robotics: Applications, challenges, and the future.The International Journal of Robotics Research, 44(5):701–739, 2025

work page 2025
[37]

RT-1: Robotics Transformer for Real-World Control at Scale

Anthony Brohan, Noah Brown, Justice Carbajal, Yev- gen Chebotar, Joseph Dabis, Chelsea Finn, Keerthana Gopalakrishnan, Karol Hausman, Alex Herzog, Jasmine Hsu, Julian Ibarz, Brian Ichter, Alex Irpan, Tomas Jack- son, Sally Jesmonth, Nikhil Joshi, Ryan Julian, Dmitry Kalashnikov, Yuheng Kuang, Isabel Leal, Kuang-Huei Lee, Sergey Levine, Yao Lu, Utsav Malla...

work page internal anchor Pith review Pith/arXiv arXiv 2022
[38]

Rt-2: Vision-language- action models transfer web knowledge to robotic con- trol

Brianna Zitkovich, Tianhe Yu, Sichun Xu, Peng Xu, Ted Xiao, Fei Xia, Jialin Wu, Paul Wohlhart, Stefan Welker, Ayzaan Wahid, et al. Rt-2: Vision-language- action models transfer web knowledge to robotic con- trol. InConference on Robot Learning, pages 2165–

work page
[39]

Octo: An open-source generalist robot policy

Octo Model Team, Dibya Ghosh, Homer Walke, Karl Pertsch, Kevin Black, Oier Mees, Sudeep Dasari, Joey Hejna, Charles Xu, Jianlan Luo, Tobias Kreiman, You Liang Tan, Lawrence Yunliang Chen, Pannag Sanketi, Quan Vuong, Ted Xiao, Dorsa Sadigh, Chelsea Finn, and Sergey Levine. Octo: An open-source generalist robot policy. InProceedings of Robotics: Science and...

work page 2024
[40]

GR00T N1: An Open Foundation Model for Generalist Humanoid Robots

Johan Bjorck, Fernando Casta ˜neda, Nikita Cherniadev, Xingye Da, Runyu Ding, Linxi Fan, Yu Fang, Dieter Fox, Fengyuan Hu, Spencer Huang, et al. Gr00t n1: An open foundation model for generalist humanoid robots. arXiv preprint arXiv:2503.14734, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[41]

NaVILA: Legged Robot Vision-Language-Action Model for Navigation

An-Chieh Cheng, Yandong Ji, Zhaojing Yang, Zaitian Gongye, Xueyan Zou, Jan Kautz, Erdem Biyik, Hongxu Yin, Sifei Liu, and Xiaolong Wang. NaVILA: Legged Robot Vision-Language-Action Model for Navigation. InProceedings of Robotics: Science and Systems, LosAngeles, CA, USA, June 2025. doi: 10.15607/RSS. 2025.XXI.018

work page doi:10.15607/rss 2025
[42]

Quar-vla: Vision-language-action model for quadruped robots

Pengxiang Ding, Han Zhao, Wenjie Zhang, Wenxuan Song, Min Zhang, Siteng Huang, Ningxi Yang, and Donglin Wang. Quar-vla: Vision-language-action model for quadruped robots. InEuropean Conference on Computer Vision, pages 352–367. Springer, 2024

work page 2024
[43]

Humanoid-vla: Towards universal humanoid control with visual inte- gration.arXiv preprint arXiv:2502.14795, 2025

Pengxiang Ding, Jianfei Ma, Xinyang Tong, Binghong Zou, Xinxin Luo, Yiguo Fan, Ting Wang, Hongchao Lu, Panzhong Mo, Jinxin Liu, et al. Humanoid-vla: Towards universal humanoid control with visual inte- gration.arXiv preprint arXiv:2502.14795, 2025

work page arXiv 2025
[44]

Wholebodyvla: Towards unified latent vla for whole-body loco-manipulation control.arXiv preprint arXiv:2512.11047, 2025

Haoran Jiang, Jin Chen, Qingwen Bu, Li Chen, Modi Shi, Yanjie Zhang, Delong Li, Chuanzhe Suo, Chuang Wang, Zhihui Peng, et al. Wholebodyvla: Towards unified latent vla for whole-body loco-manipulation control.arXiv preprint arXiv:2512.11047, 2025

work page arXiv 2025
[45]

Opendrivevla: Towards end-to-end au- tonomous driving with large vision language action model

Xingcheng Zhou, Xuyuan Han, Feng Yang, Yunpu Ma, V olker Tresp, and Alois Knoll. Opendrivevla: Towards end-to-end autonomous driving with large vision lan- guage action model, 2025. URL https://arxiv.org/abs/ 2503.23463

work page arXiv 2025
[46]

Chain-of-thought prompting elicits reasoning in large language models.Advances in neural information processing systems, 35:24824–24837, 2022

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models.Advances in neural information processing systems, 35:24824–24837, 2022

work page 2022
[47]

Large language models are zero-shot reasoners.Advances in neural information processing systems, 35:22199–22213, 2022

Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. Large language models are zero-shot reasoners.Advances in neural information processing systems, 35:22199–22213, 2022

work page 2022
[48]

and Sabharwal, A

William Merrill and Ashish Sabharwal. The expressive power of transformers with chain of thought.arXiv preprint arXiv:2310.07923, 2023

work page arXiv 2023
[49]

Chain of Thought Empowers Transformers to Solve Inherently Serial Problems, 2024

Zhiyuan Li, Hong Liu, Denny Zhou, and Tengyu Ma. Chain of thought empowers transformers to solve inher- ently serial problems.arXiv preprint arXiv:2402.12875, 1, 2024

work page arXiv 2024
[50]

Towards revealing the mystery behind chain of thought: a theoretical perspective.Ad- vances in Neural Information Processing Systems, 36: 70757–70798, 2023

Guhao Feng, Bohang Zhang, Yuntian Gu, Haotian Ye, Di He, and Liwei Wang. Towards revealing the mystery behind chain of thought: a theoretical perspective.Ad- vances in Neural Information Processing Systems, 36: 70757–70798, 2023

work page 2023
[51]

Chain-of-thought reasoning without prompting.Advances in Neural In- formation Processing Systems, 37:66383–66409, 2024

Xuezhi Wang and Denny Zhou. Chain-of-thought reasoning without prompting.Advances in Neural In- formation Processing Systems, 37:66383–66409, 2024

work page 2024
[52]

A survey on large language models for mathematical reasoning

Peng-Yuan Wang, Tian-Shuo Liu, Chenyang Wang, Ziniu Li, Yidi Wang, Shu Yan, Chengxing Jia, Xu- Hui Liu, Xinwei Chen, Jiacheng Xu, et al. A survey on large language models for mathematical reasoning. ACM Computing Surveys, 2025

work page 2025
[53]

is this text bolded?

Dayu Yang, Tianyang Liu, Daoan Zhang, Antoine Simoulin, Xiaoyi Liu, Yuwei Cao, Zhaopu Teng, Xin Qian, Grey Yang, Jiebo Luo, et al. Code to think, think to code: A survey on code-enhanced reasoning and reasoning-driven code intelligence in llms.arXiv preprint arXiv:2502.19411, 2025

work page arXiv 2025
[54]

Llava-cot: Let vision language models reason step-by-step

Guowei Xu, Peng Jin, Ziang Wu, Hao Li, Yibing Song, Lichao Sun, and Li Yuan. Llava-cot: Let vision language models reason step-by-step. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 2087–2098, 2025

work page 2087
[55]

When visualizing is the first step to reasoning: Mira, a benchmark for visual chain-of-thought.arXiv preprint arXiv:2511.02779, 2025

Yiyang Zhou, Haoqin Tu, Zijun Wang, Zeyu Wang, Niklas Muennighoff, Fan Nie, Yejin Choi, James Zou, Chaorui Deng, Shen Yan, et al. When visualizing is the first step to reasoning: Mira, a benchmark for visual chain-of-thought.arXiv preprint arXiv:2511.02779, 2025

work page arXiv 2025
[56]

Phi-4 Technical Report

Marah Abdin, Jyoti Aneja, Harkirat Behl, S ´ebastien Bubeck, Ronen Eldan, Suriya Gunasekar, Michael Har- rison, Russell J Hewett, Mojan Javaheripi, Piero Kauff- mann, et al. Phi-4 technical report.arXiv preprint arXiv:2412.08905, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[57]

Textbooks Are All You Need II: phi-1.5 technical report

Yuanzhi Li, S ´ebastien Bubeck, Ronen Eldan, Allie Del Giorno, Suriya Gunasekar, and Yin Tat Lee. Text- books are all you need ii: phi-1.5 technical report.arXiv preprint arXiv:2309.05463, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[58]

Wizardcoder: Empowering code large language models with evol-instruct.arXiv preprint arXiv:2306.08568, 2023

Ziyang Luo, Can Xu, Pu Zhao, Qingfeng Sun, Xiubo Geng, Wenxiang Hu, Chongyang Tao, Jing Ma, Qing- wei Lin, and Daxin Jiang. Wizardcoder: Empowering code large language models with evol-instruct.arXiv preprint arXiv:2306.08568, 2023

work page arXiv 2023
[59]

Magicoder: Empowering code generation with oss-instruct.arXiv preprint arXiv:2312.02120, 2023

Yuxiang Wei, Zhe Wang, Jiawei Liu, Yifeng Ding, and Lingming Zhang. Magicoder: Empowering code generation with oss-instruct.arXiv preprint arXiv:2312.02120, 2023

work page arXiv 2023
[60]

Tinygsm: achieving ¿80% on gsm8k with small language models

Bingbin Liu, Sebastien Bubeck, Ronen Eldan, Ja- nardhan Kulkarni, Yuanzhi Li, Anh Nguyen, Rachel Ward, and Yi Zhang. Tinygsm: achieving>80% on gsm8k with small language models.arXiv preprint arXiv:2312.09241, 2023

work page arXiv 2023
[61]

WizardLM: Empowering large pre-trained language models to follow complex instructions

Can Xu, Qingfeng Sun, Kai Zheng, Xiubo Geng, Pu Zhao, Jiazhan Feng, Chongyang Tao, and Daxin Jiang. Wizardlm: Empowering large language mod- els to follow complex instructions.arXiv preprint arXiv:2304.12244, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[62]

Scaling Synthetic Data Creation with 1,000,000,000 Personas

Tao Ge, Xin Chan, Xiaoyang Wang, Dian Yu, Haitao Mi, and Dong Yu. Scaling synthetic data cre- ation with 1,000,000,000 personas.arXiv preprint arXiv:2406.20094, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[63]

Stanford alpaca: An instruction-following llama model, 2023

Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, and Tatsunori B Hashimoto. Stanford alpaca: An instruction-following llama model, 2023

work page 2023
[64]

Solving olympiad geometry without human demonstrations.Nature, 625(7995):476–482, 2024

Trieu H Trinh, Yuhuai Wu, Quoc V Le, He He, and Thang Luong. Solving olympiad geometry without human demonstrations.Nature, 625(7995):476–482, 2024

work page 2024
[65]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[66]

Mastering the game of go with deep neural networks and tree search.nature, 529 (7587):484–489, 2016

David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershel- vam, Marc Lanctot, et al. Mastering the game of go with deep neural networks and tree search.nature, 529 (7587):484–489, 2016

work page 2016
[67]

Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm

David Silver, Thomas Hubert, Julian Schrittwieser, Ioannis Antonoglou, Matthew Lai, Arthur Guez, Marc Lanctot, Laurent Sifre, Dharshan Kumaran, Thore Grae- pel, et al. Mastering chess and shogi by self-play with a general reinforcement learning algorithm.arXiv preprint arXiv:1712.01815, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[68]

Lan- guage models are hidden reasoners: Unlocking latent reasoning capabilities via self-rewarding.arXiv preprint arXiv:2411.04282, 2024

Haolin Chen, Yihao Feng, Zuxin Liu, Weiran Yao, Akshara Prabhakar, Shelby Heinecke, Ricky Ho, Phil Mui, Silvio Savarese, Caiming Xiong, et al. Lan- guage models are hidden reasoners: Unlocking latent reasoning capabilities via self-rewarding.arXiv preprint arXiv:2411.04282, 2024

work page arXiv 2024
[69]

Training chain-of-thought via latent-variable inference

Matthew Douglas Hoffman, Du Phan, David Dohan, Sholto Douglas, Tuan Anh Le, Aaron Parisi, Pavel Sountsov, Charles Sutton, Sharad Vikram, and Rif A Saurous. Training chain-of-thought via latent-variable inference. InNeurIPS, 2023

work page 2023
[70]

Amortizing intractable inference in large lan- guage models

Edward J Hu, Moksh Jain, Eric Elmoznino, Younesse Kaddar, Guillaume Lajoie, Yoshua Bengio, and Nikolay Malkin. Amortizing intractable inference in large lan- guage models. InThe Twelfth International Conference on Learning Representations, 2024

work page 2024
[71]

Brite: Bootstrapping re- inforced thinking process to enhance language model reasoning.arXiv preprint arXiv:2501.18858, 2025

Han Zhong, Yutong Yin, Shenao Zhang, Xiaojun Xu, Yuanxin Liu, Yifei Zuo, Zhihan Liu, Boyi Liu, Sirui Zheng, Hongyi Guo, et al. Brite: Bootstrapping re- inforced thinking process to enhance language model reasoning.arXiv preprint arXiv:2501.18858, 2025

work page arXiv 2025
[72]

Beyond human data: Scaling self- training for problem-solving with language models

Avi Singh, John D Co-Reyes, Rishabh Agarwal, Ankesh Anand, Piyush Patil, Xavier Garcia, Peter J Liu, James Harrison, Jaehoon Lee, Kelvin Xu, Aaron T Parisi, Abhishek Kumar, Alexander A Alemi, Alex Rizkowsky, Azade Nova, Ben Adlam, Bernd Bohnet, Gamaleldin Fathy Elsayed, Hanie Sedghi, Igor Mor- datch, Isabelle Simpson, Izzeddin Gur, Jasper Snoek, Jeffrey P...

work page 2024
[73]

Reasoning to learn from latent thoughts.arXiv preprint arXiv:2503.18866, 2025

Yangjun Ruan, Neil Band, Chris J Maddison, and Tatsunori Hashimoto. Reasoning to learn from latent thoughts.arXiv preprint arXiv:2503.18866, 2025

work page arXiv 2025
[74]

Skill induction and planning with latent language

Pratyusha Sharma, Antonio Torralba, and Jacob An- dreas. Skill induction and planning with latent language. InProceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1713–1726, 2022

work page 2022
[75]

Do what you say: Steering vision-language-action models via runtime reasoning-action alignment verification.arXiv preprint arXiv:2510.16281, 2025

Yilin Wu, Anqi Li, Tucker Hermans, Fabio Ramos, Andrea Bajcsy, and Claudia P’erez-D’Arpino. Do what you say: Steering vision-language-action models via runtime reasoning-action alignment verification.arXiv preprint arXiv:2510.16281, 2025

work page arXiv 2025
[77]

MolmoAct: Action Reasoning Models that can Reason in Space

Jason Lee, Jiafei Duan, Haoquan Fang, Yuquan Deng, Shuo Liu, Boyang Li, Bohan Fang, Jieyu Zhang, Yi Ru Wang, Sangho Lee, et al. Molmoact: Action reason- ing models that can reason in space.arXiv preprint arXiv:2508.07917, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[78]

Objectvla: End-to-end open-world ob- ject manipulation without demonstration.arXiv preprint arXiv:2502.19250, 2025

Minjie Zhu, Yichen Zhu, Jinming Li, Zhongyi Zhou, Junjie Wen, Xiaoyu Liu, Chaomin Shen, Yaxin Peng, and Feifei Feng. Objectvla: End-to-end open-world ob- ject manipulation without demonstration.arXiv preprint arXiv:2502.19250, 2025

work page arXiv 2025
[79]

Cot-vla: Visual chain-of-thought reasoning for vision-language-action models

Qingqing Zhao, Yao Lu, Moo Jin Kim, Zipeng Fu, Zhuoyang Zhang, Yecheng Wu, Zhaoshuo Li, Qianli Ma, Song Han, Chelsea Finn, et al. Cot-vla: Visual chain-of-thought reasoning for vision-language-action models. InProceedings of the Computer Vision and Pat- tern Recognition Conference, pages 1702–1713, 2025

work page 2025
[80]

Spa- tialvlm: Endowing vision-language models with spatial reasoning capabilities

Boyuan Chen, Zhuo Xu, Sean Kirmani, Brain Ichter, Dorsa Sadigh, Leonidas Guibas, and Fei Xia. Spa- tialvlm: Endowing vision-language models with spatial reasoning capabilities. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recogni- tion, pages 14455–14465, 2024

work page 2024
[81]

TraceVLA: Visual Trace Prompting Enhances Spatial-Temporal Awareness for Generalist Robotic Policies

Ruijie Zheng, Yongyuan Liang, Shuaiyi Huang, Jianfeng Gao, Hal Daum ´e III, Andrey Kolobov, Furong Huang, and Jianwei Yang. Tracevla: Vi- sual trace prompting enhances spatial-temporal aware- ness for generalist robotic policies.arXiv preprint arXiv:2412.10345, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

Showing first 80 references.

[1] [1]

On the Opportunities and Risks of Foundation Models

Rishi Bommasani, Drew A Hudson, Ehsan Adeli, Russ Altman, Simran Arora, Sydney von Arx, Michael S Bernstein, Jeannette Bohg, Antoine Bosselut, Emma Brunskill, et al. On the opportunities and risks of foundation models.arXiv preprint arXiv:2108.07258, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[2] [2]

Vision-language-action models for robotics: A review towards real-world applications

Kento Kawaharazuka, Jihoon Oh, Jun Yamada, Ingmar Posner, and Yuke Zhu. Vision-language-action models for robotics: A review towards real-world applications. IEEE Access, 2025

work page 2025

[3] [3]

A Survey on Vision-Language-Action Models: An Action Tokenization Perspective

Yifan Zhong, Fengshuo Bai, Shaofei Cai, Xuchuan Huang, Zhang Chen, Xiaowei Zhang, Yuanfei Wang, Shaoyang Guo, Tianrui Guan, Ka Nam Lui, et al. A sur- vey on vision-language-action models: An action tok- enization perspective.arXiv preprint arXiv:2507.01925, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[4] [4]

Vision-language models for vision tasks: A survey

Jingyi Zhang, Jiaxing Huang, Sheng Jin, and Shijian Lu. Vision-language models for vision tasks: A survey. IEEE transactions on pattern analysis and machine intelligence, 46(8):5625–5644, 2024

work page 2024

[5] [5]

Openvla: An open-source vision-language-action model

Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan P Foster, Pannag R Sanketi, Quan Vuong, Thomas Kollar, Benjamin Burchfiel, Russ Tedrake, Dorsa Sadigh, Sergey Levine, Percy Liang, and Chelsea Finn. Openvla: An open-source vision-language-action model. In Pulkit Agrawal, Oliver Kroemer, and W...

work page 2025

[6] [6]

Minivla: A better vla with a smaller footprint, 2024

Suneel Belkhale and Dorsa Sadigh. Minivla: A better vla with a smaller footprint, 2024. URL https://github. com/Stanford-ILIAD/openvla-mini

work page 2024

[7] [7]

InProceedings of Robotics: Science and Systems, LosAngeles, CA, USA, June 2025

Kevin Black, Noah Brown, Danny Driess, Adnan Es- mail, Michael Robert Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, Szymon Jakubczak, Tim Jones, Liyiming Ke, Sergey Levine, Adrian Li-Bell, Mohith Mothukuri, Suraj Nair, Karl Pertsch, Lucy Xiaoyang Shi, Laura Smith, James Tanner, Quan Vuong, Anna Walling, Haohuan Wang, and Ury...

work page doi:10.15607/rss.2025.xxi.010 2025

[8] [8]

Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Esmail, Michael Robert Equi, Chelsea Finn, Niccolo Fusai, Manuel Y . Galliker, Dibya Ghosh, Lachy Groom, Karol Hausman, brian ichter, Szymon Jakubczak, Tim Jones, Liyiming Ke, Devin LeBlanc, Sergey Levine, Adrian Li-Bell, Mohith Mothukuri, Suraj Nair, Karl Pertsch, Allen Z. Ren,...

work page

[9] [9]

PMLR, 27–30 Sep 2025

work page 2025

[10] [10]

$\pi^{*}_{0.6}$: a VLA That Learns From Experience

Physical Intelligence, Ali Amin, Raichelle Aniceto, Ashwin Balakrishna, Kevin Black, Ken Conley, Grace Connors, James Darpinian, Karan Dhabalia, Jared Di- Carlo, et al.π 0.6: a vla that learns from experience. arXiv preprint arXiv:2511.14759, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[11] [11]

RT-H: Action Hierarchies using Language.Proceedings of Robotics: Science and Systems, July 2024

Suneel Belkhale, Tianli Ding, Ted Xiao, Pierre Ser- manet, Quan Vuong, Jonathan Tompson, Yevgen Cheb- otar, Debidatta Dwibedi, and Dorsa Sadigh. RT-H: Action Hierarchies using Language.Proceedings of Robotics: Science and Systems, July 2024. doi: 10. 15607/RSS.2024.XX.049

work page 2024

[12] [12]

Robotic control via embodied chain-of-thought reasoning

Michał Zawalski, William Chen, Karl Pertsch, Oier Mees, Chelsea Finn, and Sergey Levine. Robotic control via embodied chain-of-thought reasoning. In Pulkit Agrawal, Oliver Kroemer, and Wolfram Bur- gard, editors,Proceedings of The 8th Conference on Robot Learning, volume 270 ofProceedings of Machine Learning Research, pages 3157–3181. PMLR, 06–09 Nov 2025

work page 2025

[13] [13]

Training strategies for efficient embodied rea- soning

William Chen, Suneel Belkhale, Suvir Mirchandani, Karl Pertsch, Danny Driess, Oier Mees, and Sergey Levine. Training strategies for efficient embodied rea- soning. In Joseph Lim, Shuran Song, and Hae-Won Park, editors,Proceedings of The 9th Conference on Robot Learning, volume 305 ofProceedings of Machine Learning Research, pages 365–391. PMLR, 27–30 Sep 2025

work page 2025

[14] [14]

Gemini Robotics 1.5: Pushing the Frontier of Generalist Robots with Advanced Embodied Reasoning, Thinking, and Motion Transfer

Gemini Robotics Team, Abbas Abdolmaleki, Saminda Abeyruwan, Joshua Ainslie, Jean-Baptiste Alayrac, Montserrat Gonzalez Arenas, Ashwin Balakrishna, Nathan Batchelor, Alex Bewley, Jeff Bingham, et al. Gemini robotics 1.5: Pushing the frontier of generalist robots with advanced embodied reasoning, thinking, and motion transfer.arXiv preprint arXiv:2510.03342, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[15] [15]

Cosmos-Reason1: From Physical Common Sense To Embodied Reasoning

Alisson Azzolini, Junjie Bai, Hannah Brandon, Jiaxin Cao, Prithvijit Chattopadhyay, Huayu Chen, Jinju Chu, Yin Cui, Jenna Diamond, Yifan Ding, et al. Cosmos- reason1: From physical common sense to embodied reasoning.arXiv preprint arXiv:2503.15558, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[16] [16]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[17] [17]

Smith, Hannaneh Hajishirzi, Ross Girshick, Ali Farhadi, and Aniruddha Kembhavi

Matt Deitke, Christopher Clark, Sangho Lee, Ro- hun Tripathi, Yue Yang, Jae Sung Park, Moham- madreza Salehi, Niklas Muennighoff, Kyle Lo, Luca Soldaini, Jiasen Lu, Taira Anderson, Erin Bransom, Kiana Ehsani, Huong Ngo, YenSung Chen, Ajay Patel, Mark Yatskar, Chris Callison-Burch, Andrew Head, Rose Hendrix, Favyen Bastani, Eli VanderBilt, Nathan Lambert, ...

work page 2025

[18] [18]

Qwen3-VL Technical Report

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, Wenbin Ge, Zhifang Guo, Qidong Huang, Jie Huang, Fei Huang, Binyuan Hui, Shutong Jiang, Zhaohai Li, Mingsheng Li, Mei Li, Kaixin Li, Zicheng Lin, Junyang Lin, Xuejing Liu, Jiawei Liu, Chenglong Liu, Yang Liu, Dayiheng Liu, Shixuan ...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[19] [19]

Robovqa: Multimodal long-horizon reasoning for robotics

Pierre Sermanet, Tianli Ding, Jeffrey Zhao, Fei Xia, Debidatta Dwibedi, Keerthana Gopalakrishnan, Chris- tine Chan, Gabriel Dulac-Arnold, Sharath Maddineni, Nikhil J Joshi, et al. Robovqa: Multimodal long-horizon reasoning for robotics. In2024 IEEE International Conference on Robotics and Automation (ICRA), pages 645–652. IEEE, 2024

work page 2024

[20] [20]

Sim- pact: Simulation-enabled action planning using vision- language models.arXiv preprint arXiv:2512.05955, 2025

Haowen Liu, Shaoxiong Yao, Haonan Chen, Jiawei Gao, Jiayuan Mao, Jia-Bin Huang, and Yilun Du. Sim- pact: Simulation-enabled action planning using vision- language models.arXiv preprint arXiv:2512.05955, 2025

work page arXiv 2025

[21] [21]

Evovla: Self-evolving vision-language-action model

Zeting Liu, Zida Yang, Zeyu Zhang, and Hao Tang. Evovla: Self-evolving vision-language-action model. arXiv preprint arXiv:2511.16166, 2025

work page arXiv 2025

[22] [22]

Emma-x: An embodied multimodal action model with grounded chain of thought and look-ahead spatial rea- soning

Qi Sun, Pengfei Hong, Tej Deep Pala, Vernon Toh, U-Xuan Tan, Deepanway Ghosal, and Soujanya Poria. Emma-x: An embodied multimodal action model with grounded chain of thought and look-ahead spatial rea- soning. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 14199–14214, 2025

work page 2025

[23] [23]

Argus: Vision-centric reasoning with grounded chain-of-thought

Yunze Man, De-An Huang, Guilin Liu, Shiwei Sheng, Shilong Liu, Liang-Yan Gui, Jan Kautz, Yu-Xiong Wang, and Zhiding Yu. Argus: Vision-centric reasoning with grounded chain-of-thought. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 14268–14280, 2025

work page 2025

[24] [24]

Igniting vlms toward the embodied space.arXiv preprint arXiv:2509.11766, 2025

Andy Zhai, Brae Liu, Bruno Fang, Chalse Cai, Ellie Ma, Ethan Yin, Hao Wang, Hugo Zhou, James Wang, Lights Shi, et al. Igniting vlms toward the embodied space.arXiv preprint arXiv:2509.11766, 2025

work page arXiv 2025

[25] [25]

Distilling internet-scale vision-language models into embodied agents.arXiv preprint arXiv:2301.12507, 2023

Theodore Sumers, Kenneth Marino, Arun Ahuja, Rob Fergus, and Ishita Dasgupta. Distilling internet-scale vision-language models into embodied agents.arXiv preprint arXiv:2301.12507, 2023

work page arXiv 2023

[26] [26]

Chatvla: Unified mul- timodal understanding and robot control with vision- language-action model

Zhongyi Zhou, Yichen Zhu, Minjie Zhu, Junjie Wen, Ning Liu, Zhiyuan Xu, Weibin Meng, Yaxin Peng, Chaomin Shen, Feifei Feng, et al. Chatvla: Unified mul- timodal understanding and robot control with vision- language-action model. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 5377–5395, 2025

work page 2025

[27] [27]

PaliGemma 2: A Family of Versatile VLMs for Transfer

Andreas Steiner, Andr ´e Susano Pinto, Michael Tschan- nen, Daniel Keysers, Xiao Wang, Yonatan Bitton, Alexey Gritsenko, Matthias Minderer, Anthony Sher- bondy, Shangbang Long, et al. Paligemma 2: A family of versatile vlms for transfer.arXiv preprint arXiv:2412.03555, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[28] [28]

SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

Michael Tschannen, Alexey Gritsenko, Xiao Wang, Muhammad Ferjad Naeem, Ibrahim Alabdulmohsin, Nikhil Parthasarathy, Talfan Evans, Lucas Beyer, Ye Xia, Basil Mustafa, et al. Siglip 2: Multilingual vision-language encoders with improved semantic un- derstanding, localization, and dense features.arXiv preprint arXiv:2502.14786, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[29] [29]

Pris- matic vlms: Investigating the design space of visually- conditioned language models

Siddharth Karamcheti, Suraj Nair, Ashwin Balakrishna, Percy Liang, Thomas Kollar, and Dorsa Sadigh. Pris- matic vlms: Investigating the design space of visually- conditioned language models. InForty-first Interna- tional Conference on Machine Learning, 2024

work page 2024

[30] [30]

Open x-embodiment: Robotic learning datasets and rt-x models: Open x-embodiment collabo- ration 0

Abby O’Neill, Abdul Rehman, Abhiram Maddukuri, Abhishek Gupta, Abhishek Padalkar, Abraham Lee, Acorn Pooley, Agrim Gupta, Ajay Mandlekar, Ajinkya Jain, et al. Open x-embodiment: Robotic learning datasets and rt-x models: Open x-embodiment collabo- ration 0. In2024 IEEE International Conference on Robotics and Automation (ICRA), pages 6892–6903. IEEE, 2024

work page 2024

[31] [31]

Zhao, Quan Vuong, Chongyi Zheng, Philippe Hansen-Estruch, An- dre Wang He, Vivek Myers, Moo Jin Kim, Max Du, Abraham Lee, Kuan Fang, Chelsea Finn, and Sergey Levine

Homer Rich Walke, Kevin Black, Tony Z. Zhao, Quan Vuong, Chongyi Zheng, Philippe Hansen-Estruch, An- dre Wang He, Vivek Myers, Moo Jin Kim, Max Du, Abraham Lee, Kuan Fang, Chelsea Finn, and Sergey Levine. Bridgedata v2: A dataset for robot learning at scale. In Jie Tan, Marc Toussaint, and Kourosh Darvish, editors,Proceedings of The 7th Conference on Robo...

work page 2023

[32] [32]

DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset

Alexander Khazatsky, Karl Pertsch, Suraj Nair, Ash- win Balakrishna, Sudeep Dasari, Siddharth Karam- cheti, Soroush Nasiriany, Mohan Kumar Srirama, Lawrence Yunliang Chen, Kirsty Ellis, et al. Droid: A large-scale in-the-wild robot manipulation dataset. arXiv preprint arXiv:2403.12945, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[33] [33]

nuscenes: A multimodal dataset for autonomous driv- ing

Holger Caesar, Varun Bankiti, Alex H Lang, Sourabh V ora, Venice Erin Liong, Qiang Xu, Anush Krish- nan, Yu Pan, Giancarlo Baldan, and Oscar Beijbom. nuscenes: A multimodal dataset for autonomous driv- ing. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11621– 11631, 2020

work page 2020

[34] [34]

FAST: Efficient Action Tokenization for Vision-Language-Action Models

Karl Pertsch, Kyle Stachowicz, Brian Ichter, Danny Driess, Suraj Nair, Quan Vuong, Oier Mees, Chelsea Finn, and Sergey Levine. FAST: Efficient Action Tokenization for Vision-Language-Action Models. In Proceedings of Robotics: Science and Systems, LosAn- geles, CA, USA, June 2025. doi: 10.15607/RSS.2025. XXI.012

work page doi:10.15607/rss.2025 2025

[35] [35]

Vq-vla: Improving vision- language-action models via scaling vector-quantized action tokenizers

Yating Wang, Haoyi Zhu, Mingyu Liu, Jiange Yang, Hao-Shu Fang, and Tong He. Vq-vla: Improving vision- language-action models via scaling vector-quantized action tokenizers. InProceedings of the IEEE/CVF International Conference on Computer Vision, 2025

work page 2025

[36] [36]

Foundation models in robotics: Applications, challenges, and the future.The International Journal of Robotics Research, 44(5):701–739, 2025

Roya Firoozi, Johnathan Tucker, Stephen Tian, Anirudha Majumdar, Jiankai Sun, Weiyu Liu, Yuke Zhu, Shuran Song, Ashish Kapoor, Karol Hausman, et al. Foundation models in robotics: Applications, challenges, and the future.The International Journal of Robotics Research, 44(5):701–739, 2025

work page 2025

[37] [37]

RT-1: Robotics Transformer for Real-World Control at Scale

Anthony Brohan, Noah Brown, Justice Carbajal, Yev- gen Chebotar, Joseph Dabis, Chelsea Finn, Keerthana Gopalakrishnan, Karol Hausman, Alex Herzog, Jasmine Hsu, Julian Ibarz, Brian Ichter, Alex Irpan, Tomas Jack- son, Sally Jesmonth, Nikhil Joshi, Ryan Julian, Dmitry Kalashnikov, Yuheng Kuang, Isabel Leal, Kuang-Huei Lee, Sergey Levine, Yao Lu, Utsav Malla...

work page internal anchor Pith review Pith/arXiv arXiv 2022

[38] [38]

Rt-2: Vision-language- action models transfer web knowledge to robotic con- trol

Brianna Zitkovich, Tianhe Yu, Sichun Xu, Peng Xu, Ted Xiao, Fei Xia, Jialin Wu, Paul Wohlhart, Stefan Welker, Ayzaan Wahid, et al. Rt-2: Vision-language- action models transfer web knowledge to robotic con- trol. InConference on Robot Learning, pages 2165–

work page

[39] [39]

Octo: An open-source generalist robot policy

Octo Model Team, Dibya Ghosh, Homer Walke, Karl Pertsch, Kevin Black, Oier Mees, Sudeep Dasari, Joey Hejna, Charles Xu, Jianlan Luo, Tobias Kreiman, You Liang Tan, Lawrence Yunliang Chen, Pannag Sanketi, Quan Vuong, Ted Xiao, Dorsa Sadigh, Chelsea Finn, and Sergey Levine. Octo: An open-source generalist robot policy. InProceedings of Robotics: Science and...

work page 2024

[40] [40]

GR00T N1: An Open Foundation Model for Generalist Humanoid Robots

Johan Bjorck, Fernando Casta ˜neda, Nikita Cherniadev, Xingye Da, Runyu Ding, Linxi Fan, Yu Fang, Dieter Fox, Fengyuan Hu, Spencer Huang, et al. Gr00t n1: An open foundation model for generalist humanoid robots. arXiv preprint arXiv:2503.14734, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[41] [41]

NaVILA: Legged Robot Vision-Language-Action Model for Navigation

An-Chieh Cheng, Yandong Ji, Zhaojing Yang, Zaitian Gongye, Xueyan Zou, Jan Kautz, Erdem Biyik, Hongxu Yin, Sifei Liu, and Xiaolong Wang. NaVILA: Legged Robot Vision-Language-Action Model for Navigation. InProceedings of Robotics: Science and Systems, LosAngeles, CA, USA, June 2025. doi: 10.15607/RSS. 2025.XXI.018

work page doi:10.15607/rss 2025

[42] [42]

Quar-vla: Vision-language-action model for quadruped robots

Pengxiang Ding, Han Zhao, Wenjie Zhang, Wenxuan Song, Min Zhang, Siteng Huang, Ningxi Yang, and Donglin Wang. Quar-vla: Vision-language-action model for quadruped robots. InEuropean Conference on Computer Vision, pages 352–367. Springer, 2024

work page 2024

[43] [43]

Humanoid-vla: Towards universal humanoid control with visual inte- gration.arXiv preprint arXiv:2502.14795, 2025

Pengxiang Ding, Jianfei Ma, Xinyang Tong, Binghong Zou, Xinxin Luo, Yiguo Fan, Ting Wang, Hongchao Lu, Panzhong Mo, Jinxin Liu, et al. Humanoid-vla: Towards universal humanoid control with visual inte- gration.arXiv preprint arXiv:2502.14795, 2025

work page arXiv 2025

[44] [44]

Wholebodyvla: Towards unified latent vla for whole-body loco-manipulation control.arXiv preprint arXiv:2512.11047, 2025

Haoran Jiang, Jin Chen, Qingwen Bu, Li Chen, Modi Shi, Yanjie Zhang, Delong Li, Chuanzhe Suo, Chuang Wang, Zhihui Peng, et al. Wholebodyvla: Towards unified latent vla for whole-body loco-manipulation control.arXiv preprint arXiv:2512.11047, 2025

work page arXiv 2025

[45] [45]

Opendrivevla: Towards end-to-end au- tonomous driving with large vision language action model

Xingcheng Zhou, Xuyuan Han, Feng Yang, Yunpu Ma, V olker Tresp, and Alois Knoll. Opendrivevla: Towards end-to-end autonomous driving with large vision lan- guage action model, 2025. URL https://arxiv.org/abs/ 2503.23463

work page arXiv 2025

[46] [46]

Chain-of-thought prompting elicits reasoning in large language models.Advances in neural information processing systems, 35:24824–24837, 2022

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models.Advances in neural information processing systems, 35:24824–24837, 2022

work page 2022

[47] [47]

Large language models are zero-shot reasoners.Advances in neural information processing systems, 35:22199–22213, 2022

Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. Large language models are zero-shot reasoners.Advances in neural information processing systems, 35:22199–22213, 2022

work page 2022

[48] [48]

and Sabharwal, A

William Merrill and Ashish Sabharwal. The expressive power of transformers with chain of thought.arXiv preprint arXiv:2310.07923, 2023

work page arXiv 2023

[49] [49]

Chain of Thought Empowers Transformers to Solve Inherently Serial Problems, 2024

Zhiyuan Li, Hong Liu, Denny Zhou, and Tengyu Ma. Chain of thought empowers transformers to solve inher- ently serial problems.arXiv preprint arXiv:2402.12875, 1, 2024

work page arXiv 2024

[50] [50]

Towards revealing the mystery behind chain of thought: a theoretical perspective.Ad- vances in Neural Information Processing Systems, 36: 70757–70798, 2023

Guhao Feng, Bohang Zhang, Yuntian Gu, Haotian Ye, Di He, and Liwei Wang. Towards revealing the mystery behind chain of thought: a theoretical perspective.Ad- vances in Neural Information Processing Systems, 36: 70757–70798, 2023

work page 2023

[51] [51]

Chain-of-thought reasoning without prompting.Advances in Neural In- formation Processing Systems, 37:66383–66409, 2024

Xuezhi Wang and Denny Zhou. Chain-of-thought reasoning without prompting.Advances in Neural In- formation Processing Systems, 37:66383–66409, 2024

work page 2024

[52] [52]

A survey on large language models for mathematical reasoning

Peng-Yuan Wang, Tian-Shuo Liu, Chenyang Wang, Ziniu Li, Yidi Wang, Shu Yan, Chengxing Jia, Xu- Hui Liu, Xinwei Chen, Jiacheng Xu, et al. A survey on large language models for mathematical reasoning. ACM Computing Surveys, 2025

work page 2025

[53] [53]

is this text bolded?

Dayu Yang, Tianyang Liu, Daoan Zhang, Antoine Simoulin, Xiaoyi Liu, Yuwei Cao, Zhaopu Teng, Xin Qian, Grey Yang, Jiebo Luo, et al. Code to think, think to code: A survey on code-enhanced reasoning and reasoning-driven code intelligence in llms.arXiv preprint arXiv:2502.19411, 2025

work page arXiv 2025

[54] [54]

Llava-cot: Let vision language models reason step-by-step

Guowei Xu, Peng Jin, Ziang Wu, Hao Li, Yibing Song, Lichao Sun, and Li Yuan. Llava-cot: Let vision language models reason step-by-step. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 2087–2098, 2025

work page 2087

[55] [55]

When visualizing is the first step to reasoning: Mira, a benchmark for visual chain-of-thought.arXiv preprint arXiv:2511.02779, 2025

Yiyang Zhou, Haoqin Tu, Zijun Wang, Zeyu Wang, Niklas Muennighoff, Fan Nie, Yejin Choi, James Zou, Chaorui Deng, Shen Yan, et al. When visualizing is the first step to reasoning: Mira, a benchmark for visual chain-of-thought.arXiv preprint arXiv:2511.02779, 2025

work page arXiv 2025

[56] [56]

Phi-4 Technical Report

Marah Abdin, Jyoti Aneja, Harkirat Behl, S ´ebastien Bubeck, Ronen Eldan, Suriya Gunasekar, Michael Har- rison, Russell J Hewett, Mojan Javaheripi, Piero Kauff- mann, et al. Phi-4 technical report.arXiv preprint arXiv:2412.08905, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[57] [57]

Textbooks Are All You Need II: phi-1.5 technical report

Yuanzhi Li, S ´ebastien Bubeck, Ronen Eldan, Allie Del Giorno, Suriya Gunasekar, and Yin Tat Lee. Text- books are all you need ii: phi-1.5 technical report.arXiv preprint arXiv:2309.05463, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[58] [58]

Wizardcoder: Empowering code large language models with evol-instruct.arXiv preprint arXiv:2306.08568, 2023

Ziyang Luo, Can Xu, Pu Zhao, Qingfeng Sun, Xiubo Geng, Wenxiang Hu, Chongyang Tao, Jing Ma, Qing- wei Lin, and Daxin Jiang. Wizardcoder: Empowering code large language models with evol-instruct.arXiv preprint arXiv:2306.08568, 2023

work page arXiv 2023

[59] [59]

Magicoder: Empowering code generation with oss-instruct.arXiv preprint arXiv:2312.02120, 2023

Yuxiang Wei, Zhe Wang, Jiawei Liu, Yifeng Ding, and Lingming Zhang. Magicoder: Empowering code generation with oss-instruct.arXiv preprint arXiv:2312.02120, 2023

work page arXiv 2023

[60] [60]

Tinygsm: achieving ¿80% on gsm8k with small language models

Bingbin Liu, Sebastien Bubeck, Ronen Eldan, Ja- nardhan Kulkarni, Yuanzhi Li, Anh Nguyen, Rachel Ward, and Yi Zhang. Tinygsm: achieving>80% on gsm8k with small language models.arXiv preprint arXiv:2312.09241, 2023

work page arXiv 2023

[61] [61]

WizardLM: Empowering large pre-trained language models to follow complex instructions

Can Xu, Qingfeng Sun, Kai Zheng, Xiubo Geng, Pu Zhao, Jiazhan Feng, Chongyang Tao, and Daxin Jiang. Wizardlm: Empowering large language mod- els to follow complex instructions.arXiv preprint arXiv:2304.12244, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[62] [62]

Scaling Synthetic Data Creation with 1,000,000,000 Personas

Tao Ge, Xin Chan, Xiaoyang Wang, Dian Yu, Haitao Mi, and Dong Yu. Scaling synthetic data cre- ation with 1,000,000,000 personas.arXiv preprint arXiv:2406.20094, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[63] [63]

Stanford alpaca: An instruction-following llama model, 2023

Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, and Tatsunori B Hashimoto. Stanford alpaca: An instruction-following llama model, 2023

work page 2023

[64] [64]

Solving olympiad geometry without human demonstrations.Nature, 625(7995):476–482, 2024

Trieu H Trinh, Yuhuai Wu, Quoc V Le, He He, and Thang Luong. Solving olympiad geometry without human demonstrations.Nature, 625(7995):476–482, 2024

work page 2024

[65] [65]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[66] [66]

Mastering the game of go with deep neural networks and tree search.nature, 529 (7587):484–489, 2016

David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershel- vam, Marc Lanctot, et al. Mastering the game of go with deep neural networks and tree search.nature, 529 (7587):484–489, 2016

work page 2016

[67] [67]

Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm

David Silver, Thomas Hubert, Julian Schrittwieser, Ioannis Antonoglou, Matthew Lai, Arthur Guez, Marc Lanctot, Laurent Sifre, Dharshan Kumaran, Thore Grae- pel, et al. Mastering chess and shogi by self-play with a general reinforcement learning algorithm.arXiv preprint arXiv:1712.01815, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[68] [68]

Lan- guage models are hidden reasoners: Unlocking latent reasoning capabilities via self-rewarding.arXiv preprint arXiv:2411.04282, 2024

Haolin Chen, Yihao Feng, Zuxin Liu, Weiran Yao, Akshara Prabhakar, Shelby Heinecke, Ricky Ho, Phil Mui, Silvio Savarese, Caiming Xiong, et al. Lan- guage models are hidden reasoners: Unlocking latent reasoning capabilities via self-rewarding.arXiv preprint arXiv:2411.04282, 2024

work page arXiv 2024

[69] [69]

Training chain-of-thought via latent-variable inference

Matthew Douglas Hoffman, Du Phan, David Dohan, Sholto Douglas, Tuan Anh Le, Aaron Parisi, Pavel Sountsov, Charles Sutton, Sharad Vikram, and Rif A Saurous. Training chain-of-thought via latent-variable inference. InNeurIPS, 2023

work page 2023

[70] [70]

Amortizing intractable inference in large lan- guage models

Edward J Hu, Moksh Jain, Eric Elmoznino, Younesse Kaddar, Guillaume Lajoie, Yoshua Bengio, and Nikolay Malkin. Amortizing intractable inference in large lan- guage models. InThe Twelfth International Conference on Learning Representations, 2024

work page 2024

[71] [71]

Brite: Bootstrapping re- inforced thinking process to enhance language model reasoning.arXiv preprint arXiv:2501.18858, 2025

Han Zhong, Yutong Yin, Shenao Zhang, Xiaojun Xu, Yuanxin Liu, Yifei Zuo, Zhihan Liu, Boyi Liu, Sirui Zheng, Hongyi Guo, et al. Brite: Bootstrapping re- inforced thinking process to enhance language model reasoning.arXiv preprint arXiv:2501.18858, 2025

work page arXiv 2025

[72] [72]

Beyond human data: Scaling self- training for problem-solving with language models

Avi Singh, John D Co-Reyes, Rishabh Agarwal, Ankesh Anand, Piyush Patil, Xavier Garcia, Peter J Liu, James Harrison, Jaehoon Lee, Kelvin Xu, Aaron T Parisi, Abhishek Kumar, Alexander A Alemi, Alex Rizkowsky, Azade Nova, Ben Adlam, Bernd Bohnet, Gamaleldin Fathy Elsayed, Hanie Sedghi, Igor Mor- datch, Isabelle Simpson, Izzeddin Gur, Jasper Snoek, Jeffrey P...

work page 2024

[73] [73]

Reasoning to learn from latent thoughts.arXiv preprint arXiv:2503.18866, 2025

Yangjun Ruan, Neil Band, Chris J Maddison, and Tatsunori Hashimoto. Reasoning to learn from latent thoughts.arXiv preprint arXiv:2503.18866, 2025

work page arXiv 2025

[74] [74]

Skill induction and planning with latent language

Pratyusha Sharma, Antonio Torralba, and Jacob An- dreas. Skill induction and planning with latent language. InProceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1713–1726, 2022

work page 2022

[75] [75]

Do what you say: Steering vision-language-action models via runtime reasoning-action alignment verification.arXiv preprint arXiv:2510.16281, 2025

Yilin Wu, Anqi Li, Tucker Hermans, Fabio Ramos, Andrea Bajcsy, and Claudia P’erez-D’Arpino. Do what you say: Steering vision-language-action models via runtime reasoning-action alignment verification.arXiv preprint arXiv:2510.16281, 2025

work page arXiv 2025

[76] [77]

MolmoAct: Action Reasoning Models that can Reason in Space

Jason Lee, Jiafei Duan, Haoquan Fang, Yuquan Deng, Shuo Liu, Boyang Li, Bohan Fang, Jieyu Zhang, Yi Ru Wang, Sangho Lee, et al. Molmoact: Action reason- ing models that can reason in space.arXiv preprint arXiv:2508.07917, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[77] [78]

Objectvla: End-to-end open-world ob- ject manipulation without demonstration.arXiv preprint arXiv:2502.19250, 2025

Minjie Zhu, Yichen Zhu, Jinming Li, Zhongyi Zhou, Junjie Wen, Xiaoyu Liu, Chaomin Shen, Yaxin Peng, and Feifei Feng. Objectvla: End-to-end open-world ob- ject manipulation without demonstration.arXiv preprint arXiv:2502.19250, 2025

work page arXiv 2025

[78] [79]

Cot-vla: Visual chain-of-thought reasoning for vision-language-action models

Qingqing Zhao, Yao Lu, Moo Jin Kim, Zipeng Fu, Zhuoyang Zhang, Yecheng Wu, Zhaoshuo Li, Qianli Ma, Song Han, Chelsea Finn, et al. Cot-vla: Visual chain-of-thought reasoning for vision-language-action models. InProceedings of the Computer Vision and Pat- tern Recognition Conference, pages 1702–1713, 2025

work page 2025

[79] [80]

Spa- tialvlm: Endowing vision-language models with spatial reasoning capabilities

Boyuan Chen, Zhuo Xu, Sean Kirmani, Brain Ichter, Dorsa Sadigh, Leonidas Guibas, and Fei Xia. Spa- tialvlm: Endowing vision-language models with spatial reasoning capabilities. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recogni- tion, pages 14455–14465, 2024

work page 2024

[80] [81]

TraceVLA: Visual Trace Prompting Enhances Spatial-Temporal Awareness for Generalist Robotic Policies

Ruijie Zheng, Yongyuan Liang, Shuaiyi Huang, Jianfeng Gao, Hal Daum ´e III, Andrey Kolobov, Furong Huang, and Jianwei Yang. Tracevla: Vi- sual trace prompting enhances spatial-temporal aware- ness for generalist robotic policies.arXiv preprint arXiv:2412.10345, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024