pith. sign in

arxiv: 2507.16815 · v2 · pith:NQB6XJHGnew · submitted 2025-07-22 · 💻 cs.CV · cs.AI· cs.LG· cs.RO

ThinkAct: Vision-Language-Action Reasoning via Reinforced Visual Latent Planning

Pith reviewed 2026-05-19 03:15 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.LGcs.RO
keywords vision-language-actionembodied reasoningmultimodal LLMvisual latent planningreinforced rewardsrobot manipulationfew-shot adaptationlong-horizon planning
0
0 comments X

The pith

ThinkAct trains multimodal LLMs with visual rewards on goal completion to generate plans that a separate action model can execute more reliably in long tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Existing vision-language-action models map instructions straight to actions without an explicit reasoning stage, which limits their ability to handle extended sequences or adjust to changes. ThinkAct introduces a two-part setup in which a multimodal large language model first produces embodied reasoning plans. These plans receive reinforcement from visual signals that score how well they match completed goals and consistent trajectories. The plans are then reduced to a compact visual plan latent that steers a downstream action model during actual execution. Experiments on embodied reasoning and robot manipulation tasks show this yields improved few-shot adaptation, multi-step planning, and error recovery.

Core claim

ThinkAct is a dual-system framework that bridges high-level reasoning with low-level action execution via reinforced visual latent planning. It trains a multimodal LLM to generate embodied reasoning plans guided by reinforcing action-aligned visual rewards based on goal completion and trajectory consistency. These reasoning plans are compressed into a visual plan latent that conditions a downstream action model for robust action execution on target environments.

What carries the argument

Reinforced visual latent planning: the process of optimizing multimodal LLM reasoning plans through visual rewards tied to goal completion and trajectory consistency, then compressing those plans into a latent vector that directly conditions a separate action execution model.

If this is right

  • The framework supports few-shot adaptation to new task variations without full retraining.
  • It enables explicit long-horizon planning over multiple steps instead of single-step action prediction.
  • Self-correction behaviors emerge during execution when the visual plan latent supplies corrective guidance.
  • Performance improves on standard embodied reasoning and robot manipulation benchmarks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Separating the reasoning stage from action execution may allow independent updates to the planning component when new visual reward signals become available.
  • The latent compression step could reduce the computational cost of running the full multimodal LLM at every timestep during real-world deployment.
  • This modular structure opens a route for combining the system with external simulators to generate additional training signals for the reward model.

Load-bearing premise

Visual rewards based on goal completion and trajectory consistency will generate reasoning plans from the multimodal LLM that remain effective once compressed into the visual plan latent for guiding action execution.

What would settle it

A direct comparison in which action models guided by the compressed visual plan latent show no gain in success rate or no increase in self-correction on long-horizon manipulation tasks relative to standard end-to-end vision-language-action models would undermine the central claim.

Figures

Figures reproduced from arXiv: 2507.16815 by Chi-Pin Huang, Fu-En Yang, Min-Hung Chen, Yu-Chiang Frank Wang, Yueh-Hua Wu.

Figure 1
Figure 1. Figure 1: We introduce ThinkAct, a reasoning VLA framework capable of thinking before acting. Through [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of our ThinkAct. (a) Given observation 𝑜𝑡 and instruction 𝑙, ThinkAct advances action￾aligned rewards derived from visual trajectory 𝜏 to incentivize embodied reasoning capability of Reasoning MLLM ℱ𝜃. (b) Conditioned on the visual plan latent 𝑐𝑡, the DiT-based Action Model 𝜋𝜑 learns to predict executable action while keeping ℱ𝜃 frozen. Note that, during inference, 𝜋𝜑 and ℱ𝜃 could operate asynchro… view at source ↗
Figure 3
Figure 3. Figure 3: Qualitative results of intermediate reasoning steps and visualized trajectory for robot manipulation [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative comparison of reasoning process and the derived answer for our ThinkAct with and [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Few-shot adaptation results on LIBERO. We use 10 demonstrations per task for fine-tuning. Dropped!! <think>Let's start by examining the scene and the task. The goal is to put both the cream cheese box and the butter in the basket. However, something unexpected has happened — oh, I see the cream cheese box has dropped from the robot’s arm. Hmm, let's reconsider how to complete this task. First, I identify t… view at source ↗
Figure 6
Figure 6. Figure 6: Demonstration of self-reflection and correction capability of ThinkAct. The robot accidentally drops [PITH_FULL_IMAGE:figures/full_fig_p010_6.png] view at source ↗
read the original abstract

Vision-language-action (VLA) reasoning tasks require agents to interpret multimodal instructions, perform long-horizon planning, and act adaptively in dynamic environments. Existing approaches typically train VLA models in an end-to-end fashion, directly mapping inputs to actions without explicit reasoning, which hinders their ability to plan over multiple steps or adapt to complex task variations. In this paper, we propose ThinkAct, a dual-system framework that bridges high-level reasoning with low-level action execution via reinforced visual latent planning. ThinkAct trains a multimodal LLM to generate embodied reasoning plans guided by reinforcing action-aligned visual rewards based on goal completion and trajectory consistency. These reasoning plans are compressed into a visual plan latent that conditions a downstream action model for robust action execution on target environments. Extensive experiments on embodied reasoning and robot manipulation benchmarks demonstrate that ThinkAct enables few-shot adaptation, long-horizon planning, and self-correction behaviors in complex embodied AI tasks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper proposes ThinkAct, a dual-system framework for vision-language-action (VLA) reasoning in embodied AI. It trains a multimodal LLM to produce embodied reasoning plans using reinforced action-aligned visual rewards based on goal completion and trajectory consistency; these plans are then compressed into a visual plan latent that conditions a downstream action model. The authors claim that this approach enables few-shot adaptation, long-horizon planning, and self-correction on embodied reasoning and robot manipulation benchmarks, outperforming end-to-end VLA baselines.

Significance. If the central claims hold, ThinkAct would represent a meaningful advance by explicitly separating high-level reinforced reasoning from low-level action execution via a compressed visual latent, potentially improving adaptability and planning in dynamic environments over purely end-to-end trained VLAs. The use of goal-based and consistency rewards is a plausible mechanism for guiding multimodal LLMs toward useful plans, but the significance is tempered by the absence of quantitative evidence in the abstract and the unverified transfer properties of the compression step.

major comments (3)
  1. [Abstract] Abstract: the claim of 'extensive experiments on embodied reasoning and robot manipulation benchmarks' that 'demonstrate few-shot adaptation, long-horizon planning, and self-correction' is unsupported because the abstract supplies no metrics, baselines, ablation results, or implementation details; this prevents verification of whether the data actually supports the stated behaviors.
  2. [Method] Method (reinforced planning stage): the reward function combining goal-completion and trajectory-consistency terms is never formulated; without an explicit equation or weighting scheme, it is impossible to assess whether the reinforced plans contain the multi-step causal structure needed for downstream self-correction and few-shot adaptation.
  3. [Method] Method (latent compression): no operator or architecture is given for compressing the LLM-generated reasoning plans into the visual plan latent, and no ablation isolates whether temporal or causal details survive compression; if the latent discards the information the rewards were intended to enforce, the claimed adaptation and correction behaviors cannot follow from the reinforced planning stage.
minor comments (2)
  1. [Abstract] The abstract uses the phrase 'action-aligned visual rewards' without defining alignment; a brief clarification of this term would improve readability.
  2. [Method] Notation for the visual plan latent (e.g., whether it is a vector, feature map, or sequence) is introduced without an accompanying equation or diagram reference.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment point by point below, indicating the revisions we will make to strengthen the paper.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the claim of 'extensive experiments on embodied reasoning and robot manipulation benchmarks' that 'demonstrate few-shot adaptation, long-horizon planning, and self-correction' is unsupported because the abstract supplies no metrics, baselines, ablation results, or implementation details; this prevents verification of whether the data actually supports the stated behaviors.

    Authors: We agree that the abstract, being a high-level summary, lacks the quantitative support needed to fully substantiate the claims. In the revised manuscript we will expand the abstract to include specific metrics (e.g., success rates on embodied reasoning and manipulation benchmarks), comparisons against end-to-end VLA baselines, and concise references to ablation results that demonstrate the contributions to few-shot adaptation, long-horizon planning, and self-correction. revision: yes

  2. Referee: [Method] Method (reinforced planning stage): the reward function combining goal-completion and trajectory-consistency terms is never formulated; without an explicit equation or weighting scheme, it is impossible to assess whether the reinforced plans contain the multi-step causal structure needed for downstream self-correction and few-shot adaptation.

    Authors: The current manuscript describes the reward as a combination of goal-completion and trajectory-consistency terms but does not provide an explicit equation or weighting details. We will add a clear mathematical formulation (R = w_g * R_goal + w_t * R_traj) together with definitions of each term and the chosen hyperparameter values in the revised Method section so that readers can evaluate the causal structure enforced by the rewards. revision: yes

  3. Referee: [Method] Method (latent compression): no operator or architecture is given for compressing the LLM-generated reasoning plans into the visual plan latent, and no ablation isolates whether temporal or causal details survive compression; if the latent discards the information the rewards were intended to enforce, the claimed adaptation and correction behaviors cannot follow from the reinforced planning stage.

    Authors: The manuscript outlines that reasoning plans are compressed into a visual plan latent that conditions the action model, but does not specify the compression operator or provide an ablation on information retention. We will expand the Method section with the precise architecture (including the encoder design) and add an ablation study that quantifies preservation of temporal and causal details after compression, thereby linking the reinforced planning stage to the observed adaptation and self-correction behaviors. revision: yes

Circularity Check

0 steps flagged

No significant circularity; training procedure uses external rewards without self-referential reduction

full rationale

The paper describes ThinkAct as training a multimodal LLM to produce embodied reasoning plans using reinforcing visual rewards defined externally from goal completion and trajectory consistency, followed by compression into a visual plan latent to condition an action model. No equations or steps in the provided description reduce the central claims (few-shot adaptation, long-horizon planning, self-correction) to quantities defined by the model's own fitted parameters or prior self-citations in a load-bearing way. The derivation chain remains self-contained against external benchmarks and does not exhibit self-definitional, fitted-input-as-prediction, or uniqueness-imported patterns.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The framework rests on the assumption that visual rewards derived from goal completion can train useful plans and that latent compression preserves necessary information for action; no new physical entities are introduced.

free parameters (1)
  • balancing weights between goal-completion and trajectory-consistency rewards
    These weights are needed to define the reinforcement signal and are not derived from first principles.
axioms (1)
  • domain assumption A multimodal LLM can be trained to output embodied reasoning plans that align with visual action outcomes
    Invoked when stating that the LLM generates plans guided by the reinforced rewards.

pith-pipeline@v0.9.0 · 5707 in / 1239 out tokens · 56175 ms · 2026-05-19T03:15:01.519748+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 33 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. One Token Per Frame: Reconsidering Visual Bandwidth in World Models for VLA Policy

    cs.CV 2026-05 conditional novelty 7.0

    Reducing visual input to one token per frame in VLA world models maintains or improves long-horizon performance on MetaWorld, LIBERO, and real-robot tasks.

  2. OA-WAM: Object-Addressable World Action Model for Robust Robot Manipulation

    cs.RO 2026-05 unverdicted novelty 7.0

    OA-WAM uses persistent address vectors and dynamic content vectors in object slots to enable addressable world-action prediction, improving robustness on manipulation benchmarks under scene changes.

  3. VLA-GSE: Boosting Parameter-Efficient Fine-Tuning in VLA with Generalized and Specialized Experts

    cs.RO 2026-05 unverdicted novelty 7.0

    VLA-GSE improves VLA adaptation by initializing generalized shared experts and specialized routed experts via spectral decomposition of the backbone, outperforming full fine-tuning and other PEFT methods on robotic be...

  4. CoRAL: Contact-Rich Adaptive LLM-based Control for Robotic Manipulation

    cs.RO 2026-05 unverdicted novelty 7.0

    CoRAL lets LLMs act as adaptive cost designers for motion planners while using VLM priors and online identification to handle unknown physics, achieving over 50% higher success rates than baselines in unseen contact-r...

  5. Modular Sensory Stream for Integrating Physical Feedback in Vision-Language-Action Models

    cs.RO 2026-04 unverdicted novelty 7.0

    MoSS augments VLAs with decoupled modality streams for multiple physical signals, achieving synergistic gains in real-world robot tasks via joint attention and auxiliary future-signal prediction.

  6. DFM-VLA: Iterative Action Refinement for Robot Manipulation via Discrete Flow Matching

    cs.RO 2026-03 conditional novelty 7.0

    DFM-VLA uses discrete flow matching to iteratively refine action tokens in VLA models, outperforming autoregressive and diffusion baselines with 4.44 average success length on CALVIN and 95.7% success on LIBERO.

  7. Learning Physics from Pretrained Video Models: A Multimodal Continuous and Sequential World Interaction Models for Robotic Manipulation

    cs.RO 2026-02 unverdicted novelty 7.0

    PhysGen uses video models to learn physics for robots, outperforming baselines by up to 13.8% on Libero and matching specialized models in real-world tasks.

  8. Reasoning Within the Mind: Dynamic Multimodal Interleaving in Latent Space

    cs.CV 2025-12 unverdicted novelty 7.0

    DMLR performs dynamic visual-textual interleaving in latent space using confidence-guided latent policy gradient optimization and a dynamic visual injection strategy, yielding improved multimodal reasoning on benchmarks.

  9. FrameSkip: Learning from Fewer but More Informative Frames in VLA Training

    cs.RO 2026-05 unverdicted novelty 6.0

    FrameSkip improves VLA policy training success from 66.50% to 76.15% by selecting high-importance frames and retaining only 20% of unique frames across three benchmarks.

  10. Towards Long-horizon Embodied Agents with Tool-Aligned Vision-Language-Action Models

    cs.RO 2026-05 unverdicted novelty 6.0

    VLAs-as-Tools pairs a VLM planner with specialized VLA executors via a new interface and Tool-Aligned Post-Training to raise long-horizon robot success rates on LIBERO-Long and RoboTwin benchmarks.

  11. One Token Per Frame: Reconsidering Visual Bandwidth in World Models for VLA Policy

    cs.CV 2026-05 unverdicted novelty 6.0

    Reducing visual input to one token per frame via adaptive attention pooling and a unified flow-matching objective improves long-horizon performance in VLA policies on MetaWorld, LIBERO, and real-robot tasks.

  12. One Token Per Frame: Reconsidering Visual Bandwidth in World Models for VLA Policy

    cs.CV 2026-05 unverdicted novelty 6.0

    Reducing visual input to one token per frame in world models for vision-language-action policies maintains long-horizon performance while improving success rates on MetaWorld, LIBERO, and real-robot tasks.

  13. From Pixels to Tokens: A Systematic Study of Latent Action Supervision for Vision-Language-Action Models

    cs.RO 2026-05 unverdicted novelty 6.0

    A unified comparison of latent action supervision strategies for VLA models reveals task-specific benefits, with image-based approaches aiding reasoning and generalization, action-based aiding motor control, and discr...

  14. LaST-R1: Reinforcing Robotic Manipulation via Adaptive Physical Latent Reasoning

    cs.RO 2026-04 unverdicted novelty 6.0

    LaST-R1 introduces a RL post-training method called LAPO that optimizes latent Chain-of-Thought reasoning in vision-language-action models, yielding 99.9% success on LIBERO and up to 22.5% real-world gains.

  15. LaST-R1: Reinforcing Robotic Manipulation via Adaptive Physical Latent Reasoning

    cs.RO 2026-04 unverdicted novelty 6.0

    LaST-R1 reaches 99.8% average success on the LIBERO benchmark using one-shot warm-up plus LAPO reinforcement learning on latent physical reasoning, with up to 44% real-world gains on complex single- and dual-arm tasks.

  16. Learning Human-Intention Priors from Large-Scale Human Demonstrations for Robotic Manipulation

    cs.RO 2026-04 unverdicted novelty 6.0

    MoT-HRA learns embodiment-agnostic human-intention priors from the HA-2.2M dataset of 2.2M human video episodes through a three-expert hierarchy to improve robotic motion plausibility and robustness under distribution shift.

  17. $M^2$-VLA: Boosting Vision-Language Models for Generalizable Manipulation via Layer Mixture and Meta-Skills

    cs.RO 2026-04 unverdicted novelty 6.0

    M²-VLA shows that generalized VLMs can serve as direct backbones for robotic manipulation by selectively extracting task-critical features via Mixture of Layers and adding Meta Skill Modules for efficient trajectory learning.

  18. Robotic Manipulation is Vision-to-Geometry Mapping ($f(v) \rightarrow G$): Vision-Geometry Backbones over Language and Video Models

    cs.RO 2026-04 unverdicted novelty 6.0

    Vision-geometry backbones using pretrained 3D world models outperform vision-language and video models for robotic manipulation by enabling direct mapping from visual input to geometric actions.

  19. AnySlot: Goal-Conditioned Vision-Language-Action Policies for Zero-Shot Slot-Level Placement

    cs.RO 2026-04 unverdicted novelty 6.0

    AnySlot decouples language grounding from low-level control by inserting an explicit visual goal image, yielding better zero-shot performance on precise slot placement tasks than flat VLA policies.

  20. VLANeXt: Recipes for Building Strong VLA Models

    cs.CV 2026-02 conditional novelty 6.0

    VLANeXt distills 12 design insights from a unified VLA study into a model that outperforms prior methods on LIBERO benchmarks while releasing code for further exploration.

  21. Towards Long-Lived Robots: Continual Learning VLA Models via Reinforcement Fine-Tuning

    cs.RO 2026-02 unverdicted novelty 6.0

    LifeLong-RFT applies chunking-level on-policy reinforcement learning with Quantized Action Consistency Reward, Continuous Trajectory Alignment Reward, and Format Compliance Reward to fine-tune VLA models, achieving a ...

  22. Self-Supervised Bootstrapping of Action-Predictive Embodied Reasoning

    cs.RO 2026-02 unverdicted novelty 6.0

    R&B-EnCoRe uses self-supervised importance-weighted variational inference to distill action-predictive reasoning datasets that improve VLA performance on manipulation, navigation, and driving tasks without external verifiers.

  23. DextER: Language-driven Dexterous Grasp Generation with Embodied Reasoning

    cs.RO 2026-01 unverdicted novelty 6.0

    DextER uses contact-based embodied reasoning via autoregressive token generation to produce language-driven dexterous grasps, reaching 67.14% success on DexGYS with a 3.83 p.p. gain over prior methods and 96.4% better...

  24. mimic-video: Video-Action Models for Generalizable Robot Control Beyond VLAs

    cs.RO 2025-12 unverdicted novelty 6.0

    mimic-video combines internet video pretraining with a flow-matching decoder to achieve state-of-the-art robotic manipulation performance with 10x better sample efficiency than vision-language-action models.

  25. AsyncVLA: Asynchronous Flow Matching for Vision-Language-Action Models

    cs.RO 2025-11 unverdicted novelty 6.0

    AsyncVLA adds asynchronous flow matching and a confidence rater to VLA models so they can generate actions on flexible schedules and selectively refine low-confidence tokens before execution.

  26. Embodied-R1: Reinforced Embodied Reasoning for General Robotic Manipulation

    cs.RO 2025-08 conditional novelty 6.0

    Embodied-R1 uses a pointing-centric representation and reinforced fine-tuning on a 200K dataset to achieve state-of-the-art results on embodied benchmarks plus 56.2% success in SIMPLEREnv and 87.5% on real XArm tasks ...

  27. PointACT: Vision-Language-Action Models with Multi-Scale Point-Action Interaction

    cs.RO 2026-05 unverdicted novelty 5.0

    PointACT proposes a 3D-aware dual-system VLA policy using multi-scale point-action interaction with bottleneck window self-attention, achieving 10% higher success rates on RLBench-10Tasks over prior pretrained VLAs.

  28. VLA-GSE: Boosting Parameter-Efficient Fine-Tuning in VLA with Generalized and Specialized Experts

    cs.RO 2026-05 unverdicted novelty 5.0

    VLA-GSE uses spectral decomposition of the VLA backbone to create generalized and specialized experts, enabling effective robot task adaptation while updating only 2.51% of parameters and achieving 81.2% zero-shot suc...

  29. CoRAL: Contact-Rich Adaptive LLM-based Control for Robotic Manipulation

    cs.RO 2026-05 unverdicted novelty 5.0

    CoRAL lets LLMs design objective functions for robot motion planners and uses vision-language models plus real-time identification to adapt to unknown physical properties, raising success rates by over 50 percent on n...

  30. Learning Human-Intention Priors from Large-Scale Human Demonstrations for Robotic Manipulation

    cs.RO 2026-04 unverdicted novelty 5.0

    MoT-HRA learns embodiment-agnostic human-intention priors from a curated 2.2M-episode human video dataset via a three-expert hierarchical vision-language-action model to improve robotic manipulation under distribution shift.

  31. Goal2Skill: Long-Horizon Manipulation with Adaptive Planning and Reflection

    cs.RO 2026-04 unverdicted novelty 5.0

    A dual VLM-VLA framework for long-horizon robot manipulation achieves 32.4% success on RMBench tasks versus 9.8% for the strongest baseline via structured memory and closed-loop adaptive replanning.

  32. Fast-dVLA: Accelerating Discrete Diffusion VLA to Real-Time Performance

    cs.RO 2026-03 unverdicted novelty 5.0

    Parameter differences from two training runs on a small task set are treated as auxiliary capability vectors that are merged into a pretrained VLA model, yielding auxiliary-task gains at the cost of ordinary supervise...

  33. Towards Explainable Industrial Anomaly Detection via Knowledge-Guided Latent Reasoning

    cs.CV 2026-02 unverdicted novelty 5.0

    Reason-IAD improves explainable industrial anomaly detection by combining retrieval-augmented category knowledge with entropy-guided latent reasoning and dynamic visual patch injection in MLLMs.

Reference graph

Works this paper leans on

57 extracted references · 57 canonical work pages · cited by 27 Pith papers · 26 internal anchors

  1. [1]

    GPT-4 Technical Report

    Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023. 1

  2. [2]

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025. 1, 6, 13

  3. [3]

    GR00T N1: An Open Foundation Model for Generalist Humanoid Robots

    Johan Bjorck, Fernando Castañeda, Nikita Cherniadev, Xingye Da, Runyu Ding, Linxi Fan, Yu Fang, Dieter Fox, Fengyuan Hu, Spencer Huang, et al. Gr00t n1: An open foundation model for generalist humanoid robots.arXiv preprint arXiv:2503.14734, 2025. 1, 3

  4. [4]

    RT-1: Robotics Transformer for Real-World Control at Scale

    Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Joseph Dabis, Chelsea Finn, Keerthana Gopalakrishnan, Karol Hausman, Alex Herzog, Jasmine Hsu, et al. Rt-1: Robotics transformer for real-world control at scale.arXiv preprint arXiv:2212.06817, 2022. 3

  5. [5]

    RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control

    Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Xi Chen, Krzysztof Choromanski, Tianli Ding, Danny Driess, Avinava Dubey, Chelsea Finn, et al. Rt-2: Vision-language-action models transfer web knowledge to robotic control.arXiv preprint arXiv:2307.15818, 2023. 1

  6. [6]

    Eagle 2.5: Boosting long-context post-training for frontier vision-language models.arXiv preprint arXiv:2504.15271, 2025

    Guo Chen, Zhiqi Li, Shihao Wang, Jindong Jiang, Yicheng Liu, Lidong Lu, De-An Huang, Wonmin Byeon, Matthieu Le, Tuomas Rintamaki, et al. Eagle 2.5: Boosting long-context post-training for frontier vision-language models.arXiv preprint arXiv:2504.15271, 2025. 1

  7. [7]

    Egoplan-bench: Benchmarking egocentric embodied planning with multimodal large language models,

    Yi Chen, Yuying Ge, Yixiao Ge, Mingyu Ding, Bohao Li, Rui Wang, Ruifeng Xu, Ying Shan, and Xihui Liu. Egoplan-bench: Benchmarking multimodal large language models for human-level planning.arXiv preprint arXiv:2312.06722, 2023. 6, 8, 12, 13, 14

  8. [8]

    Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

    Zhe Chen, Weiyun Wang, Yue Cao, Yangzhou Liu, Zhangwei Gao, Erfei Cui, Jinguo Zhu, Shenglong Ye, Hao Tian, Zhaoyang Liu, et al. Expanding performance boundaries of open-source multimodal models with model, data, and test-time scaling.arXiv preprint arXiv:2412.05271, 2024. 1

  9. [9]

    Diffusion policy: Visuomotor policy learning via action diffusion.The International Journal of Robotics Research, page 02783649241273668, 2023

    Cheng Chi, Zhenjia Xu, Siyuan Feng, Eric Cousineau, Yilun Du, Benjamin Burchfiel, Russ Tedrake, and Shuran Song. Diffusion policy: Visuomotor policy learning via action diffusion.The International Journal of Robotics Research, page 02783649241273668, 2023. 5, 6, 12, 15

  10. [10]

    Action-free reasoning for policy generalization.arXiv preprint arXiv:2502.03729, 2025

    Jaden Clark, Suvir Mirchandani, Dorsa Sadigh, and Suneel Belkhale. Action-free reasoning for policy generalization.arXiv preprint arXiv:2502.03729, 2025. 1, 3

  11. [11]

    AHA: A Vision-Language-Model for Detecting and Reasoning Over Failures in Robotic Manipulation

    Jiafei Duan, Wilbert Pumacay, Nishanth Kumar, Yi Ru Wang, Shulin Tian, Wentao Yuan, Ranjay Krishna, Dieter Fox, Ajay Mandlekar, and Yijie Guo. Aha: A vision-language-model for detecting and reasoning over failures in robotic manipulation.arXiv preprint arXiv:2410.00371, 2024. 3

  12. [12]

    Video-R1: Reinforcing Video Reasoning in MLLMs

    Kaituo Feng, Kaixiong Gong, Bohao Li, Zonghao Guo, Yibing Wang, Tianshuo Peng, Benyou Wang, and Xiangyu Yue. Video-r1: Reinforcing video reasoning in mllms.arXiv preprint arXiv:2503.21776, 2025. 2, 3, 6, 12, 13

  13. [13]

    something something

    Raghav Goyal, Samira Ebrahimi Kahou, Vincent Michalski, Joanna Materzynska, Susanne Westphal, Heuna Kim, Valentin Haenel, Ingo Fruend, Peter Yianilos, Moritz Mueller-Freitag, et al. The" something something" video database for learning and evaluating visual common sense. InProceedings of the IEEE international conference on computer vision, pages 5842–585...

  14. [14]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025. 2, 3, 4, 5 18 ThinkAct: Vision-Language-Action Reasoning via Reinforced Visual Latent Planning

  15. [15]

    Alexander Khazatsky, Karl Pertsch, Suraj Nair, Ashwin Balakrishna, Sudeep Dasari, Siddharth Karamcheti, Soroush Nasiriany, Mohan Kumar Srirama, Lawrence Yunliang Chen, Kirsty Ellis, Peter David Fagan, Joey Hejna, Masha Itkina, Marion Lepert, Yecheng Jason Ma, Patrick Tree Miller, Jimmy Wu, Suneel Belkhale, Shivin Dass, Huy Ha, Arhan Jain, Abraham Lee, You...

  16. [16]

    OpenVLA: An Open-Source Vision-Language-Action Model

    Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, Quan Vuong, Thomas Kollar, Benjamin Burchfiel, Russ Tedrake, Dorsa Sadigh, Sergey Levine, Percy Liang, and Chelsea Finn. Openvla: An open-source vision-language-action model.arXiv preprint arXiv:2406.09246, 2...

  17. [17]

    LLaVA-OneVision: Easy Visual Task Transfer

    Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Ziwei Liu, et al. Llava-onevision: Easy visual task transfer.arXiv preprint arXiv:2408.03326,

  18. [18]

    Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models

    Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. InInternational conference on machine learning, pages 19730–19742. PMLR, 2023. 6

  19. [19]

    Llara: Supercharging robot learning data for vision- language policy

    Xiang Li, Cristina Mata, Jongwoo Park, Kumara Kahatapitiya, Yoo Sung Jang, Jinghuan Shang, Kanchana Ranasinghe, Ryan Burgert, Mu Cai, Yong Jae Lee, et al. Llara: Supercharging robot learning data for vision-language policy.arXiv preprint arXiv:2406.20095, 2024. 3

  20. [20]

    Evaluating Real-World Robot Manipulation Policies in Simulation

    Xuanlin Li, Kyle Hsu, Jiayuan Gu, Karl Pertsch, Oier Mees, Homer Rich Walke, Chuyuan Fu, Ishikaa Lunawat,IsabelSieh,SeanKirmani,SergeyLevine,JiajunWu,ChelseaFinn,HaoSu,QuanVuong,andTed Xiao. Evaluating real-world robot manipulation policies in simulation.arXiv preprint arXiv:2405.05941,

  21. [21]

    Hamster: Hierarchical action models for open-world robot manipulation,

    Yi Li, Yuquan Deng, Jesse Zhang, Joel Jang, Marius Memmel, Raymond Yu, Caelan Reed Garrett, Fabio Ramos, Dieter Fox, Anqi Li, et al. Hamster: Hierarchical action models for open-world robot manipulation. arXiv preprint arXiv:2502.05485, 2025. 3, 12

  22. [22]

    Eagle 2: Building post-training data strategies from scratch for frontier vision-language models.arXiv preprint arXiv:2501.14818, 2025

    Zhiqi Li, Guo Chen, Shilong Liu, Shihao Wang, Vibashan VS, Yishen Ji, Shiyi Lan, Hao Zhang, Yilin Zhao, Subhashree Radhakrishnan, et al. Eagle 2: Building post-training data strategies from scratch for frontier vision-language models.arXiv preprint arXiv:2501.14818, 2025. 1

  23. [23]

    Vila: On pre-training for visual language models

    Ji Lin, Hongxu Yin, Wei Ping, Pavlo Molchanov, Mohammad Shoeybi, and Song Han. Vila: On pre-training for visual language models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 26689–26699, 2024. 1 19 ThinkAct: Vision-Language-Action Reasoning via Reinforced Visual Latent Planning

  24. [24]

    LIBERO: Benchmarking Knowledge Transfer for Lifelong Robot Learning

    Bo Liu, Yifeng Zhu, Chongkai Gao, Yihao Feng, Qiang Liu, Yuke Zhu, and Peter Stone. Libero: Bench- marking knowledge transfer for lifelong robot learning.arXiv preprint arXiv:2306.03310, 2023. 4, 6, 7, 8, 10, 12, 13, 16, 17

  25. [25]

    Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023. 1

  26. [26]

    Reflect: Summarizing robot experiences for failure explanation and correction

    Zeyi Liu, Arpit Bahety, and Shuran Song. Reflect: Summarizing robot experiences for failure explanation and correction.arXiv preprint arXiv:2306.15724, 2023. 5, 6, 11, 12, 13

  27. [27]

    NVILA: Efficient Frontier Visual Language Models

    Zhijian Liu, Ligeng Zhu, Baifeng Shi, Zhuoyang Zhang, Yuming Lou, Shang Yang, Haocheng Xi, Shiyi Cao, Yuxian Gu, Dacheng Li, et al. Nvila: Efficient frontier visual language models.arXiv preprint arXiv:2412.04468, 2024. 1

  28. [28]

    Visual-RFT: Visual Reinforcement Fine-Tuning

    Ziyu Liu, Zeyi Sun, Yuhang Zang, Xiaoyi Dong, Yuhang Cao, Haodong Duan, Dahua Lin, and Jiaqi Wang. Visual-rft: Visual reinforcement fine-tuning.arXiv preprint arXiv:2503.01785, 2025. 3

  29. [29]

    OpenEQA: Embodied Question Answering in the Era of Foundation Models

    Arjun Majumdar, Anurag Ajay, Xiaohan Zhang, Pranav Putta, Sriram Yenamandra, Mikael Henaff, Sneha Silwal, Paul Mcvay, Oleksandr Maksymets, Sergio Arnaud, Karmesh Yadav, Qiyang Li, Ben Newman, Mohit Sharma, Vincent Berges, Shiqi Zhang, Pulkit Agrawal, Yonatan Bisk, Dhruv Batra, Mrinal Kalakrishnan, Franziska Meier, Chris Paxton, Sasha Sax, and Aravind Raje...

  30. [30]

    Llarva: Vision-action instruction tuning en- hances robot learning

    Dantong Niu, Yuvan Sharma, Giscard Biamby, Jerome Quenum, Yutong Bai, Baifeng Shi, Trevor Darrell, and Roei Herzig. Llarva: Vision-action instruction tuning enhances robot learning.arXiv preprint arXiv:2406.11815, 2024. 3, 4, 12

  31. [31]

    Cosmos-Reason1: From Physical Common Sense To Embodied Reasoning

    NVIDIA, Alisson Azzolini, Hannah Brandon, Prithvijit Chattopadhyay, Huayu Chen, Jinju Chu, Yin Cui, Jenna Diamond, Yifan Ding, Francesco Ferroni, Rama Govindaraju, Jinwei Gu, Siddharth Gururani, Imad El Hanafi, Zekun Hao, Jacob Huffman, Jingyi Jin, Brendan Johnson, Rizwan Khan, George Kurian, Elena Lantz, Nayeon Lee, Zhaoshuo Li, Xuan Li, Tsung-Yi Lin, Ye...

  32. [32]

    DINOv2: Learning Robust Visual Features without Supervision

    Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision.arXiv preprint arXiv:2304.07193, 2023. 6

  33. [33]

    Open x-embodiment: Robotic learning datasets and rt-x models: Open x-embodiment collaboration 0

    Abby O’Neill, Abdul Rehman, Abhiram Maddukuri, Abhishek Gupta, Abhishek Padalkar, Abraham Lee, Acorn Pooley, Agrim Gupta, Ajay Mandlekar, Ajinkya Jain, et al. Open x-embodiment: Robotic learning datasets and rt-x models: Open x-embodiment collaboration 0. In2024 IEEE International Conference on Robotics and Automation (ICRA), pages 6892–6903. IEEE, 2024. ...

  34. [34]

    Bleu: a method for automatic evaluation of machine translation

    Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method for automatic evaluation of machine translation. InProceedings of the 40th annual meeting of the Association for Computational Linguistics, pages 311–318, 2002. 7

  35. [35]

    EgoPlan-Bench2: A benchmark for multimodal large language model planning in real-world scenarios.arXiv preprint arXiv:2412.04447, 2024

    Lu Qiu, Yuying Ge, Yi Chen, Yixiao Ge, Ying Shan, and Xihui Liu. Egoplan-bench2: A benchmark for multimodal large language model planning in real-world scenarios.arXiv preprint arXiv:2412.04447,

  36. [36]

    6, 7, 14, 15 20 ThinkAct: Vision-Language-Action Reasoning via Reinforced Visual Latent Planning

  37. [37]

    Learning transferable visual models from natural language supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational conference on machine learning, pages 8748–8763. PmLR,

  38. [38]

    Dynamic time warping algorithm review.Information and Computer Science Department University of Hawaii at Manoa Honolulu, USA, 855(1-23):40, 2008

    Pavel Senin. Dynamic time warping algorithm review.Information and Computer Science Department University of Hawaii at Manoa Honolulu, USA, 855(1-23):40, 2008. 5

  39. [39]

    Robovqa: Multimodal long-horizon reasoning for robotics

    Pierre Sermanet, Tianli Ding, Jeffrey Zhao, Fei Xia, Debidatta Dwibedi, Keerthana Gopalakrishnan, Christine Chan, Gabriel Dulac-Arnold, Sharath Maddineni, Nikhil J Joshi, et al. Robovqa: Multimodal long-horizon reasoning for robotics. In2024 IEEE International Conference on Robotics and Automation (ICRA), pages 645–652. IEEE, 2024. 5, 6, 7, 8, 9, 12, 13, 14, 15

  40. [40]

    Understanding human hands in contact at internet scale

    Dandan Shan, Jiaqi Geng, Michelle Shu, and David Fouhey. Understanding human hands in contact at internet scale. InProceedings of the IEEE/CVF international conference on computer vision, 2020. 12

  41. [41]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Y Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024. 2, 3, 4, 5, 6

  42. [42]

    Hi Robot: Open-Ended Instruction Following with Hierarchical Vision-Language-Action Models

    Lucy Xiaoyang Shi, Brian Ichter, Michael Equi, Liyiming Ke, Karl Pertsch, Quan Vuong, James Tanner, Anna Walling, Haohuan Wang, Niccolo Fusai, et al. Hi robot: Open-ended instruction following with hierarchical vision-language-action models.arXiv preprint arXiv:2502.19417, 2025. 1

  43. [43]

    Eagle: Exploring The Design Space for Multimodal LLMs with Mixture of Encoders

    Min Shi, Fuxiao Liu, Shihao Wang, Shijia Liao, Subhashree Radhakrishnan, Yilin Zhao, De-An Huang, Hongxu Yin, Karan Sapra, Yaser Yacoob, et al. Eagle: Exploring the design space for multimodal llms with mixture of encoders.arXiv preprint arXiv:2408.15998, 2024. 1

  44. [44]

    From multimodal llms to generalist embodied agents: Methods and lessons

    Andrew Szot, Bogdan Mazoure, Omar Attia, Aleksei Timofeev, Harsh Agrawal, Devon Hjelm, Zhe Gan, Zsolt Kira, and Alexander Toshev. From multimodal llms to generalist embodied agents: Methods and lessons.arXiv preprint arXiv:2412.08442, 2024. 3

  45. [45]

    Reason-rft: Reinforcement fine-tuning for visual reasoning.arXiv preprint arXiv:2503.20752, 2025

    Huajie Tan, Yuheng Ji, Xiaoshuai Hao, Minglan Lin, Pengwei Wang, Zhongyuan Wang, and Shanghang Zhang. Reason-rft: Reinforcement fine-tuning for visual reasoning.arXiv preprint arXiv:2503.20752,

  46. [46]

    Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

    Gemini Team, Petko Georgiev, Ving Ian Lei, Ryan Burnell, Libin Bai, Anmol Gulati, Garrett Tanzer, Damien Vincent, Zhufeng Pan, Shibo Wang, et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context.arXiv preprint arXiv:2403.05530, 2024. 1

  47. [47]

    Octo: An Open-Source Generalist Robot Policy

    Octo Model Team, Dibya Ghosh, Homer Walke, Karl Pertsch, Kevin Black, Oier Mees, Sudeep Dasari, Joey Hejna, Tobias Kreiman, Charles Xu, et al. Octo: An open-source generalist robot policy.arXiv preprint arXiv:2405.12213, 2024. 1

  48. [48]

    Chain-of-thought reasoning without prompting.arXiv preprint arXiv:2402.10200, 2024

    Xuezhi Wang and Denny Zhou. Chain-of-thought reasoning without prompting.arXiv preprint arXiv:2402.10200, 2024. 3

  49. [49]

    Chain-of-thought prompting elicits reasoning in large language models.Advances in neural information processing systems, 35:24824–24837, 2022

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models.Advances in neural information processing systems, 35:24824–24837, 2022. 1, 3

  50. [50]

    Magma: A founda- tion model for multimodal ai agents

    Jianwei Yang, Reuben Tan, Qianhui Wu, Ruijie Zheng, Baolin Peng, Yongyuan Liang, Yu Gu, Mu Cai, Seonghyeon Ye, Joel Jang, et al. Magma: A foundation model for multimodal ai agents.arXiv preprint arXiv:2502.13130, 2025. 3, 4, 10, 12, 16 21 ThinkAct: Vision-Language-Action Reasoning via Reinforced Visual Latent Planning

  51. [51]

    Demystifying Long Chain-of-Thought Reasoning in LLMs

    Edward Yeo, Yuxuan Tong, Morry Niu, Graham Neubig, and Xiang Yue. Demystifying long chain-of- thought reasoning in llms.arXiv preprint arXiv:2502.03373, 2025. 3

  52. [52]

    Robopoint: A vision-language model for spatial affordance prediction for robotics,

    WentaoYuan, JiafeiDuan, ValtsBlukis, WilbertPumacay, RanjayKrishna, AdithyavairavanMurali, Arsalan Mousavian, and Dieter Fox. Robopoint: A vision-language model for spatial affordance prediction for robotics.arXiv preprint arXiv:2406.10721, 2024. 3

  53. [53]

    Robotic Control via Embodied Chain-of-Thought Reasoning

    Michał Zawalski, William Chen, Karl Pertsch, Oier Mees, Chelsea Finn, and Sergey Levine. Robotic control via embodied chain-of-thought reasoning.arXiv preprint arXiv:2407.08693, 2024. 1, 3

  54. [54]

    LLaVA-Video: Video Instruction Tuning With Synthetic Data

    Yuanhan Zhang, Jinming Wu, Wei Li, Bo Li, Zejun Ma, Ziwei Liu, and Chunyuan Li. Video instruction tuning with synthetic data.arXiv preprint arXiv:2410.02713, 2024. 6, 13

  55. [55]

    CoT-VLA: Visual Chain-of-Thought Reasoning for Vision-Language-Action Models

    Qingqing Zhao, Yao Lu, Moo Jin Kim, Zipeng Fu, Zhuoyang Zhang, Yecheng Wu, Zhaoshuo Li, Qianli Ma, Song Han, Chelsea Finn, et al. Cot-vla: Visual chain-of-thought reasoning for vision-language-action models.arXiv preprint arXiv:2503.22020, 2025. 1, 3, 7, 13

  56. [56]

    TraceVLA: Visual Trace Prompting Enhances Spatial-Temporal Awareness for Generalist Robotic Policies

    Ruijie Zheng, Yongyuan Liang, Shuaiyi Huang, Jianfeng Gao, Hal Daumé III, Andrey Kolobov, Furong Huang, and Jianwei Yang. Tracevla: Visual trace prompting enhances spatial-temporal awareness for generalist robotic policies.arXiv preprint arXiv:2412.10345, 2024. 1, 3, 4

  57. [57]

    InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

    Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shenglong Ye, Lixin Gu, Yuchen Duan, Hao Tian, Weijie Su, Jie Shao, et al. Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models.arXiv preprint arXiv:2504.10479, 2025. 1 22