ThinkAct: Vision-Language-Action Reasoning via Reinforced Visual Latent Planning

Chi-Pin Huang; Fu-En Yang; Min-Hung Chen; Yu-Chiang Frank Wang; Yueh-Hua Wu

arxiv: 2507.16815 · v2 · pith:NQB6XJHGnew · submitted 2025-07-22 · 💻 cs.CV · cs.AI· cs.LG· cs.RO

ThinkAct: Vision-Language-Action Reasoning via Reinforced Visual Latent Planning

Chi-Pin Huang , Yueh-Hua Wu , Min-Hung Chen , Yu-Chiang Frank Wang , Fu-En Yang This is my paper

Pith reviewed 2026-05-19 03:15 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.LGcs.RO

keywords vision-language-actionembodied reasoningmultimodal LLMvisual latent planningreinforced rewardsrobot manipulationfew-shot adaptationlong-horizon planning

0 comments

The pith

ThinkAct trains multimodal LLMs with visual rewards on goal completion to generate plans that a separate action model can execute more reliably in long tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Existing vision-language-action models map instructions straight to actions without an explicit reasoning stage, which limits their ability to handle extended sequences or adjust to changes. ThinkAct introduces a two-part setup in which a multimodal large language model first produces embodied reasoning plans. These plans receive reinforcement from visual signals that score how well they match completed goals and consistent trajectories. The plans are then reduced to a compact visual plan latent that steers a downstream action model during actual execution. Experiments on embodied reasoning and robot manipulation tasks show this yields improved few-shot adaptation, multi-step planning, and error recovery.

Core claim

ThinkAct is a dual-system framework that bridges high-level reasoning with low-level action execution via reinforced visual latent planning. It trains a multimodal LLM to generate embodied reasoning plans guided by reinforcing action-aligned visual rewards based on goal completion and trajectory consistency. These reasoning plans are compressed into a visual plan latent that conditions a downstream action model for robust action execution on target environments.

What carries the argument

Reinforced visual latent planning: the process of optimizing multimodal LLM reasoning plans through visual rewards tied to goal completion and trajectory consistency, then compressing those plans into a latent vector that directly conditions a separate action execution model.

If this is right

The framework supports few-shot adaptation to new task variations without full retraining.
It enables explicit long-horizon planning over multiple steps instead of single-step action prediction.
Self-correction behaviors emerge during execution when the visual plan latent supplies corrective guidance.
Performance improves on standard embodied reasoning and robot manipulation benchmarks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Separating the reasoning stage from action execution may allow independent updates to the planning component when new visual reward signals become available.
The latent compression step could reduce the computational cost of running the full multimodal LLM at every timestep during real-world deployment.
This modular structure opens a route for combining the system with external simulators to generate additional training signals for the reward model.

Load-bearing premise

Visual rewards based on goal completion and trajectory consistency will generate reasoning plans from the multimodal LLM that remain effective once compressed into the visual plan latent for guiding action execution.

What would settle it

A direct comparison in which action models guided by the compressed visual plan latent show no gain in success rate or no increase in self-correction on long-horizon manipulation tasks relative to standard end-to-end vision-language-action models would undermine the central claim.

Figures

Figures reproduced from arXiv: 2507.16815 by Chi-Pin Huang, Fu-En Yang, Min-Hung Chen, Yu-Chiang Frank Wang, Yueh-Hua Wu.

**Figure 2.** Figure 2: Overview of our ThinkAct. (a) Given observation 𝑜𝑡 and instruction 𝑙, ThinkAct advances actionaligned rewards derived from visual trajectory 𝜏 to incentivize embodied reasoning capability of Reasoning MLLM ℱ𝜃. (b) Conditioned on the visual plan latent 𝑐𝑡, the DiT-based Action Model 𝜋𝜑 learns to predict executable action while keeping ℱ𝜃 frozen. Note that, during inference, 𝜋𝜑 and ℱ𝜃 could operate asynchro… view at source ↗

**Figure 3.** Figure 3: Qualitative results of intermediate reasoning steps and visualized trajectory for robot manipulation [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: Qualitative comparison of reasoning process and the derived answer for our ThinkAct with and [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

**Figure 5.** Figure 5: Few-shot adaptation results on LIBERO. We use 10 demonstrations per task for fine-tuning. Dropped!! <think>Let's start by examining the scene and the task. The goal is to put both the cream cheese box and the butter in the basket. However, something unexpected has happened — oh, I see the cream cheese box has dropped from the robot’s arm. Hmm, let's reconsider how to complete this task. First, I identify t… view at source ↗

**Figure 6.** Figure 6: Demonstration of self-reflection and correction capability of ThinkAct. The robot accidentally drops [PITH_FULL_IMAGE:figures/full_fig_p010_6.png] view at source ↗

read the original abstract

Vision-language-action (VLA) reasoning tasks require agents to interpret multimodal instructions, perform long-horizon planning, and act adaptively in dynamic environments. Existing approaches typically train VLA models in an end-to-end fashion, directly mapping inputs to actions without explicit reasoning, which hinders their ability to plan over multiple steps or adapt to complex task variations. In this paper, we propose ThinkAct, a dual-system framework that bridges high-level reasoning with low-level action execution via reinforced visual latent planning. ThinkAct trains a multimodal LLM to generate embodied reasoning plans guided by reinforcing action-aligned visual rewards based on goal completion and trajectory consistency. These reasoning plans are compressed into a visual plan latent that conditions a downstream action model for robust action execution on target environments. Extensive experiments on embodied reasoning and robot manipulation benchmarks demonstrate that ThinkAct enables few-shot adaptation, long-horizon planning, and self-correction behaviors in complex embodied AI tasks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ThinkAct proposes a dual-system with reinforced visual planning to bridge LLM reasoning and action models in VLA tasks, but the abstract's lack of metrics and details on the compression step leaves the adaptation and self-correction claims hard to evaluate.

read the letter

The main point to take from ThinkAct is that it splits VLA work into a reinforced planning stage run by a multimodal LLM and a separate action model that takes a compressed visual latent from those plans. The reinforcement uses rewards tied to goal completion and trajectory consistency to shape the plans before they get squeezed down for execution. This setup targets the usual problem where end-to-end VLA training skips explicit multi-step reasoning and struggles with long tasks or changes in the environment. The framing is straightforward and builds directly on existing VLA and RL ideas without overclaiming a new paradigm. The choice to reinforce the plans with visual, action-aligned signals is a practical way to keep the high-level output grounded in what the robot can actually do. If the compression step keeps the important causal and temporal structure, the approach could support the few-shot adaptation and self-correction behaviors described. The soft spots sit in the evidence and the transfer mechanism. The abstract states that extensive experiments on embodied reasoning and robot manipulation benchmarks show the desired behaviors, yet it supplies no numbers, baselines, or ablation results. That gap makes it difficult to judge how large the gains are or whether the rewards are the main driver. The compression from full reasoning plans into the visual plan latent is the least secure link: without a clear description of the operator or tests showing what information survives, it is possible that critical long-horizon details get lost and the downstream action model ends up no better than a standard VLA. Minor implementation choices like the balancing weights on the two rewards are also left open. This paper is aimed at researchers working on embodied AI and robotics who already follow VLA models and want concrete ways to add planning without starting from scratch. Readers who care about RL for high-level guidance or latent conditioning of policies could extract useful ideas even if they adapt the details. It deserves a serious referee because the architecture is specific enough to be implemented and tested, and the claims are falsifiable once the missing results and ablations are supplied.

Referee Report

3 major / 2 minor

Summary. The paper proposes ThinkAct, a dual-system framework for vision-language-action (VLA) reasoning in embodied AI. It trains a multimodal LLM to produce embodied reasoning plans using reinforced action-aligned visual rewards based on goal completion and trajectory consistency; these plans are then compressed into a visual plan latent that conditions a downstream action model. The authors claim that this approach enables few-shot adaptation, long-horizon planning, and self-correction on embodied reasoning and robot manipulation benchmarks, outperforming end-to-end VLA baselines.

Significance. If the central claims hold, ThinkAct would represent a meaningful advance by explicitly separating high-level reinforced reasoning from low-level action execution via a compressed visual latent, potentially improving adaptability and planning in dynamic environments over purely end-to-end trained VLAs. The use of goal-based and consistency rewards is a plausible mechanism for guiding multimodal LLMs toward useful plans, but the significance is tempered by the absence of quantitative evidence in the abstract and the unverified transfer properties of the compression step.

major comments (3)

[Abstract] Abstract: the claim of 'extensive experiments on embodied reasoning and robot manipulation benchmarks' that 'demonstrate few-shot adaptation, long-horizon planning, and self-correction' is unsupported because the abstract supplies no metrics, baselines, ablation results, or implementation details; this prevents verification of whether the data actually supports the stated behaviors.
[Method] Method (reinforced planning stage): the reward function combining goal-completion and trajectory-consistency terms is never formulated; without an explicit equation or weighting scheme, it is impossible to assess whether the reinforced plans contain the multi-step causal structure needed for downstream self-correction and few-shot adaptation.
[Method] Method (latent compression): no operator or architecture is given for compressing the LLM-generated reasoning plans into the visual plan latent, and no ablation isolates whether temporal or causal details survive compression; if the latent discards the information the rewards were intended to enforce, the claimed adaptation and correction behaviors cannot follow from the reinforced planning stage.

minor comments (2)

[Abstract] The abstract uses the phrase 'action-aligned visual rewards' without defining alignment; a brief clarification of this term would improve readability.
[Method] Notation for the visual plan latent (e.g., whether it is a vector, feature map, or sequence) is introduced without an accompanying equation or diagram reference.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment point by point below, indicating the revisions we will make to strengthen the paper.

read point-by-point responses

Referee: [Abstract] Abstract: the claim of 'extensive experiments on embodied reasoning and robot manipulation benchmarks' that 'demonstrate few-shot adaptation, long-horizon planning, and self-correction' is unsupported because the abstract supplies no metrics, baselines, ablation results, or implementation details; this prevents verification of whether the data actually supports the stated behaviors.

Authors: We agree that the abstract, being a high-level summary, lacks the quantitative support needed to fully substantiate the claims. In the revised manuscript we will expand the abstract to include specific metrics (e.g., success rates on embodied reasoning and manipulation benchmarks), comparisons against end-to-end VLA baselines, and concise references to ablation results that demonstrate the contributions to few-shot adaptation, long-horizon planning, and self-correction. revision: yes
Referee: [Method] Method (reinforced planning stage): the reward function combining goal-completion and trajectory-consistency terms is never formulated; without an explicit equation or weighting scheme, it is impossible to assess whether the reinforced plans contain the multi-step causal structure needed for downstream self-correction and few-shot adaptation.

Authors: The current manuscript describes the reward as a combination of goal-completion and trajectory-consistency terms but does not provide an explicit equation or weighting details. We will add a clear mathematical formulation (R = w_g * R_goal + w_t * R_traj) together with definitions of each term and the chosen hyperparameter values in the revised Method section so that readers can evaluate the causal structure enforced by the rewards. revision: yes
Referee: [Method] Method (latent compression): no operator or architecture is given for compressing the LLM-generated reasoning plans into the visual plan latent, and no ablation isolates whether temporal or causal details survive compression; if the latent discards the information the rewards were intended to enforce, the claimed adaptation and correction behaviors cannot follow from the reinforced planning stage.

Authors: The manuscript outlines that reasoning plans are compressed into a visual plan latent that conditions the action model, but does not specify the compression operator or provide an ablation on information retention. We will expand the Method section with the precise architecture (including the encoder design) and add an ablation study that quantifies preservation of temporal and causal details after compression, thereby linking the reinforced planning stage to the observed adaptation and self-correction behaviors. revision: yes

Circularity Check

0 steps flagged

No significant circularity; training procedure uses external rewards without self-referential reduction

full rationale

The paper describes ThinkAct as training a multimodal LLM to produce embodied reasoning plans using reinforcing visual rewards defined externally from goal completion and trajectory consistency, followed by compression into a visual plan latent to condition an action model. No equations or steps in the provided description reduce the central claims (few-shot adaptation, long-horizon planning, self-correction) to quantities defined by the model's own fitted parameters or prior self-citations in a load-bearing way. The derivation chain remains self-contained against external benchmarks and does not exhibit self-definitional, fitted-input-as-prediction, or uniqueness-imported patterns.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The framework rests on the assumption that visual rewards derived from goal completion can train useful plans and that latent compression preserves necessary information for action; no new physical entities are introduced.

free parameters (1)

balancing weights between goal-completion and trajectory-consistency rewards
These weights are needed to define the reinforcement signal and are not derived from first principles.

axioms (1)

domain assumption A multimodal LLM can be trained to output embodied reasoning plans that align with visual action outcomes
Invoked when stating that the LLM generates plans guided by the reinforced rewards.

pith-pipeline@v0.9.0 · 5707 in / 1239 out tokens · 56175 ms · 2026-05-19T03:15:01.519748+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

ThinkAct trains a multimodal LLM to generate embodied reasoning plans guided by reinforcing action-aligned visual rewards based on goal completion and trajectory consistency. These reasoning plans are compressed into a visual plan latent that conditions a downstream action model
IndisputableMonolith/Foundation/ArithmeticFromLogic.lean absolute_floor_iff_bare_distinguishability unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

GRPO ... r = 0.9 r_visual + 0.1 r_format where r_visual = ω_goal r_goal + ω_traj r_traj

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 33 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

One Token Per Frame: Reconsidering Visual Bandwidth in World Models for VLA Policy
cs.CV 2026-05 conditional novelty 7.0

Reducing visual input to one token per frame in VLA world models maintains or improves long-horizon performance on MetaWorld, LIBERO, and real-robot tasks.
OA-WAM: Object-Addressable World Action Model for Robust Robot Manipulation
cs.RO 2026-05 unverdicted novelty 7.0

OA-WAM uses persistent address vectors and dynamic content vectors in object slots to enable addressable world-action prediction, improving robustness on manipulation benchmarks under scene changes.
VLA-GSE: Boosting Parameter-Efficient Fine-Tuning in VLA with Generalized and Specialized Experts
cs.RO 2026-05 unverdicted novelty 7.0

VLA-GSE improves VLA adaptation by initializing generalized shared experts and specialized routed experts via spectral decomposition of the backbone, outperforming full fine-tuning and other PEFT methods on robotic be...
CoRAL: Contact-Rich Adaptive LLM-based Control for Robotic Manipulation
cs.RO 2026-05 unverdicted novelty 7.0

CoRAL lets LLMs act as adaptive cost designers for motion planners while using VLM priors and online identification to handle unknown physics, achieving over 50% higher success rates than baselines in unseen contact-r...
Modular Sensory Stream for Integrating Physical Feedback in Vision-Language-Action Models
cs.RO 2026-04 unverdicted novelty 7.0

MoSS augments VLAs with decoupled modality streams for multiple physical signals, achieving synergistic gains in real-world robot tasks via joint attention and auxiliary future-signal prediction.
DFM-VLA: Iterative Action Refinement for Robot Manipulation via Discrete Flow Matching
cs.RO 2026-03 conditional novelty 7.0

DFM-VLA uses discrete flow matching to iteratively refine action tokens in VLA models, outperforming autoregressive and diffusion baselines with 4.44 average success length on CALVIN and 95.7% success on LIBERO.
Learning Physics from Pretrained Video Models: A Multimodal Continuous and Sequential World Interaction Models for Robotic Manipulation
cs.RO 2026-02 unverdicted novelty 7.0

PhysGen uses video models to learn physics for robots, outperforming baselines by up to 13.8% on Libero and matching specialized models in real-world tasks.
Reasoning Within the Mind: Dynamic Multimodal Interleaving in Latent Space
cs.CV 2025-12 unverdicted novelty 7.0

DMLR performs dynamic visual-textual interleaving in latent space using confidence-guided latent policy gradient optimization and a dynamic visual injection strategy, yielding improved multimodal reasoning on benchmarks.
FrameSkip: Learning from Fewer but More Informative Frames in VLA Training
cs.RO 2026-05 unverdicted novelty 6.0

FrameSkip improves VLA policy training success from 66.50% to 76.15% by selecting high-importance frames and retaining only 20% of unique frames across three benchmarks.
Towards Long-horizon Embodied Agents with Tool-Aligned Vision-Language-Action Models
cs.RO 2026-05 unverdicted novelty 6.0

VLAs-as-Tools pairs a VLM planner with specialized VLA executors via a new interface and Tool-Aligned Post-Training to raise long-horizon robot success rates on LIBERO-Long and RoboTwin benchmarks.
One Token Per Frame: Reconsidering Visual Bandwidth in World Models for VLA Policy
cs.CV 2026-05 unverdicted novelty 6.0

Reducing visual input to one token per frame via adaptive attention pooling and a unified flow-matching objective improves long-horizon performance in VLA policies on MetaWorld, LIBERO, and real-robot tasks.
One Token Per Frame: Reconsidering Visual Bandwidth in World Models for VLA Policy
cs.CV 2026-05 unverdicted novelty 6.0

Reducing visual input to one token per frame in world models for vision-language-action policies maintains long-horizon performance while improving success rates on MetaWorld, LIBERO, and real-robot tasks.
From Pixels to Tokens: A Systematic Study of Latent Action Supervision for Vision-Language-Action Models
cs.RO 2026-05 unverdicted novelty 6.0

A unified comparison of latent action supervision strategies for VLA models reveals task-specific benefits, with image-based approaches aiding reasoning and generalization, action-based aiding motor control, and discr...
LaST-R1: Reinforcing Robotic Manipulation via Adaptive Physical Latent Reasoning
cs.RO 2026-04 unverdicted novelty 6.0

LaST-R1 introduces a RL post-training method called LAPO that optimizes latent Chain-of-Thought reasoning in vision-language-action models, yielding 99.9% success on LIBERO and up to 22.5% real-world gains.
LaST-R1: Reinforcing Robotic Manipulation via Adaptive Physical Latent Reasoning
cs.RO 2026-04 unverdicted novelty 6.0

LaST-R1 reaches 99.8% average success on the LIBERO benchmark using one-shot warm-up plus LAPO reinforcement learning on latent physical reasoning, with up to 44% real-world gains on complex single- and dual-arm tasks.
Learning Human-Intention Priors from Large-Scale Human Demonstrations for Robotic Manipulation
cs.RO 2026-04 unverdicted novelty 6.0

MoT-HRA learns embodiment-agnostic human-intention priors from the HA-2.2M dataset of 2.2M human video episodes through a three-expert hierarchy to improve robotic motion plausibility and robustness under distribution shift.
$M^2$-VLA: Boosting Vision-Language Models for Generalizable Manipulation via Layer Mixture and Meta-Skills
cs.RO 2026-04 unverdicted novelty 6.0

M²-VLA shows that generalized VLMs can serve as direct backbones for robotic manipulation by selectively extracting task-critical features via Mixture of Layers and adding Meta Skill Modules for efficient trajectory learning.
Robotic Manipulation is Vision-to-Geometry Mapping ($f(v) \rightarrow G$): Vision-Geometry Backbones over Language and Video Models
cs.RO 2026-04 unverdicted novelty 6.0

Vision-geometry backbones using pretrained 3D world models outperform vision-language and video models for robotic manipulation by enabling direct mapping from visual input to geometric actions.
AnySlot: Goal-Conditioned Vision-Language-Action Policies for Zero-Shot Slot-Level Placement
cs.RO 2026-04 unverdicted novelty 6.0

AnySlot decouples language grounding from low-level control by inserting an explicit visual goal image, yielding better zero-shot performance on precise slot placement tasks than flat VLA policies.
VLANeXt: Recipes for Building Strong VLA Models
cs.CV 2026-02 conditional novelty 6.0

VLANeXt distills 12 design insights from a unified VLA study into a model that outperforms prior methods on LIBERO benchmarks while releasing code for further exploration.
Towards Long-Lived Robots: Continual Learning VLA Models via Reinforcement Fine-Tuning
cs.RO 2026-02 unverdicted novelty 6.0

LifeLong-RFT applies chunking-level on-policy reinforcement learning with Quantized Action Consistency Reward, Continuous Trajectory Alignment Reward, and Format Compliance Reward to fine-tune VLA models, achieving a ...
Self-Supervised Bootstrapping of Action-Predictive Embodied Reasoning
cs.RO 2026-02 unverdicted novelty 6.0

R&B-EnCoRe uses self-supervised importance-weighted variational inference to distill action-predictive reasoning datasets that improve VLA performance on manipulation, navigation, and driving tasks without external verifiers.
DextER: Language-driven Dexterous Grasp Generation with Embodied Reasoning
cs.RO 2026-01 unverdicted novelty 6.0

DextER uses contact-based embodied reasoning via autoregressive token generation to produce language-driven dexterous grasps, reaching 67.14% success on DexGYS with a 3.83 p.p. gain over prior methods and 96.4% better...
mimic-video: Video-Action Models for Generalizable Robot Control Beyond VLAs
cs.RO 2025-12 unverdicted novelty 6.0

mimic-video combines internet video pretraining with a flow-matching decoder to achieve state-of-the-art robotic manipulation performance with 10x better sample efficiency than vision-language-action models.
AsyncVLA: Asynchronous Flow Matching for Vision-Language-Action Models
cs.RO 2025-11 unverdicted novelty 6.0

AsyncVLA adds asynchronous flow matching and a confidence rater to VLA models so they can generate actions on flexible schedules and selectively refine low-confidence tokens before execution.
Embodied-R1: Reinforced Embodied Reasoning for General Robotic Manipulation
cs.RO 2025-08 conditional novelty 6.0

Embodied-R1 uses a pointing-centric representation and reinforced fine-tuning on a 200K dataset to achieve state-of-the-art results on embodied benchmarks plus 56.2% success in SIMPLEREnv and 87.5% on real XArm tasks ...
PointACT: Vision-Language-Action Models with Multi-Scale Point-Action Interaction
cs.RO 2026-05 unverdicted novelty 5.0

PointACT proposes a 3D-aware dual-system VLA policy using multi-scale point-action interaction with bottleneck window self-attention, achieving 10% higher success rates on RLBench-10Tasks over prior pretrained VLAs.
VLA-GSE: Boosting Parameter-Efficient Fine-Tuning in VLA with Generalized and Specialized Experts
cs.RO 2026-05 unverdicted novelty 5.0

VLA-GSE uses spectral decomposition of the VLA backbone to create generalized and specialized experts, enabling effective robot task adaptation while updating only 2.51% of parameters and achieving 81.2% zero-shot suc...
CoRAL: Contact-Rich Adaptive LLM-based Control for Robotic Manipulation
cs.RO 2026-05 unverdicted novelty 5.0

CoRAL lets LLMs design objective functions for robot motion planners and uses vision-language models plus real-time identification to adapt to unknown physical properties, raising success rates by over 50 percent on n...
Learning Human-Intention Priors from Large-Scale Human Demonstrations for Robotic Manipulation
cs.RO 2026-04 unverdicted novelty 5.0

MoT-HRA learns embodiment-agnostic human-intention priors from a curated 2.2M-episode human video dataset via a three-expert hierarchical vision-language-action model to improve robotic manipulation under distribution shift.
Goal2Skill: Long-Horizon Manipulation with Adaptive Planning and Reflection
cs.RO 2026-04 unverdicted novelty 5.0

A dual VLM-VLA framework for long-horizon robot manipulation achieves 32.4% success on RMBench tasks versus 9.8% for the strongest baseline via structured memory and closed-loop adaptive replanning.
Fast-dVLA: Accelerating Discrete Diffusion VLA to Real-Time Performance
cs.RO 2026-03 unverdicted novelty 5.0

Parameter differences from two training runs on a small task set are treated as auxiliary capability vectors that are merged into a pretrained VLA model, yielding auxiliary-task gains at the cost of ordinary supervise...
Towards Explainable Industrial Anomaly Detection via Knowledge-Guided Latent Reasoning
cs.CV 2026-02 unverdicted novelty 5.0

Reason-IAD improves explainable industrial anomaly detection by combining retrieval-augmented category knowledge with entropy-guided latent reasoning and dynamic visual patch injection in MLLMs.

Reference graph

Works this paper leans on

57 extracted references · 57 canonical work pages · cited by 27 Pith papers · 26 internal anchors

[1]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023. 1

work page internal anchor Pith review Pith/arXiv arXiv 2023
[2]

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025. 1, 6, 13

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

GR00T N1: An Open Foundation Model for Generalist Humanoid Robots

Johan Bjorck, Fernando Castañeda, Nikita Cherniadev, Xingye Da, Runyu Ding, Linxi Fan, Yu Fang, Dieter Fox, Fengyuan Hu, Spencer Huang, et al. Gr00t n1: An open foundation model for generalist humanoid robots.arXiv preprint arXiv:2503.14734, 2025. 1, 3

work page internal anchor Pith review Pith/arXiv arXiv 2025
[4]

RT-1: Robotics Transformer for Real-World Control at Scale

Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Joseph Dabis, Chelsea Finn, Keerthana Gopalakrishnan, Karol Hausman, Alex Herzog, Jasmine Hsu, et al. Rt-1: Robotics transformer for real-world control at scale.arXiv preprint arXiv:2212.06817, 2022. 3

work page internal anchor Pith review Pith/arXiv arXiv 2022
[5]

RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control

Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Xi Chen, Krzysztof Choromanski, Tianli Ding, Danny Driess, Avinava Dubey, Chelsea Finn, et al. Rt-2: Vision-language-action models transfer web knowledge to robotic control.arXiv preprint arXiv:2307.15818, 2023. 1

work page internal anchor Pith review Pith/arXiv arXiv 2023
[6]

Eagle 2.5: Boosting long-context post-training for frontier vision-language models.arXiv preprint arXiv:2504.15271, 2025

Guo Chen, Zhiqi Li, Shihao Wang, Jindong Jiang, Yicheng Liu, Lidong Lu, De-An Huang, Wonmin Byeon, Matthieu Le, Tuomas Rintamaki, et al. Eagle 2.5: Boosting long-context post-training for frontier vision-language models.arXiv preprint arXiv:2504.15271, 2025. 1

work page arXiv 2025
[7]

Egoplan-bench: Benchmarking egocentric embodied planning with multimodal large language models,

Yi Chen, Yuying Ge, Yixiao Ge, Mingyu Ding, Bohao Li, Rui Wang, Ruifeng Xu, Ying Shan, and Xihui Liu. Egoplan-bench: Benchmarking multimodal large language models for human-level planning.arXiv preprint arXiv:2312.06722, 2023. 6, 8, 12, 13, 14

work page arXiv 2023
[8]

Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

Zhe Chen, Weiyun Wang, Yue Cao, Yangzhou Liu, Zhangwei Gao, Erfei Cui, Jinguo Zhu, Shenglong Ye, Hao Tian, Zhaoyang Liu, et al. Expanding performance boundaries of open-source multimodal models with model, data, and test-time scaling.arXiv preprint arXiv:2412.05271, 2024. 1

work page internal anchor Pith review Pith/arXiv arXiv 2024
[9]

Diffusion policy: Visuomotor policy learning via action diffusion.The International Journal of Robotics Research, page 02783649241273668, 2023

Cheng Chi, Zhenjia Xu, Siyuan Feng, Eric Cousineau, Yilun Du, Benjamin Burchfiel, Russ Tedrake, and Shuran Song. Diffusion policy: Visuomotor policy learning via action diffusion.The International Journal of Robotics Research, page 02783649241273668, 2023. 5, 6, 12, 15

work page 2023
[10]

Action-free reasoning for policy generalization.arXiv preprint arXiv:2502.03729, 2025

Jaden Clark, Suvir Mirchandani, Dorsa Sadigh, and Suneel Belkhale. Action-free reasoning for policy generalization.arXiv preprint arXiv:2502.03729, 2025. 1, 3

work page arXiv 2025
[11]

AHA: A Vision-Language-Model for Detecting and Reasoning Over Failures in Robotic Manipulation

Jiafei Duan, Wilbert Pumacay, Nishanth Kumar, Yi Ru Wang, Shulin Tian, Wentao Yuan, Ranjay Krishna, Dieter Fox, Ajay Mandlekar, and Yijie Guo. Aha: A vision-language-model for detecting and reasoning over failures in robotic manipulation.arXiv preprint arXiv:2410.00371, 2024. 3

work page arXiv 2024
[12]

Video-R1: Reinforcing Video Reasoning in MLLMs

Kaituo Feng, Kaixiong Gong, Bohao Li, Zonghao Guo, Yibing Wang, Tianshuo Peng, Benyou Wang, and Xiangyu Yue. Video-r1: Reinforcing video reasoning in mllms.arXiv preprint arXiv:2503.21776, 2025. 2, 3, 6, 12, 13

work page internal anchor Pith review Pith/arXiv arXiv 2025
[13]

something something

Raghav Goyal, Samira Ebrahimi Kahou, Vincent Michalski, Joanna Materzynska, Susanne Westphal, Heuna Kim, Valentin Haenel, Ingo Fruend, Peter Yianilos, Moritz Mueller-Freitag, et al. The" something something" video database for learning and evaluating visual common sense. InProceedings of the IEEE international conference on computer vision, pages 5842–585...

work page 2017
[14]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025. 2, 3, 4, 5 18 ThinkAct: Vision-Language-Action Reasoning via Reinforced Visual Latent Planning

work page internal anchor Pith review Pith/arXiv arXiv 2025
[15]

Alexander Khazatsky, Karl Pertsch, Suraj Nair, Ashwin Balakrishna, Sudeep Dasari, Siddharth Karamcheti, Soroush Nasiriany, Mohan Kumar Srirama, Lawrence Yunliang Chen, Kirsty Ellis, Peter David Fagan, Joey Hejna, Masha Itkina, Marion Lepert, Yecheng Jason Ma, Patrick Tree Miller, Jimmy Wu, Suneel Belkhale, Shivin Dass, Huy Ha, Arhan Jain, Abraham Lee, You...

work page
[16]

OpenVLA: An Open-Source Vision-Language-Action Model

Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, Quan Vuong, Thomas Kollar, Benjamin Burchfiel, Russ Tedrake, Dorsa Sadigh, Sergey Levine, Percy Liang, and Chelsea Finn. Openvla: An open-source vision-language-action model.arXiv preprint arXiv:2406.09246, 2...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[17]

LLaVA-OneVision: Easy Visual Task Transfer

Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Ziwei Liu, et al. Llava-onevision: Easy visual task transfer.arXiv preprint arXiv:2408.03326,

work page internal anchor Pith review Pith/arXiv arXiv
[18]

Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models

Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. InInternational conference on machine learning, pages 19730–19742. PMLR, 2023. 6

work page 2023
[19]

Llara: Supercharging robot learning data for vision- language policy

Xiang Li, Cristina Mata, Jongwoo Park, Kumara Kahatapitiya, Yoo Sung Jang, Jinghuan Shang, Kanchana Ranasinghe, Ryan Burgert, Mu Cai, Yong Jae Lee, et al. Llara: Supercharging robot learning data for vision-language policy.arXiv preprint arXiv:2406.20095, 2024. 3

work page arXiv 2024
[20]

Evaluating Real-World Robot Manipulation Policies in Simulation

Xuanlin Li, Kyle Hsu, Jiayuan Gu, Karl Pertsch, Oier Mees, Homer Rich Walke, Chuyuan Fu, Ishikaa Lunawat,IsabelSieh,SeanKirmani,SergeyLevine,JiajunWu,ChelseaFinn,HaoSu,QuanVuong,andTed Xiao. Evaluating real-world robot manipulation policies in simulation.arXiv preprint arXiv:2405.05941,

work page internal anchor Pith review Pith/arXiv arXiv
[21]

Hamster: Hierarchical action models for open-world robot manipulation,

Yi Li, Yuquan Deng, Jesse Zhang, Joel Jang, Marius Memmel, Raymond Yu, Caelan Reed Garrett, Fabio Ramos, Dieter Fox, Anqi Li, et al. Hamster: Hierarchical action models for open-world robot manipulation. arXiv preprint arXiv:2502.05485, 2025. 3, 12

work page arXiv 2025
[22]

Eagle 2: Building post-training data strategies from scratch for frontier vision-language models.arXiv preprint arXiv:2501.14818, 2025

Zhiqi Li, Guo Chen, Shilong Liu, Shihao Wang, Vibashan VS, Yishen Ji, Shiyi Lan, Hao Zhang, Yilin Zhao, Subhashree Radhakrishnan, et al. Eagle 2: Building post-training data strategies from scratch for frontier vision-language models.arXiv preprint arXiv:2501.14818, 2025. 1

work page arXiv 2025
[23]

Vila: On pre-training for visual language models

Ji Lin, Hongxu Yin, Wei Ping, Pavlo Molchanov, Mohammad Shoeybi, and Song Han. Vila: On pre-training for visual language models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 26689–26699, 2024. 1 19 ThinkAct: Vision-Language-Action Reasoning via Reinforced Visual Latent Planning

work page 2024
[24]

LIBERO: Benchmarking Knowledge Transfer for Lifelong Robot Learning

Bo Liu, Yifeng Zhu, Chongkai Gao, Yihao Feng, Qiang Liu, Yuke Zhu, and Peter Stone. Libero: Bench- marking knowledge transfer for lifelong robot learning.arXiv preprint arXiv:2306.03310, 2023. 4, 6, 7, 8, 10, 12, 13, 16, 17

work page internal anchor Pith review Pith/arXiv arXiv 2023
[25]

Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023. 1

work page 2023
[26]

Reflect: Summarizing robot experiences for failure explanation and correction

Zeyi Liu, Arpit Bahety, and Shuran Song. Reflect: Summarizing robot experiences for failure explanation and correction.arXiv preprint arXiv:2306.15724, 2023. 5, 6, 11, 12, 13

work page arXiv 2023
[27]

NVILA: Efficient Frontier Visual Language Models

Zhijian Liu, Ligeng Zhu, Baifeng Shi, Zhuoyang Zhang, Yuming Lou, Shang Yang, Haocheng Xi, Shiyi Cao, Yuxian Gu, Dacheng Li, et al. Nvila: Efficient frontier visual language models.arXiv preprint arXiv:2412.04468, 2024. 1

work page internal anchor Pith review Pith/arXiv arXiv 2024
[28]

Visual-RFT: Visual Reinforcement Fine-Tuning

Ziyu Liu, Zeyi Sun, Yuhang Zang, Xiaoyi Dong, Yuhang Cao, Haodong Duan, Dahua Lin, and Jiaqi Wang. Visual-rft: Visual reinforcement fine-tuning.arXiv preprint arXiv:2503.01785, 2025. 3

work page internal anchor Pith review Pith/arXiv arXiv 2025
[29]

OpenEQA: Embodied Question Answering in the Era of Foundation Models

Arjun Majumdar, Anurag Ajay, Xiaohan Zhang, Pranav Putta, Sriram Yenamandra, Mikael Henaff, Sneha Silwal, Paul Mcvay, Oleksandr Maksymets, Sergio Arnaud, Karmesh Yadav, Qiyang Li, Ben Newman, Mohit Sharma, Vincent Berges, Shiqi Zhang, Pulkit Agrawal, Yonatan Bisk, Dhruv Batra, Mrinal Kalakrishnan, Franziska Meier, Chris Paxton, Sasha Sax, and Aravind Raje...

work page 2024
[30]

Llarva: Vision-action instruction tuning en- hances robot learning

Dantong Niu, Yuvan Sharma, Giscard Biamby, Jerome Quenum, Yutong Bai, Baifeng Shi, Trevor Darrell, and Roei Herzig. Llarva: Vision-action instruction tuning enhances robot learning.arXiv preprint arXiv:2406.11815, 2024. 3, 4, 12

work page arXiv 2024
[31]

Cosmos-Reason1: From Physical Common Sense To Embodied Reasoning

NVIDIA, Alisson Azzolini, Hannah Brandon, Prithvijit Chattopadhyay, Huayu Chen, Jinju Chu, Yin Cui, Jenna Diamond, Yifan Ding, Francesco Ferroni, Rama Govindaraju, Jinwei Gu, Siddharth Gururani, Imad El Hanafi, Zekun Hao, Jacob Huffman, Jingyi Jin, Brendan Johnson, Rizwan Khan, George Kurian, Elena Lantz, Nayeon Lee, Zhaoshuo Li, Xuan Li, Tsung-Yi Lin, Ye...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[32]

DINOv2: Learning Robust Visual Features without Supervision

Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision.arXiv preprint arXiv:2304.07193, 2023. 6

work page internal anchor Pith review Pith/arXiv arXiv 2023
[33]

Open x-embodiment: Robotic learning datasets and rt-x models: Open x-embodiment collaboration 0

Abby O’Neill, Abdul Rehman, Abhiram Maddukuri, Abhishek Gupta, Abhishek Padalkar, Abraham Lee, Acorn Pooley, Agrim Gupta, Ajay Mandlekar, Ajinkya Jain, et al. Open x-embodiment: Robotic learning datasets and rt-x models: Open x-embodiment collaboration 0. In2024 IEEE International Conference on Robotics and Automation (ICRA), pages 6892–6903. IEEE, 2024. ...

work page 2024
[34]

Bleu: a method for automatic evaluation of machine translation

Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method for automatic evaluation of machine translation. InProceedings of the 40th annual meeting of the Association for Computational Linguistics, pages 311–318, 2002. 7

work page 2002
[35]

EgoPlan-Bench2: A benchmark for multimodal large language model planning in real-world scenarios.arXiv preprint arXiv:2412.04447, 2024

Lu Qiu, Yuying Ge, Yi Chen, Yixiao Ge, Ying Shan, and Xihui Liu. Egoplan-bench2: A benchmark for multimodal large language model planning in real-world scenarios.arXiv preprint arXiv:2412.04447,

work page arXiv
[36]

6, 7, 14, 15 20 ThinkAct: Vision-Language-Action Reasoning via Reinforced Visual Latent Planning

work page
[37]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational conference on machine learning, pages 8748–8763. PmLR,

work page
[38]

Dynamic time warping algorithm review.Information and Computer Science Department University of Hawaii at Manoa Honolulu, USA, 855(1-23):40, 2008

Pavel Senin. Dynamic time warping algorithm review.Information and Computer Science Department University of Hawaii at Manoa Honolulu, USA, 855(1-23):40, 2008. 5

work page 2008
[39]

Robovqa: Multimodal long-horizon reasoning for robotics

Pierre Sermanet, Tianli Ding, Jeffrey Zhao, Fei Xia, Debidatta Dwibedi, Keerthana Gopalakrishnan, Christine Chan, Gabriel Dulac-Arnold, Sharath Maddineni, Nikhil J Joshi, et al. Robovqa: Multimodal long-horizon reasoning for robotics. In2024 IEEE International Conference on Robotics and Automation (ICRA), pages 645–652. IEEE, 2024. 5, 6, 7, 8, 9, 12, 13, 14, 15

work page 2024
[40]

Understanding human hands in contact at internet scale

Dandan Shan, Jiaqi Geng, Michelle Shu, and David Fouhey. Understanding human hands in contact at internet scale. InProceedings of the IEEE/CVF international conference on computer vision, 2020. 12

work page 2020
[41]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Y Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024. 2, 3, 4, 5, 6

work page internal anchor Pith review Pith/arXiv arXiv 2024
[42]

Hi Robot: Open-Ended Instruction Following with Hierarchical Vision-Language-Action Models

Lucy Xiaoyang Shi, Brian Ichter, Michael Equi, Liyiming Ke, Karl Pertsch, Quan Vuong, James Tanner, Anna Walling, Haohuan Wang, Niccolo Fusai, et al. Hi robot: Open-ended instruction following with hierarchical vision-language-action models.arXiv preprint arXiv:2502.19417, 2025. 1

work page internal anchor Pith review Pith/arXiv arXiv 2025
[43]

Eagle: Exploring The Design Space for Multimodal LLMs with Mixture of Encoders

Min Shi, Fuxiao Liu, Shihao Wang, Shijia Liao, Subhashree Radhakrishnan, Yilin Zhao, De-An Huang, Hongxu Yin, Karan Sapra, Yaser Yacoob, et al. Eagle: Exploring the design space for multimodal llms with mixture of encoders.arXiv preprint arXiv:2408.15998, 2024. 1

work page arXiv 2024
[44]

From multimodal llms to generalist embodied agents: Methods and lessons

Andrew Szot, Bogdan Mazoure, Omar Attia, Aleksei Timofeev, Harsh Agrawal, Devon Hjelm, Zhe Gan, Zsolt Kira, and Alexander Toshev. From multimodal llms to generalist embodied agents: Methods and lessons.arXiv preprint arXiv:2412.08442, 2024. 3

work page arXiv 2024
[45]

Reason-rft: Reinforcement fine-tuning for visual reasoning.arXiv preprint arXiv:2503.20752, 2025

Huajie Tan, Yuheng Ji, Xiaoshuai Hao, Minglan Lin, Pengwei Wang, Zhongyuan Wang, and Shanghang Zhang. Reason-rft: Reinforcement fine-tuning for visual reasoning.arXiv preprint arXiv:2503.20752,

work page arXiv
[46]

Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

Gemini Team, Petko Georgiev, Ving Ian Lei, Ryan Burnell, Libin Bai, Anmol Gulati, Garrett Tanzer, Damien Vincent, Zhufeng Pan, Shibo Wang, et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context.arXiv preprint arXiv:2403.05530, 2024. 1

work page internal anchor Pith review Pith/arXiv arXiv 2024
[47]

Octo: An Open-Source Generalist Robot Policy

Octo Model Team, Dibya Ghosh, Homer Walke, Karl Pertsch, Kevin Black, Oier Mees, Sudeep Dasari, Joey Hejna, Tobias Kreiman, Charles Xu, et al. Octo: An open-source generalist robot policy.arXiv preprint arXiv:2405.12213, 2024. 1

work page internal anchor Pith review Pith/arXiv arXiv 2024
[48]

Chain-of-thought reasoning without prompting.arXiv preprint arXiv:2402.10200, 2024

Xuezhi Wang and Denny Zhou. Chain-of-thought reasoning without prompting.arXiv preprint arXiv:2402.10200, 2024. 3

work page arXiv 2024
[49]

Chain-of-thought prompting elicits reasoning in large language models.Advances in neural information processing systems, 35:24824–24837, 2022

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models.Advances in neural information processing systems, 35:24824–24837, 2022. 1, 3

work page 2022
[50]

Magma: A founda- tion model for multimodal ai agents

Jianwei Yang, Reuben Tan, Qianhui Wu, Ruijie Zheng, Baolin Peng, Yongyuan Liang, Yu Gu, Mu Cai, Seonghyeon Ye, Joel Jang, et al. Magma: A foundation model for multimodal ai agents.arXiv preprint arXiv:2502.13130, 2025. 3, 4, 10, 12, 16 21 ThinkAct: Vision-Language-Action Reasoning via Reinforced Visual Latent Planning

work page arXiv 2025
[51]

Demystifying Long Chain-of-Thought Reasoning in LLMs

Edward Yeo, Yuxuan Tong, Morry Niu, Graham Neubig, and Xiang Yue. Demystifying long chain-of- thought reasoning in llms.arXiv preprint arXiv:2502.03373, 2025. 3

work page internal anchor Pith review Pith/arXiv arXiv 2025
[52]

Robopoint: A vision-language model for spatial affordance prediction for robotics,

WentaoYuan, JiafeiDuan, ValtsBlukis, WilbertPumacay, RanjayKrishna, AdithyavairavanMurali, Arsalan Mousavian, and Dieter Fox. Robopoint: A vision-language model for spatial affordance prediction for robotics.arXiv preprint arXiv:2406.10721, 2024. 3

work page arXiv 2024
[53]

Robotic Control via Embodied Chain-of-Thought Reasoning

Michał Zawalski, William Chen, Karl Pertsch, Oier Mees, Chelsea Finn, and Sergey Levine. Robotic control via embodied chain-of-thought reasoning.arXiv preprint arXiv:2407.08693, 2024. 1, 3

work page internal anchor Pith review Pith/arXiv arXiv 2024
[54]

LLaVA-Video: Video Instruction Tuning With Synthetic Data

Yuanhan Zhang, Jinming Wu, Wei Li, Bo Li, Zejun Ma, Ziwei Liu, and Chunyuan Li. Video instruction tuning with synthetic data.arXiv preprint arXiv:2410.02713, 2024. 6, 13

work page internal anchor Pith review Pith/arXiv arXiv 2024
[55]

CoT-VLA: Visual Chain-of-Thought Reasoning for Vision-Language-Action Models

Qingqing Zhao, Yao Lu, Moo Jin Kim, Zipeng Fu, Zhuoyang Zhang, Yecheng Wu, Zhaoshuo Li, Qianli Ma, Song Han, Chelsea Finn, et al. Cot-vla: Visual chain-of-thought reasoning for vision-language-action models.arXiv preprint arXiv:2503.22020, 2025. 1, 3, 7, 13

work page internal anchor Pith review Pith/arXiv arXiv 2025
[56]

TraceVLA: Visual Trace Prompting Enhances Spatial-Temporal Awareness for Generalist Robotic Policies

Ruijie Zheng, Yongyuan Liang, Shuaiyi Huang, Jianfeng Gao, Hal Daumé III, Andrey Kolobov, Furong Huang, and Jianwei Yang. Tracevla: Visual trace prompting enhances spatial-temporal awareness for generalist robotic policies.arXiv preprint arXiv:2412.10345, 2024. 1, 3, 4

work page internal anchor Pith review Pith/arXiv arXiv 2024
[57]

InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shenglong Ye, Lixin Gu, Yuchen Duan, Hao Tian, Weijie Su, Jie Shao, et al. Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models.arXiv preprint arXiv:2504.10479, 2025. 1 22

work page internal anchor Pith review Pith/arXiv arXiv 2025

[1] [1]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023. 1

work page internal anchor Pith review Pith/arXiv arXiv 2023

[2] [2]

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025. 1, 6, 13

work page internal anchor Pith review Pith/arXiv arXiv 2025

[3] [3]

GR00T N1: An Open Foundation Model for Generalist Humanoid Robots

Johan Bjorck, Fernando Castañeda, Nikita Cherniadev, Xingye Da, Runyu Ding, Linxi Fan, Yu Fang, Dieter Fox, Fengyuan Hu, Spencer Huang, et al. Gr00t n1: An open foundation model for generalist humanoid robots.arXiv preprint arXiv:2503.14734, 2025. 1, 3

work page internal anchor Pith review Pith/arXiv arXiv 2025

[4] [4]

RT-1: Robotics Transformer for Real-World Control at Scale

Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Joseph Dabis, Chelsea Finn, Keerthana Gopalakrishnan, Karol Hausman, Alex Herzog, Jasmine Hsu, et al. Rt-1: Robotics transformer for real-world control at scale.arXiv preprint arXiv:2212.06817, 2022. 3

work page internal anchor Pith review Pith/arXiv arXiv 2022

[5] [5]

RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control

Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Xi Chen, Krzysztof Choromanski, Tianli Ding, Danny Driess, Avinava Dubey, Chelsea Finn, et al. Rt-2: Vision-language-action models transfer web knowledge to robotic control.arXiv preprint arXiv:2307.15818, 2023. 1

work page internal anchor Pith review Pith/arXiv arXiv 2023

[6] [6]

Eagle 2.5: Boosting long-context post-training for frontier vision-language models.arXiv preprint arXiv:2504.15271, 2025

Guo Chen, Zhiqi Li, Shihao Wang, Jindong Jiang, Yicheng Liu, Lidong Lu, De-An Huang, Wonmin Byeon, Matthieu Le, Tuomas Rintamaki, et al. Eagle 2.5: Boosting long-context post-training for frontier vision-language models.arXiv preprint arXiv:2504.15271, 2025. 1

work page arXiv 2025

[7] [7]

Egoplan-bench: Benchmarking egocentric embodied planning with multimodal large language models,

Yi Chen, Yuying Ge, Yixiao Ge, Mingyu Ding, Bohao Li, Rui Wang, Ruifeng Xu, Ying Shan, and Xihui Liu. Egoplan-bench: Benchmarking multimodal large language models for human-level planning.arXiv preprint arXiv:2312.06722, 2023. 6, 8, 12, 13, 14

work page arXiv 2023

[8] [8]

Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

Zhe Chen, Weiyun Wang, Yue Cao, Yangzhou Liu, Zhangwei Gao, Erfei Cui, Jinguo Zhu, Shenglong Ye, Hao Tian, Zhaoyang Liu, et al. Expanding performance boundaries of open-source multimodal models with model, data, and test-time scaling.arXiv preprint arXiv:2412.05271, 2024. 1

work page internal anchor Pith review Pith/arXiv arXiv 2024

[9] [9]

Diffusion policy: Visuomotor policy learning via action diffusion.The International Journal of Robotics Research, page 02783649241273668, 2023

Cheng Chi, Zhenjia Xu, Siyuan Feng, Eric Cousineau, Yilun Du, Benjamin Burchfiel, Russ Tedrake, and Shuran Song. Diffusion policy: Visuomotor policy learning via action diffusion.The International Journal of Robotics Research, page 02783649241273668, 2023. 5, 6, 12, 15

work page 2023

[10] [10]

Action-free reasoning for policy generalization.arXiv preprint arXiv:2502.03729, 2025

Jaden Clark, Suvir Mirchandani, Dorsa Sadigh, and Suneel Belkhale. Action-free reasoning for policy generalization.arXiv preprint arXiv:2502.03729, 2025. 1, 3

work page arXiv 2025

[11] [11]

AHA: A Vision-Language-Model for Detecting and Reasoning Over Failures in Robotic Manipulation

Jiafei Duan, Wilbert Pumacay, Nishanth Kumar, Yi Ru Wang, Shulin Tian, Wentao Yuan, Ranjay Krishna, Dieter Fox, Ajay Mandlekar, and Yijie Guo. Aha: A vision-language-model for detecting and reasoning over failures in robotic manipulation.arXiv preprint arXiv:2410.00371, 2024. 3

work page arXiv 2024

[12] [12]

Video-R1: Reinforcing Video Reasoning in MLLMs

Kaituo Feng, Kaixiong Gong, Bohao Li, Zonghao Guo, Yibing Wang, Tianshuo Peng, Benyou Wang, and Xiangyu Yue. Video-r1: Reinforcing video reasoning in mllms.arXiv preprint arXiv:2503.21776, 2025. 2, 3, 6, 12, 13

work page internal anchor Pith review Pith/arXiv arXiv 2025

[13] [13]

something something

Raghav Goyal, Samira Ebrahimi Kahou, Vincent Michalski, Joanna Materzynska, Susanne Westphal, Heuna Kim, Valentin Haenel, Ingo Fruend, Peter Yianilos, Moritz Mueller-Freitag, et al. The" something something" video database for learning and evaluating visual common sense. InProceedings of the IEEE international conference on computer vision, pages 5842–585...

work page 2017

[14] [14]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025. 2, 3, 4, 5 18 ThinkAct: Vision-Language-Action Reasoning via Reinforced Visual Latent Planning

work page internal anchor Pith review Pith/arXiv arXiv 2025

[15] [15]

Alexander Khazatsky, Karl Pertsch, Suraj Nair, Ashwin Balakrishna, Sudeep Dasari, Siddharth Karamcheti, Soroush Nasiriany, Mohan Kumar Srirama, Lawrence Yunliang Chen, Kirsty Ellis, Peter David Fagan, Joey Hejna, Masha Itkina, Marion Lepert, Yecheng Jason Ma, Patrick Tree Miller, Jimmy Wu, Suneel Belkhale, Shivin Dass, Huy Ha, Arhan Jain, Abraham Lee, You...

work page

[16] [16]

OpenVLA: An Open-Source Vision-Language-Action Model

Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, Quan Vuong, Thomas Kollar, Benjamin Burchfiel, Russ Tedrake, Dorsa Sadigh, Sergey Levine, Percy Liang, and Chelsea Finn. Openvla: An open-source vision-language-action model.arXiv preprint arXiv:2406.09246, 2...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[17] [17]

LLaVA-OneVision: Easy Visual Task Transfer

Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Ziwei Liu, et al. Llava-onevision: Easy visual task transfer.arXiv preprint arXiv:2408.03326,

work page internal anchor Pith review Pith/arXiv arXiv

[18] [18]

Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models

Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. InInternational conference on machine learning, pages 19730–19742. PMLR, 2023. 6

work page 2023

[19] [19]

Llara: Supercharging robot learning data for vision- language policy

Xiang Li, Cristina Mata, Jongwoo Park, Kumara Kahatapitiya, Yoo Sung Jang, Jinghuan Shang, Kanchana Ranasinghe, Ryan Burgert, Mu Cai, Yong Jae Lee, et al. Llara: Supercharging robot learning data for vision-language policy.arXiv preprint arXiv:2406.20095, 2024. 3

work page arXiv 2024

[20] [20]

Evaluating Real-World Robot Manipulation Policies in Simulation

Xuanlin Li, Kyle Hsu, Jiayuan Gu, Karl Pertsch, Oier Mees, Homer Rich Walke, Chuyuan Fu, Ishikaa Lunawat,IsabelSieh,SeanKirmani,SergeyLevine,JiajunWu,ChelseaFinn,HaoSu,QuanVuong,andTed Xiao. Evaluating real-world robot manipulation policies in simulation.arXiv preprint arXiv:2405.05941,

work page internal anchor Pith review Pith/arXiv arXiv

[21] [21]

Hamster: Hierarchical action models for open-world robot manipulation,

Yi Li, Yuquan Deng, Jesse Zhang, Joel Jang, Marius Memmel, Raymond Yu, Caelan Reed Garrett, Fabio Ramos, Dieter Fox, Anqi Li, et al. Hamster: Hierarchical action models for open-world robot manipulation. arXiv preprint arXiv:2502.05485, 2025. 3, 12

work page arXiv 2025

[22] [22]

Eagle 2: Building post-training data strategies from scratch for frontier vision-language models.arXiv preprint arXiv:2501.14818, 2025

Zhiqi Li, Guo Chen, Shilong Liu, Shihao Wang, Vibashan VS, Yishen Ji, Shiyi Lan, Hao Zhang, Yilin Zhao, Subhashree Radhakrishnan, et al. Eagle 2: Building post-training data strategies from scratch for frontier vision-language models.arXiv preprint arXiv:2501.14818, 2025. 1

work page arXiv 2025

[23] [23]

Vila: On pre-training for visual language models

Ji Lin, Hongxu Yin, Wei Ping, Pavlo Molchanov, Mohammad Shoeybi, and Song Han. Vila: On pre-training for visual language models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 26689–26699, 2024. 1 19 ThinkAct: Vision-Language-Action Reasoning via Reinforced Visual Latent Planning

work page 2024

[24] [24]

LIBERO: Benchmarking Knowledge Transfer for Lifelong Robot Learning

Bo Liu, Yifeng Zhu, Chongkai Gao, Yihao Feng, Qiang Liu, Yuke Zhu, and Peter Stone. Libero: Bench- marking knowledge transfer for lifelong robot learning.arXiv preprint arXiv:2306.03310, 2023. 4, 6, 7, 8, 10, 12, 13, 16, 17

work page internal anchor Pith review Pith/arXiv arXiv 2023

[25] [25]

Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023. 1

work page 2023

[26] [26]

Reflect: Summarizing robot experiences for failure explanation and correction

Zeyi Liu, Arpit Bahety, and Shuran Song. Reflect: Summarizing robot experiences for failure explanation and correction.arXiv preprint arXiv:2306.15724, 2023. 5, 6, 11, 12, 13

work page arXiv 2023

[27] [27]

NVILA: Efficient Frontier Visual Language Models

Zhijian Liu, Ligeng Zhu, Baifeng Shi, Zhuoyang Zhang, Yuming Lou, Shang Yang, Haocheng Xi, Shiyi Cao, Yuxian Gu, Dacheng Li, et al. Nvila: Efficient frontier visual language models.arXiv preprint arXiv:2412.04468, 2024. 1

work page internal anchor Pith review Pith/arXiv arXiv 2024

[28] [28]

Visual-RFT: Visual Reinforcement Fine-Tuning

Ziyu Liu, Zeyi Sun, Yuhang Zang, Xiaoyi Dong, Yuhang Cao, Haodong Duan, Dahua Lin, and Jiaqi Wang. Visual-rft: Visual reinforcement fine-tuning.arXiv preprint arXiv:2503.01785, 2025. 3

work page internal anchor Pith review Pith/arXiv arXiv 2025

[29] [29]

OpenEQA: Embodied Question Answering in the Era of Foundation Models

Arjun Majumdar, Anurag Ajay, Xiaohan Zhang, Pranav Putta, Sriram Yenamandra, Mikael Henaff, Sneha Silwal, Paul Mcvay, Oleksandr Maksymets, Sergio Arnaud, Karmesh Yadav, Qiyang Li, Ben Newman, Mohit Sharma, Vincent Berges, Shiqi Zhang, Pulkit Agrawal, Yonatan Bisk, Dhruv Batra, Mrinal Kalakrishnan, Franziska Meier, Chris Paxton, Sasha Sax, and Aravind Raje...

work page 2024

[30] [30]

Llarva: Vision-action instruction tuning en- hances robot learning

Dantong Niu, Yuvan Sharma, Giscard Biamby, Jerome Quenum, Yutong Bai, Baifeng Shi, Trevor Darrell, and Roei Herzig. Llarva: Vision-action instruction tuning enhances robot learning.arXiv preprint arXiv:2406.11815, 2024. 3, 4, 12

work page arXiv 2024

[31] [31]

Cosmos-Reason1: From Physical Common Sense To Embodied Reasoning

NVIDIA, Alisson Azzolini, Hannah Brandon, Prithvijit Chattopadhyay, Huayu Chen, Jinju Chu, Yin Cui, Jenna Diamond, Yifan Ding, Francesco Ferroni, Rama Govindaraju, Jinwei Gu, Siddharth Gururani, Imad El Hanafi, Zekun Hao, Jacob Huffman, Jingyi Jin, Brendan Johnson, Rizwan Khan, George Kurian, Elena Lantz, Nayeon Lee, Zhaoshuo Li, Xuan Li, Tsung-Yi Lin, Ye...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[32] [32]

DINOv2: Learning Robust Visual Features without Supervision

Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision.arXiv preprint arXiv:2304.07193, 2023. 6

work page internal anchor Pith review Pith/arXiv arXiv 2023

[33] [33]

Open x-embodiment: Robotic learning datasets and rt-x models: Open x-embodiment collaboration 0

Abby O’Neill, Abdul Rehman, Abhiram Maddukuri, Abhishek Gupta, Abhishek Padalkar, Abraham Lee, Acorn Pooley, Agrim Gupta, Ajay Mandlekar, Ajinkya Jain, et al. Open x-embodiment: Robotic learning datasets and rt-x models: Open x-embodiment collaboration 0. In2024 IEEE International Conference on Robotics and Automation (ICRA), pages 6892–6903. IEEE, 2024. ...

work page 2024

[34] [34]

Bleu: a method for automatic evaluation of machine translation

Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method for automatic evaluation of machine translation. InProceedings of the 40th annual meeting of the Association for Computational Linguistics, pages 311–318, 2002. 7

work page 2002

[35] [35]

EgoPlan-Bench2: A benchmark for multimodal large language model planning in real-world scenarios.arXiv preprint arXiv:2412.04447, 2024

Lu Qiu, Yuying Ge, Yi Chen, Yixiao Ge, Ying Shan, and Xihui Liu. Egoplan-bench2: A benchmark for multimodal large language model planning in real-world scenarios.arXiv preprint arXiv:2412.04447,

work page arXiv

[36] [36]

6, 7, 14, 15 20 ThinkAct: Vision-Language-Action Reasoning via Reinforced Visual Latent Planning

work page

[37] [37]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational conference on machine learning, pages 8748–8763. PmLR,

work page

[38] [38]

Dynamic time warping algorithm review.Information and Computer Science Department University of Hawaii at Manoa Honolulu, USA, 855(1-23):40, 2008

Pavel Senin. Dynamic time warping algorithm review.Information and Computer Science Department University of Hawaii at Manoa Honolulu, USA, 855(1-23):40, 2008. 5

work page 2008

[39] [39]

Robovqa: Multimodal long-horizon reasoning for robotics

Pierre Sermanet, Tianli Ding, Jeffrey Zhao, Fei Xia, Debidatta Dwibedi, Keerthana Gopalakrishnan, Christine Chan, Gabriel Dulac-Arnold, Sharath Maddineni, Nikhil J Joshi, et al. Robovqa: Multimodal long-horizon reasoning for robotics. In2024 IEEE International Conference on Robotics and Automation (ICRA), pages 645–652. IEEE, 2024. 5, 6, 7, 8, 9, 12, 13, 14, 15

work page 2024

[40] [40]

Understanding human hands in contact at internet scale

Dandan Shan, Jiaqi Geng, Michelle Shu, and David Fouhey. Understanding human hands in contact at internet scale. InProceedings of the IEEE/CVF international conference on computer vision, 2020. 12

work page 2020

[41] [41]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Y Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024. 2, 3, 4, 5, 6

work page internal anchor Pith review Pith/arXiv arXiv 2024

[42] [42]

Hi Robot: Open-Ended Instruction Following with Hierarchical Vision-Language-Action Models

Lucy Xiaoyang Shi, Brian Ichter, Michael Equi, Liyiming Ke, Karl Pertsch, Quan Vuong, James Tanner, Anna Walling, Haohuan Wang, Niccolo Fusai, et al. Hi robot: Open-ended instruction following with hierarchical vision-language-action models.arXiv preprint arXiv:2502.19417, 2025. 1

work page internal anchor Pith review Pith/arXiv arXiv 2025

[43] [43]

Eagle: Exploring The Design Space for Multimodal LLMs with Mixture of Encoders

Min Shi, Fuxiao Liu, Shihao Wang, Shijia Liao, Subhashree Radhakrishnan, Yilin Zhao, De-An Huang, Hongxu Yin, Karan Sapra, Yaser Yacoob, et al. Eagle: Exploring the design space for multimodal llms with mixture of encoders.arXiv preprint arXiv:2408.15998, 2024. 1

work page arXiv 2024

[44] [44]

From multimodal llms to generalist embodied agents: Methods and lessons

Andrew Szot, Bogdan Mazoure, Omar Attia, Aleksei Timofeev, Harsh Agrawal, Devon Hjelm, Zhe Gan, Zsolt Kira, and Alexander Toshev. From multimodal llms to generalist embodied agents: Methods and lessons.arXiv preprint arXiv:2412.08442, 2024. 3

work page arXiv 2024

[45] [45]

Reason-rft: Reinforcement fine-tuning for visual reasoning.arXiv preprint arXiv:2503.20752, 2025

Huajie Tan, Yuheng Ji, Xiaoshuai Hao, Minglan Lin, Pengwei Wang, Zhongyuan Wang, and Shanghang Zhang. Reason-rft: Reinforcement fine-tuning for visual reasoning.arXiv preprint arXiv:2503.20752,

work page arXiv

[46] [46]

Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

Gemini Team, Petko Georgiev, Ving Ian Lei, Ryan Burnell, Libin Bai, Anmol Gulati, Garrett Tanzer, Damien Vincent, Zhufeng Pan, Shibo Wang, et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context.arXiv preprint arXiv:2403.05530, 2024. 1

work page internal anchor Pith review Pith/arXiv arXiv 2024

[47] [47]

Octo: An Open-Source Generalist Robot Policy

Octo Model Team, Dibya Ghosh, Homer Walke, Karl Pertsch, Kevin Black, Oier Mees, Sudeep Dasari, Joey Hejna, Tobias Kreiman, Charles Xu, et al. Octo: An open-source generalist robot policy.arXiv preprint arXiv:2405.12213, 2024. 1

work page internal anchor Pith review Pith/arXiv arXiv 2024

[48] [48]

Chain-of-thought reasoning without prompting.arXiv preprint arXiv:2402.10200, 2024

Xuezhi Wang and Denny Zhou. Chain-of-thought reasoning without prompting.arXiv preprint arXiv:2402.10200, 2024. 3

work page arXiv 2024

[49] [49]

Chain-of-thought prompting elicits reasoning in large language models.Advances in neural information processing systems, 35:24824–24837, 2022

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models.Advances in neural information processing systems, 35:24824–24837, 2022. 1, 3

work page 2022

[50] [50]

Magma: A founda- tion model for multimodal ai agents

Jianwei Yang, Reuben Tan, Qianhui Wu, Ruijie Zheng, Baolin Peng, Yongyuan Liang, Yu Gu, Mu Cai, Seonghyeon Ye, Joel Jang, et al. Magma: A foundation model for multimodal ai agents.arXiv preprint arXiv:2502.13130, 2025. 3, 4, 10, 12, 16 21 ThinkAct: Vision-Language-Action Reasoning via Reinforced Visual Latent Planning

work page arXiv 2025

[51] [51]

Demystifying Long Chain-of-Thought Reasoning in LLMs

Edward Yeo, Yuxuan Tong, Morry Niu, Graham Neubig, and Xiang Yue. Demystifying long chain-of- thought reasoning in llms.arXiv preprint arXiv:2502.03373, 2025. 3

work page internal anchor Pith review Pith/arXiv arXiv 2025

[52] [52]

Robopoint: A vision-language model for spatial affordance prediction for robotics,

WentaoYuan, JiafeiDuan, ValtsBlukis, WilbertPumacay, RanjayKrishna, AdithyavairavanMurali, Arsalan Mousavian, and Dieter Fox. Robopoint: A vision-language model for spatial affordance prediction for robotics.arXiv preprint arXiv:2406.10721, 2024. 3

work page arXiv 2024

[53] [53]

Robotic Control via Embodied Chain-of-Thought Reasoning

Michał Zawalski, William Chen, Karl Pertsch, Oier Mees, Chelsea Finn, and Sergey Levine. Robotic control via embodied chain-of-thought reasoning.arXiv preprint arXiv:2407.08693, 2024. 1, 3

work page internal anchor Pith review Pith/arXiv arXiv 2024

[54] [54]

LLaVA-Video: Video Instruction Tuning With Synthetic Data

Yuanhan Zhang, Jinming Wu, Wei Li, Bo Li, Zejun Ma, Ziwei Liu, and Chunyuan Li. Video instruction tuning with synthetic data.arXiv preprint arXiv:2410.02713, 2024. 6, 13

work page internal anchor Pith review Pith/arXiv arXiv 2024

[55] [55]

CoT-VLA: Visual Chain-of-Thought Reasoning for Vision-Language-Action Models

Qingqing Zhao, Yao Lu, Moo Jin Kim, Zipeng Fu, Zhuoyang Zhang, Yecheng Wu, Zhaoshuo Li, Qianli Ma, Song Han, Chelsea Finn, et al. Cot-vla: Visual chain-of-thought reasoning for vision-language-action models.arXiv preprint arXiv:2503.22020, 2025. 1, 3, 7, 13

work page internal anchor Pith review Pith/arXiv arXiv 2025

[56] [56]

TraceVLA: Visual Trace Prompting Enhances Spatial-Temporal Awareness for Generalist Robotic Policies

Ruijie Zheng, Yongyuan Liang, Shuaiyi Huang, Jianfeng Gao, Hal Daumé III, Andrey Kolobov, Furong Huang, and Jianwei Yang. Tracevla: Visual trace prompting enhances spatial-temporal awareness for generalist robotic policies.arXiv preprint arXiv:2412.10345, 2024. 1, 3, 4

work page internal anchor Pith review Pith/arXiv arXiv 2024

[57] [57]

InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shenglong Ye, Lixin Gu, Yuchen Duan, Hao Tian, Weijie Su, Jie Shao, et al. Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models.arXiv preprint arXiv:2504.10479, 2025. 1 22

work page internal anchor Pith review Pith/arXiv arXiv 2025