arxiv: 2410.11758 · v2 · submitted 2024-10-15 · 💻 cs.RO · cs.CL· cs.CV· cs.LG

Recognition: 2 theorem links

· Lean Theorem

Latent Action Pretraining from Videos

Seonghyeon Ye , Joel Jang , Byeongguk Jeon , Sejune Joo , Jianwei Yang , Baolin Peng , Ajay Mandlekar , Reuben Tan , Yu-Wei Chao , Bill Yuchen Lin , Lars Liden , Kimin Lee , Jianfeng Gao , Luke Zettlemoyer , Dieter Fox , Minjoon Seo

Authors on Pith no claims yet

Pith reviewed 2026-05-14 02:23 UTC · model grok-4.3

classification 💻 cs.RO cs.CLcs.CVcs.LG

keywords actionlatentrobotlabelsmanipulationmodelpretrainingvideos

0 comments

The pith

LAPA learns discrete latent actions from unlabeled videos with VQ-VAE, pretrains a VLA model to predict them, and finetunes on small robot datasets to outperform both video-only baselines and labeled SOTA VLA models on language-conditioned manipulation tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The method starts by training a VQ-VAE on pairs of video frames to turn the visual change between them into one of a small number of discrete codes called latent actions. A large vision-language model is then pretrained to output the correct code given the current image and a task description in words. Finally, a small set of real robot demonstrations is used to learn a mapping from those codes to actual motor commands. Experiments on real-world tasks show gains in language following, handling new objects, and following new instructions compared with prior approaches that either require full action labels or train only on videos without this latent step.

Core claim

Experimental results demonstrate that our method significantly outperforms existing techniques that train robot manipulation policies from large-scale videos. Furthermore, it outperforms the state-of-the-art VLA model trained with robotic action labels on real-world manipulation tasks that require language conditioning, generalization to unseen objects, and semantic generalization to unseen instructions.

Load-bearing premise

That discrete latent actions extracted from human manipulation videos contain sufficient transferable information to map effectively to robot actions during finetuning and yield better generalization than direct supervised training on labeled robot data.

read the original abstract

We introduce Latent Action Pretraining for general Action models (LAPA), an unsupervised method for pretraining Vision-Language-Action (VLA) models without ground-truth robot action labels. Existing Vision-Language-Action models require action labels typically collected by human teleoperators during pretraining, which significantly limits possible data sources and scale. In this work, we propose a method to learn from internet-scale videos that do not have robot action labels. We first train an action quantization model leveraging VQ-VAE-based objective to learn discrete latent actions between image frames, then pretrain a latent VLA model to predict these latent actions from observations and task descriptions, and finally finetune the VLA on small-scale robot manipulation data to map from latent to robot actions. Experimental results demonstrate that our method significantly outperforms existing techniques that train robot manipulation policies from large-scale videos. Furthermore, it outperforms the state-of-the-art VLA model trained with robotic action labels on real-world manipulation tasks that require language conditioning, generalization to unseen objects, and semantic generalization to unseen instructions. Training only on human manipulation videos also shows positive transfer, opening up the potential for leveraging web-scale data for robotics foundation model.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

LAPA shows you can pretrain VLAs on unlabeled human videos via VQ-VAE latents and still beat labeled baselines on real-robot generalization, but the transfer step from latent codes to controllable actions is the part that needs the most checking.

read the letter

The core idea is straightforward: train a VQ-VAE on pairs of video frames to get discrete latent actions, pretrain a VLA to predict those latents from images plus language, then finetune the same model on a modest set of real robot trajectories to output actual actions. This sidesteps the need for action labels during the large-scale pretraining phase and lets the method pull from internet videos. That pipeline is the new piece; prior video-only work did not close the loop to robot actions this way, and the reported results show gains over both video-based policies and the current best labeled VLA on language-conditioned manipulation, unseen objects, and unseen instructions. The positive transfer from human videos alone is also worth noting because it directly addresses the data bottleneck everyone complains about. The experiments appear to be run on real hardware with multiple tasks, which is better than many simulation-heavy papers. The main soft spot is the assumption that the VQ-VAE codes actually encode robot-usable dynamics rather than just visual change. If the latents mostly track appearance shifts that do not map cleanly to joint or gripper commands, the finetuning advantage could shrink or disappear. The abstract does not give reconstruction metrics or ablations that isolate the latent pretraining step, so those details will matter for how much credit the method deserves. Minor issues like baseline implementation details or exact data splits can be cleaned up in revision. This paper is aimed at groups trying to scale VLA training beyond curated teleoperation sets. Readers working on foundation models for robotics will get immediate value from the approach even if they end up tweaking the quantization stage. It is coherent on its own terms and engages the right literature, so it deserves a serious referee rather than a desk reject.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces LAPA, an unsupervised pretraining approach for Vision-Language-Action (VLA) models. It trains a VQ-VAE on unlabeled human manipulation videos to obtain discrete latent actions, pretrains a VLA to predict these latents from images and language, and finetunes the resulting model on small-scale robot data to map latents to real actions. The central claim is that this yields significant outperformance over existing video-trained manipulation policies and even over state-of-the-art VLAs trained with robotic action labels, specifically on language-conditioned real-world tasks, generalization to unseen objects, and semantic generalization to unseen instructions, while also showing positive transfer from human videos.

Significance. If the results hold, the contribution would be significant for robotics foundation models. Enabling effective pretraining on web-scale unlabeled video data without action labels directly addresses the data bottleneck that currently limits VLA scaling and generalization. The reported positive transfer from human videos and outperformance on held-out object and instruction tasks would indicate a practical path toward more data-efficient robot learning.

major comments (2)

[Experiments] §Experiments (results tables and text): the abstract and main results claim clear outperformance on real-robot tasks, yet the manuscript provides no full details on baseline implementations, exact metrics, data splits, or ablation controls. This is load-bearing for the central claim that latent pretraining, rather than other factors, drives the gains over both video-only methods and labeled SOTA VLAs.
[Method] Method (VQ-VAE pretraining and finetuning pipeline): no reconstruction metrics, cross-embodiment alignment scores, or ablation that replaces the VQ-VAE latent objective with a standard reconstruction or contrastive loss are reported. Without these, it remains unclear whether the discrete codes encode robot-controllable dynamics (as required for successful finetuning transfer) or primarily visual appearance changes.

minor comments (2)

[Abstract] Abstract: the phrasing 'significantly outperforms existing techniques that train robot manipulation policies from large-scale videos' should be accompanied by a brief parenthetical reference to the specific baselines used.
[Method] Notation: ensure consistent use of z for latent actions versus a for robot actions throughout the method and results sections to avoid reader confusion during the finetuning description.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback. We address each major comment below and have revised the manuscript to provide the requested details and analyses.

read point-by-point responses

Referee: [Experiments] §Experiments (results tables and text): the abstract and main results claim clear outperformance on real-robot tasks, yet the manuscript provides no full details on baseline implementations, exact metrics, data splits, or ablation controls. This is load-bearing for the central claim that latent pretraining, rather than other factors, drives the gains over both video-only methods and labeled SOTA VLAs.

Authors: We agree that additional experimental details are required to substantiate the claims. In the revised manuscript we have expanded the Experiments section with full descriptions of baseline implementations (including how video-only and labeled SOTA VLA models were reproduced), exact success-rate metrics, training/validation/test splits, and new ablation controls that isolate the contribution of latent pretraining. These additions confirm that performance gains are attributable to the proposed method. revision: yes
Referee: [Method] Method (VQ-VAE pretraining and finetuning pipeline): no reconstruction metrics, cross-embodiment alignment scores, or ablation that replaces the VQ-VAE latent objective with a standard reconstruction or contrastive loss are reported. Without these, it remains unclear whether the discrete codes encode robot-controllable dynamics (as required for successful finetuning transfer) or primarily visual appearance changes.

Authors: We acknowledge the value of these diagnostics. The revised manuscript now reports VQ-VAE reconstruction metrics on both human and robot videos, cross-embodiment alignment scores between the learned discrete codes, and an ablation that replaces the VQ-VAE objective with a standard reconstruction loss and a contrastive loss. The new results show that the discrete latent-action formulation yields superior downstream transfer, supporting that the codes capture controllable dynamics rather than mere visual changes. revision: yes

Circularity Check

0 steps flagged

No circularity: standard VQ-VAE + separate finetuning pipeline

full rationale

The derivation consists of three sequential stages: (1) train a VQ-VAE on unlabeled video frames to obtain discrete latent actions z via a standard reconstruction objective, (2) pretrain a VLA to predict z from images+language, (3) finetune the VLA on small robot datasets to map z to real actions a. None of these stages define a quantity in terms of itself, rename a fitted parameter as a prediction, or rely on a self-citation chain for the central transfer claim. The reported gains are measured on held-out robot tasks after finetuning and are not algebraically forced by the pretraining loss. The pipeline is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the assumption that VQ-VAE codes learned from human videos capture action semantics transferable to robots; no free parameters are explicitly named in the abstract but codebook size and quantization temperature are typical in such models.

axioms (1)

domain assumption VQ-VAE objective on consecutive video frames produces discrete codes that represent meaningful actions
Invoked in the first training stage to create the latent action vocabulary.

invented entities (1)

latent actions no independent evidence
purpose: Discrete codes serving as proxy action labels for large-scale pretraining
Newly introduced representation that bridges unlabeled video and robot control.

pith-pipeline@v0.9.0 · 5571 in / 1340 out tokens · 64506 ms · 2026-05-14T02:23:37.058623+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Foundation.HierarchyEmergence hierarchy_emergence_forces_phi unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Experimental results demonstrate that our method significantly outperforms existing techniques that train robot manipulation policies from large-scale videos.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 25 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

RotVLA: Rotational Latent Action for Vision-Language-Action Model
cs.RO 2026-05 unverdicted novelty 7.0

RotVLA models latent actions as continuous SO(n) rotations with triplet-frame supervision and flow-matching to reach 98.2% success on LIBERO and 89.6%/88.5% on RoboTwin2.0 using a 1.7B-parameter model.
ALAM: Algebraically Consistent Latent Action Model for Vision-Language-Action Models
cs.RO 2026-05 unverdicted novelty 7.0

ALAM creates algebraically consistent latent action transitions from videos to act as auxiliary generative targets, raising robot policy success rates from 47.9% to 85.0% on MetaWorld MT50 and 94.1% to 98.1% on LIBERO.
SABER: A Scalable Action-Based Embodied Dataset for Real-World VLA Adaptation
cs.RO 2026-05 unverdicted novelty 7.0

SABER provides 44.8K multi-representation action samples from unscripted retail environments that raise a VLA model's mean success rate on ten manipulation tasks from 13.4% to 29.3%.
Render, Don't Decode: Weight-Space World Models with Latent Structural Disentanglement
cs.CV 2026-05 unverdicted novelty 7.0

NOVA represents world states as INR weights for decoder-free rendering, compactness, and unsupervised disentanglement of background, foreground, and motion in video world models.
Latent State Design for World Models under Sufficiency Constraints
cs.AI 2026-05 unverdicted novelty 7.0

World models succeed when their latent states are built to meet task-specific sufficiency constraints rather than preserving the maximum amount of information.
Being-H0.7: A Latent World-Action Model from Egocentric Videos
cs.RO 2026-04 unverdicted novelty 7.0

Being-H0.7 adds future-aware latent reasoning to direct VLA policies via dual-branch alignment on latent queries, matching world-model benefits at VLA efficiency.
RoboWM-Bench: A Benchmark for Evaluating World Models in Robotic Manipulation
cs.RO 2026-04 unverdicted novelty 7.0

RoboWM-Bench evaluates video world models by converting their outputs into executable robot actions and running them on manipulation tasks, showing that physical inconsistencies remain common.
${\pi}_{0.7}$: a Steerable Generalist Robotic Foundation Model with Emergent Capabilities
cs.LG 2026-04 unverdicted novelty 7.0

π₀.₇ is a steerable generalist robotic model that uses rich multimodal prompts including language, subgoal images, and performance metadata to achieve out-of-the-box generalization across tasks and robot bodies.
CUBic: Coordinated Unified Bimanual Perception and Control Framework
cs.RO 2026-05 unverdicted novelty 6.0

CUBic learns a shared tokenized representation for bimanual robot perception and control via unidirectional aggregation, bidirectional codebook coordination, and a unified diffusion policy, yielding higher coordinatio...
ALAM: Algebraically Consistent Latent Action Model for Vision-Language-Action Models
cs.RO 2026-05 unverdicted novelty 6.0

ALAM introduces algebraic consistency regularization on latent action transitions from videos, raising VLA success rates from 47.9% to 85.0% on MetaWorld MT50 and 94.1% to 98.1% on LIBERO.
Render, Don't Decode: Weight-Space World Models with Latent Structural Disentanglement
cs.CV 2026-05 unverdicted novelty 6.0

NOVA represents scene states as INR weights for analytical rendering without decoders and achieves structural disentanglement of content and dynamics in video world models.
When to Trust Imagination: Adaptive Action Execution for World Action Models
cs.RO 2026-05 unverdicted novelty 6.0

Future Forward Dynamics Causal Attention (FFDC) enables World Action Models to adaptively choose action chunk lengths based on prediction-observation consistency, cutting model inferences by 69% and improving real-wor...
When to Trust Imagination: Adaptive Action Execution for World Action Models
cs.RO 2026-05 unverdicted novelty 6.0

A verifier called Future Forward Dynamics Causal Attention enables adaptive action execution in World Action Models, reducing model inferences by 69% and improving success rates in robotic tasks.
From Pixels to Tokens: A Systematic Study of Latent Action Supervision for Vision-Language-Action Models
cs.RO 2026-05 unverdicted novelty 6.0

A unified comparison of latent action supervision strategies for VLA models reveals task-specific benefits, with image-based approaches aiding reasoning and generalization, action-based aiding motor control, and discr...
Bridging the Embodiment Gap: Disentangled Cross-Embodiment Video Editing
cs.RO 2026-05 unverdicted novelty 6.0

A dual-contrastive disentanglement method factorizes videos into independent task and embodiment latents, then uses a parameter-efficient adapter on a frozen video diffusion model to synthesize robot executions from s...
LA-Pose: Latent Action Pretraining Meets Pose Estimation
cs.CV 2026-04 unverdicted novelty 6.0

LA-Pose achieves over 10% higher pose accuracy than recent feed-forward methods on Waymo and PandaSet benchmarks by repurposing latent actions from self-supervised inverse-dynamics pretraining while using orders of ma...
GazeVLA: Learning Human Intention for Robotic Manipulation
cs.RO 2026-04 unverdicted novelty 6.0

GazeVLA pretrains on large human egocentric datasets to capture gaze-based intention, then finetunes on limited robot data with chain-of-thought reasoning to achieve better robotic manipulation performance than baselines.
Hi-WM: Human-in-the-World-Model for Scalable Robot Post-Training
cs.RO 2026-04 unverdicted novelty 6.0

Hi-WM uses human interventions inside an action-conditioned world model with rollback and branching to generate dense corrective data, raising real-world success by 37.9 points on average across three manipulation tasks.
UniT: Toward a Unified Physical Language for Human-to-Humanoid Policy Learning and World Modeling
cs.RO 2026-04 unverdicted novelty 6.0

UniT creates a unified physical language via visual anchoring and tri-branch reconstruction to enable scalable human-to-humanoid transfer for policy learning and world modeling.
A Mechanistic Analysis of Sim-and-Real Co-Training in Generative Robot Policies
cs.RO 2026-04 unverdicted novelty 6.0

Sim-and-real co-training for robot policies is driven primarily by balanced cross-domain representation alignment and secondarily by domain-dependent action reweighting.
EgoVerse: An Egocentric Human Dataset for Robot Learning from Around the World
cs.RO 2026-04 unverdicted novelty 6.0

EgoVerse releases 1,362 hours of standardized egocentric human data across 1,965 tasks and shows via multi-lab experiments that robot policy performance scales with human data volume when the data aligns with robot ob...
Unified World Models: Coupling Video and Action Diffusion for Pretraining on Large Robotic Datasets
cs.RO 2025-04 unverdicted novelty 6.0

Unified World Models couple video and action diffusion inside one transformer with independent timesteps, enabling pretraining on heterogeneous robot datasets that include action-free video and producing more generali...
FAST: Efficient Action Tokenization for Vision-Language-Action Models
cs.RO 2025-01 unverdicted novelty 6.0

FAST applies discrete cosine transform to robot action sequences for efficient tokenization, enabling autoregressive VLAs to succeed on high-frequency dexterous tasks and scale to 10k hours of data while matching diff...
Video Prediction Policy: A Generalist Robot Policy with Predictive Visual Representations
cs.CV 2024-12 unverdicted novelty 6.0

Video Prediction Policy conditions robot action learning on future-frame predictions inside fine-tuned video diffusion models, yielding 18.6% relative gains on Calvin ABC-D and 31.6% higher real-world success rates.
From Video to Control: A Survey of Learning Manipulation Interfaces from Temporal Visual Data
cs.RO 2026-04 accept novelty 5.0

A survey introduces an interface-centric taxonomy for video-to-control methods in robotic manipulation and identifies the robotics integration layer as the central open challenge.