Recognition: 2 theorem links
· Lean TheoremLatent Action Pretraining from Videos
Pith reviewed 2026-05-14 02:23 UTC · model grok-4.3
The pith
LAPA learns discrete latent actions from unlabeled videos with VQ-VAE, pretrains a VLA model to predict them, and finetunes on small robot datasets to outperform both video-only baselines and labeled SOTA VLA models on language-conditioned manipulation tasks.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Experimental results demonstrate that our method significantly outperforms existing techniques that train robot manipulation policies from large-scale videos. Furthermore, it outperforms the state-of-the-art VLA model trained with robotic action labels on real-world manipulation tasks that require language conditioning, generalization to unseen objects, and semantic generalization to unseen instructions.
Load-bearing premise
That discrete latent actions extracted from human manipulation videos contain sufficient transferable information to map effectively to robot actions during finetuning and yield better generalization than direct supervised training on labeled robot data.
read the original abstract
We introduce Latent Action Pretraining for general Action models (LAPA), an unsupervised method for pretraining Vision-Language-Action (VLA) models without ground-truth robot action labels. Existing Vision-Language-Action models require action labels typically collected by human teleoperators during pretraining, which significantly limits possible data sources and scale. In this work, we propose a method to learn from internet-scale videos that do not have robot action labels. We first train an action quantization model leveraging VQ-VAE-based objective to learn discrete latent actions between image frames, then pretrain a latent VLA model to predict these latent actions from observations and task descriptions, and finally finetune the VLA on small-scale robot manipulation data to map from latent to robot actions. Experimental results demonstrate that our method significantly outperforms existing techniques that train robot manipulation policies from large-scale videos. Furthermore, it outperforms the state-of-the-art VLA model trained with robotic action labels on real-world manipulation tasks that require language conditioning, generalization to unseen objects, and semantic generalization to unseen instructions. Training only on human manipulation videos also shows positive transfer, opening up the potential for leveraging web-scale data for robotics foundation model.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces LAPA, an unsupervised pretraining approach for Vision-Language-Action (VLA) models. It trains a VQ-VAE on unlabeled human manipulation videos to obtain discrete latent actions, pretrains a VLA to predict these latents from images and language, and finetunes the resulting model on small-scale robot data to map latents to real actions. The central claim is that this yields significant outperformance over existing video-trained manipulation policies and even over state-of-the-art VLAs trained with robotic action labels, specifically on language-conditioned real-world tasks, generalization to unseen objects, and semantic generalization to unseen instructions, while also showing positive transfer from human videos.
Significance. If the results hold, the contribution would be significant for robotics foundation models. Enabling effective pretraining on web-scale unlabeled video data without action labels directly addresses the data bottleneck that currently limits VLA scaling and generalization. The reported positive transfer from human videos and outperformance on held-out object and instruction tasks would indicate a practical path toward more data-efficient robot learning.
major comments (2)
- [Experiments] §Experiments (results tables and text): the abstract and main results claim clear outperformance on real-robot tasks, yet the manuscript provides no full details on baseline implementations, exact metrics, data splits, or ablation controls. This is load-bearing for the central claim that latent pretraining, rather than other factors, drives the gains over both video-only methods and labeled SOTA VLAs.
- [Method] Method (VQ-VAE pretraining and finetuning pipeline): no reconstruction metrics, cross-embodiment alignment scores, or ablation that replaces the VQ-VAE latent objective with a standard reconstruction or contrastive loss are reported. Without these, it remains unclear whether the discrete codes encode robot-controllable dynamics (as required for successful finetuning transfer) or primarily visual appearance changes.
minor comments (2)
- [Abstract] Abstract: the phrasing 'significantly outperforms existing techniques that train robot manipulation policies from large-scale videos' should be accompanied by a brief parenthetical reference to the specific baselines used.
- [Method] Notation: ensure consistent use of z for latent actions versus a for robot actions throughout the method and results sections to avoid reader confusion during the finetuning description.
Simulated Author's Rebuttal
We thank the referee for their constructive feedback. We address each major comment below and have revised the manuscript to provide the requested details and analyses.
read point-by-point responses
-
Referee: [Experiments] §Experiments (results tables and text): the abstract and main results claim clear outperformance on real-robot tasks, yet the manuscript provides no full details on baseline implementations, exact metrics, data splits, or ablation controls. This is load-bearing for the central claim that latent pretraining, rather than other factors, drives the gains over both video-only methods and labeled SOTA VLAs.
Authors: We agree that additional experimental details are required to substantiate the claims. In the revised manuscript we have expanded the Experiments section with full descriptions of baseline implementations (including how video-only and labeled SOTA VLA models were reproduced), exact success-rate metrics, training/validation/test splits, and new ablation controls that isolate the contribution of latent pretraining. These additions confirm that performance gains are attributable to the proposed method. revision: yes
-
Referee: [Method] Method (VQ-VAE pretraining and finetuning pipeline): no reconstruction metrics, cross-embodiment alignment scores, or ablation that replaces the VQ-VAE latent objective with a standard reconstruction or contrastive loss are reported. Without these, it remains unclear whether the discrete codes encode robot-controllable dynamics (as required for successful finetuning transfer) or primarily visual appearance changes.
Authors: We acknowledge the value of these diagnostics. The revised manuscript now reports VQ-VAE reconstruction metrics on both human and robot videos, cross-embodiment alignment scores between the learned discrete codes, and an ablation that replaces the VQ-VAE objective with a standard reconstruction loss and a contrastive loss. The new results show that the discrete latent-action formulation yields superior downstream transfer, supporting that the codes capture controllable dynamics rather than mere visual changes. revision: yes
Circularity Check
No circularity: standard VQ-VAE + separate finetuning pipeline
full rationale
The derivation consists of three sequential stages: (1) train a VQ-VAE on unlabeled video frames to obtain discrete latent actions z via a standard reconstruction objective, (2) pretrain a VLA to predict z from images+language, (3) finetune the VLA on small robot datasets to map z to real actions a. None of these stages define a quantity in terms of itself, rename a fitted parameter as a prediction, or rely on a self-citation chain for the central transfer claim. The reported gains are measured on held-out robot tasks after finetuning and are not algebraically forced by the pretraining loss. The pipeline is therefore self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption VQ-VAE objective on consecutive video frames produces discrete codes that represent meaningful actions
invented entities (1)
-
latent actions
no independent evidence
Lean theorems connected to this paper
-
Foundation.HierarchyEmergencehierarchy_emergence_forces_phi unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Experimental results demonstrate that our method significantly outperforms existing techniques that train robot manipulation policies from large-scale videos.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 25 Pith papers
-
RotVLA: Rotational Latent Action for Vision-Language-Action Model
RotVLA models latent actions as continuous SO(n) rotations with triplet-frame supervision and flow-matching to reach 98.2% success on LIBERO and 89.6%/88.5% on RoboTwin2.0 using a 1.7B-parameter model.
-
ALAM: Algebraically Consistent Latent Action Model for Vision-Language-Action Models
ALAM creates algebraically consistent latent action transitions from videos to act as auxiliary generative targets, raising robot policy success rates from 47.9% to 85.0% on MetaWorld MT50 and 94.1% to 98.1% on LIBERO.
-
SABER: A Scalable Action-Based Embodied Dataset for Real-World VLA Adaptation
SABER provides 44.8K multi-representation action samples from unscripted retail environments that raise a VLA model's mean success rate on ten manipulation tasks from 13.4% to 29.3%.
-
Render, Don't Decode: Weight-Space World Models with Latent Structural Disentanglement
NOVA represents world states as INR weights for decoder-free rendering, compactness, and unsupervised disentanglement of background, foreground, and motion in video world models.
-
Latent State Design for World Models under Sufficiency Constraints
World models succeed when their latent states are built to meet task-specific sufficiency constraints rather than preserving the maximum amount of information.
-
Being-H0.7: A Latent World-Action Model from Egocentric Videos
Being-H0.7 adds future-aware latent reasoning to direct VLA policies via dual-branch alignment on latent queries, matching world-model benefits at VLA efficiency.
-
RoboWM-Bench: A Benchmark for Evaluating World Models in Robotic Manipulation
RoboWM-Bench evaluates video world models by converting their outputs into executable robot actions and running them on manipulation tasks, showing that physical inconsistencies remain common.
-
${\pi}_{0.7}$: a Steerable Generalist Robotic Foundation Model with Emergent Capabilities
π₀.₇ is a steerable generalist robotic model that uses rich multimodal prompts including language, subgoal images, and performance metadata to achieve out-of-the-box generalization across tasks and robot bodies.
-
CUBic: Coordinated Unified Bimanual Perception and Control Framework
CUBic learns a shared tokenized representation for bimanual robot perception and control via unidirectional aggregation, bidirectional codebook coordination, and a unified diffusion policy, yielding higher coordinatio...
-
ALAM: Algebraically Consistent Latent Action Model for Vision-Language-Action Models
ALAM introduces algebraic consistency regularization on latent action transitions from videos, raising VLA success rates from 47.9% to 85.0% on MetaWorld MT50 and 94.1% to 98.1% on LIBERO.
-
Render, Don't Decode: Weight-Space World Models with Latent Structural Disentanglement
NOVA represents scene states as INR weights for analytical rendering without decoders and achieves structural disentanglement of content and dynamics in video world models.
-
When to Trust Imagination: Adaptive Action Execution for World Action Models
Future Forward Dynamics Causal Attention (FFDC) enables World Action Models to adaptively choose action chunk lengths based on prediction-observation consistency, cutting model inferences by 69% and improving real-wor...
-
When to Trust Imagination: Adaptive Action Execution for World Action Models
A verifier called Future Forward Dynamics Causal Attention enables adaptive action execution in World Action Models, reducing model inferences by 69% and improving success rates in robotic tasks.
-
From Pixels to Tokens: A Systematic Study of Latent Action Supervision for Vision-Language-Action Models
A unified comparison of latent action supervision strategies for VLA models reveals task-specific benefits, with image-based approaches aiding reasoning and generalization, action-based aiding motor control, and discr...
-
Bridging the Embodiment Gap: Disentangled Cross-Embodiment Video Editing
A dual-contrastive disentanglement method factorizes videos into independent task and embodiment latents, then uses a parameter-efficient adapter on a frozen video diffusion model to synthesize robot executions from s...
-
LA-Pose: Latent Action Pretraining Meets Pose Estimation
LA-Pose achieves over 10% higher pose accuracy than recent feed-forward methods on Waymo and PandaSet benchmarks by repurposing latent actions from self-supervised inverse-dynamics pretraining while using orders of ma...
-
GazeVLA: Learning Human Intention for Robotic Manipulation
GazeVLA pretrains on large human egocentric datasets to capture gaze-based intention, then finetunes on limited robot data with chain-of-thought reasoning to achieve better robotic manipulation performance than baselines.
-
Hi-WM: Human-in-the-World-Model for Scalable Robot Post-Training
Hi-WM uses human interventions inside an action-conditioned world model with rollback and branching to generate dense corrective data, raising real-world success by 37.9 points on average across three manipulation tasks.
-
UniT: Toward a Unified Physical Language for Human-to-Humanoid Policy Learning and World Modeling
UniT creates a unified physical language via visual anchoring and tri-branch reconstruction to enable scalable human-to-humanoid transfer for policy learning and world modeling.
-
A Mechanistic Analysis of Sim-and-Real Co-Training in Generative Robot Policies
Sim-and-real co-training for robot policies is driven primarily by balanced cross-domain representation alignment and secondarily by domain-dependent action reweighting.
-
EgoVerse: An Egocentric Human Dataset for Robot Learning from Around the World
EgoVerse releases 1,362 hours of standardized egocentric human data across 1,965 tasks and shows via multi-lab experiments that robot policy performance scales with human data volume when the data aligns with robot ob...
-
Unified World Models: Coupling Video and Action Diffusion for Pretraining on Large Robotic Datasets
Unified World Models couple video and action diffusion inside one transformer with independent timesteps, enabling pretraining on heterogeneous robot datasets that include action-free video and producing more generali...
-
FAST: Efficient Action Tokenization for Vision-Language-Action Models
FAST applies discrete cosine transform to robot action sequences for efficient tokenization, enabling autoregressive VLAs to succeed on high-frequency dexterous tasks and scale to 10k hours of data while matching diff...
-
Video Prediction Policy: A Generalist Robot Policy with Predictive Visual Representations
Video Prediction Policy conditions robot action learning on future-frame predictions inside fine-tuned video diffusion models, yielding 18.6% relative gains on Calvin ABC-D and 31.6% higher real-world success rates.
-
From Video to Control: A Survey of Learning Manipulation Interfaces from Temporal Visual Data
A survey introduces an interface-centric taxonomy for video-to-control methods in robotic manipulation and identifies the robotics integration layer as the central open challenge.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.