arxiv: 2506.09985 · v1 · submitted 2025-06-11 · 💻 cs.AI · cs.CV· cs.LG· cs.RO

Recognition: 2 theorem links

V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning

Mido Assran , Adrien Bardes , David Fan , Quentin Garrido , Russell Howes , Mojtaba , Komeili , Matthew Muckley

show 22 more authors

Ammar Rizvi Claire Roberts Koustuv Sinha Artem Zholus Sergio Arnaud Abha Gejji Ada Martin Francois Robert Hogan Daniel Dugas Piotr Bojanowski Vasil Khalidov Patrick Labatut Francisco Massa Marc Szafraniec Kapil Krishnakumar Yong Li Xiaodong Ma Sarath Chandar Franziska Meier Yann LeCun Michael Rabbat Nicolas Ballas

Authors on Pith no claims yet

Pith reviewed 2026-05-11 00:27 UTC · model grok-4.3

classification 💻 cs.AI cs.CVcs.LGcs.RO

keywords self-supervised learningvideo modelsworld modelsrobotic planningjoint embedding predictive architecturezero-shot deploymentaction prediction

0 comments

The pith

Self-supervised video models trained on internet data plus limited robot footage can plan physical actions zero-shot on new robots.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper tries to establish that a self-supervised joint-embedding predictive architecture can learn rich world representations mostly by watching internet video, then use a small amount of unlabeled robot video to support planning in physical environments. A sympathetic reader would care because the approach avoids the usual need for task-specific data collection, rewards, or training on the robots where the model will actually be deployed. The authors pre-train V-JEPA 2 on over one million hours of web video to reach strong motion understanding and action anticipation results, align it with a language model for video question answering, and then post-train an action-conditioned version on under 62 hours of Droid robot trajectories. This enables zero-shot pick-and-place planning on Franka arms in two new labs using only image goals.

Core claim

The central claim is that pre-training an action-free joint-embedding-predictive architecture called V-JEPA 2 on more than one million hours of internet video produces representations that support strong motion understanding and anticipation, and that post-training an action-conditioned extension V-JEPA 2-AC on less than 62 hours of unlabeled robot video from the Droid dataset yields a latent world model capable of zero-shot planning for object manipulation on unseen Franka robot arms across different laboratories, without any data collected from the target robots or any task-specific training or reward.

What carries the argument

V-JEPA 2, the self-supervised joint-embedding predictive architecture that learns to predict future video embeddings from past ones without actions, extended to the action-conditioned V-JEPA 2-AC variant that incorporates latent actions to model and plan future states.

If this is right

The pretrained model reaches 77.3 top-1 accuracy on Something-Something v2 motion understanding and 39.7 recall-at-5 on Epic-Kitchens-100 action anticipation.
After alignment with a language model, the 8-billion-parameter version achieves state-of-the-art results on multiple video question-answering benchmarks such as 84.0 on PerceptionTest.
The action-conditioned world model supports planning sequences of actions to reach image-specified goals on physical robots.
Planning succeeds without collecting any interaction data from the target robots or using task-specific rewards or supervision.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Scaling the internet video pretraining further could reduce the amount of robot video needed for new planning tasks.
The same self-supervised approach might support prediction and planning in other embodied settings beyond the tested pick-and-place tasks.
If the representations generalize broadly, similar models could be adapted for non-robotic physical forecasting problems with minimal new data.

Load-bearing premise

Representations learned from internet video plus a small set of unlabeled Droid robot trajectories will transfer to new robot arms and physical environments without any additional data or training from those target setups.

What would settle it

If V-JEPA 2-AC generates action plans that fail to achieve successful pick-and-place object manipulation on the Franka arms when given only image goals in the two new labs, the zero-shot planning claim would not hold.

read the original abstract

A major challenge for modern AI is to learn to understand the world and learn to act largely by observation. This paper explores a self-supervised approach that combines internet-scale video data with a small amount of interaction data (robot trajectories), to develop models capable of understanding, predicting, and planning in the physical world. We first pre-train an action-free joint-embedding-predictive architecture, V-JEPA 2, on a video and image dataset comprising over 1 million hours of internet video. V-JEPA 2 achieves strong performance on motion understanding (77.3 top-1 accuracy on Something-Something v2) and state-of-the-art performance on human action anticipation (39.7 recall-at-5 on Epic-Kitchens-100) surpassing previous task-specific models. Additionally, after aligning V-JEPA 2 with a large language model, we demonstrate state-of-the-art performance on multiple video question-answering tasks at the 8 billion parameter scale (e.g., 84.0 on PerceptionTest, 76.9 on TempCompass). Finally, we show how self-supervised learning can be applied to robotic planning tasks by post-training a latent action-conditioned world model, V-JEPA 2-AC, using less than 62 hours of unlabeled robot videos from the Droid dataset. We deploy V-JEPA 2-AC zero-shot on Franka arms in two different labs and enable picking and placing of objects using planning with image goals. Notably, this is achieved without collecting any data from the robots in these environments, and without any task-specific training or reward. This work demonstrates how self-supervised learning from web-scale data and a small amount of robot interaction data can yield a world model capable of planning in the physical world.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

V-JEPA 2 scales JEPA-style video pretraining to a million hours and adds a lightweight robot post-training step that produces zero-shot planning on new Franka arms, but the cross-embodiment transfer still rests on untested assumptions.

read the letter

The core result is that pretraining an action-free joint-embedding predictor on over a million hours of internet video yields competitive numbers on Something-Something v2 and Epic-Kitchens, then a quick alignment with an LLM pushes video QA scores to the top at 8B scale. Adding less than 62 hours of unlabeled Droid trajectories lets them train a latent action-conditioned head that plans pick-and-place on Franka arms in two external labs with no target-robot data or task rewards. That combination of scale and deployment is the actual novelty relative to earlier JEPA papers.

Referee Report

2 major / 1 minor

Summary. The paper introduces V-JEPA 2, a self-supervised joint-embedding predictive architecture pre-trained on over 1 million hours of internet video. It reports 77.3 top-1 accuracy on Something-Something v2 for motion understanding and 39.7 recall-at-5 on Epic-Kitchens-100 for action anticipation, surpassing prior task-specific models. After alignment with a large language model, it achieves SOTA video QA results at 8B scale (e.g., 84.0 on PerceptionTest). Post-training an action-conditioned variant V-JEPA 2-AC on less than 62 hours of unlabeled Droid robot videos enables zero-shot deployment on Franka arms in two external labs for image-goal-based pick-and-place planning, without any target-robot data collection or task-specific training.

Significance. If the zero-shot robot results are substantiated with detailed methods and ablations, the work would be significant for demonstrating that web-scale self-supervised video pre-training combined with minimal unlabeled robot data can produce world models supporting physical understanding, prediction, and planning. This approach could reduce dependence on extensive task-specific or environment-specific data in robotics, building on prior JEPA-style predictive representations with concrete benchmark numbers on standard video tasks.

major comments (2)

[Abstract] Abstract: The headline claim of zero-shot transfer of V-JEPA 2-AC from Droid videos to unseen Franka arms in new labs (without any data from the target robots) is load-bearing for the central contribution on physical planning. However, the abstract supplies no quantitative success rates, number of trials, or ablation removing the <62h Droid post-training, leaving open whether the reported capability stems from the learned latent dynamics or other factors.
[Abstract] Abstract: No details are provided on the image-goal planner, how latent actions are decoded to executable robot controls, or how the forward model handles out-of-distribution states under new camera setups and embodiments. These elements are necessary to evaluate whether the claimed invariances in latent dynamics actually hold.

minor comments (1)

[Abstract] The abstract states that V-JEPA 2 is 'aligned with a large language model' for video QA but does not specify the alignment procedure, architecture, or training details, which would aid reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the positive evaluation of the work's potential significance and for the constructive comments focused on the abstract. We address each major comment below and have revised the manuscript to strengthen the presentation of the zero-shot planning results.

read point-by-point responses

Referee: [Abstract] Abstract: The headline claim of zero-shot transfer of V-JEPA 2-AC from Droid videos to unseen Franka arms in new labs (without any data from the target robots) is load-bearing for the central contribution on physical planning. However, the abstract supplies no quantitative success rates, number of trials, or ablation removing the <62h Droid post-training, leaving open whether the reported capability stems from the learned latent dynamics or other factors.

Authors: We agree that the abstract would benefit from including key quantitative results to make the central claim more self-contained. The success rates, trial counts, and ablation studies (including the performance drop when removing the Droid post-training) are reported in detail in Section 5 and the associated appendix. We have revised the abstract to incorporate a concise summary of these quantitative findings and the ablation results. revision: yes
Referee: [Abstract] Abstract: No details are provided on the image-goal planner, how latent actions are decoded to executable robot controls, or how the forward model handles out-of-distribution states under new camera setups and embodiments. These elements are necessary to evaluate whether the claimed invariances in latent dynamics actually hold.

Authors: We appreciate the referee highlighting the need for greater clarity on these components. The image-goal planner, decoding of latent actions to controls, and the forward model's approach to out-of-distribution states (including invariance to new camera setups and embodiments) are described in Sections 3.3 and 4.2, with additional implementation details in the supplementary material. We have updated the abstract to include a brief overview of these elements. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical pipeline rests on external benchmarks and deployments

full rationale

The paper's core claims are established through pre-training on public internet video corpora, post-training on the public Droid dataset, and evaluation on standard held-out benchmarks (Something-Something v2, Epic-Kitchens-100, PerceptionTest, TempCompass) plus physical robot deployments in external labs. No equations, fitted parameters, or predictions are shown to reduce by construction to quantities derived from the evaluation sets themselves. Self-citations to prior JEPA work describe the architectural lineage but do not serve as the sole justification for the reported performance numbers, which remain independently verifiable on public test sets and real hardware.

Axiom & Free-Parameter Ledger

3 free parameters · 2 axioms · 2 invented entities

The claims rest on the assumption that joint-embedding predictive training on video captures transferable world knowledge and that a small amount of unlabeled robot video suffices to adapt the model for planning; several model-scale and data-volume choices are free parameters.

free parameters (3)

model parameter count (8 billion)
Scale chosen for the language-model alignment experiments
pre-training video volume (over 1 million hours)
Internet-scale data quantity used for V-JEPA 2 pre-training
robot video volume (less than 62 hours)
Amount of unlabeled Droid data used for V-JEPA 2-AC post-training

axioms (2)

domain assumption Self-supervised joint-embedding predictive training on video produces representations useful for motion understanding and anticipation
Core premise of V-JEPA 2
domain assumption A latent action-conditioned world model trained on unlabeled robot video can support image-goal planning without task-specific rewards or data
Core premise of V-JEPA 2-AC and zero-shot deployment

invented entities (2)

V-JEPA 2 no independent evidence
purpose: Large-scale self-supervised video model for understanding and prediction
New model architecture and training run introduced in the paper
V-JEPA 2-AC no independent evidence
purpose: Action-conditioned variant for robotic planning
Post-trained model for zero-shot planning

pith-pipeline@v0.9.0 · 5753 in / 1754 out tokens · 82936 ms · 2026-05-11T00:27:48.885341+00:00 · methodology

discussion (0)

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Open-H-Embodiment: A Large-Scale Dataset for Enabling Foundation Models in Medical Robotics
cs.RO 2026-04 conditional novelty 8.0

Open-H-Embodiment is the largest open multi-embodiment medical robotics dataset, used to train GR00T-H, the first open vision-language-action model that achieves end-to-end suturing completion where prior models fail.
Coding Agent Is Good As World Simulator
cs.AI 2026-05 unverdicted novelty 7.0

A multi-agent framework generates and refines executable physics simulation code from prompts to create world models that enforce physical constraints, claiming superior accuracy and fidelity over video-based alternatives.
JEDI: Joint Embedding Diffusion World Model for Online Model-Based Reinforcement Learning
cs.LG 2026-05 unverdicted novelty 7.0

JEDI is the first online end-to-end latent diffusion world model that trains latents from denoising loss rather than reconstruction, achieving competitive Atari100k results with 43% less VRAM and over 3x faster sampli...
Clin-JEPA: A Multi-Phase Co-Training Framework for Joint-Embedding Predictive Pretraining on EHR Patient Trajectories
cs.LG 2026-05 unverdicted novelty 7.0

Clin-JEPA supplies a multi-phase co-training method for JEPA pretraining on EHR trajectories that achieves converging latent rollouts and improved multi-task AUROC on MIMIC-IV data.
Tracing the Arrow of Time: Diagnosing Temporal Information Flow in Video-LLMs
cs.CV 2026-05 unverdicted novelty 7.0

Temporal information in Video-LLMs is encoded well by video-centric encoders but disrupted by standard projectors; time-preserved MLPs plus AoT supervision yield 98.1% accuracy on arrow-of-time and gains on other temp...
Learning Visual Feature-Based World Models via Residual Latent Action
cs.CV 2026-05 unverdicted novelty 7.0

RLA-WM predicts residual latent actions via flow matching to create visual feature world models that outperform prior feature-based and diffusion approaches while enabling offline video-based robot RL.
A foundation model of vision, audition, and language for in-silico neuroscience
q-bio.NC 2026-05 unverdicted novelty 7.0

TRIBE v2 is a multimodal AI model that predicts human brain activity more accurately than linear encoding models and recovers established neuroscientific findings through in-silico testing.
Latent State Design for World Models under Sufficiency Constraints
cs.AI 2026-05 unverdicted novelty 7.0

World models succeed when their latent states are built to meet task-specific sufficiency constraints rather than preserving the maximum amount of information.
Agentic World Modeling: Foundations, Capabilities, Laws, and Beyond
cs.AI 2026-04 unverdicted novelty 7.0

Proposes a levels x laws taxonomy for world models in AI agents, defining L1-L3 capabilities across physical, digital, social, and scientific regimes while reviewing over 400 works to outline a roadmap for advanced ag...
Mask World Model: Predicting What Matters for Robust Robot Policy Learning
cs.RO 2026-04 unverdicted novelty 7.0

Mask World Model predicts semantic mask dynamics with video diffusion and integrates it with a diffusion policy head, outperforming RGB world models on LIBERO and RLBench while showing better real-world generalization...
RoboWM-Bench: A Benchmark for Evaluating World Models in Robotic Manipulation
cs.RO 2026-04 unverdicted novelty 7.0

RoboWM-Bench evaluates video world models by converting their manipulation video predictions into executable actions validated in simulation, showing that visual plausibility does not guarantee physical executability.
RoboWM-Bench: A Benchmark for Evaluating World Models in Robotic Manipulation
cs.RO 2026-04 unverdicted novelty 7.0

RoboWM-Bench evaluates video world models by converting their outputs into executable robot actions and running them on manipulation tasks, showing that physical inconsistencies remain common.
AnimationBench: Are Video Models Good at Character-Centric Animation?
cs.CV 2026-04 unverdicted novelty 7.0

AnimationBench is the first benchmark that operationalizes the twelve basic principles of animation and IP preservation into scalable, VLM-assisted metrics for animation-style I2V generation.
GTASA: Ground Truth Annotations for Spatiotemporal Analysis, Evaluation and Training of Video Models
cs.CV 2026-04 unverdicted novelty 7.0

GTASA supplies annotated multi-actor videos with exact 3D spatial and temporal ground truth that outperforms neural video generators in physical and semantic validity while enabling new probes of video encoders.
Action Images: End-to-End Policy Learning via Multiview Video Generation
cs.CV 2026-04 unverdicted novelty 7.0

Action Images turn robot arm motions into interpretable multiview pixel videos, letting video backbones serve as zero-shot policies for end-to-end robot learning.
StarVLA: A Lego-like Codebase for Vision-Language-Action Model Developing
cs.RO 2026-04 conditional novelty 7.0

StarVLA delivers a Lego-like open-source framework for VLA models with swappable backbones and action heads, reusable training methods, and unified evaluation across major benchmarks.
Factorization Regret mediates compositional generalization in latent space
cs.LG 2026-03 unverdicted novelty 7.0

Factorization Regret measures how latent variable interactions affect performance, and RCCs enable learning them to achieve compositional generalization in partially observable tasks.
Emergent Compositional Communication for Latent World Properties
cs.MA 2026-03 conditional novelty 7.0

Multi-agent iterated learning produces emergent positionally disentangled communication protocols for latent physical properties from unsupervised video features.
Feature Visualization Recovers Known Cortical Selectivity from TRIBE v2
q-bio.NC 2026-05 unverdicted novelty 6.0

Feature visualization on TRIBE v2 brain encoders recovers the known ventral visual hierarchy from V1 to V4 and produces distinctive patterns for MT, FFA, and PPA, with optimized stimuli driving ~4x higher activation t...
Contrastive Learning under Noisy Temporal Self-Supervision for Colonoscopy Videos
cs.CV 2026-05 unverdicted novelty 6.0

A noise-aware contrastive loss built on temporal self-supervision learns polyp tracklet representations from 27 videos that outperform prior self-supervised and supervised baselines and match foundation models on retr...
Do Enterprise Systems Need Learned World Models? The Importance of Context to Infer Dynamics
cs.AI 2026-05 unverdicted novelty 6.0

In configurable enterprise systems, runtime discovery of transition dynamics from system configuration is more robust to deployment shifts than offline-trained world models.
The DAWN of World-Action Interactive Models
cs.CV 2026-05 unverdicted novelty 6.0

DAWN couples a world predictor with a world-conditioned action denoiser in latent space so that each refines the other recursively, yielding strong planning and safety results on autonomous driving benchmarks.
Clin-JEPA: A Multi-Phase Co-Training Framework for Joint-Embedding Predictive Pretraining on EHR Patient Trajectories
cs.LG 2026-05 unverdicted novelty 6.0

A five-phase co-training framework enables stable JEPA pretraining on EHR trajectories, producing converging latent rollouts and higher multi-task AUROC than baselines on MIMIC-IV ICU data.
CoWorld-VLA: Thinking in a Multi-Expert World Model for Autonomous Driving
cs.CV 2026-05 unverdicted novelty 6.0

CoWorld-VLA extracts semantic, geometric, dynamic, and trajectory expert tokens from multi-source supervision and feeds them into a diffusion-based hierarchical planner, achieving competitive collision avoidance and t...
CoWorld-VLA: Thinking in a Multi-Expert World Model for Autonomous Driving
cs.CV 2026-05 unverdicted novelty 6.0

CoWorld-VLA encodes world information into four expert tokens that condition a diffusion-based planner, yielding competitive collision avoidance and trajectory accuracy on the NAVSIM benchmark.
RePO-VLA: Recovery-Driven Policy Optimization for Vision-Language-Action Models
cs.RO 2026-05 unverdicted novelty 6.0

RePO-VLA raises average adversarial success rates in VLA manipulation from 20% to 75% by using recovery-aware initialization, a progress-aware semantic value function, and value-conditioned refinement on success and c...
Latent Geometry Beyond Search: Amortizing Planning in World Models
cs.RO 2026-05 unverdicted novelty 6.0

In regularized latent spaces of world models, planning can be amortized into a goal-conditioned inverse dynamics model that matches CEM performance at 100-130x lower per-decision cost.
Predictive but Not Plannable: RC-aux for Latent World Models
cs.LG 2026-05 unverdicted novelty 6.0

RC-aux corrects spatiotemporal mismatch in reconstruction-free latent world models by adding multi-horizon prediction and reachability supervision, improving planning performance on goal-conditioned pixel-control tasks.
3D MRI Image Pretraining via Controllable 2D Slice Navigation Task
cs.CV 2026-05 unverdicted novelty 6.0

Converting 3D MRI volumes into action-conditioned 2D slice navigation sequences offers a complementary self-supervised pretraining signal for learning anatomical and spatial representations.
ConsisVLA-4D: Advancing Spatiotemporal Consistency in Efficient 3D-Perception and 4D-Reasoning for Robotic Manipulation
cs.RO 2026-05 unverdicted novelty 6.0

ConsisVLA-4D adds cross-view semantic alignment, cross-object geometric fusion, and cross-scene dynamic reasoning to VLA models, delivering 21.6% and 41.5% gains plus 2.3x and 2.4x speedups on LIBERO and real-world tasks.
Understanding Self-Supervised Learning via Latent Distribution Matching
cs.LG 2026-05 unverdicted novelty 6.0

Self-supervised learning is cast as latent distribution matching that aligns representations to a model while enforcing uniformity, unifying multiple SSL families and proving identifiability for predictive variants ev...
Text-Conditional JEPA for Learning Semantically Rich Visual Representations
cs.LG 2026-05 unverdicted novelty 6.0

TC-JEPA conditions masked feature prediction on text captions via sparse cross-attention to produce more semantically rich visual representations and outperforms contrastive methods on fine-grained tasks.
Divide and Conquer: Decoupled Representation Alignment for Multimodal World Models
cs.CV 2026-05 unverdicted novelty 6.0

M²-REPA decouples modality-specific features inside a diffusion model and aligns each to its matching expert foundation model via an alignment loss plus a decoupling regularizer, yielding better visual quality and lon...
Alethia: A Foundational Encoder for Voice Deepfakes
cs.SD 2026-04 unverdicted novelty 6.0

Alethia is a pretrained audio encoder using continuous embedding prediction and generative flow-matching reconstruction that outperforms existing speech foundation models on voice deepfake tasks with better robustness...
LA-Pose: Latent Action Pretraining Meets Pose Estimation
cs.CV 2026-04 unverdicted novelty 6.0

LA-Pose achieves over 10% higher pose accuracy than recent feed-forward methods on Waymo and PandaSet benchmarks by repurposing latent actions from self-supervised inverse-dynamics pretraining while using orders of ma...
Learning Human-Intention Priors from Large-Scale Human Demonstrations for Robotic Manipulation
cs.RO 2026-04 unverdicted novelty 6.0

MoT-HRA learns embodiment-agnostic human-intention priors from the HA-2.2M dataset of 2.2M human video episodes through a three-expert hierarchy to improve robotic motion plausibility and robustness under distribution shift.
SS3D: End2End Self-Supervised 3D from Web Videos
cs.CV 2026-04 unverdicted novelty 6.0

SS3D pretrains an end-to-end 3D estimator on filtered YouTube-8M videos via SfM self-supervision, achieving improved zero-shot transfer and fine-tuning over prior baselines.
SS3D: End2End Self-Supervised 3D from Web Videos
cs.CV 2026-04 unverdicted novelty 6.0

SS3D pretrains an end-to-end feed-forward 3D estimator on filtered YouTube-8M videos via SfM self-supervision, MVS filtering, and expert distillation, delivering stronger zero-shot transfer and fine-tuning than prior ...
Only Brains Align with Brains: Cross-Region Alignment Patterns Expose Limits of Normative Models
q-bio.NC 2026-04 unverdicted novelty 6.0

Alignment pattern analysis reveals that models aligned to individual brain ROIs do not reproduce the stable cross-region alignment profiles observed across human subjects.
Exploring High-Order Self-Similarity for Video Understanding
cs.CV 2026-04 unverdicted novelty 6.0

The MOSS module learns and combines multi-order space-time self-similarity features to enhance temporal dynamics modeling in videos across action recognition, VQA, and robotic tasks.
Active World-Model with 4D-informed Retrieval for Exploration and Awareness
cs.CV 2026-04 unverdicted novelty 6.0

AW4RE is a generative world model that estimates action-conditioned observations via 4D-informed evidence retrieval, geometric support, and conditional completion to enable better exploration under partial observability.
Human Cognition in Machines: A Unified Perspective of World Models
cs.RO 2026-04 unverdicted novelty 6.0

The paper introduces a unified framework for world models that fully incorporates all cognitive functions from Cognitive Architecture Theory, highlights under-researched areas in motivation and meta-cognition, and pro...
Grounded World Model for Semantically Generalizable Planning
cs.RO 2026-04 conditional novelty 6.0

A vision-language-aligned world model turns visuomotor MPC into a language-following planner that reaches 87% success on 288 unseen semantic tasks where standard VLAs drop to 22%.
Representations Before Pixels: Semantics-Guided Hierarchical Video Prediction
cs.CV 2026-04 unverdicted novelty 6.0

Re2Pix decomposes video prediction into semantic feature forecasting followed by representation-conditioned diffusion synthesis, with nested dropout and mixed supervision to handle prediction errors.
Zero-shot World Models Are Developmentally Efficient Learners
cs.AI 2026-04 unverdicted novelty 6.0

A zero-shot visual world model trained on one child's experience achieves broad competence on physical understanding benchmarks while matching developmental behavioral patterns.
Veo-Act: How Far Can Frontier Video Models Advance Generalizable Robot Manipulation?
cs.RO 2026-04 unverdicted novelty 6.0

Veo-3 video predictions enable approximate task-level robot trajectories in zero-shot settings but require hierarchical integration with low-level VLA policies for reliable manipulation performance.
Hierarchical Planning with Latent World Models
cs.LG 2026-04 unverdicted novelty 6.0

Hierarchical planning over multi-scale latent world models enables 70% success on real robotic pick-and-place with goal-only input where flat models achieve 0%, while cutting planning compute up to 4x in simulations.
LeWorldModel: Stable End-to-End Joint-Embedding Predictive Architecture from Pixels
cs.LG 2026-03 unverdicted novelty 6.0

LeWM is the first end-to-end trainable JEPA from pixels that uses only two loss terms for stable training and fast planning on 2D/3D control tasks.
World Action Models are Zero-shot Policies
cs.RO 2026-02 unverdicted novelty 6.0

DreamZero uses a 14B video diffusion model as a World Action Model to achieve over 2x better zero-shot generalization on real robots than state-of-the-art VLAs, real-time 7Hz closed-loop control, and cross-embodiment ...
Video models are zero-shot learners and reasoners
cs.LG 2025-09 unverdicted novelty 6.0

Generative video models exhibit emergent zero-shot capabilities across perception, manipulation, and basic reasoning tasks.
Towards Effective Theory of LLMs: A Representation Learning Approach
cs.LG 2026-05 unverdicted novelty 5.0

RET learns temporally consistent macrovariables from LLM activations via self-supervised learning to support interpretability, early behavioral prediction, and causal intervention.
Pan-FM: A Pan-Organ Foundation Model with Saliency-Guided Masking for Missing Robustness
cs.CV 2026-05 unverdicted novelty 5.0

Pan-FM learns balanced representations across seven organs by adaptively masking dominant organs during pre-training, yielding stronger disease prediction and missing-organ robustness than single-organ or naive multim...
HaM-World: Soft-Hamiltonian World Models with Selective Memory for Planning
cs.AI 2026-05 unverdicted novelty 5.0

HaM-World integrates soft-Hamiltonian dynamics with selective state-space memory to reduce long-horizon rollout error by 55% and achieve top returns under 12 OOD perturbations on DeepMind Control Suite tasks.
Video Generation with Predictive Latents
cs.CV 2026-05 unverdicted novelty 5.0

PV-VAE improves video latent spaces for generation by unifying reconstruction with future-frame prediction, reporting 52% faster convergence and 34.42 FVD gain over Wan2.2 VAE on UCF101.
Embody4D: A Generalist 4D World Model for Embodied AI
cs.CV 2026-05 unverdicted novelty 5.0

Embody4D generates high-fidelity, view-consistent novel views from monocular videos for embodied scenarios via 3D-aware data synthesis, adaptive noise injection, and interaction-aware attention.
Lifting Embodied World Models for Planning and Control
cs.CV 2026-04 unverdicted novelty 5.0

Composing a policy that maps 2D waypoints to joint actions with a frozen world model yields a lifted world model that achieves 3.8 times lower mean joint error than direct low-level search while being more compute-eff...
SS3D: End2End Self-Supervised 3D from Web Videos
cs.CV 2026-04 unverdicted novelty 5.0

SS3D scales SfM-based self-supervision to ~100M frames from YouTube-8M using a multi-view signal proxy for filtering and a two-stage training schedule, achieving strong zero-shot transfer and better fine-tuning than p...
Sapiens2
cs.CV 2026-04 unverdicted novelty 5.0

Sapiens2 improves pretraining, data scale, and architecture over its predecessor to set new state-of-the-art results on human pose estimation, body-part segmentation, normal estimation, and new tasks like pointmap and...
Cortex 2.0: Grounding World Models in Real-World Industrial Deployment
cs.RO 2026-04 unverdicted novelty 5.0

Cortex 2.0 introduces world-model-based planning that generates and scores future trajectories to outperform reactive vision-language-action baselines on industrial robotic tasks including pick-and-place, sorting, and...
Stylistic-STORM (ST-STORM) : Perceiving the Semantic Nature of Appearance
cs.CV 2026-04 unverdicted novelty 5.0

ST-STORM introduces a dual-branch SSL framework that disentangles semantic content from stylistic appearance using gated latent streams, JEPA for content invariance, and adversarial constraints for style capture.

Reference graph

Works this paper leans on

69 extracted references · 69 canonical work pages · cited by 69 Pith papers · 27 internal anchors

[1]

Cosmos World Foundation Model Platform for Physical AI

Niket Agarwal, Arslan Ali, Maciej Bala, Yogesh Balaji, Erik Barker, Tiffany Cai, Prithvijit Chattopadhyay, Yongxin Chen, Yin Cui, Yifan Ding, Daniel Dworakowski, Jiaojiao Fan, Michele Fenzi, Francesco Ferroni, Sanja Fidler, Dieter Fox, Songwei Ge, Yunhao Ge, Jinwei Gu, Siddharth Gururani, Ethan He, Jiahui Huang, Jacob Huffman, Pooya Jannaty, Jingyi Jin, S...

work page internal anchor Pith review Pith/arXiv arXiv
[2]

arXiv preprint arXiv:2210.07277 , year=

Mahmoud Assran, Randall Balestriero, Quentin Duval, Florian Bordes, Ishan Misra, Piotr Bojanowski, Pascal Vincent, Michael Rabbat, and Nicolas Ballas. The hidden uniform cluster prior in self-supervised learning.arXiv preprint arXiv:2210.07277,

work page arXiv
[3]

Revisiting Feature Prediction for Learning Visual Representations from Video

Adrien Bardes, Quentin Garrido, Jean Ponce, Xinlei Chen, Michael Rabbat, Yann LeCun, Mahmoud Assran, and Nicolas Ballas. Revisiting feature prediction for learning visual representations from video.arXiv preprint arXiv:2404.08471 ,

work page internal anchor Pith review arXiv
[4]

GR00T N1: An Open Foundation Model for Generalist Humanoid Robots

Johan Bjorck, Fernando Castañeda, Nikita Cherniadev, Xingye Da, Runyu Ding, Linxi “Jim” Fan, Yu Fang, Dieter Fox, Fengyuan Hu, Spencer Huang, Joel Jang, Zhenyu Jiang, Jan Kautz, Kaushil Kundalia, Lawrence Lao, Zhiqi Li, Zongyu Lin, Kevin Lin, Guilin Liu, Edith Llontop, Loic Magne, Ajay Mandlekar, Avnish Narayan, Soroush Nasiriany, Scott Reed, You Liang Ta...

work page internal anchor Pith review Pith/arXiv arXiv
[5]

Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, Szymon Jakubczak, Tim Jones, Liyiming Ke, Sergey Levine, Adrian Li-Bell, Mohith Mothukuri, Suraj Nair, Karl Pertsch, Lucy Xiaoyang Shi, James Tanner, Quan Vuong, Anna Walling, Haohuan Wang, and Ury Zhilinsky.π0: A visio...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[6]

Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Manuel Y. Galliker, Dibya Ghosh, Lachy Groom, Karol Hausman, Brian Ichter, Szymon Jakubczak, Tim Jones, Liyiming Ke, Devin LeBlanc, Sergey Levine, Adrian Li-Bell, Mohith Mothukuri, Suraj Nair, Karl Pertsch, Allen Z. Ren, Lucy Xi...

work page internal anchor Pith review Pith/arXiv arXiv
[7]

Perception Encoder: The best visual embeddings are not at the output of the network

Daniel Bolya, Po-Yao Huang, Peize Sun, Jang Hyun Cho, Andrea Madotto, Chen Wei, Tengyu Ma, Jiale Zhi, Jathushan Rajasegaran, Hanoona Rasheed, Junke Wang, Marco Monteiro, Hu Xu, Shiyu Dong, Nikhila Ravi, Daniel Li, Piotr Dollár, and Christoph Feichtenhofer. Perception encoder: The best visual embeddings are not at the output of the network. arXiv preprint ...

work page internal anchor Pith review arXiv
[8]

RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control

Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Xi Chen, Krzysztof Choromanski, Tianli Ding, Danny Driess, Avinava Dubey, Chelsea Finn, Pete Florence, Chuyuan Fu, Montse Gonzalez Arenas, Keerthana Gopalakrishnan, Kehang Han, Karol Hausman, Alexander Herzog, Jasmine Hsu, Brian Ichter, Alex Irpan, Nikhil Joshi, Ryan Julian, Dmitry Kalashnikov...

work page internal anchor Pith review Pith/arXiv arXiv
[9]

Temporalbench: Benchmarking fine-grained temporal understanding for multimodal video models,

Mu Cai, Reuben Tan, Jianrui Zhang, Bocheng Zou, Kai Zhang, Feng Yao, Fangrui Zhu, Jing Gu, Yiwu Zhong, Yuzhang Shang, Yao Dou, Jaden Park, Jianfeng Gao, Yong Jae Lee, and Jianwei Yang. Temporalbench: Benchmarking fine-grained temporal understanding for multimodal video models.arXiv preprint arXiv:2410.10818 ,

work page arXiv
[10]

7, 13, 16

Joao Carreira, Eric Noland, Andras Banki-Horvath, Chloe Hillier, and Andrew Zisserman. A short note about kinetics-600. arXiv preprint arXiv:1808.01340 ,

work page arXiv
[11]

A short note on the kinetics-700 human action dataset

Joao Carreira, Eric Noland, Chloe Hillier, and Andrew Zisserman. A short note on the kinetics-700 human action dataset. arXiv preprint arXiv:1907.06987 ,

work page arXiv 1907
[12]

João Carreira, Dilara Gokay, Michael King, Chuhan Zhang, Ignacio Rocco, Aravindh Mahendran, Thomas Albert Keck, Joseph Heyward, Skanda Koppula, Etienne Pot, Goker Erdogan, Yana Hasson, Yi Yang, Klaus Greff, Guillaume Le Moing, Sjoerd van Steenkiste, Daniel Zoran, Drew A. Hudson, Pedro Vélez, Luisa Polanía, Luke Friedman, Chris Duvarney, Ross Goroshin, Kel...

work page arXiv
[13]

Actionable models: Unsupervised ofﬂine reinforcement learning of robotic skills

Yevgen Chebotar, Karol Hausman, Yao Lu, Ted Xiao, Dmitry Kalashnikov, Jake Varley, Alex Irpan, Benjamin Eysenbach, Ryan Julian, Chelsea Finn, and Sergey Levine. Actionable models: Unsupervised offline reinforcement learning of robotic skills.arXiv preprint arXiv:2104.07749 ,

work page arXiv
[14]

Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

Zhe Chen, Weiyun Wang, Yue Cao, Yangzhou Liu, Zhangwei Gao, Erfei Cui, Jinguo Zhu, Shenglong Ye, Hao Tian, Zhaoyang Liu, Lixin Gu, Xuehui Wang, Qingyun Li, Yimin Ren, Zixuan Chen, Jiapeng Luo, Jiahao Wang, Tan Jiang, Bo Wang, Conghui He, Botian Shi, Xingcheng Zhang, Han Lv, Yi Wang, Wenqi Shao, Pei Chu, Zhongying Tu, Tong He, Zhiyong Wu, Huipeng Deng, Jia...

work page internal anchor Pith review Pith/arXiv arXiv
[15]

arXiv preprint arXiv:2504.13180 , year=

Jang Hyun Cho, Andrea Madotto, Effrosyni Mavroudi, Triantafyllos Afouras, Tushar Nagarajan, Muhammad Maaz, Yale Song, Tengyu Ma, Shuming Hu, Suyog Jain, Miguel Martin, Huiyu Wang, Hanoona Rasheed, Peize Sun, Po-Yao Huang, Daniel Bolya, Nikhila Ravi, Shashank Jain, Tammy Stark, Shane Moon, Babak Damavandi, Vivian Lee, Andrew Westbury, Salman Khan, Philipp ...

work page arXiv
[16]

Lost in time: A new temporal benchmark for videollms.arXiv preprint arXiv:2410.07752,

Daniel Cores, Michael Dorkenwald, Manuel Mucientes, Cees GM Snoek, and Yuki M Asano. Tvbench: Redesigning video-language evaluation. arXiv preprint arXiv:2410.07752 ,

work page arXiv
[17]

International Journal of Computer Vision (IJCV) 130: 33–55

https://doi.org/10.1007/s11263-021-01531-2. Timothée Darcet, Maxime Oquab, Julien Mairal, and Piotr Bojanowski. Vision transformers need registers. InThe Twelfth International Conference on Learning Representations ,

work page doi:10.1007/s11263-021-01531-2
[18]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale.arXiv preprint arXiv:2010.11929 ,

work page internal anchor Pith review Pith/arXiv arXiv 2010
[19]

Danny Driess, Fei Xia, Mehdi S. M. Sajjadi, Corey Lynch, Aakanksha Chowdhery, Brian Ichter, Ayzaan Wahid, Jonathan Tompson, Quan Vuong, Tianhe Yu, Wenlong Huang, Yevgen Chebotar, Pierre Sermanet, Daniel Duckworth, Sergey Levine, Vincent Vanhoucke, Karol Hausman, Marc Toussaint, Klaus Greff, Andy Zeng, Igor Mordatch, and Pete Florence. PaLM-E: An embodied ...

work page arXiv 2023
[20]

Visual foresight: Model-based deep reinforcement learning for vision-based robotic control,

Frederik Ebert, Chelsea Finn, Sudeep Dasari, Annie Xie, Alex Lee, and Sergey Levine. Visual foresight: Model-based deep reinforcement learning for vision-based robotic control.arXiv preprint arXiv:1812.00568 ,

work page arXiv
[21]

Scaling language-free visual representation learning

David Fan, Shengbang Tong, Jiachen Zhu, Koustuv Sinha, Zhuang Liu, Xinlei Chen, Michael Rabbat, Nicolas Ballas, Yann LeCun, Amir Bar, and Saining Xie. Scaling language-free visual representation learning.arXiv preprint arXiv:2504.01017,

work page arXiv
[22]

Susskind, and Alaaeldin El-Nouby

Enrico Fini, Mustafa Shukor, Xiujun Li, Philipp Dufter, Michal Klein, David Haldimann, Sai Aitharaju, Victor Guilherme Turrisi da Costa, Louis Béthune, Zhe Gan, Alexander T Toshev, Marcin Eichner, Moin Nabi, Yinfei Yang, Joshua M. Susskind, and Alaaeldin El-Nouby. Multimodal autoregressive pre-training of large vision encoders. arXiv preprint arXiv:2411.14402 ,

work page arXiv
[23]

Learning visual predictive models of physics for playing billiards.arXiv preprint arXiv:1511.07404 ,

Katerina Fragkiadaki, Pulkit Agrawal, Sergey Levine, and Jitendra Malik. Learning visual predictive models of physics for playing billiards.arXiv preprint arXiv:1511.07404 ,

work page arXiv
[24]

Ethan He, Abhinav Khattar, Ryan Prenger, Vijay Korthikanti, Zijie Yan, Tong Liu, Shiqing Fan, Ashwath Aithal, Mohammad Shoeybi, and Bryan Catanzaro

Leo Gao, Jonathan Tow, Baber Abbasi, Stella Biderman, Sid Black, Anthony DiPofi, Charles Foster, Laurence Golding, Jeffrey Hsu, Alain Le Noac’h, Haonan Li, Kyle McDonell, Niklas Muennighoff, Chris Ociepa, Jason Phang, Laria Reynolds, Hailey Schoelkopf, Aviya Skowron, Lintang Sutawika, Eric Tang, Anish Thite, Ben Wang, Kevin Wang, and Andy Zou. A framework...

work page arXiv 2024
[25]

The Llama 3 Herd of Models

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783,

work page internal anchor Pith review Pith/arXiv arXiv
[26]

arXiv preprint arXiv:2206.11894 , year=

Agrim Gupta, Stephen Tian, Yunzhi Zhang, Jiajun Wu, Roberto Martín-Martín, and Li Fei-Fei. Maskvit: Masked visual pre-training for video prediction.arXiv preprint arXiv:2206.11894 ,

work page arXiv
[27]

World Models

David Ha and Jürgen Schmidhuber. World models.arXiv preprint arXiv:1803.10122 ,

work page internal anchor Pith review arXiv
[28]

Dream to Control: Learning Behaviors by Latent Imagination

Danijar Hafner, Timothy Lillicrap, Jimmy Ba, and Mohammad Norouzi. Dream to control: Learning behaviors by latent imagination. arXiv preprint arXiv:1912.01603 , 2019a. Danijar Hafner, Timothy Lillicrap, Ian Fischer, Ruben Villegas, David Ha, Honglak Lee, and James Davidson. Learning latent dynamics for planning from pixels. InInternational conference on m...

work page internal anchor Pith review arXiv 1912
[29]

Temporal difference learning for model predictive control.arXiv preprint arXiv:2203.04955,

Nicklas Hansen, Xiaolong Wang, and Hao Su. Temporal difference learning for model predictive control.arXiv preprint arXiv:2203.04955,

work page arXiv
[30]

TD-MPC2: Scalable, Robust World Models for Continuous Control

Nicklas Hansen, Hao Su, and Xiaolong Wang. Td-mpc2: Scalable, robust world models for continuous control.arXiv preprint arXiv:2310.16828,

work page internal anchor Pith review arXiv
[31]

GAIA-1: A Generative World Model for Autonomous Driving

Anthony Hu, Lloyd Russell, Hudson Yeo, Zak Murez, George Fedoseev, Alex Kendall, Jamie Shotton, and Gianluca Corrado. Gaia-1: A generative world model for autonomous driving.arXiv preprint arXiv:2309.17080 ,

work page internal anchor Pith review arXiv
[32]

Learning to achieve goals with belief state transformers.arXiv preprint arXiv:2410.23506 ,

Edward S Hu, Kwangjun Ahn, Qinghua Liu, Haoran Xu, Manan Tomar, Ada Langford, Dinesh Jayaraman, Alex Lamb, and John Langford. Learning to achieve goals with belief state transformers.arXiv preprint arXiv:2410.23506 ,

work page arXiv
[33]

Bowen Jing, Bonnie Berger, and Tommi Jaakkola

Andrew Jaegle, Sebastian Borgeaud, Jean-Baptiste Alayrac, Carl Doersch, Catalin Ionescu, David Ding, Skanda Koppula, Daniel Zoran, Andrew Brock, Evan Shelhamer, Olivier Hénaff, Matthew M. Botvinick, Andrew Zisserman, Oriol Vinyals, and Jo¯ ao Carreira. Perceiver io: A general architecture for structured inputs & outputs.arXiv preprint arXiv:2107.14795,

work page arXiv
[34]

The Kinetics Human Action Video Dataset

Will Kay, Joao Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier, Sudheendra Vijayanarasimhan, Fabio Viola, Tim Green, Trevor Back, Paul Natsev, Mustafa Suleyman, and Andrew Zisserman. The kinetics human action video dataset. arXiv preprint arXiv:1705.06950 ,

work page internal anchor Pith review arXiv
[35]

Alexander Khazatsky, Karl Pertsch, Suraj Nair, Ashwin Balakrishna, Sudeep Dasari, Siddharth Karamcheti, Soroush Nasiriany, Mohan Kumar Srirama, Lawrence Yunliang Chen, Kirsty Ellis, Peter David Fagan, Joey Hejna, Masha Itkina, Marion Lepert, Yecheng Jason Ma, Patrick Tree Miller, Jimmy Wu, Suneel Belkhale, Shivin Dass, Huy Ha, Arhan Jain, Abraham Lee, You...

work page internal anchor Pith review arXiv
[36]

OpenVLA: An Open-Source Vision-Language-Action Model

27 Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, Quan Vuong, Thomas Kollar, Benjamin Burchfiel, Russ Tedrake, Dorsa Sadigh, Sergey Levine, Percy Liang, and Chelsea Finn. Openvla: An open-source vision-language-action model.arXiv preprint arXiv:2406.09246,

work page internal anchor Pith review Pith/arXiv arXiv
[37]

A path towards autonomous machine intelligence version 0.9.2, 2022-06-27.Open Review, 62(1):1–62,

Yann LeCun. A path towards autonomous machine intelligence version 0.9.2, 2022-06-27.Open Review, 62(1):1–62,

work page 2022
[38]

LLaVA-OneVision: Easy Visual Task Transfer

Bo Li, Peiyuan Zhang, Kaichen Zhang, Fanyi Pu, Xinrun Du, Yuhao Dong, Haotian Liu, Yuanhan Zhang, Ge Zhang, Chunyuan Li, and Ziwei Liu. Lmms-eval: Accelerating the development of large multimoal models, March 2024a. https://github.com/EvolvingLMMs-Lab/lmms-eval. Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang...

work page internal anchor Pith review Pith/arXiv arXiv
[39]

Tempcompass: Do video llms really understand videos?,

Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 26296–26306, 2024a. Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, and Yong Jae Lee. Llava-next: Improved reasoning, ocr, and world kn...

work page arXiv 2024
[40]

Decoupled Weight Decay Regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101 ,

work page internal anchor Pith review Pith/arXiv arXiv
[41]

Keypoints into the future: Self-supervised correspondence in model-based reinforcement learning.arXiv preprint arXiv:2009.05085 ,

Lucas Manuelli, Yunzhu Li, Pete Florence, and Russ Tedrake. Keypoints into the future: Self-supervised correspondence in model-based reinforcement learning.arXiv preprint arXiv:2009.05085 ,

work page arXiv 2009
[42]

Octo: An Open-Source Generalist Robot Policy

Octo Model Team, Dibya Ghosh, Homer Walke, Karl Pertsch, Kevin Black, Oier Mees, Sudeep Dasari, Joey Hejna, Tobias Kreiman, Charles Xu, Jianlan Luo, You Liang Tan, Lawrence Yunliang Chen, Pannag Sanketi, Quan Vuong, Ted Xiao, Dorsa Sadigh, Chelsea Finn, and Sergey Levine. Octo: An open-source generalist robot policy.arXiv preprint arXiv:2405.12213,

work page internal anchor Pith review Pith/arXiv arXiv
[43]

DINOv2: Learning Robust Visual Features without Supervision

Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, Mahmoud Assran, Nicolas Ballas, Wojciech Galuba, Russell Howes, Po-Yao Huang, Shang-Wen Li, Ishan Misra, Michael Rabbat, Vasu Sharma, Gabriel Synnaeve, Hu Xu, Hervé Jégou, Julien Mairal, Patrick Lab...

work page internal anchor Pith review Pith/arXiv arXiv
[44]

Qwen2.5-VL Technical Report

Qwen Team, Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, Junyang Lin, An Yang, Binyuan Hui, B...

work page internal anchor Pith review Pith/arXiv arXiv
[45]

An empirical study of autoregressive pre-training from videos.arXiv:2501.05453, 2025

Jathushan Rajasegaran, Ilija Radosavovic, Rahul Ravishankar, Yossi Gandelsman, Christoph Feichtenhofer, and Jitendra Malik. An empirical study of autoregressive pre-training from videos.arXiv preprint arXiv:2501.05453 ,

work page arXiv
[46]

Gaia-2: A controllable multi-view generative world model for autonomous driving.arXiv preprint arXiv:2503.20523, 2025

Lloyd Russell, Anthony Hu, Lorenzo Bertoni, George Fedoseev, Jamie Shotton, Elahe Arani, and Gianluca Corrado. Gaia-2: A controllable multi-view generative world model for autonomous driving.arXiv preprint arXiv:2503.20523 ,

work page arXiv
[47]

Tomato: Assessing visual temporal reasoning capabilities in multimodal foundation models,

Ziyao Shangguan, Chuhan Li, Yuxuan Ding, Yanan Zheng, Yilun Zhao, Tesca Fitzgerald, and Arman Cohan. Tomato: Assessing visual temporal reasoning capabilities in multimodal foundation models.arXiv preprint arXiv:2410.23266 ,

work page arXiv
[48]

Learning from reward-free offline data: A case for planning with latent dynamics models

Vlad Sobal, Wancong Zhang, Kynghyun Cho, Randall Balestriero, Tim GJ Rudner, and Yann LeCun. Learning from reward-free offline data: A case for planning with latent dynamics models.arXiv preprint arXiv:2502.14819 ,

work page arXiv
[49]

Video occupancy models

Manan Tomar, Philippe Hansen-Estruch, Philip Bachman, Alex Lamb, John Langford, Matthew E Taylor, and Sergey Levine. Video occupancy models. arXiv preprint arXiv:2407.09533 ,

work page arXiv
[50]

SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

Michael Tschannen, Alexey Gritsenko, Xiao Wang, Muhammad Ferjad Naeem, Ibrahim Alabdulmohsin, Nikhil Parthasarathy, Talfan Evans, Lucas Beyer, Ye Xia, Basil Mustafa, Olivier Hénaff, Jeremiah Harmsen, Andreas Steiner, and Xiaohua Zhai. SigLIP 2: Multilingual vision-language encoders with improved semantic understanding, localization, and dense features.arX...

work page internal anchor Pith review Pith/arXiv arXiv
[51]

Junyang Wang, Yuhang Wang, Guohai Xu, Jing Zhang, Yukai Gu, Haitao Jia, Jiaqi Wang, Haiyang Xu, Ming Yan, Ji Zhang, et al

Mathurin Videau, Badr Youbi Idrissi, Daniel Haziza, Luca Wehrstedt, Jade Copet, Olivier Teytaud, and David Lopez- Paz. Meta Lingua: A minimal PyTorch LLM training library, 2024.https://github.com/facebookresearch/lingua. Shakti N Wadekar, Abhishek Chaurasia, Aman Chadha, and Eugenio Culurciello. The evolution of multimodal model architectures. arXiv prepr...

work page arXiv 2024
[52]

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Yang Fan, Kai Dang, Mengfei Du, Xuancheng Ren, Rui Men, Dayiheng Liu, Chang Zhou, Jingren Zhou, and Junyang Lin. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024a. ...

work page internal anchor Pith review Pith/arXiv arXiv
[53]

Unleashing Large-Scale Video Generative Pre-training for Visual Robot Manipulation

Hongtao Wu, Ya Jing, Chilam Cheang, Guangzeng Chen, Jiafeng Xu, Xinghang Li, Minghuan Liu, Hang Li, and Tao Kong. Unleashing large-scale video generative pre-training for visual robot manipulation.arXiv preprint arXiv:2312.13139, 2023a. Philipp Wu, Alejandro Escontrela, Danijar Hafner, Pieter Abbeel, and Ken Goldberg. Daydreamer: World models for physical...

work page internal anchor Pith review arXiv
[54]

Tarsier2: Advanc- ing large vision-language models from detailed video description to comprehensive video understanding.arXiv preprint arXiv:2501.07888, 2025

Liping Yuan, Jiawei Wang, Haomiao Sun, Yuchen Zhang, and Yuan Lin. Tarsier2: Advancing large vision-language models from detailed video description to comprehensive video understanding.arXiv preprint arXiv:2501.07888 ,

work page arXiv
[55]

Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding

Hang Zhang, Xin Li, and Lidong Bing. Video-llama: An instruction-tuned audio-visual language model for video understanding. arXiv preprint arXiv:2306.02858 ,

work page internal anchor Pith review arXiv
[56]

Zhu, J., Wang, W., Chen, Z., Liu, Z., Ye, S., Gu, L., Tian, H., Duan, Y ., Su, W., Shao, J., et al

Kaichen Zhang, Bo Li, Peiyuan Zhang, Fanyi Pu, Joshua Adrian Cahyono, Kairui Hu, Shuai Liu, Yuanhan Zhang, Jingkang Yang, Chunyuan Li, and Ziwei Liu. Lmms-eval: Reality check on the evaluation of large multimodal models. arXiv preprint arXiv:2407.12772 , 2024a. Yuanhan Zhang, Bo Li, Haotian Liu, Yong jae Lee, Liangke Gui, Di Fu, Jiashi Feng, Ziwei Liu, an...

work page arXiv 2024
[57]

Cot-vla: Visual chain-of-thought reasoning for vision-language-action models.arXiv preprint arXiv:2503.22020, 2025

Qingqing Zhao, Yao Lu, Moo Jin Kim, Zipeng Fu, Zhuoyang Zhang, Yecheng Wu, Zhaoshuo Li, Qianli Ma, Song Han, Chelsea Finn, Ankur Handa, Ming-Yu Liu, Donglai Xiang, Gordon Wetzstein, and Tsung-Yi Lin. CoT-VLA: Visual chain-of-thought reasoning for vision-language-action models.arXiv preprint arXiv:2503.22020 ,

work page arXiv
[58]

Flare: Robot learning with implicit world modeling, 2025

Ruijie Zheng, Jing Wang, Scott Reed, Johan Bjorck, Yu Fang, Fengyuan Hu, Joel Jang, Kaushil Kundalia, Zongyu Lin, Loic Magne, Avnish Narayan, You Liang Tan, Guanzhi Wang, Qi Wang, Jiannan Xiang, Yinzhen Xu, Seonghyeon Ye, Jan Kautz, Furong Huang, Yuke Zhu, and Linxi Fan. FLARE: Robot learning with implicit world modeling. arXiv preprint arXiv:2505.15659 ,

work page arXiv
[59]

2411.04983 , archiveprefix =

Gaoyue Zhou, Hengkai Pan, Yann LeCun, and Lerrel Pinto. DINO-WM: World models on pre-trained visual features enable zero-shot planning. arXiv preprint arXiv:2411.04983 ,

work page arXiv
[60]

Unified World Models: Coupling Video and Action Diffusion for Pretraining on Large Robotic Datasets

31 Chuning Zhu, Raymond Yu, Siyuan Feng, Benjamin Burchfiel, Paarth Shah, and Abhishek Gupta. Unified world models: Coupling video and action diffusion for pretraining on large robotic datasets.arXiv preprint arXiv:2504.02792 ,

work page internal anchor Pith review arXiv
[61]

abbreviated

Throughout the Appendix, we refer to a “abbreviated” training recipe that corresponds to a 90,000-step training following the procedure of Bardes et al. (2024). There are a few key differences with the abbreviated recipe. The first is the learning rate: the abbreviated recipe begins with a linear warmup followed by a cosine decay. The second are the sched...

work page 2024
[62]

(2020), using the standard16 × 16 patch size

All models are parameterized as vision transform- ers Dosovitskiy et al. (2020), using the standard16 × 16 patch size. When scaling model size, we increase the encoder from a ViT-L (300M parameters) to a ViT-g (1B parameters), while the predictor size is kept fixed across all pre-training experiments. Table 12 Model architecture details. Family of encoder...

work page 2020
[63]

For the pick-and-place tasks we present two sub-goal images to the model in addition to the final goal

For thegrasp and reach with object tasks the model is shown a single goal image. For the pick-and-place tasks we present two sub-goal images to the model in addition to the final goal. The first goal image shows the object being grasped, the second goal image shows the object in the vicinity of the goal position. The model first optimizes actions with res...

work page 2000
[64]

The MLLM ingests the output embeddings of the vision encoder, which are projected to the hidden dimension of the LLM backbone using aprojector module

setup. The MLLM ingests the output embeddings of the vision encoder, which are projected to the hidden dimension of the LLM backbone using aprojector module. The projector is typically a 2-layer MLP. The MLLM is trained using a mix of image-text and video-text paired data, in a series of progressive training steps. To understand the impact of data scale, ...

work page 2025
[65]

We describe the training details in the following sections

for the scaling experiments. We describe the training details in the following sections. E.1 Processing Images and Videos as Input Since video question answering uses video instead of image inputs, the number of output visual tokens increases significantly compared to image question answering. If required, we can use pooling methods to reduce the number o...

work page 2025
[66]

To assess the ability of V-JEPA 2 to capture spatiotemporal details for VidQA, we compare to leading off-shelf image encoders

Baselines. To assess the ability of V-JEPA 2 to capture spatiotemporal details for VidQA, we compare to leading off-shelf image encoders. Specifically, we compare to DINOv2 (Oquab et al., 2023), SigLIP2 (Tschannen et al., 2025), and Perception Encoder (Bolya et al., 2025). DINOv2 is a self-supervised image model, while SigLIP2 and Perception Encoder are b...

work page 2023
[67]

Unlike Cho et al

as the backbone LLM. Unlike Cho et al. (2025), we do not use pooling, instead we train V-JEPA 2 VIT-g384 using MLP projector, leading to 288 tokens per frame. The training setup also consists of three progressive stages: Stage 1: aligning the MLP pooler with image captioning data; Stage 2: training on a mix of image-text captioning and QA data; and Stage

work page 2025
[68]

We scale up the data size to 88.5 million samples

training on video-text captioning and QA data. We scale up the data size to 88.5 million samples. Our setup uses Pytorch 2.5.1 and Perception LM training code,5 modified with the V-JEPA 2 encoder. We train on 512 H100 GPUs for Stage 2 and Stage 3 with a global batch size of 2048 and 1024 respectively. Details of the training hyperparams are provided in Table

work page 2048
[69]

5https://github.com/facebookresearch/perception_models 47 Table 22 Data scaling training parameters. Parameter Values Common parameters Crop Size 384 Video Frames per Second 1 Sampling method Uniform Seed 777 Stage 1 Steps 16000 Warmup Steps 96 Batch Size (global) 128 Learning Rate 1e-4 Final Learning Rate 1e-6 Weight Decay 0.05 Max sequence length 1920 S...

work page 1920