V-JEPA 2.1: Unlocking Dense Features in Video Self-Supervised Learning

Adrien Bardes; Amir Bar; Koustuv Sinha; Lorenzo Mur-Labadia; Matthew Muckley; Mido Assran; Mike Rabbat; Nicolas Ballas; Yann LeCun

arxiv: 2603.14482 · v3 · pith:2KLVNDVXnew · submitted 2026-03-15 · 💻 cs.CV

V-JEPA 2.1: Unlocking Dense Features in Video Self-Supervised Learning

Lorenzo Mur-Labadia , Matthew Muckley , Amir Bar , Mido Assran , Koustuv Sinha , Mike Rabbat , Yann LeCun , Nicolas Ballas

show 1 more author

Adrien Bardes

This is my paper

classification 💻 cs.CV

keywords densev-jepamodelself-supervisedtrainingacrossanticipationglobal

0 comments

read the original abstract

We present V-JEPA 2.1, a family of self-supervised models that learn dense, high-quality visual representations for both images and videos while retaining strong global scene understanding. The approach combines four key components. First, a dense predictive loss uses a masking-based objective in which both visible and masked tokens contribute to the training signal, encouraging explicit spatial and temporal grounding. Second, deep self-supervision applies the self-supervised objective hierarchically across multiple intermediate encoder layers to improve representation quality. Third, multi-modal tokenizers enable unified training across images and videos. Finally, the model benefits from effective scaling in both model capacity and training data. Together, these design choices produce representations that are spatially structured, semantically coherent, and temporally consistent. Empirically, V-JEPA 2.1 achieves state-of-the-art performance on several challenging benchmarks, including 7.71 mAP on Ego4D for short-term object-interaction anticipation and 40.8 Recall@5 on EPIC-KITCHENS for high-level action anticipation, as well as a 20-point improvement in real-robot grasping success rate over V-JEPA-2 AC. The model also demonstrates strong performance in robotic navigation (5.687 ATE on TartanDrive), depth estimation (0.307 RMSE on NYUv2 with a linear probe), and global recognition (77.7 on Something-Something-V2). These results show that V-JEPA 2.1 significantly advances the state of the art in dense visual understanding and world modeling.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 18 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Learning Visual Feature-Based World Models via Residual Latent Action
cs.CV 2026-05 unverdicted novelty 7.0

RLA-WM predicts residual latent actions via flow matching to create visual feature world models that outperform prior feature-based and diffusion approaches while enabling offline video-based robot RL.
LookWhen? Fast Video Recognition by Learning When, Where, and What to Compute
cs.CV 2026-05 conditional novelty 7.0

LookWhen factorizes video recognition into learning when, where, and what to compute via uniqueness-based token selection and dual-teacher distillation, achieving better accuracy-FLOPs trade-offs than baselines on mul...
Latent State Design for World Models under Sufficiency Constraints
cs.AI 2026-05 unverdicted novelty 7.0

World models succeed when their latent states are built to meet task-specific sufficiency constraints rather than preserving the maximum amount of information.
Being-H0.7: A Latent World-Action Model from Egocentric Videos
cs.RO 2026-04 unverdicted novelty 7.0

Being-H0.7 adds future-aware latent reasoning to direct VLA policies via dual-branch alignment on latent queries, matching world-model benefits at VLA efficiency.
How You Move Tells What You'll Do: Trajectory-Conditioned Egocentric Prediction
cs.CV 2026-05 unverdicted novelty 6.0

TrajPilot predicts candidate future trajectories from egocentric context and uses them to condition action prediction in an embedding space, outperforming VLM and planner baselines on Ego-Exo4D, Ego4D, and other datas...
Latent Video Prediction Learns Better World Models
cs.CV 2026-05 unverdicted novelty 6.0

Latent prediction video models exhibit a distinct robustness profile across corruption, occlusion, fine-grained discrimination, and temporal sensitivity compared to other self-supervised video models when used as worl...
EgoExo-WM: Unlocking Exo Video for Ego World Models
cs.CV 2026-05 unverdicted novelty 6.0

Converting exocentric video to egocentric format via body-pose extraction and kinematics prior enables training of action-conditioned egocentric world models that improve prediction quality and goal-directed planning.
SARA: Semantically Adaptive Relational Alignment for Video Diffusion Models
cs.CV 2026-05 unverdicted novelty 6.0

SARA improves text alignment and motion quality in video diffusion models by routing token-relation distillation supervision to semantically salient pairs using a Stage-1 aligner trained with SAM masks and InfoNCE.
3D MRI Image Pretraining via Controllable 2D Slice Navigation Task
cs.CV 2026-05 unverdicted novelty 6.0

Converting 3D MRI volumes into action-conditioned 2D slice navigation sequences offers a complementary self-supervised pretraining signal for learning anatomical and spatial representations.
Modular Multimodal Classification Without Fine-Tuning: A Simple Compositional Approach
cs.LG 2026-05 unverdicted novelty 5.0

CoMET achieves strong multimodal classification performance by composing frozen modality encoders, PCA compression, and tabular foundation models without any training, reaching state-of-the-art on diverse benchmarks i...
Factorized Latent Dynamics for Video JEPA: An Empirical Study of Auxiliary Objectives
cs.CV 2026-05 unverdicted novelty 5.0

Empirical tests show that factorized world-model with hard-region-weighted latent dynamics improves ImageNet-100 by 5.92 and SSv2 by 3.21 points over baseline in mixed-dataset pretraining while staying within 0.3 poin...
Representation Without Reward: A JEPA Audit for LLM Fine-Tuning
cs.LG 2026-05 conditional novelty 5.0

An empirical audit of 22 JEPA-style training auxiliaries on Llama-3.2-1B fine-tuning for regex generation finds no statistically significant task improvement after multiple-testing correction, even when auxiliaries vi...
Reconstruction or Semantics? What Makes a Latent Space Useful for Robotic World Models
cs.CV 2026-05 unverdicted novelty 5.0

Semantic latent spaces from pretrained encoders outperform reconstruction-based spaces for robotic world models on planning and downstream policy performance.
Lifting Embodied World Models for Planning and Control
cs.CV 2026-04 unverdicted novelty 5.0

Composing a policy that maps 2D waypoints to joint actions with a frozen world model yields a lifted world model that achieves 3.8 times lower mean joint error than direct low-level search while being more compute-eff...
Motif-Video 2B: Technical Report
cs.CV 2026-04 unverdicted novelty 5.0

Motif-Video 2B achieves 83.76% VBench score, beating a 14B-parameter baseline with 7x fewer parameters and substantially less training data through shared cross-attention and a three-part backbone.
Motif-Video 2B: Technical Report
cs.CV 2026-04 unverdicted novelty 4.0

Motif-Video 2B reaches 83.76% on VBench, outperforming a 14B-parameter model with 7x fewer parameters and far less training data through shared cross-attention and a three-part backbone.
JFAA: Technical Report for the EPIC-KITCHENS-100 Action Anticipation Challenge at EgoVis 2026
cs.CV 2026-05 unverdicted novelty 3.0

JFAA freezes a JEPA future-prediction model, adds a lightweight probe and ensemble, and wins the 2026 EK-100 action anticipation challenge.
VISTA: Technical Report for the Ego4D Short-Term Object Interaction Anticipation at EgoVis 2026
cs.CV 2026-05 unverdicted novelty 2.0

VISTA wins first place on the Ego4D Short-Term Object Interaction Anticipation challenge by combining spatial object proposals with temporal context via feature modulation and ROI fusion, followed by ensembling.