hub Mixed citations

arXiv preprint arXiv:2603.14482 (2026)

Lorenzo Mur-Labadia, Matthew Muckley, Amir Bar, Mahmoud Assran, Koustuv Sinha, Michael Rabbat, Yann LeCun, Nicolas Ballas, Adrien Bardes · 2026 · arXiv 2603.14482

Mixed citation behavior. Most common role is background (57%).

17 Pith papers citing it

Background 57% of classified citations

read on arXiv browse 17 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 4 method 3

citation-polarity summary

background 4 use method 3

representative citing papers

Learning Visual Feature-Based World Models via Residual Latent Action

cs.CV · 2026-05-08 · unverdicted · novelty 7.0

RLA-WM predicts residual latent actions via flow matching to create visual feature world models that outperform prior feature-based and diffusion approaches while enabling offline video-based robot RL.

LookWhen? Fast Video Recognition by Learning When, Where, and What to Compute

cs.CV · 2026-05-07 · conditional · novelty 7.0

LookWhen factorizes video recognition into learning when, where, and what to compute via uniqueness-based token selection and dual-teacher distillation, achieving better accuracy-FLOPs trade-offs than baselines on multiple datasets.

Latent State Design for World Models under Sufficiency Constraints

cs.AI · 2026-05-03 · unverdicted · novelty 7.0

World models succeed when their latent states are built to meet task-specific sufficiency constraints rather than preserving the maximum amount of information.

Being-H0.7: A Latent World-Action Model from Egocentric Videos

cs.RO · 2026-04-30 · unverdicted · novelty 7.0

Being-H0.7 adds future-aware latent reasoning to direct VLA policies via dual-branch alignment on latent queries, matching world-model benefits at VLA efficiency.

How You Move Tells What You'll Do: Trajectory-Conditioned Egocentric Prediction

cs.CV · 2026-05-19 · unverdicted · novelty 6.0

TrajPilot predicts candidate future trajectories from egocentric context and uses them to condition action prediction in an embedding space, outperforming VLM and planner baselines on Ego-Exo4D, Ego4D, and other datasets with gains increasing at longer horizons.

Latent Video Prediction Learns Better World Models

cs.CV · 2026-05-15 · unverdicted · novelty 6.0

Latent prediction video models exhibit a distinct robustness profile across corruption, occlusion, fine-grained discrimination, and temporal sensitivity compared to other self-supervised video models when used as world models.

SARA: Semantically Adaptive Relational Alignment for Video Diffusion Models

cs.CV · 2026-05-08 · unverdicted · novelty 6.0

SARA improves text alignment and motion quality in video diffusion models by routing token-relation distillation supervision to semantically salient pairs using a Stage-1 aligner trained with SAM masks and InfoNCE.

3D MRI Image Pretraining via Controllable 2D Slice Navigation Task

cs.CV · 2026-05-07 · unverdicted · novelty 6.0

Converting 3D MRI volumes into action-conditioned 2D slice navigation sequences offers a complementary self-supervised pretraining signal for learning anatomical and spatial representations.

Modular Multimodal Classification Without Fine-Tuning: A Simple Compositional Approach

cs.LG · 2026-05-20 · unverdicted · novelty 5.0

CoMET achieves strong multimodal classification performance by composing frozen modality encoders, PCA compression, and tabular foundation models without any training, reaching state-of-the-art on diverse benchmarks including large-scale hierarchical tasks.

Factorized Latent Dynamics for Video JEPA: An Empirical Study of Auxiliary Objectives

cs.CV · 2026-05-16 · unverdicted · novelty 5.0

Empirical tests show that factorized world-model with hard-region-weighted latent dynamics improves ImageNet-100 by 5.92 and SSv2 by 3.21 points over baseline in mixed-dataset pretraining while staying within 0.3 points on Diving-48.

Representation Without Reward: A JEPA Audit for LLM Fine-Tuning

cs.LG · 2026-05-14 · conditional · novelty 5.0

An empirical audit of 22 JEPA-style training auxiliaries on Llama-3.2-1B fine-tuning for regex generation finds no statistically significant task improvement after multiple-testing correction, even when auxiliaries visibly alter hidden-state geometry.

Reconstruction or Semantics? What Makes a Latent Space Useful for Robotic World Models

cs.CV · 2026-05-07 · unverdicted · novelty 5.0

Semantic latent spaces from pretrained encoders outperform reconstruction-based spaces for robotic world models on planning and downstream policy performance.

Lifting Embodied World Models for Planning and Control

cs.CV · 2026-04-28 · unverdicted · novelty 5.0

Composing a policy that maps 2D waypoints to joint actions with a frozen world model yields a lifted world model that achieves 3.8 times lower mean joint error than direct low-level search while being more compute-efficient and generalizing to unseen environments.

Motif-Video 2B: Technical Report

cs.CV · 2026-04-14 · unverdicted · novelty 4.0 · 2 refs

Motif-Video 2B reaches 83.76% on VBench, outperforming a 14B-parameter model with 7x fewer parameters and far less training data through shared cross-attention and a three-part backbone.

JFAA: Technical Report for the EPIC-KITCHENS-100 Action Anticipation Challenge at EgoVis 2026

cs.CV · 2026-05-20 · unverdicted · novelty 3.0

JFAA freezes a JEPA future-prediction model, adds a lightweight probe and ensemble, and wins the 2026 EK-100 action anticipation challenge.

VISTA: Technical Report for the Ego4D Short-Term Object Interaction Anticipation at EgoVis 2026

cs.CV · 2026-05-20 · unverdicted · novelty 2.0

VISTA wins first place on the Ego4D Short-Term Object Interaction Anticipation challenge by combining spatial object proposals with temporal context via feature modulation and ROI fusion, followed by ensembling.

EgoExo-WM: Unlocking Exo Video for Ego World Models

cs.CV · 2026-05-14

citing papers explorer

Showing 17 of 17 citing papers.

Learning Visual Feature-Based World Models via Residual Latent Action cs.CV · 2026-05-08 · unverdicted · none · ref 37
RLA-WM predicts residual latent actions via flow matching to create visual feature world models that outperform prior feature-based and diffusion approaches while enabling offline video-based robot RL.
LookWhen? Fast Video Recognition by Learning When, Where, and What to Compute cs.CV · 2026-05-07 · conditional · none · ref 60
LookWhen factorizes video recognition into learning when, where, and what to compute via uniqueness-based token selection and dual-teacher distillation, achieving better accuracy-FLOPs trade-offs than baselines on multiple datasets.
Latent State Design for World Models under Sufficiency Constraints cs.AI · 2026-05-03 · unverdicted · none · ref 48
World models succeed when their latent states are built to meet task-specific sufficiency constraints rather than preserving the maximum amount of information.
Being-H0.7: A Latent World-Action Model from Egocentric Videos cs.RO · 2026-04-30 · unverdicted · none · ref 110
Being-H0.7 adds future-aware latent reasoning to direct VLA policies via dual-branch alignment on latent queries, matching world-model benefits at VLA efficiency.
How You Move Tells What You'll Do: Trajectory-Conditioned Egocentric Prediction cs.CV · 2026-05-19 · unverdicted · none · ref 19
TrajPilot predicts candidate future trajectories from egocentric context and uses them to condition action prediction in an embedding space, outperforming VLM and planner baselines on Ego-Exo4D, Ego4D, and other datasets with gains increasing at longer horizons.
Latent Video Prediction Learns Better World Models cs.CV · 2026-05-15 · unverdicted · none · ref 21
Latent prediction video models exhibit a distinct robustness profile across corruption, occlusion, fine-grained discrimination, and temporal sensitivity compared to other self-supervised video models when used as world models.
SARA: Semantically Adaptive Relational Alignment for Video Diffusion Models cs.CV · 2026-05-08 · unverdicted · none · ref 29
SARA improves text alignment and motion quality in video diffusion models by routing token-relation distillation supervision to semantically salient pairs using a Stage-1 aligner trained with SAM masks and InfoNCE.
3D MRI Image Pretraining via Controllable 2D Slice Navigation Task cs.CV · 2026-05-07 · unverdicted · none · ref 19
Converting 3D MRI volumes into action-conditioned 2D slice navigation sequences offers a complementary self-supervised pretraining signal for learning anatomical and spatial representations.
Modular Multimodal Classification Without Fine-Tuning: A Simple Compositional Approach cs.LG · 2026-05-20 · unverdicted · none · ref 21
CoMET achieves strong multimodal classification performance by composing frozen modality encoders, PCA compression, and tabular foundation models without any training, reaching state-of-the-art on diverse benchmarks including large-scale hierarchical tasks.
Factorized Latent Dynamics for Video JEPA: An Empirical Study of Auxiliary Objectives cs.CV · 2026-05-16 · unverdicted · none · ref 35
Empirical tests show that factorized world-model with hard-region-weighted latent dynamics improves ImageNet-100 by 5.92 and SSv2 by 3.21 points over baseline in mixed-dataset pretraining while staying within 0.3 points on Diving-48.
Representation Without Reward: A JEPA Audit for LLM Fine-Tuning cs.LG · 2026-05-14 · conditional · none · ref 5
An empirical audit of 22 JEPA-style training auxiliaries on Llama-3.2-1B fine-tuning for regex generation finds no statistically significant task improvement after multiple-testing correction, even when auxiliaries visibly alter hidden-state geometry.
Reconstruction or Semantics? What Makes a Latent Space Useful for Robotic World Models cs.CV · 2026-05-07 · unverdicted · none · ref 38
Semantic latent spaces from pretrained encoders outperform reconstruction-based spaces for robotic world models on planning and downstream policy performance.
Lifting Embodied World Models for Planning and Control cs.CV · 2026-04-28 · unverdicted · none · ref 26
Composing a policy that maps 2D waypoints to joint actions with a frozen world model yields a lifted world model that achieves 3.8 times lower mean joint error than direct low-level search while being more compute-efficient and generalizing to unseen environments.
Motif-Video 2B: Technical Report cs.CV · 2026-04-14 · unverdicted · none · ref 25 · 2 links
Motif-Video 2B reaches 83.76% on VBench, outperforming a 14B-parameter model with 7x fewer parameters and far less training data through shared cross-attention and a three-part backbone.
JFAA: Technical Report for the EPIC-KITCHENS-100 Action Anticipation Challenge at EgoVis 2026 cs.CV · 2026-05-20 · unverdicted · none · ref 8
JFAA freezes a JEPA future-prediction model, adds a lightweight probe and ensemble, and wins the 2026 EK-100 action anticipation challenge.
VISTA: Technical Report for the Ego4D Short-Term Object Interaction Anticipation at EgoVis 2026 cs.CV · 2026-05-20 · unverdicted · none · ref 10
VISTA wins first place on the Ego4D Short-Term Object Interaction Anticipation challenge by combining spatial object proposals with temporal context via feature modulation and ROI fusion, followed by ensembling.
EgoExo-WM: Unlocking Exo Video for Ego World Models cs.CV · 2026-05-14 · unreviewed · ref 55

arXiv preprint arXiv:2603.14482 (2026)

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer