Entity-Centric World Models: Interaction-Aware Masking for Causal Video Prediction

Santosh Kumar Paidi

arxiv: 2605.15466 · v1 · pith:WSWSFIL3new · submitted 2026-05-14 · 💻 cs.CV

Entity-Centric World Models: Interaction-Aware Masking for Causal Video Prediction

Santosh Kumar Paidi This is my paper

Pith reviewed 2026-05-19 14:47 UTC · model grok-4.3

classification 💻 cs.CV

keywords interaction-aware maskingcausal video predictionworld modelsJEPAself-supervised learningphysical interactionsCLEVRER benchmarkentity-centric models

0 comments

The pith

Motion-centric masking in self-supervised video models enables better learning of causal physical dynamics by focusing on interactions rather than static patches.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Interaction-Aware JEPA to address the failure of standard JEPA models to capture causal dynamics in videos due to their masking strategies. Standard patch-based masking tends to focus on visual texture and backgrounds instead of rare but important kinematic events like collisions. By using a self-supervised strategy that masks and reconstructs based on motion and physical interactions, the model is forced to build representations of latent trajectories. This results in improved accuracy on causal reasoning benchmarks and a latent space that better reflects physical properties such as energy. The approach generalizes to real-world videos and puzzle tasks, suggesting a way to build world models that understand physical causality without labels.

Core claim

Interaction-Aware JEPA uses motion-centric masking to target entities in collisions or momentum transfers, compelling the model to reconstruct latent trajectories of physical interactions instead of static background features, which yields higher accuracy on causal reasoning tasks and induces a latent space that linearizes physical energy.

What carries the argument

Interaction-aware masking strategy that prioritizes physical interactions such as collisions and momentum transfers for self-supervised prediction in video models.

If this is right

Standard JEPA models can be improved for physics-related tasks by shifting masking from random patches to interaction-focused ones.
The resulting higher-entropy latent space correlates with physical energy, enabling better downstream reasoning.
The method generalizes beyond synthetic benchmarks to real human action videos and zero-shot physical puzzles.
Self-supervised training can internalize causal structure of the physical world at scale without explicit labels.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This technique might help other predictive models in robotics or autonomous systems to better anticipate physical outcomes.
Applying similar interaction biases could address static biases in other self-supervised vision tasks.
Testing on more complex or longer video sequences would reveal if the causal learning scales effectively.
The entropy increase suggests potential for better generalization in dynamic environments.

Load-bearing premise

That prioritizing masking around collisions and momentum transfers will make the model reconstruct dynamic physical trajectories instead of static background elements.

What would settle it

Observing that IA-JEPA performs no better than standard patch-masked models on causal reasoning tasks or that its latent representations do not show improved correlation with physical energy measures would falsify the central claim.

Figures

Figures reproduced from arXiv: 2605.15466 by Santosh Kumar Paidi.

**Figure 2.** Figure 2: Latent Linearity of Physical Energy. We observe a significant positive correlation between [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: Evolution of Internal Representation Saliency. Activations are visualized using global [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Latent Rollout Dynamics. The Baseline model (blue) exhibits near-perfect stability, [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

read the original abstract

Learning predictive world models from unlabelled video is a foundational challenge in artificial intelligence. While Joint Embedding Predictive Architectures (JEPA) have set new benchmarks in semantic classification, they often remain physics-blind, failing to capture the causal dynamics necessary for downstream reasoning. We hypothesize that this stems from standard patch-based masking strategies, which prioritize visual texture over rare but informative kinematic events. We propose Interaction-Aware JEPA (IA-JEPA), which utilizes a self-supervised motion-centric masking strategy to prioritize physical interactions. By specifically targeting entities engaged in collisions or momentum transfers, we force the architecture to reconstruct latent trajectories rather than static background features. Evaluated on the CLEVRER benchmark, IA-JEPA achieves 14.26% accuracy on causal reasoning tasks, a significant lead over the 3.22% achieved by standard patch-masked baselines. Crucially, we demonstrate that IA-JEPA breaks the "static bias" of standard self-supervision by inducing a higher-entropy, more discriminative latent space (+10% entropy gain) that linearizes physical energy ($R^2=0.43$). We show that this interaction bias generalizes to real-world human actions (Something-Something V2) and zero-shot physical puzzles (PHYRE-Lite). Our results provide a scalable, fully self-supervised path toward building foundational world models that begin to internalize the causal structure of the physical world.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper's masking change lifts CLEVRER causal accuracy from 3% to 14%, but the self-supervised claim hinges on how collisions get spotted without extra priors.

read the letter

The main takeaway is that swapping standard random patch masking for a motion-centric version inside JEPA produces a clear accuracy jump on causal reasoning tasks in CLEVRER, along with a higher-entropy latent space that shows some correlation to physical energy. That is the concrete result worth noting first. They frame the change as forcing the model to reconstruct trajectories of interacting objects rather than background texture, and they report gains that carry over to Something-Something V2 and PHYRE-Lite. The empirical comparison is straightforward and the numbers are large enough to notice. If the full runs include proper controls, this points to a practical way to bias self-supervised video models toward dynamics without adding labels. The work is most useful for groups already running JEPA-style predictors and looking for targeted masking adjustments that improve downstream physical reasoning. A reader who cares about scalable world models for robotics or simulation would get a usable idea from the setup. The soft spots sit mainly in the masking implementation itself. The abstract says the strategy specifically targets collisions and momentum transfers in a fully self-supervised way, yet detecting those events from raw video is not automatic. Optical flow or similar cues pick up any motion, not necessarily the interaction events that matter for causality. Without seeing the exact detection rule or ablation that isolates the effect from the JEPA loss, it is hard to rule out that the performance gap comes from an injected motion prior rather than the predictive objective. The R² of 0.43 is modest, and the summary gives no error bars or training details, so the strength of the linearization claim is still provisional. Overall the central empirical result looks worth checking in review. I would send it to referees so they can verify the masking code and run the missing controls. The idea is simple enough that a careful review can settle whether the self-supervision story holds.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes Interaction-Aware JEPA (IA-JEPA), an extension of Joint Embedding Predictive Architectures that replaces standard patch-based masking with a self-supervised motion-centric masking strategy. The core hypothesis is that prioritizing entities involved in collisions or momentum transfers forces the model to reconstruct latent trajectories rather than static features, thereby capturing causal physical dynamics. On the CLEVRER benchmark, IA-JEPA reports 14.26% accuracy on causal reasoning tasks versus 3.22% for patch-masked baselines, together with a +10% entropy gain in the latent space and an R²=0.43 fit to physical energy; the approach is also evaluated on Something-Something V2 and PHYRE-Lite.

Significance. If the central empirical claims are robustly supported and the masking procedure is shown to be free of external priors, the work would offer a concrete, scalable route to physics-aware world models from unlabeled video. The reported accuracy lift and the entropy/energy correlation provide a falsifiable link between masking design and causal representation quality, which is a valuable direction for self-supervised learning.

major comments (2)

[Abstract and §3] Abstract and §3 (Method): The claim that the motion-centric masking 'specifically target[s] entities engaged in collisions or momentum transfers' in a purely self-supervised regime from raw video is load-bearing for the performance gap (14.26% vs 3.22%) and for the assertion of a 'fully self-supervised path.' Standard motion cues such as optical-flow magnitude detect any movement; without an explicit description of how collision/momentum events are isolated without labels, trackers, or geometric rules, it remains possible that the observed gains arise from implicit supervision in the masking heuristic rather than from the JEPA objective itself. A concrete implementation sketch or ablation isolating the detection rule is required.
[Results] Results (CLEVRER and energy-linearization paragraphs): The R²=0.43 fit between latent representations and physical energy is presented as evidence that IA-JEPA linearizes causal structure, yet the manuscript does not state whether this regression is performed on held-out data, whether the energy labels are derived independently of the masking procedure, or whether the fit survives controls that remove the interaction bias. Because this metric is used to support the claim of breaking 'static bias,' its post-hoc nature and lack of statistical controls weaken the causal interpretation.

minor comments (2)

[Abstract] Abstract: The reported accuracy figures lack error bars or the number of random seeds; likewise the '+10% entropy gain' is stated without a precise baseline or variance estimate.
Throughout: Training hyperparameters, exact dataset splits for CLEVRER causal tasks, and whether the same masking strategy is applied unchanged to Something-Something V2 and PHYRE-Lite should be stated explicitly for reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their insightful comments, which have helped us improve the clarity and rigor of our manuscript. We address each major comment below.

read point-by-point responses

Referee: [Abstract and §3] Abstract and §3 (Method): The claim that the motion-centric masking 'specifically target[s] entities engaged in collisions or momentum transfers' in a purely self-supervised regime from raw video is load-bearing for the performance gap (14.26% vs 3.22%) and for the assertion of a 'fully self-supervised path.' Standard motion cues such as optical-flow magnitude detect any movement; without an explicit description of how collision/momentum events are isolated without labels, trackers, or geometric rules, it remains possible that the observed gains arise from implicit supervision in the masking heuristic rather than from the JEPA objective itself. A concrete implementation sketch or ablation isolating the detection rule is required.

Authors: We appreciate the referee's concern regarding the self-supervised nature of our masking strategy. In the original manuscript, we described the motion-centric masking at a high level in §3, but we acknowledge that a more detailed implementation sketch is warranted to address potential concerns about implicit supervision. The masking procedure computes dense optical flow between frames and selects patches exhibiting both high magnitude and abrupt changes in direction, which empirically correlate with collision and momentum transfer events in the video data. This is done without any labels, trackers, or external geometric rules, relying solely on pixel-level motion statistics. To strengthen this, we have added a concrete pseudocode sketch in the revised §3 and an ablation comparing our interaction-aware masking to standard high-motion masking, demonstrating that the specific targeting of interaction events contributes to the performance gains beyond generic motion prioritization. We believe this clarifies that the gains stem from the combination with the JEPA objective. revision: yes
Referee: [Results] Results (CLEVRER and energy-linearization paragraphs): The R²=0.43 fit between latent representations and physical energy is presented as evidence that IA-JEPA linearizes causal structure, yet the manuscript does not state whether this regression is performed on held-out data, whether the energy labels are derived independently of the masking procedure, or whether the fit survives controls that remove the interaction bias. Because this metric is used to support the claim of breaking 'static bias,' its post-hoc nature and lack of statistical controls weaken the causal interpretation.

Authors: We agree that specifying the details of the energy linearization analysis is important for robust interpretation. The R²=0.43 regression is performed on held-out test videos from the CLEVRER dataset, with physical energy labels computed directly from the ground-truth simulation parameters (masses, velocities) independently of our masking or training procedure. In the revised manuscript, we have clarified this in the results section and added a control experiment ablating the interaction bias (by reverting to standard patch masking while keeping the same architecture), which shows a significant drop in the R² value to approximately 0.15. This supports our claim that the masking strategy helps linearize the latent space with respect to physical quantities. We have also included statistical significance tests for the fit. revision: yes

Circularity Check

0 steps flagged

No significant circularity; claims rest on empirical benchmarks

full rationale

The paper's derivation chain consists of a hypothesis about masking strategies followed by empirical evaluation on CLEVRER, Something-Something V2, and PHYRE-Lite. Reported metrics (14.26% accuracy, +10% entropy, R²=0.43) are measured outcomes against explicit baselines rather than quantities derived by construction from the masking definition or any self-citation. No equations, fitted parameters renamed as predictions, or load-bearing self-citations appear in the abstract or described chain; the self-supervised claim is tested externally and does not reduce to tautology.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract-only review limits visibility into explicit assumptions; standard self-supervised learning premises are implicit.

axioms (1)

domain assumption Patch-based masking in JEPA prioritizes visual texture over kinematic events
Stated as the source of physics-blind behavior in the hypothesis.

pith-pipeline@v0.9.0 · 5778 in / 1129 out tokens · 56012 ms · 2026-05-19T14:47:01.208588+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We utilize temporal acceleration as a self-supervised proxy for physical interaction... second derivative over time... saliency map S... 40% of tokens exhibiting the highest action intensity.
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

inducing a higher-entropy, more discriminative latent space (+10% entropy gain) that linearizes physical energy (R²=0.43)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

35 extracted references · 35 canonical work pages · 3 internal anchors

[1]

Video generation models as world simulators.OpenAI Blog, 2024

Tim Brooks et al. Video generation models as world simulators.OpenAI Blog, 2024

work page 2024
[2]

V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning

Mahmoud Assran, Adrien Bardes, David Fan, Quentin Garrido, et al. V-jepa 2: Self-supervised video models enable understanding, prediction and planning.arXiv preprint arXiv:2506.09985, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

CLEVRER: CoLlision Events for Video REpresentation and Reasoning

Kexin Yi, Chuang Gan, Yunzhu Li, Pushmeet Kohli, Jiajun Wu, Antonio Torralba, and Joshua B Tenenbaum. Clevrer: Collision events for video representation and reasoning.arXiv preprint arXiv:1910.01442, 2020. 9

work page internal anchor Pith review Pith/arXiv arXiv 1910
[4]

something something

Raghav Goyal et al. The" something something" video database for learning and evaluating visual common sense. InProceedings of the IEEE International Conference on Computer Vision (ICCV), 2017

work page 2017
[5]

Phyre: A new benchmark for physical reasoning

Anton Bakhtin et al. Phyre: A new benchmark for physical reasoning. InAdvances in Neural Information Processing Systems (NeurIPS), 2019

work page 2019
[6]

A simple framework for contrastive learning of visual representations

Ting Chen et al. A simple framework for contrastive learning of visual representations. In International Conference on Machine Learning (ICML), 2020

work page 2020
[7]

Momentum contrast for unsupervised visual representation learning

Kaiming He et al. Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020

work page 2020
[8]

Assran et al

Mahmoud Assran et al. Self-supervised learning from images with a joint-embedding predictive architecture.arXiv preprint arXiv:2301.08243, 2023

work page arXiv 2023
[9]

Bootstrap your own latent-a new approach to self-supervised learning

Jean-Bastien Grill et al. Bootstrap your own latent-a new approach to self-supervised learning. InAdvances in Neural Information Processing Systems (NeurIPS), 2020

work page 2020
[10]

Emerging properties in self-supervised vision transformers

Mathilde Caron et al. Emerging properties in self-supervised vision transformers. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2021

work page 2021
[11]

Barlow twins: Self-supervised learning via redundancy reduction

Jure Zbontar et al. Barlow twins: Self-supervised learning via redundancy reduction. In International Conference on Machine Learning (ICML), 2021

work page 2021
[12]

Videomae: Masked autoencoders are data-efficient learners for self-supervised video forecasting

Zhan Tong, Yibing Song, Jue Wang, and Guan-Guan Lim. Videomae: Masked autoencoders are data-efficient learners for self-supervised video forecasting. InAdvances in Neural Information Processing Systems (NeurIPS), 2022

work page 2022
[13]

Masked autoencoders are scalable vision learners

Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, and Ross Girshick. Masked autoencoders are scalable vision learners. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 7292–7300, 2022

work page 2022
[14]

Flava: A foundational language and vision alignment model

Amanpreet Singh et al. Flava: A foundational language and vision alignment model. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022

work page 2022
[15]

Mgmae: Motion guided masking for video masked autoencoding

Bingkun Huang, Zhiyu Zhao, Guanpu Zhang, Yu Qiao, and Limin Wang. Mgmae: Motion guided masking for video masked autoencoding. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023

work page 2023
[16]

EVEREST: Efficient masked video autoencoder by removing redundant spatiotemporal tokens

Sunil Hwang, Jaehong Yoon, Youngwan Lee, and Sung Ju Hwang. EVEREST: Efficient masked video autoencoder by removing redundant spatiotemporal tokens. InProceedings of the 41st International Conference on Machine Learning (ICML), 2024

work page 2024
[17]

Motion-guided masking for spatiotemporal representation learning

David Fan, Jue Wang, Shuai Liao, Xinyu Li, Yi Ding, Madhu Pan, Hao Pan, and Roger Zhang. Motion-guided masking for spatiotemporal representation learning. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023

work page 2023
[18]

Skeleton2vec: A self- supervised learning framework with contextualized target representations for skeleton sequence

Ruizhuo Xu, Linzhi Huang, Mei Wang, Jiani Hu, and Weihong Deng. Skeleton2vec: A self- supervised learning framework with contextualized target representations for skeleton sequence. arXiv preprint arXiv:2401.00921, 2024

work page arXiv 2024
[19]

Cross-modal contrastive masked autoencoder for compressed video pre-training.IEEE Transactions on Image Processing (TIP), 2025

Zohng-Yuan Zheng, Kai Wang, Yang Wang, Shidong Wang, Liujuan Wang, and Zhiyong Wang. Cross-modal contrastive masked autoencoder for compressed video pre-training.IEEE Transactions on Image Processing (TIP), 2025

work page 2025
[20]

Intuitive physics from natural video pre-training.Nature Machine Intelligence, 2025

Quentin Garrido et al. Intuitive physics from natural video pre-training.Nature Machine Intelligence, 2025

work page 2025
[21]

Object-centric learning with slot attention

Francesco Locatello et al. Object-centric learning with slot attention. InAdvances in Neural Information Processing Systems (NeurIPS), 2020. 10

work page 2020
[22]

Conditional object-centric learning from video

Thomas Kipf et al. Conditional object-centric learning from video. InInternational Conference on Learning Representations (ICLR), 2022

work page 2022
[23]

Savi++: Towards end-to-end object-centric learning from real-world videos.arXiv preprint arXiv:2206.07764, 2022

Gamaleldin F Elsayed et al. Savi++: Towards end-to-end object-centric learning from real-world videos.arXiv preprint arXiv:2206.07764, 2022

work page arXiv 2022
[24]

Bridging the gap between object-centric learning and semantic segmentation

Maximilian Seitzer et al. Bridging the gap between object-centric learning and semantic segmentation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023

work page 2023
[25]

Nam et al

J. Nam et al. Causal-jepa: Object-level latent interventions for self-supervised physics.Interna- tional Conference on Learning Representations (ICLR), 2026

work page 2026
[26]

Li et al

X. Li et al. Statm: Spatiotemporal attention transition models for world modeling. InInterna- tional Conference on Learning Representations (ICLR), 2025

work page 2025
[27]

Slotformer: Unsupervised visual world models from slots.arXiv preprint arXiv:2210.11438, 2022

Zoltán Wu et al. Slotformer: Unsupervised visual world models from slots.arXiv preprint arXiv:2210.11438, 2022

work page arXiv 2022
[28]

Learning physical intuition of block towers by example

Adam Lerer et al. Learning physical intuition of block towers by example. InInternational Conference on Machine Learning (ICML), 2016

work page 2016
[29]

Zhang et al

H. Zhang et al. Morpheus: A benchmark for long-horizon physical reasoning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025

work page 2025
[30]

Li et al

B. Li et al. Seed-bench-2: Benchmarking multimodal large language models.arXiv preprint arXiv:2403.11111, 2024

work page arXiv 2024
[31]

A very big video reasoning suite.arXiv preprint arXiv:2602.20159, 2026

Maijunxian Wang, Ruisi Wang, Juyi Lin, Ran Ji, et al. A very big video reasoning suite.arXiv preprint arXiv:2602.20159, 2026

work page arXiv 2026
[32]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Alexey Dosovitskiy et al. An image is worth 16x16 words: Transformers for image recognition at scale.arXiv preprint arXiv:2010.11929, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2010
[33]

Attention over learned object embeddings enables complex visual reasoning

David Ding, Felix Hill, Adam Santoro, and Matthew Botvinick. Attention over learned object embeddings enables complex visual reasoning. InAdvances in Neural Information Processing Systems (NeurIPS), 2021

work page 2021
[34]

Compositional attention networks for machine reasoning

Drew A Hudson and Christopher D Manning. Compositional attention networks for machine reasoning. InInternational Conference on Learning Representations (ICLR), 2018

work page 2018
[35]

Normalization Mandate

Jovana Mitrovic, Brian McWilliams, Ian Walker, Lars Buesing, and Charles Blundell. Repre- sentation learning via invariant causal mechanisms.arXiv preprint arXiv:2010.07922, 2021. A Technical appendices and supplementary material A.1 Standardization of Implementation and Reproducibility To ensure the scientific integrity of our results, all experiments we...

work page arXiv 2010

[1] [1]

Video generation models as world simulators.OpenAI Blog, 2024

Tim Brooks et al. Video generation models as world simulators.OpenAI Blog, 2024

work page 2024

[2] [2]

V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning

Mahmoud Assran, Adrien Bardes, David Fan, Quentin Garrido, et al. V-jepa 2: Self-supervised video models enable understanding, prediction and planning.arXiv preprint arXiv:2506.09985, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[3] [3]

CLEVRER: CoLlision Events for Video REpresentation and Reasoning

Kexin Yi, Chuang Gan, Yunzhu Li, Pushmeet Kohli, Jiajun Wu, Antonio Torralba, and Joshua B Tenenbaum. Clevrer: Collision events for video representation and reasoning.arXiv preprint arXiv:1910.01442, 2020. 9

work page internal anchor Pith review Pith/arXiv arXiv 1910

[4] [4]

something something

Raghav Goyal et al. The" something something" video database for learning and evaluating visual common sense. InProceedings of the IEEE International Conference on Computer Vision (ICCV), 2017

work page 2017

[5] [5]

Phyre: A new benchmark for physical reasoning

Anton Bakhtin et al. Phyre: A new benchmark for physical reasoning. InAdvances in Neural Information Processing Systems (NeurIPS), 2019

work page 2019

[6] [6]

A simple framework for contrastive learning of visual representations

Ting Chen et al. A simple framework for contrastive learning of visual representations. In International Conference on Machine Learning (ICML), 2020

work page 2020

[7] [7]

Momentum contrast for unsupervised visual representation learning

Kaiming He et al. Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020

work page 2020

[8] [8]

Assran et al

Mahmoud Assran et al. Self-supervised learning from images with a joint-embedding predictive architecture.arXiv preprint arXiv:2301.08243, 2023

work page arXiv 2023

[9] [9]

Bootstrap your own latent-a new approach to self-supervised learning

Jean-Bastien Grill et al. Bootstrap your own latent-a new approach to self-supervised learning. InAdvances in Neural Information Processing Systems (NeurIPS), 2020

work page 2020

[10] [10]

Emerging properties in self-supervised vision transformers

Mathilde Caron et al. Emerging properties in self-supervised vision transformers. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2021

work page 2021

[11] [11]

Barlow twins: Self-supervised learning via redundancy reduction

Jure Zbontar et al. Barlow twins: Self-supervised learning via redundancy reduction. In International Conference on Machine Learning (ICML), 2021

work page 2021

[12] [12]

Videomae: Masked autoencoders are data-efficient learners for self-supervised video forecasting

Zhan Tong, Yibing Song, Jue Wang, and Guan-Guan Lim. Videomae: Masked autoencoders are data-efficient learners for self-supervised video forecasting. InAdvances in Neural Information Processing Systems (NeurIPS), 2022

work page 2022

[13] [13]

Masked autoencoders are scalable vision learners

Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, and Ross Girshick. Masked autoencoders are scalable vision learners. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 7292–7300, 2022

work page 2022

[14] [14]

Flava: A foundational language and vision alignment model

Amanpreet Singh et al. Flava: A foundational language and vision alignment model. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022

work page 2022

[15] [15]

Mgmae: Motion guided masking for video masked autoencoding

Bingkun Huang, Zhiyu Zhao, Guanpu Zhang, Yu Qiao, and Limin Wang. Mgmae: Motion guided masking for video masked autoencoding. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023

work page 2023

[16] [16]

EVEREST: Efficient masked video autoencoder by removing redundant spatiotemporal tokens

Sunil Hwang, Jaehong Yoon, Youngwan Lee, and Sung Ju Hwang. EVEREST: Efficient masked video autoencoder by removing redundant spatiotemporal tokens. InProceedings of the 41st International Conference on Machine Learning (ICML), 2024

work page 2024

[17] [17]

Motion-guided masking for spatiotemporal representation learning

David Fan, Jue Wang, Shuai Liao, Xinyu Li, Yi Ding, Madhu Pan, Hao Pan, and Roger Zhang. Motion-guided masking for spatiotemporal representation learning. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023

work page 2023

[18] [18]

Skeleton2vec: A self- supervised learning framework with contextualized target representations for skeleton sequence

Ruizhuo Xu, Linzhi Huang, Mei Wang, Jiani Hu, and Weihong Deng. Skeleton2vec: A self- supervised learning framework with contextualized target representations for skeleton sequence. arXiv preprint arXiv:2401.00921, 2024

work page arXiv 2024

[19] [19]

Cross-modal contrastive masked autoencoder for compressed video pre-training.IEEE Transactions on Image Processing (TIP), 2025

Zohng-Yuan Zheng, Kai Wang, Yang Wang, Shidong Wang, Liujuan Wang, and Zhiyong Wang. Cross-modal contrastive masked autoencoder for compressed video pre-training.IEEE Transactions on Image Processing (TIP), 2025

work page 2025

[20] [20]

Intuitive physics from natural video pre-training.Nature Machine Intelligence, 2025

Quentin Garrido et al. Intuitive physics from natural video pre-training.Nature Machine Intelligence, 2025

work page 2025

[21] [21]

Object-centric learning with slot attention

Francesco Locatello et al. Object-centric learning with slot attention. InAdvances in Neural Information Processing Systems (NeurIPS), 2020. 10

work page 2020

[22] [22]

Conditional object-centric learning from video

Thomas Kipf et al. Conditional object-centric learning from video. InInternational Conference on Learning Representations (ICLR), 2022

work page 2022

[23] [23]

Savi++: Towards end-to-end object-centric learning from real-world videos.arXiv preprint arXiv:2206.07764, 2022

Gamaleldin F Elsayed et al. Savi++: Towards end-to-end object-centric learning from real-world videos.arXiv preprint arXiv:2206.07764, 2022

work page arXiv 2022

[24] [24]

Bridging the gap between object-centric learning and semantic segmentation

Maximilian Seitzer et al. Bridging the gap between object-centric learning and semantic segmentation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023

work page 2023

[25] [25]

Nam et al

J. Nam et al. Causal-jepa: Object-level latent interventions for self-supervised physics.Interna- tional Conference on Learning Representations (ICLR), 2026

work page 2026

[26] [26]

Li et al

X. Li et al. Statm: Spatiotemporal attention transition models for world modeling. InInterna- tional Conference on Learning Representations (ICLR), 2025

work page 2025

[27] [27]

Slotformer: Unsupervised visual world models from slots.arXiv preprint arXiv:2210.11438, 2022

Zoltán Wu et al. Slotformer: Unsupervised visual world models from slots.arXiv preprint arXiv:2210.11438, 2022

work page arXiv 2022

[28] [28]

Learning physical intuition of block towers by example

Adam Lerer et al. Learning physical intuition of block towers by example. InInternational Conference on Machine Learning (ICML), 2016

work page 2016

[29] [29]

Zhang et al

H. Zhang et al. Morpheus: A benchmark for long-horizon physical reasoning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025

work page 2025

[30] [30]

Li et al

B. Li et al. Seed-bench-2: Benchmarking multimodal large language models.arXiv preprint arXiv:2403.11111, 2024

work page arXiv 2024

[31] [31]

A very big video reasoning suite.arXiv preprint arXiv:2602.20159, 2026

Maijunxian Wang, Ruisi Wang, Juyi Lin, Ran Ji, et al. A very big video reasoning suite.arXiv preprint arXiv:2602.20159, 2026

work page arXiv 2026

[32] [32]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Alexey Dosovitskiy et al. An image is worth 16x16 words: Transformers for image recognition at scale.arXiv preprint arXiv:2010.11929, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2010

[33] [33]

Attention over learned object embeddings enables complex visual reasoning

David Ding, Felix Hill, Adam Santoro, and Matthew Botvinick. Attention over learned object embeddings enables complex visual reasoning. InAdvances in Neural Information Processing Systems (NeurIPS), 2021

work page 2021

[34] [34]

Compositional attention networks for machine reasoning

Drew A Hudson and Christopher D Manning. Compositional attention networks for machine reasoning. InInternational Conference on Learning Representations (ICLR), 2018

work page 2018

[35] [35]

Normalization Mandate

Jovana Mitrovic, Brian McWilliams, Ian Walker, Lars Buesing, and Charles Blundell. Repre- sentation learning via invariant causal mechanisms.arXiv preprint arXiv:2010.07922, 2021. A Technical appendices and supplementary material A.1 Standardization of Implementation and Reproducibility To ensure the scientific integrity of our results, all experiments we...

work page arXiv 2010