pith. sign in

arxiv: 2605.15466 · v1 · pith:WSWSFIL3new · submitted 2026-05-14 · 💻 cs.CV

Entity-Centric World Models: Interaction-Aware Masking for Causal Video Prediction

Pith reviewed 2026-05-19 14:47 UTC · model grok-4.3

classification 💻 cs.CV
keywords interaction-aware maskingcausal video predictionworld modelsJEPAself-supervised learningphysical interactionsCLEVRER benchmarkentity-centric models
0
0 comments X

The pith

Motion-centric masking in self-supervised video models enables better learning of causal physical dynamics by focusing on interactions rather than static patches.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Interaction-Aware JEPA to address the failure of standard JEPA models to capture causal dynamics in videos due to their masking strategies. Standard patch-based masking tends to focus on visual texture and backgrounds instead of rare but important kinematic events like collisions. By using a self-supervised strategy that masks and reconstructs based on motion and physical interactions, the model is forced to build representations of latent trajectories. This results in improved accuracy on causal reasoning benchmarks and a latent space that better reflects physical properties such as energy. The approach generalizes to real-world videos and puzzle tasks, suggesting a way to build world models that understand physical causality without labels.

Core claim

Interaction-Aware JEPA uses motion-centric masking to target entities in collisions or momentum transfers, compelling the model to reconstruct latent trajectories of physical interactions instead of static background features, which yields higher accuracy on causal reasoning tasks and induces a latent space that linearizes physical energy.

What carries the argument

Interaction-aware masking strategy that prioritizes physical interactions such as collisions and momentum transfers for self-supervised prediction in video models.

If this is right

  • Standard JEPA models can be improved for physics-related tasks by shifting masking from random patches to interaction-focused ones.
  • The resulting higher-entropy latent space correlates with physical energy, enabling better downstream reasoning.
  • The method generalizes beyond synthetic benchmarks to real human action videos and zero-shot physical puzzles.
  • Self-supervised training can internalize causal structure of the physical world at scale without explicit labels.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • This technique might help other predictive models in robotics or autonomous systems to better anticipate physical outcomes.
  • Applying similar interaction biases could address static biases in other self-supervised vision tasks.
  • Testing on more complex or longer video sequences would reveal if the causal learning scales effectively.
  • The entropy increase suggests potential for better generalization in dynamic environments.

Load-bearing premise

That prioritizing masking around collisions and momentum transfers will make the model reconstruct dynamic physical trajectories instead of static background elements.

What would settle it

Observing that IA-JEPA performs no better than standard patch-masked models on causal reasoning tasks or that its latent representations do not show improved correlation with physical energy measures would falsify the central claim.

Figures

Figures reproduced from arXiv: 2605.15466 by Santosh Kumar Paidi.

Figure 1
Figure 1. Figure 1: Overview of Interaction-Aware Masking via a Controlled Distractor Case. To provide a [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Latent Linearity of Physical Energy. We observe a significant positive correlation between [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Evolution of Internal Representation Saliency. Activations are visualized using global [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Latent Rollout Dynamics. The Baseline model (blue) exhibits near-perfect stability, [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
read the original abstract

Learning predictive world models from unlabelled video is a foundational challenge in artificial intelligence. While Joint Embedding Predictive Architectures (JEPA) have set new benchmarks in semantic classification, they often remain physics-blind, failing to capture the causal dynamics necessary for downstream reasoning. We hypothesize that this stems from standard patch-based masking strategies, which prioritize visual texture over rare but informative kinematic events. We propose Interaction-Aware JEPA (IA-JEPA), which utilizes a self-supervised motion-centric masking strategy to prioritize physical interactions. By specifically targeting entities engaged in collisions or momentum transfers, we force the architecture to reconstruct latent trajectories rather than static background features. Evaluated on the CLEVRER benchmark, IA-JEPA achieves 14.26% accuracy on causal reasoning tasks, a significant lead over the 3.22% achieved by standard patch-masked baselines. Crucially, we demonstrate that IA-JEPA breaks the "static bias" of standard self-supervision by inducing a higher-entropy, more discriminative latent space (+10% entropy gain) that linearizes physical energy ($R^2=0.43$). We show that this interaction bias generalizes to real-world human actions (Something-Something V2) and zero-shot physical puzzles (PHYRE-Lite). Our results provide a scalable, fully self-supervised path toward building foundational world models that begin to internalize the causal structure of the physical world.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes Interaction-Aware JEPA (IA-JEPA), an extension of Joint Embedding Predictive Architectures that replaces standard patch-based masking with a self-supervised motion-centric masking strategy. The core hypothesis is that prioritizing entities involved in collisions or momentum transfers forces the model to reconstruct latent trajectories rather than static features, thereby capturing causal physical dynamics. On the CLEVRER benchmark, IA-JEPA reports 14.26% accuracy on causal reasoning tasks versus 3.22% for patch-masked baselines, together with a +10% entropy gain in the latent space and an R²=0.43 fit to physical energy; the approach is also evaluated on Something-Something V2 and PHYRE-Lite.

Significance. If the central empirical claims are robustly supported and the masking procedure is shown to be free of external priors, the work would offer a concrete, scalable route to physics-aware world models from unlabeled video. The reported accuracy lift and the entropy/energy correlation provide a falsifiable link between masking design and causal representation quality, which is a valuable direction for self-supervised learning.

major comments (2)
  1. [Abstract and §3] Abstract and §3 (Method): The claim that the motion-centric masking 'specifically target[s] entities engaged in collisions or momentum transfers' in a purely self-supervised regime from raw video is load-bearing for the performance gap (14.26% vs 3.22%) and for the assertion of a 'fully self-supervised path.' Standard motion cues such as optical-flow magnitude detect any movement; without an explicit description of how collision/momentum events are isolated without labels, trackers, or geometric rules, it remains possible that the observed gains arise from implicit supervision in the masking heuristic rather than from the JEPA objective itself. A concrete implementation sketch or ablation isolating the detection rule is required.
  2. [Results] Results (CLEVRER and energy-linearization paragraphs): The R²=0.43 fit between latent representations and physical energy is presented as evidence that IA-JEPA linearizes causal structure, yet the manuscript does not state whether this regression is performed on held-out data, whether the energy labels are derived independently of the masking procedure, or whether the fit survives controls that remove the interaction bias. Because this metric is used to support the claim of breaking 'static bias,' its post-hoc nature and lack of statistical controls weaken the causal interpretation.
minor comments (2)
  1. [Abstract] Abstract: The reported accuracy figures lack error bars or the number of random seeds; likewise the '+10% entropy gain' is stated without a precise baseline or variance estimate.
  2. Throughout: Training hyperparameters, exact dataset splits for CLEVRER causal tasks, and whether the same masking strategy is applied unchanged to Something-Something V2 and PHYRE-Lite should be stated explicitly for reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their insightful comments, which have helped us improve the clarity and rigor of our manuscript. We address each major comment below.

read point-by-point responses
  1. Referee: [Abstract and §3] Abstract and §3 (Method): The claim that the motion-centric masking 'specifically target[s] entities engaged in collisions or momentum transfers' in a purely self-supervised regime from raw video is load-bearing for the performance gap (14.26% vs 3.22%) and for the assertion of a 'fully self-supervised path.' Standard motion cues such as optical-flow magnitude detect any movement; without an explicit description of how collision/momentum events are isolated without labels, trackers, or geometric rules, it remains possible that the observed gains arise from implicit supervision in the masking heuristic rather than from the JEPA objective itself. A concrete implementation sketch or ablation isolating the detection rule is required.

    Authors: We appreciate the referee's concern regarding the self-supervised nature of our masking strategy. In the original manuscript, we described the motion-centric masking at a high level in §3, but we acknowledge that a more detailed implementation sketch is warranted to address potential concerns about implicit supervision. The masking procedure computes dense optical flow between frames and selects patches exhibiting both high magnitude and abrupt changes in direction, which empirically correlate with collision and momentum transfer events in the video data. This is done without any labels, trackers, or external geometric rules, relying solely on pixel-level motion statistics. To strengthen this, we have added a concrete pseudocode sketch in the revised §3 and an ablation comparing our interaction-aware masking to standard high-motion masking, demonstrating that the specific targeting of interaction events contributes to the performance gains beyond generic motion prioritization. We believe this clarifies that the gains stem from the combination with the JEPA objective. revision: yes

  2. Referee: [Results] Results (CLEVRER and energy-linearization paragraphs): The R²=0.43 fit between latent representations and physical energy is presented as evidence that IA-JEPA linearizes causal structure, yet the manuscript does not state whether this regression is performed on held-out data, whether the energy labels are derived independently of the masking procedure, or whether the fit survives controls that remove the interaction bias. Because this metric is used to support the claim of breaking 'static bias,' its post-hoc nature and lack of statistical controls weaken the causal interpretation.

    Authors: We agree that specifying the details of the energy linearization analysis is important for robust interpretation. The R²=0.43 regression is performed on held-out test videos from the CLEVRER dataset, with physical energy labels computed directly from the ground-truth simulation parameters (masses, velocities) independently of our masking or training procedure. In the revised manuscript, we have clarified this in the results section and added a control experiment ablating the interaction bias (by reverting to standard patch masking while keeping the same architecture), which shows a significant drop in the R² value to approximately 0.15. This supports our claim that the masking strategy helps linearize the latent space with respect to physical quantities. We have also included statistical significance tests for the fit. revision: yes

Circularity Check

0 steps flagged

No significant circularity; claims rest on empirical benchmarks

full rationale

The paper's derivation chain consists of a hypothesis about masking strategies followed by empirical evaluation on CLEVRER, Something-Something V2, and PHYRE-Lite. Reported metrics (14.26% accuracy, +10% entropy, R²=0.43) are measured outcomes against explicit baselines rather than quantities derived by construction from the masking definition or any self-citation. No equations, fitted parameters renamed as predictions, or load-bearing self-citations appear in the abstract or described chain; the self-supervised claim is tested externally and does not reduce to tautology.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract-only review limits visibility into explicit assumptions; standard self-supervised learning premises are implicit.

axioms (1)
  • domain assumption Patch-based masking in JEPA prioritizes visual texture over kinematic events
    Stated as the source of physics-blind behavior in the hypothesis.

pith-pipeline@v0.9.0 · 5778 in / 1129 out tokens · 56012 ms · 2026-05-19T14:47:01.208588+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

35 extracted references · 35 canonical work pages · 3 internal anchors

  1. [1]

    Video generation models as world simulators.OpenAI Blog, 2024

    Tim Brooks et al. Video generation models as world simulators.OpenAI Blog, 2024

  2. [2]

    V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning

    Mahmoud Assran, Adrien Bardes, David Fan, Quentin Garrido, et al. V-jepa 2: Self-supervised video models enable understanding, prediction and planning.arXiv preprint arXiv:2506.09985, 2025

  3. [3]

    CLEVRER: CoLlision Events for Video REpresentation and Reasoning

    Kexin Yi, Chuang Gan, Yunzhu Li, Pushmeet Kohli, Jiajun Wu, Antonio Torralba, and Joshua B Tenenbaum. Clevrer: Collision events for video representation and reasoning.arXiv preprint arXiv:1910.01442, 2020. 9

  4. [4]

    something something

    Raghav Goyal et al. The" something something" video database for learning and evaluating visual common sense. InProceedings of the IEEE International Conference on Computer Vision (ICCV), 2017

  5. [5]

    Phyre: A new benchmark for physical reasoning

    Anton Bakhtin et al. Phyre: A new benchmark for physical reasoning. InAdvances in Neural Information Processing Systems (NeurIPS), 2019

  6. [6]

    A simple framework for contrastive learning of visual representations

    Ting Chen et al. A simple framework for contrastive learning of visual representations. In International Conference on Machine Learning (ICML), 2020

  7. [7]

    Momentum contrast for unsupervised visual representation learning

    Kaiming He et al. Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020

  8. [8]

    Assran et al

    Mahmoud Assran et al. Self-supervised learning from images with a joint-embedding predictive architecture.arXiv preprint arXiv:2301.08243, 2023

  9. [9]

    Bootstrap your own latent-a new approach to self-supervised learning

    Jean-Bastien Grill et al. Bootstrap your own latent-a new approach to self-supervised learning. InAdvances in Neural Information Processing Systems (NeurIPS), 2020

  10. [10]

    Emerging properties in self-supervised vision transformers

    Mathilde Caron et al. Emerging properties in self-supervised vision transformers. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2021

  11. [11]

    Barlow twins: Self-supervised learning via redundancy reduction

    Jure Zbontar et al. Barlow twins: Self-supervised learning via redundancy reduction. In International Conference on Machine Learning (ICML), 2021

  12. [12]

    Videomae: Masked autoencoders are data-efficient learners for self-supervised video forecasting

    Zhan Tong, Yibing Song, Jue Wang, and Guan-Guan Lim. Videomae: Masked autoencoders are data-efficient learners for self-supervised video forecasting. InAdvances in Neural Information Processing Systems (NeurIPS), 2022

  13. [13]

    Masked autoencoders are scalable vision learners

    Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, and Ross Girshick. Masked autoencoders are scalable vision learners. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 7292–7300, 2022

  14. [14]

    Flava: A foundational language and vision alignment model

    Amanpreet Singh et al. Flava: A foundational language and vision alignment model. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022

  15. [15]

    Mgmae: Motion guided masking for video masked autoencoding

    Bingkun Huang, Zhiyu Zhao, Guanpu Zhang, Yu Qiao, and Limin Wang. Mgmae: Motion guided masking for video masked autoencoding. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023

  16. [16]

    EVEREST: Efficient masked video autoencoder by removing redundant spatiotemporal tokens

    Sunil Hwang, Jaehong Yoon, Youngwan Lee, and Sung Ju Hwang. EVEREST: Efficient masked video autoencoder by removing redundant spatiotemporal tokens. InProceedings of the 41st International Conference on Machine Learning (ICML), 2024

  17. [17]

    Motion-guided masking for spatiotemporal representation learning

    David Fan, Jue Wang, Shuai Liao, Xinyu Li, Yi Ding, Madhu Pan, Hao Pan, and Roger Zhang. Motion-guided masking for spatiotemporal representation learning. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023

  18. [18]

    Skeleton2vec: A self- supervised learning framework with contextualized target representations for skeleton sequence

    Ruizhuo Xu, Linzhi Huang, Mei Wang, Jiani Hu, and Weihong Deng. Skeleton2vec: A self- supervised learning framework with contextualized target representations for skeleton sequence. arXiv preprint arXiv:2401.00921, 2024

  19. [19]

    Cross-modal contrastive masked autoencoder for compressed video pre-training.IEEE Transactions on Image Processing (TIP), 2025

    Zohng-Yuan Zheng, Kai Wang, Yang Wang, Shidong Wang, Liujuan Wang, and Zhiyong Wang. Cross-modal contrastive masked autoencoder for compressed video pre-training.IEEE Transactions on Image Processing (TIP), 2025

  20. [20]

    Intuitive physics from natural video pre-training.Nature Machine Intelligence, 2025

    Quentin Garrido et al. Intuitive physics from natural video pre-training.Nature Machine Intelligence, 2025

  21. [21]

    Object-centric learning with slot attention

    Francesco Locatello et al. Object-centric learning with slot attention. InAdvances in Neural Information Processing Systems (NeurIPS), 2020. 10

  22. [22]

    Conditional object-centric learning from video

    Thomas Kipf et al. Conditional object-centric learning from video. InInternational Conference on Learning Representations (ICLR), 2022

  23. [23]

    Savi++: Towards end-to-end object-centric learning from real-world videos.arXiv preprint arXiv:2206.07764, 2022

    Gamaleldin F Elsayed et al. Savi++: Towards end-to-end object-centric learning from real-world videos.arXiv preprint arXiv:2206.07764, 2022

  24. [24]

    Bridging the gap between object-centric learning and semantic segmentation

    Maximilian Seitzer et al. Bridging the gap between object-centric learning and semantic segmentation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023

  25. [25]

    Nam et al

    J. Nam et al. Causal-jepa: Object-level latent interventions for self-supervised physics.Interna- tional Conference on Learning Representations (ICLR), 2026

  26. [26]

    Li et al

    X. Li et al. Statm: Spatiotemporal attention transition models for world modeling. InInterna- tional Conference on Learning Representations (ICLR), 2025

  27. [27]

    Slotformer: Unsupervised visual world models from slots.arXiv preprint arXiv:2210.11438, 2022

    Zoltán Wu et al. Slotformer: Unsupervised visual world models from slots.arXiv preprint arXiv:2210.11438, 2022

  28. [28]

    Learning physical intuition of block towers by example

    Adam Lerer et al. Learning physical intuition of block towers by example. InInternational Conference on Machine Learning (ICML), 2016

  29. [29]

    Zhang et al

    H. Zhang et al. Morpheus: A benchmark for long-horizon physical reasoning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025

  30. [30]

    Li et al

    B. Li et al. Seed-bench-2: Benchmarking multimodal large language models.arXiv preprint arXiv:2403.11111, 2024

  31. [31]

    A very big video reasoning suite.arXiv preprint arXiv:2602.20159, 2026

    Maijunxian Wang, Ruisi Wang, Juyi Lin, Ran Ji, et al. A very big video reasoning suite.arXiv preprint arXiv:2602.20159, 2026

  32. [32]

    An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

    Alexey Dosovitskiy et al. An image is worth 16x16 words: Transformers for image recognition at scale.arXiv preprint arXiv:2010.11929, 2020

  33. [33]

    Attention over learned object embeddings enables complex visual reasoning

    David Ding, Felix Hill, Adam Santoro, and Matthew Botvinick. Attention over learned object embeddings enables complex visual reasoning. InAdvances in Neural Information Processing Systems (NeurIPS), 2021

  34. [34]

    Compositional attention networks for machine reasoning

    Drew A Hudson and Christopher D Manning. Compositional attention networks for machine reasoning. InInternational Conference on Learning Representations (ICLR), 2018

  35. [35]

    Normalization Mandate

    Jovana Mitrovic, Brian McWilliams, Ian Walker, Lars Buesing, and Charles Blundell. Repre- sentation learning via invariant causal mechanisms.arXiv preprint arXiv:2010.07922, 2021. A Technical appendices and supplementary material A.1 Standardization of Implementation and Reproducibility To ensure the scientific integrity of our results, all experiments we...