Empirical tests show that factorized world-model with hard-region-weighted latent dynamics improves ImageNet-100 by 5.92 and SSv2 by 3.21 points over baseline in mixed-dataset pretraining while staying within 0.3 points on Diving-48.
Video swin transformer
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
fields
cs.CV 2years
2026 2representative citing papers
citing papers explorer
-
Factorized Latent Dynamics for Video JEPA: An Empirical Study of Auxiliary Objectives
Empirical tests show that factorized world-model with hard-region-weighted latent dynamics improves ImageNet-100 by 5.92 and SSv2 by 3.21 points over baseline in mixed-dataset pretraining while staying within 0.3 points on Diving-48.
- TrajTok: Learning Trajectory Tokens enables better Video Understanding