Empirical tests show that factorized world-model with hard-region-weighted latent dynamics improves ImageNet-100 by 5.92 and SSv2 by 3.21 points over baseline in mixed-dataset pretraining while staying within 0.3 points on Diving-48.
An image is worth 16x16 words: Transformers for image recognition at scale
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.CV 1years
2026 1verdicts
UNVERDICTED 1representative citing papers
citing papers explorer
-
Factorized Latent Dynamics for Video JEPA: An Empirical Study of Auxiliary Objectives
Empirical tests show that factorized world-model with hard-region-weighted latent dynamics improves ImageNet-100 by 5.92 and SSv2 by 3.21 points over baseline in mixed-dataset pretraining while staying within 0.3 points on Diving-48.