InstAP introduces instance-aware pre-training with a new dual-granularity dataset InstVL that improves both fine-grained instance retrieval and global video understanding over standard VLP baselines.
Masked autoencoders as spatiotemporal learners.Advances in neural information processing systems, 35:35946–35958,
3 Pith papers cite this work. Polarity classification is still indexing.
representative citing papers
RVM uses recurrent computation inside a masked autoencoder to learn video representations that match or exceed prior video and image models on classification, tracking, and dense spatial tasks with up to 30x better parameter efficiency.
In moderate-sized fine-grained bioacoustics, pretraining scale of masked autoencoders on diverse general audio dominates over domain-specific objectives or data curation for transfer performance.
citing papers explorer
-
InstAP: Instance-Aware Vision-Language Pre-Train for Spatial-Temporal Understanding
InstAP introduces instance-aware pre-training with a new dual-granularity dataset InstVL that improves both fine-grained instance retrieval and global video understanding over standard VLP baselines.
-
Recurrent Video Masked Autoencoders
RVM uses recurrent computation inside a masked autoencoder to learn video representations that match or exceed prior video and image models on classification, tracking, and dense spatial tasks with up to 30x better parameter efficiency.
-
Masked Autoencoders with Limited Data: Does It Work? A Fine-Grained Bioacoustics Case Study
In moderate-sized fine-grained bioacoustics, pretraining scale of masked autoencoders on diverse general audio dominates over domain-specific objectives or data curation for transfer performance.