SignMAE uses segmentation-driven masking in a mask-and-reconstruct self-supervised task to learn fine-grained sign representations, achieving state-of-the-art accuracy on WLASL, NMFs-CSL, and Slovo with fewer frames and modalities.
Videomae: Masked au- toencoders are data-efficient learners for self-supervised video pre-training
9 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
verdicts
UNVERDICTED 9representative citing papers
Mask World Model predicts semantic mask dynamics with video diffusion and integrates it with a diffusion policy head, outperforming RGB world models on LIBERO and RLBench while showing better real-world generalization and texture robustness.
SpecSem-Net integrates Fourier-based spectral filtering with semantic-guided gated merging to detect AI-generated videos, reporting 87.25% accuracy on a new benchmark of five commercial generators and 95.59% on public datasets.
Pilot study uses pretrained video encoder features from lung ultrasound to predict 30-day CHF readmission, finding lower-lung views and temporal differences most informative with top MLP F1 of 0.80.
LAMAE adds latent-space attention to masked autoencoders so multi-view echocardiography videos can exchange information across frames and views, yielding representations that transfer from adult to pediatric hearts and enable ICD-10 code prediction on MIMIC-IV-ECHO.
A zero-shot visual world model trained on one child's experience achieves broad competence on physical understanding benchmarks while matching developmental behavioral patterns.
LanguageBind aligns video, infrared, depth, and audio to a frozen language encoder via contrastive learning on the new VIDAL-10M dataset, extending video-language pretraining to N modalities.
A hybrid motion estimation framework combines optimal stopping theory with foundation model semantic scores to reduce computation while maintaining accuracy and semantic coverage in video analysis.
The OG-ReG Transformer achieves state-of-the-art results on Kinetics-400, Something-Something v2, and Diving-48 by combining global glance and local gaze processing paths.
citing papers explorer
-
SignMAE: Segmentation-Driven Self-Supervised Learning for Sign Language Recognition
SignMAE uses segmentation-driven masking in a mask-and-reconstruct self-supervised task to learn fine-grained sign representations, achieving state-of-the-art accuracy on WLASL, NMFs-CSL, and Slovo with fewer frames and modalities.
-
Mask World Model: Predicting What Matters for Robust Robot Policy Learning
Mask World Model predicts semantic mask dynamics with video diffusion and integrates it with a diffusion policy head, outperforming RGB world models on LIBERO and RLBench while showing better real-world generalization and texture robustness.
-
SpecSem-Net: Integrating Spectral and Semantic Features for Robust AI-generated Video Detection
SpecSem-Net integrates Fourier-based spectral filtering with semantic-guided gated merging to detect AI-generated videos, reporting 87.25% accuracy on a new benchmark of five commercial generators and 95.59% on public datasets.
-
Prognostic Value of Lung Ultrasound Biomarkers for Readmission Risk in Congestive Heart Failure: A Pilot Data-Driven Analysis
Pilot study uses pretrained video encoder features from lung ultrasound to predict 30-day CHF readmission, finding lower-lung views and temporal differences most informative with top MLP F1 of 0.80.
-
Beyond Independent Frames: Latent Attention Masked Autoencoders for Multi-View Echocardiography
LAMAE adds latent-space attention to masked autoencoders so multi-view echocardiography videos can exchange information across frames and views, yielding representations that transfer from adult to pediatric hearts and enable ICD-10 code prediction on MIMIC-IV-ECHO.
-
Zero-shot World Models Are Developmentally Efficient Learners
A zero-shot visual world model trained on one child's experience achieves broad competence on physical understanding benchmarks while matching developmental behavioral patterns.
-
LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment
LanguageBind aligns video, infrared, depth, and audio to a frozen language encoder via contrastive learning on the new VIDAL-10M dataset, extending video-language pretraining to N modalities.
-
FAST-ME: Foundation-aware Adaptive Stopping for Motion Estimation for Efficient IoT Video Analysis
A hybrid motion estimation framework combines optimal stopping theory with foundation model semantic scores to reduce computation while maintaining accuracy and semantic coverage in video analysis.
-
Insights from Visual Cognition: Understanding Human Action Dynamics with Overall Glance and Refined Gaze Transformer
The OG-ReG Transformer achieves state-of-the-art results on Kinetics-400, Something-Something v2, and Diving-48 by combining global glance and local gaze processing paths.