Videomae: Masked au- toencoders are data-efficient learners for self-supervised video pre-training

· 2022 · arXiv 2203.12602

9 Pith papers cite this work. Polarity classification is still indexing.

9 Pith papers citing it

read on arXiv browse 9 citing papers

citation-role summary

method 2 background 1

citation-polarity summary

use method 2 background 1

representative citing papers

SignMAE: Segmentation-Driven Self-Supervised Learning for Sign Language Recognition

cs.CV · 2026-05-03 · unverdicted · novelty 7.0

SignMAE uses segmentation-driven masking in a mask-and-reconstruct self-supervised task to learn fine-grained sign representations, achieving state-of-the-art accuracy on WLASL, NMFs-CSL, and Slovo with fewer frames and modalities.

Mask World Model: Predicting What Matters for Robust Robot Policy Learning

cs.RO · 2026-04-21 · unverdicted · novelty 7.0

Mask World Model predicts semantic mask dynamics with video diffusion and integrates it with a diffusion policy head, outperforming RGB world models on LIBERO and RLBench while showing better real-world generalization and texture robustness.

SpecSem-Net: Integrating Spectral and Semantic Features for Robust AI-generated Video Detection

cs.CV · 2026-05-17 · unverdicted · novelty 6.0

SpecSem-Net integrates Fourier-based spectral filtering with semantic-guided gated merging to detect AI-generated videos, reporting 87.25% accuracy on a new benchmark of five commercial generators and 95.59% on public datasets.

Prognostic Value of Lung Ultrasound Biomarkers for Readmission Risk in Congestive Heart Failure: A Pilot Data-Driven Analysis

eess.SP · 2026-05-16 · unverdicted · novelty 6.0

Pilot study uses pretrained video encoder features from lung ultrasound to predict 30-day CHF readmission, finding lower-lung views and temporal differences most informative with top MLP F1 of 0.80.

Beyond Independent Frames: Latent Attention Masked Autoencoders for Multi-View Echocardiography

cs.CV · 2026-04-16 · unverdicted · novelty 6.0

LAMAE adds latent-space attention to masked autoencoders so multi-view echocardiography videos can exchange information across frames and views, yielding representations that transfer from adult to pediatric hearts and enable ICD-10 code prediction on MIMIC-IV-ECHO.

Zero-shot World Models Are Developmentally Efficient Learners

cs.AI · 2026-04-11 · unverdicted · novelty 6.0

A zero-shot visual world model trained on one child's experience achieves broad competence on physical understanding benchmarks while matching developmental behavioral patterns.

LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment

cs.CV · 2023-10-03 · unverdicted · novelty 6.0

LanguageBind aligns video, infrared, depth, and audio to a frozen language encoder via contrastive learning on the new VIDAL-10M dataset, extending video-language pretraining to N modalities.

FAST-ME: Foundation-aware Adaptive Stopping for Motion Estimation for Efficient IoT Video Analysis

cs.CV · 2026-05-22 · unverdicted · novelty 5.0

A hybrid motion estimation framework combines optimal stopping theory with foundation model semantic scores to reduce computation while maintaining accuracy and semantic coverage in video analysis.

Insights from Visual Cognition: Understanding Human Action Dynamics with Overall Glance and Refined Gaze Transformer

cs.CV · 2026-04-08 · unverdicted · novelty 5.0

The OG-ReG Transformer achieves state-of-the-art results on Kinetics-400, Something-Something v2, and Diving-48 by combining global glance and local gaze processing paths.

citing papers explorer

Showing 9 of 9 citing papers.

SignMAE: Segmentation-Driven Self-Supervised Learning for Sign Language Recognition cs.CV · 2026-05-03 · unverdicted · none · ref 23
SignMAE uses segmentation-driven masking in a mask-and-reconstruct self-supervised task to learn fine-grained sign representations, achieving state-of-the-art accuracy on WLASL, NMFs-CSL, and Slovo with fewer frames and modalities.
Mask World Model: Predicting What Matters for Robust Robot Policy Learning cs.RO · 2026-04-21 · unverdicted · none · ref 34
Mask World Model predicts semantic mask dynamics with video diffusion and integrates it with a diffusion policy head, outperforming RGB world models on LIBERO and RLBench while showing better real-world generalization and texture robustness.
SpecSem-Net: Integrating Spectral and Semantic Features for Robust AI-generated Video Detection cs.CV · 2026-05-17 · unverdicted · none · ref 22
SpecSem-Net integrates Fourier-based spectral filtering with semantic-guided gated merging to detect AI-generated videos, reporting 87.25% accuracy on a new benchmark of five commercial generators and 95.59% on public datasets.
Prognostic Value of Lung Ultrasound Biomarkers for Readmission Risk in Congestive Heart Failure: A Pilot Data-Driven Analysis eess.SP · 2026-05-16 · unverdicted · none · ref 28
Pilot study uses pretrained video encoder features from lung ultrasound to predict 30-day CHF readmission, finding lower-lung views and temporal differences most informative with top MLP F1 of 0.80.
Beyond Independent Frames: Latent Attention Masked Autoencoders for Multi-View Echocardiography cs.CV · 2026-04-16 · unverdicted · none · ref 25
LAMAE adds latent-space attention to masked autoencoders so multi-view echocardiography videos can exchange information across frames and views, yielding representations that transfer from adult to pediatric hearts and enable ICD-10 code prediction on MIMIC-IV-ECHO.
Zero-shot World Models Are Developmentally Efficient Learners cs.AI · 2026-04-11 · unverdicted · none · ref 22
A zero-shot visual world model trained on one child's experience achieves broad competence on physical understanding benchmarks while matching developmental behavioral patterns.
LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment cs.CV · 2023-10-03 · unverdicted · none · ref 124
LanguageBind aligns video, infrared, depth, and audio to a frozen language encoder via contrastive learning on the new VIDAL-10M dataset, extending video-language pretraining to N modalities.
FAST-ME: Foundation-aware Adaptive Stopping for Motion Estimation for Efficient IoT Video Analysis cs.CV · 2026-05-22 · unverdicted · none · ref 21
A hybrid motion estimation framework combines optimal stopping theory with foundation model semantic scores to reduce computation while maintaining accuracy and semantic coverage in video analysis.
Insights from Visual Cognition: Understanding Human Action Dynamics with Overall Glance and Refined Gaze Transformer cs.CV · 2026-04-08 · unverdicted · none · ref 75
The OG-ReG Transformer achieves state-of-the-art results on Kinetics-400, Something-Something v2, and Diving-48 by combining global glance and local gaze processing paths.

Videomae: Masked au- toencoders are data-efficient learners for self-supervised video pre-training

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer