iMiGUE-3K is the largest in-the-wild micro-gesture video dataset with 3.4K clips and 37M frames from real interviews, supporting self-supervised foundation models and benchmarks that show micro-gestures improve emotion understanding.
Prototype learning for micro-gesture classification
4 Pith papers cite this work. Polarity classification is still indexing.
fields
cs.CV 4years
2026 4verdicts
UNVERDICTED 4representative citing papers
A decoupled adapter with independent spatial-temporal branches via depthwise convolutions and a dynamic augmentation strategy for long-tail data achieves first place with F1 0.43808 in a micro-gesture recognition challenge.
Ensemble of self-supervised RGB model and supervised models achieves new SOTA of 74.419% on iMiGUE micro-gesture dataset.
A competition-winning multi-modal model for hidden emotion recognition integrates static and dynamic pose features via cross-attention and MIL pooling while noting representation collapse in vision foundation models on micro-dynamic tasks.
citing papers explorer
-
iMiGUE-3K: A Large-Scale Benchmark for Micro-Gesture Analysis with Self-Supervised Learning
iMiGUE-3K is the largest in-the-wild micro-gesture video dataset with 3.4K clips and 37M frames from real interviews, supporting self-supervised foundation models and benchmarks that show micro-gestures improve emotion understanding.
-
Spatial-Temporal Decoupled Adapter for Micro-gesture Online Recognition
A decoupled adapter with independent spatial-temporal branches via depthwise convolutions and a dynamic augmentation strategy for long-tail data achieves first place with F1 0.43808 in a micro-gesture recognition challenge.
-
Self-supervised Learning Matters: A Simple Ensemble Solution for Micro-Gesture Recognition
Ensemble of self-supervised RGB model and supervised models achieves new SOTA of 74.419% on iMiGUE micro-gesture dataset.
-
Rethinking the Role of Feature Engineering and Learning Strategies in Few-Shot Hidden Emotion Recognition
A competition-winning multi-modal model for hidden emotion recognition integrates static and dynamic pose features via cross-attention and MIL pooling while noting representation collapse in vision foundation models on micro-dynamic tasks.