Self-supervised Learning Matters: A Simple Ensemble Solution for Micro-Gesture Recognition

· 2026 · cs.CV · arXiv 2606.09261

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it

open full Pith review browse 1 citing papers arXiv PDF

abstract

In this paper, we present XInsight Lab's solution to the micro-gesture classification track of the 4th MiGA Challenge at IJCAI 2026, in which our solution ranked first and achieved a new state-of-the-art result. We propose a multimodal ensemble framework that integrates a self-supervised RGB-based model with supervised multi-stream models from previous solutions. The self-supervised RGB model is pretrained on 120K unlabeled clips via masked video modeling and then fine-tuned on iMiGUE. This simple yet effective RGB baseline achieves 69.224% top-1 accuracy on the iMiGUE test set, demonstrating the benefit of learning transferable representations from unlabeled in-domain videos. By incorporating this model as a complementary branch, the final ensemble reaches 74.419% top-1 accuracy, surpassing the previous state of the art by 1.206 percentage points. Experimental results on iMiGUE, including ablation studies on the ensemble strategy, validate the effectiveness of self-supervised RGB representation learning for micro-gesture recognition.

representative citing papers

Rethinking the Role of Feature Engineering and Learning Strategies in Few-Shot Hidden Emotion Recognition

cs.CV · 2026-06-30 · unverdicted · novelty 3.0

A competition-winning multi-modal model for hidden emotion recognition integrates static and dynamic pose features via cross-attention and MIL pooling while noting representation collapse in vision foundation models on micro-dynamic tasks.

citing papers explorer

Showing 1 of 1 citing paper after filters.

Rethinking the Role of Feature Engineering and Learning Strategies in Few-Shot Hidden Emotion Recognition cs.CV · 2026-06-30 · unverdicted · none · ref 32 · internal anchor
A competition-winning multi-modal model for hidden emotion recognition integrates static and dynamic pose features via cross-attention and MIL pooling while noting representation collapse in vision foundation models on micro-dynamic tasks.

Self-supervised Learning Matters: A Simple Ensemble Solution for Micro-Gesture Recognition

fields

years

verdicts

representative citing papers

citing papers explorer