Self-supervised Learning Matters: A Simple Ensemble Solution for Micro-Gesture Recognition

Dan Guo; Fei Wang; Haixu Liu; Jihao Gu; Junjie Chen; Kun Li; Tingyi Liu; Zhiliang Wu

arxiv: 2606.09261 · v1 · pith:GWZZFGPNnew · submitted 2026-06-08 · 💻 cs.CV

Self-supervised Learning Matters: A Simple Ensemble Solution for Micro-Gesture Recognition

Tingyi Liu , Kun Li , Fei Wang , Junjie Chen , Zhiliang Wu , Jihao Gu , Haixu Liu , Dan Guo This is my paper

Pith reviewed 2026-06-27 17:10 UTC · model grok-4.3

classification 💻 cs.CV

keywords micro-gesture recognitionself-supervised learningmasked video modelingensemble learningiMiGUE datasetvideo classificationRGB modality

0 comments

The pith

A self-supervised RGB model pretrained on 120K unlabeled clips via masked video modeling raises ensemble top-1 accuracy to 74.419% on the iMiGUE micro-gesture test set.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that a simple self-supervised RGB baseline, pretrained with masked video modeling on in-domain unlabeled data and then fine-tuned, supplies complementary signals when added to existing supervised multi-stream models. On the iMiGUE benchmark this baseline alone reaches 69.224% top-1 accuracy; the resulting ensemble improves the previous best result by 1.206 points to 74.419%. The work therefore demonstrates that transferable representations learned without labels can be directly useful for fine-grained gesture classification.

Core claim

Pretraining an RGB model on 120K unlabeled clips via masked video modeling and fine-tuning it on iMiGUE produces a 69.224% top-1 accuracy that, when ensembled with prior supervised multi-stream models, yields 74.419% top-1 accuracy and sets a new state of the art 1.206 points above the previous record.

What carries the argument

The self-supervised RGB model pretrained via masked video modeling, used as an additional complementary branch in the multimodal ensemble.

If this is right

Self-supervised pretraining on unlabeled in-domain video can be added as a low-cost complementary stream without redesigning existing supervised pipelines.
The 1.206-point gain shows that masked video modeling captures gesture-relevant structure that supervised training on the smaller labeled set misses.
Ablation studies on ensemble weighting confirm that the self-supervised branch contributes measurably rather than acting as noise.
The same pretraining recipe can be applied to other fine-grained video tasks that have abundant unlabeled footage but limited labeled examples.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the pretraining corpus were expanded beyond 120K clips or drawn from a broader distribution of gestures, the single-model accuracy and the final ensemble margin could increase further.
The same masked-video pretraining step might transfer to related tasks such as subtle facial-expression recognition or micro-expression detection that also suffer from label scarcity.
Because the method requires only standard video encoders and no new architecture, it can be inserted into any existing multi-stream gesture system with minimal engineering cost.

Load-bearing premise

The features obtained from masked video modeling on unlabeled in-domain clips remain complementary to the features already captured by the supervised multi-stream models.

What would settle it

An ablation that replaces the self-supervised branch with either a randomly initialized RGB model or one pretrained on unrelated video data and measures whether the ensemble accuracy falls back to or below the prior state of the art.

Figures

Figures reproduced from arXiv: 2606.09261 by Dan Guo, Fei Wang, Haixu Liu, Jihao Gu, Junjie Chen, Kun Li, Tingyi Liu, Zhiliang Wu.

**Figure 1.** Figure 1: Overview of the proposed multi-stream ensemble solution for micro-gesture recognition. (a) Single-stream supervised learning with Video Swin Transformer [29] using RGB and depth modalities as inputs. (b) Multi-stream supervised learning with PoseConv3D [4] to learn complementary representations from RGB and skeleton data. (c) Self-supervised learning with MG-FM-RGB [34], which is built upon SMILE [32]. For… view at source ↗

read the original abstract

In this paper, we present XInsight Lab's solution to the micro-gesture classification track of the 4th MiGA Challenge at IJCAI 2026, in which our solution ranked first and achieved a new state-of-the-art result. We propose a multimodal ensemble framework that integrates a self-supervised RGB-based model with supervised multi-stream models from previous solutions. The self-supervised RGB model is pretrained on 120K unlabeled clips via masked video modeling and then fine-tuned on iMiGUE. This simple yet effective RGB baseline achieves 69.224% top-1 accuracy on the iMiGUE test set, demonstrating the benefit of learning transferable representations from unlabeled in-domain videos. By incorporating this model as a complementary branch, the final ensemble reaches 74.419% top-1 accuracy, surpassing the previous state of the art by 1.206 percentage points. Experimental results on iMiGUE, including ablation studies on the ensemble strategy, validate the effectiveness of self-supervised RGB representation learning for micro-gesture recognition.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gets a new SOTA on iMiGUE by adding a masked-video SSL RGB model to prior supervised streams for a 1.2-point gain, but the experiments do not isolate whether the SSL pretraining supplies the complementarity.

read the letter

The one thing to know is that this challenge solution reaches 74.419% top-1 on the iMiGUE test set, 1.206 points above the previous best, by ensembling a self-supervised RGB model with existing supervised multi-stream models. The SSL model alone, after masked video modeling pretraining on 120K unlabeled clips, hits 69.224%.

The work does a couple of things cleanly. It produces a usable single-model baseline from in-domain unlabeled data, which is a concrete data point for this task. It also mentions ablation studies on the ensemble strategy, which is more than many competition reports provide.

The soft spot is the missing isolation of the SSL contribution. The abstract does not give the accuracy of the supervised models in ensemble without the new branch, nor any diversity metric between the SSL predictions and the others. Without those numbers the claim that the masked-video branch supplies complementary signals rests on assumption rather than direct evidence. The stress-test concern is on target.

This paper is mainly for teams working on the MiGA challenge or the iMiGUE dataset. Readers outside that narrow area will find little to take away, since the method itself is an application of known techniques rather than a new framework.

I would bring it to a reading group only if the group is surveying recent challenge winners. I would not cite it in my own work. It deserves peer review because it delivers a verifiable new benchmark result with enough experimental detail to be checked.

Referee Report

1 major / 0 minor

Summary. The paper presents XInsight Lab's winning entry to the micro-gesture classification track of the 4th MiGA Challenge. It describes a multimodal ensemble that adds a self-supervised RGB branch—pretrained via masked video modeling on 120K unlabeled in-domain clips and fine-tuned on iMiGUE—to prior supervised multi-stream models. The SSL-only RGB model reaches 69.224% top-1 accuracy; the full ensemble reaches 74.419%, exceeding the previous state of the art by 1.206 percentage points. Ablation studies on ensemble strategy are cited as validation.

Significance. If the reported gain is attributable to the self-supervised branch rather than to the addition of an extra RGB stream, the work supplies concrete evidence that masked-video pretraining on unlabeled in-domain data can improve ensemble performance on fine-grained micro-gesture tasks. The numbers are obtained on an independent challenge test set and the approach is simple enough to be reproducible.

major comments (1)

[Abstract] Abstract: the headline claim that the SSL RGB model supplies complementary signals (and is therefore responsible for the +1.206 pp gain) is not supported by any reported number for the supervised-only ensemble accuracy or by any diversity metric (prediction disagreement, feature correlation) between the SSL branch and the other streams. The abstract states that ablation studies on ensemble strategy exist, yet the provided text contains none of the required quantities.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on the abstract and the recommendation for major revision. We address the point below and will update the manuscript to strengthen the supporting evidence for our claims.

read point-by-point responses

Referee: [Abstract] Abstract: the headline claim that the SSL RGB model supplies complementary signals (and is therefore responsible for the +1.206 pp gain) is not supported by any reported number for the supervised-only ensemble accuracy or by any diversity metric (prediction disagreement, feature correlation) between the SSL branch and the other streams. The abstract states that ablation studies on ensemble strategy exist, yet the provided text contains none of the required quantities.

Authors: We agree that the abstract's headline claim would be more robust if accompanied by the accuracy of the supervised-only ensemble and explicit diversity metrics. The reported +1.206 pp improvement is measured against the previous state of the art (73.213%), which was achieved by supervised multi-stream models; our ensemble augments those models with the SSL RGB branch. The full paper contains ablation studies on ensemble strategy, but these do not yet include the exact supervised-only ensemble accuracy or diversity statistics. We will revise the abstract for precision and expand the experiments section with a table reporting ensemble performance both with and without the SSL branch, plus prediction-disagreement statistics between branches. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical accuracies measured on independent challenge test set

full rationale

The paper reports top-1 accuracies (69.224% for the SSL RGB model, 74.419% for the ensemble) directly on the iMiGUE test set from the MiGA Challenge. Pretraining uses 120K unlabeled clips and fine-tuning uses the training split; the final metric is an external benchmark score with no fitted parameters, equations, or self-citations that reduce the reported gain to a tautology by construction. Ablation studies on ensemble strategy are referenced as external validation. This is a standard empirical ML result with no load-bearing derivation that collapses to its inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No free parameters, axioms, or invented entities are explicitly detailed in the abstract; the approach relies on standard self-supervised learning techniques from prior literature.

pith-pipeline@v0.9.1-grok · 5731 in / 1183 out tokens · 24653 ms · 2026-06-27T17:10:34.514835+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Rethinking the Role of Feature Engineering and Learning Strategies in Few-Shot Hidden Emotion Recognition
cs.CV 2026-06 unverdicted novelty 3.0

A competition-winning multi-modal model for hidden emotion recognition integrates static and dynamic pose features via cross-attention and MIL pooling while noting representation collapse in vision foundation models o...

Reference graph

Works this paper leans on

45 extracted references · 1 linked inside Pith · cited by 1 Pith paper

[1]

arXiv preprint arXiv:2408.03097 (2024)

Chen, G., Wang, F., Li, K., Wu, Z., Fan, H., Yang, Y., Wang, M., Guo, D.: Pro- totype learning for micro-gesture classification. arXiv preprint arXiv:2408.03097 (2024)

arXiv 2024
[2]

Chen, H., Schuller, B.W., Adeli, E., Zhao, G.: The 3rd challenge on human behav- ior analysis for emotion understanding (miga) 2025: From recognition to emotion understanding (2025) Self-supervised Learning Matters 9

2025
[3]

International Journal of Computer Vision131(6), 1346–1366 (2023)

Chen, H., Shi, H., Liu, X., Li, X., Zhao, G.: Smg: A micro-gesture dataset to- wards spontaneous body gestures for emotional stress state analysis. International Journal of Computer Vision131(6), 1346–1366 (2023)

2023
[4]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Duan, H., Zhao, Y., Chen, K., Lin, D., Dai, B.: Revisiting skeleton-based action recognition. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 2969–2978 (2022)

2022
[5]

In: Proceedings of the 33rd ACM International Conference on Multimedia

Gu, J., Li, K., Wang, F., Wei, Y., Wu, Z., Fan, H., Wang, M.: Motion matters: Motion-guided modulation network for skeleton-based micro-action recognition. In: Proceedings of the 33rd ACM International Conference on Multimedia. pp. 5461–5470 (2025)

2025
[6]

arXiv preprint arXiv:2507.08344 (2025)

Gu, J., Wang, F., Li, K., Wei, Y., Wu, Z., Guo, D.: Mm-gesture: towards precise micro-gesture recognition through multimodal fusion. arXiv preprint arXiv:2507.08344 (2025)

arXiv 2025
[7]

IEEE Transactions on Circuits and Systems for Video Technology34(7), 6238–6252 (2024)

Guo, D., Li, K., Hu, B., Zhang, Y., Wang, M.: Benchmarking micro-action recog- nition: Dataset, methods, and applications. IEEE Transactions on Circuits and Systems for Video Technology34(7), 6238–6252 (2024)

2024
[8]

In: Proceedings of the 32nd ACM International Conference on Multimedia

Guo, D., Li, X., Li, K., Chen, H., Hu, J., Zhao, G., Yang, Y., Wang, M.: Mac 2024: Micro-action analysis grand challenge. In: Proceedings of the 32nd ACM International Conference on Multimedia. pp. 11304–11305 (2024)

2024
[9]

In: Proceedings of the AAAI conference on artificial intelligence

Guo, T., Liu, H., Chen, Z., Liu, M., Wang, T., Ding, R.: Contrastive learning from extremely augmented skeleton sequences for self-supervised action recognition. In: Proceedings of the AAAI conference on artificial intelligence. vol. 36, pp. 762–770 (2022)

2022
[10]

IEEE Transactions on Circuits and Systems for Video Technology32(10), 7120–7132 (2022)

Hao, Y., Wang, S., Cao, P., Gao, X., Xu, T., Wu, J., He, X.: Attention in attention: Modeling context correlation for efficient video classification. IEEE Transactions on Circuits and Systems for Video Technology32(10), 7120–7132 (2022)

2022
[11]

In: Proceedings of the ieee/cvf conference on computer vision and pattern recognition

Hao, Y., Zhang, H., Ngo, C.W., He, X.: Group contextualization for video recog- nition. In: Proceedings of the ieee/cvf conference on computer vision and pattern recognition. pp. 928–938 (2022)

2022
[12]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 16000–16009 (2022)

2022
[13]

MiGA@ IJCAI (2025)

Hu, X., Pu, C., Li, Y., Xu, Y., Xie, K., Miao, Q.: Enhancing micro-gesture clas- sification via global-aware importance estimation in vision transformer. MiGA@ IJCAI (2025)

2025
[14]

In: MiGA@ IJCAI (2023)

Huang, H., Guo, X., Peng, W., Xia, Z.: Micro-gesture classification based on en- semble hypergraph-convolution transformer. In: MiGA@ IJCAI (2023)

2023
[15]

In: MiGA@ IJCAI (2024)

Huang, H., Wang, Y., Linghu, K., Xia, Z.: Multi-modal micro-gesture classification via multi-scale heterogeneous ensemble network. In: MiGA@ IJCAI (2024)

2024
[16]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Li, K., Gu, J., Wang, F., Wu, Z., Fan, H., Guo, D.: Ma-bench: Towards fine- grained micro-action understanding. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 20118–20128 (June 2026)

2026
[17]

In: Proceedings of the AAAI Conference on Artificial Intelligence

Li, K., Guo, D., Chen, G., Fan, C., Xu, J., Wu, Z., Fan, H., Wang, M.: Prototypical calibrating ambiguous samples for micro-action recognition. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 39, pp. 4815–4823 (2025)

2025
[18]

In: Proceedings of the 31st ACM International Conference on Multimedia

Li, K., Guo, D., Chen, G., Liu, F., Wang, M.: Data augmentation for human behavior analysis in multi-person conversations. In: Proceedings of the 31st ACM International Conference on Multimedia. pp. 9516–9520 (2023)

2023
[19]

arXiv preprint arXiv:2307.10624 (2023) 10 Tingyi Liu et al

Li, K., Guo, D., Chen, G., Peng, X., Wang, M.: Joint skeletal and semantic embed- ding loss for micro-gesture classification. arXiv preprint arXiv:2307.10624 (2023) 10 Tingyi Liu et al

arXiv 2023
[20]

In: Proceedings of the 33rd ACM International Conference on Multimedia

Li, K., Guo, D., Li, X., Chen, H., Liu, P., Wang, F., Hu, J., Zhao, G., Wang, M.: Mac 2025: The 2nd micro-action analysis grand challenge. In: Proceedings of the 33rd ACM International Conference on Multimedia. pp. 14216–14221 (2025)

2025
[21]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision

Li, K., Liu, P., Guo, D., Wang, F., Wu, Z., Fan, H., Wang, M.: Mmad: Multi-label micro-action detection in videos. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 13225–13236 (2025)

2025
[22]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Li, L., Wang, M., Ni, B., Wang, H., Yang, J., Zhang, W.: 3d human action rep- resentation learning via cross-view consistency pursuit. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 4741–4750 (2021)

2021
[23]

In: Proceed- ings of the 31st ACM International Conference on Multimedia

Li, Q., Huang, X., Wan, Z., Hu, L., Wu, S., Zhang, J., Shan, S., Wang, Z.: Data- efficient masked video modeling for self-supervised action recognition. In: Proceed- ings of the 31st ACM International Conference on Multimedia. pp. 2723–2733 (2023)

2023
[24]

In: Proceedings of the 28th ACM international conference on multimedia

Lin, L., Song, S., Yang, W., Liu, J.: Ms2l: Multi-task self-supervised learning for skeleton based action recognition. In: Proceedings of the 28th ACM international conference on multimedia. pp. 2490–2498 (2020)

2020
[25]

arXiv preprint arXiv:2507.09512 (2025)

Liu, P., Li, K., Wang, F., Wei, Y., She, J., Guo, D.: Online micro-gesture recog- nition using data augmentation and spatial-temporal attention. arXiv preprint arXiv:2507.09512 (2025)

arXiv 2025
[26]

arXiv preprint arXiv:2407.04490 (2024)

Liu, P., Wang, F., Li, K., Chen, G., Wei, Y., Tang, S., Wu, Z., Guo, D.: Micro-gesture online recognition using learnable query points. arXiv preprint arXiv:2407.04490 (2024)

arXiv 2024
[27]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Liu, X., Shi, H., Chen, H., Yu, Z., Li, X., Zhao, G.: imigue: An identity-free video dataset for micro-gesture understanding and emotion analysis. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 10631– 10642 (2021)

2021
[28]

Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer:Hierarchicalvisiontransformerusingshiftedwindows.In:Proceedings of the IEEE/CVF international conference on computer vision. pp. 10012–10022 (2021)

2021
[29]

In: Proceedings of the IEEE/CVF conference on computer vision and pat- tern recognition

Liu, Z., Ning, J., Cao, Y., Wei, Y., Zhang, Z., Lin, S., Hu, H.: Video swin trans- former. In: Proceedings of the IEEE/CVF conference on computer vision and pat- tern recognition. pp. 3202–3211 (2022)

2022
[30]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision

Mao, Y., Deng, J., Zhou, W., Fang, Y., Ouyang, W., Li, H.: Masked motion predic- tors are strong 3d action representation learners. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 10181–10191 (2023)

2023
[31]

In: Chinese Conference on Pattern Recognition and Computer Vision (PRCV)

Shang, T., Hao, Y., Pei, M., Li, K., Ben, H., Wang, S.: Cross-modal feature en- hancement and contrastive alignment for micro-gesture recognition. In: Chinese Conference on Pattern Recognition and Computer Vision (PRCV). pp. 203–217. Springer (2025)

2025
[32]

In: Proceedings of the Computer Vision and Pattern Recognition Conference

Thoker, F.M., Jiang, L., Zhao, C., Ghanem, B.: Smile: Infusing spatial and motion semantics in masked video learning. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 8438–8449 (2025)

2025
[33]

Advances in neural infor- mation processing systems35, 10078–10093 (2022)

Tong, Z., Song, Y., Wang, J., Wang, L.: Videomae: Masked autoencoders are data- efficient learners for self-supervised video pre-training. Advances in neural infor- mation processing systems35, 10078–10093 (2022)

2022
[34]

arXiv preprint arXiv:2605.17179 (2026) Self-supervised Learning Matters 11

Wang,C.,Chen,H.,Wei,H.,Yang,Y.,Chen,Y.,Zhao,G.:imigue-3k:Alarge-scale benchmark for micro-gesture analysis with self-supervised learning. arXiv preprint arXiv:2605.17179 (2026) Self-supervised Learning Matters 11

Pith/arXiv arXiv 2026
[35]

In: Proceedings of the AAAI Conference on Artificial Intelligence

Wang, F., Guo, D., Li, K., Wang, M.: Eulermormer: Robust eulerian motion mag- nification via dynamic filtering within transformer. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 38, pp. 5345–5353 (2024)

2024
[36]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Wang, F., Guo, D., Li, K., Zhong, Z., Wang, M.: Frequency decoupling for mo- tion magnification via multi-level isomorphic architecture. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 18984– 18994 (2024)

2024
[37]

In: Companion Proceedings of the ACM on Web Conference 2025

Wang, F., Li, K., Nie, Y., Duan, Z., Zou, P., Wu, Z., Wang, Y., Wei, Y.: Exploiting ensemble learning for cross-view isolated sign language recognition. In: Companion Proceedings of the ACM on Web Conference 2025. pp. 2453–2457 (2025)

2025
[38]

In: Proceedings of the ACM Web Conference 2026

Wang, F., Yang, J., Chen, J., Liu, Y., Li, K., Wei, Y., Guo, D., Wang, M.: Xin- sight: Integrative stage-consistent psychological counseling support agents for dig- ital well-being. In: Proceedings of the ACM Web Conference 2026. pp. 9297–9308 (2026)

2026
[39]

In: Proceed- ings of the IEEE/CVF conference on computer vision and pattern recognition

Wang, L., Huang, B., Zhao, Z., Tong, Z., He, Y., Wang, Y., Wang, Y., Qiao, Y.: Videomae v2: Scaling video masked autoencoders with dual masking. In: Proceed- ings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 14549–14560 (2023)

2023
[40]

IEEE Transactions on Affective Computing pp

Wang,R.,Li,K.,Tong,A.,Xu,J.,Guo,D.,Wang,M.:Gaitemotionrecognitionvia uncertainty-oriented class discriminative learning. IEEE Transactions on Affective Computing pp. 1–14 (2026)

2026
[41]

Machine Intelligence Research23(2), 308–330 (2026)

Wang, T., Lin, X., Xu, Y., Ye, Q., Guo, D., Escalera, S., Khoriba, G., Yu, Z.: Micro- gesture recognition: A comprehensive survey of datasets, methods, and challenges. Machine Intelligence Research23(2), 308–330 (2026)

2026
[42]

In: MiGA@ IJCAI (2024)

Wang, Y., Dong, Z., Li, P., Liu, Y.: A multimodal micro-gesture classification model based on clip. In: MiGA@ IJCAI (2024)

2024
[43]

arXiv preprint arXiv:2602.08057 (2026)

Wang, Y., Liu, H., Xu, T., Shi, C., Xing, H.: Weak to strong: Vlm-based pseudo- labeling as a weakly supervised training strategy in multimodal video-based hidden emotion understanding tasks. arXiv preprint arXiv:2602.08057 (2026)

arXiv 2026
[44]

IEEE Trans- actions on Affective Computing (2025)

Xia, Z., Huang, H., Chen, H., Feng, X., Zhao, G.: Hybrid-supervised hypergraph- enhanced transformer for micro-gesture based emotion recognition. IEEE Trans- actions on Affective Computing (2025)

2025
[45]

arXiv preprint arXiv:2506.12848 (2025)

Xu, H., Cheng, L., Wang, Y., Tang, S., Zhong, Z.: Towards fine-grained emo- tion understanding via skeleton-based micro-gesture recognition. arXiv preprint arXiv:2506.12848 (2025)

arXiv 2025

[1] [1]

arXiv preprint arXiv:2408.03097 (2024)

Chen, G., Wang, F., Li, K., Wu, Z., Fan, H., Yang, Y., Wang, M., Guo, D.: Pro- totype learning for micro-gesture classification. arXiv preprint arXiv:2408.03097 (2024)

arXiv 2024

[2] [2]

Chen, H., Schuller, B.W., Adeli, E., Zhao, G.: The 3rd challenge on human behav- ior analysis for emotion understanding (miga) 2025: From recognition to emotion understanding (2025) Self-supervised Learning Matters 9

2025

[3] [3]

International Journal of Computer Vision131(6), 1346–1366 (2023)

Chen, H., Shi, H., Liu, X., Li, X., Zhao, G.: Smg: A micro-gesture dataset to- wards spontaneous body gestures for emotional stress state analysis. International Journal of Computer Vision131(6), 1346–1366 (2023)

2023

[4] [4]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Duan, H., Zhao, Y., Chen, K., Lin, D., Dai, B.: Revisiting skeleton-based action recognition. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 2969–2978 (2022)

2022

[5] [5]

In: Proceedings of the 33rd ACM International Conference on Multimedia

Gu, J., Li, K., Wang, F., Wei, Y., Wu, Z., Fan, H., Wang, M.: Motion matters: Motion-guided modulation network for skeleton-based micro-action recognition. In: Proceedings of the 33rd ACM International Conference on Multimedia. pp. 5461–5470 (2025)

2025

[6] [6]

arXiv preprint arXiv:2507.08344 (2025)

Gu, J., Wang, F., Li, K., Wei, Y., Wu, Z., Guo, D.: Mm-gesture: towards precise micro-gesture recognition through multimodal fusion. arXiv preprint arXiv:2507.08344 (2025)

arXiv 2025

[7] [7]

IEEE Transactions on Circuits and Systems for Video Technology34(7), 6238–6252 (2024)

Guo, D., Li, K., Hu, B., Zhang, Y., Wang, M.: Benchmarking micro-action recog- nition: Dataset, methods, and applications. IEEE Transactions on Circuits and Systems for Video Technology34(7), 6238–6252 (2024)

2024

[8] [8]

In: Proceedings of the 32nd ACM International Conference on Multimedia

Guo, D., Li, X., Li, K., Chen, H., Hu, J., Zhao, G., Yang, Y., Wang, M.: Mac 2024: Micro-action analysis grand challenge. In: Proceedings of the 32nd ACM International Conference on Multimedia. pp. 11304–11305 (2024)

2024

[9] [9]

In: Proceedings of the AAAI conference on artificial intelligence

Guo, T., Liu, H., Chen, Z., Liu, M., Wang, T., Ding, R.: Contrastive learning from extremely augmented skeleton sequences for self-supervised action recognition. In: Proceedings of the AAAI conference on artificial intelligence. vol. 36, pp. 762–770 (2022)

2022

[10] [10]

IEEE Transactions on Circuits and Systems for Video Technology32(10), 7120–7132 (2022)

Hao, Y., Wang, S., Cao, P., Gao, X., Xu, T., Wu, J., He, X.: Attention in attention: Modeling context correlation for efficient video classification. IEEE Transactions on Circuits and Systems for Video Technology32(10), 7120–7132 (2022)

2022

[11] [11]

In: Proceedings of the ieee/cvf conference on computer vision and pattern recognition

Hao, Y., Zhang, H., Ngo, C.W., He, X.: Group contextualization for video recog- nition. In: Proceedings of the ieee/cvf conference on computer vision and pattern recognition. pp. 928–938 (2022)

2022

[12] [12]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 16000–16009 (2022)

2022

[13] [13]

MiGA@ IJCAI (2025)

Hu, X., Pu, C., Li, Y., Xu, Y., Xie, K., Miao, Q.: Enhancing micro-gesture clas- sification via global-aware importance estimation in vision transformer. MiGA@ IJCAI (2025)

2025

[14] [14]

In: MiGA@ IJCAI (2023)

Huang, H., Guo, X., Peng, W., Xia, Z.: Micro-gesture classification based on en- semble hypergraph-convolution transformer. In: MiGA@ IJCAI (2023)

2023

[15] [15]

In: MiGA@ IJCAI (2024)

Huang, H., Wang, Y., Linghu, K., Xia, Z.: Multi-modal micro-gesture classification via multi-scale heterogeneous ensemble network. In: MiGA@ IJCAI (2024)

2024

[16] [16]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Li, K., Gu, J., Wang, F., Wu, Z., Fan, H., Guo, D.: Ma-bench: Towards fine- grained micro-action understanding. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 20118–20128 (June 2026)

2026

[17] [17]

In: Proceedings of the AAAI Conference on Artificial Intelligence

Li, K., Guo, D., Chen, G., Fan, C., Xu, J., Wu, Z., Fan, H., Wang, M.: Prototypical calibrating ambiguous samples for micro-action recognition. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 39, pp. 4815–4823 (2025)

2025

[18] [18]

In: Proceedings of the 31st ACM International Conference on Multimedia

Li, K., Guo, D., Chen, G., Liu, F., Wang, M.: Data augmentation for human behavior analysis in multi-person conversations. In: Proceedings of the 31st ACM International Conference on Multimedia. pp. 9516–9520 (2023)

2023

[19] [19]

arXiv preprint arXiv:2307.10624 (2023) 10 Tingyi Liu et al

Li, K., Guo, D., Chen, G., Peng, X., Wang, M.: Joint skeletal and semantic embed- ding loss for micro-gesture classification. arXiv preprint arXiv:2307.10624 (2023) 10 Tingyi Liu et al

arXiv 2023

[20] [20]

In: Proceedings of the 33rd ACM International Conference on Multimedia

Li, K., Guo, D., Li, X., Chen, H., Liu, P., Wang, F., Hu, J., Zhao, G., Wang, M.: Mac 2025: The 2nd micro-action analysis grand challenge. In: Proceedings of the 33rd ACM International Conference on Multimedia. pp. 14216–14221 (2025)

2025

[21] [21]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision

Li, K., Liu, P., Guo, D., Wang, F., Wu, Z., Fan, H., Wang, M.: Mmad: Multi-label micro-action detection in videos. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 13225–13236 (2025)

2025

[22] [22]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Li, L., Wang, M., Ni, B., Wang, H., Yang, J., Zhang, W.: 3d human action rep- resentation learning via cross-view consistency pursuit. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 4741–4750 (2021)

2021

[23] [23]

In: Proceed- ings of the 31st ACM International Conference on Multimedia

Li, Q., Huang, X., Wan, Z., Hu, L., Wu, S., Zhang, J., Shan, S., Wang, Z.: Data- efficient masked video modeling for self-supervised action recognition. In: Proceed- ings of the 31st ACM International Conference on Multimedia. pp. 2723–2733 (2023)

2023

[24] [24]

In: Proceedings of the 28th ACM international conference on multimedia

Lin, L., Song, S., Yang, W., Liu, J.: Ms2l: Multi-task self-supervised learning for skeleton based action recognition. In: Proceedings of the 28th ACM international conference on multimedia. pp. 2490–2498 (2020)

2020

[25] [25]

arXiv preprint arXiv:2507.09512 (2025)

Liu, P., Li, K., Wang, F., Wei, Y., She, J., Guo, D.: Online micro-gesture recog- nition using data augmentation and spatial-temporal attention. arXiv preprint arXiv:2507.09512 (2025)

arXiv 2025

[26] [26]

arXiv preprint arXiv:2407.04490 (2024)

Liu, P., Wang, F., Li, K., Chen, G., Wei, Y., Tang, S., Wu, Z., Guo, D.: Micro-gesture online recognition using learnable query points. arXiv preprint arXiv:2407.04490 (2024)

arXiv 2024

[27] [27]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Liu, X., Shi, H., Chen, H., Yu, Z., Li, X., Zhao, G.: imigue: An identity-free video dataset for micro-gesture understanding and emotion analysis. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 10631– 10642 (2021)

2021

[28] [28]

Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer:Hierarchicalvisiontransformerusingshiftedwindows.In:Proceedings of the IEEE/CVF international conference on computer vision. pp. 10012–10022 (2021)

2021

[29] [29]

In: Proceedings of the IEEE/CVF conference on computer vision and pat- tern recognition

Liu, Z., Ning, J., Cao, Y., Wei, Y., Zhang, Z., Lin, S., Hu, H.: Video swin trans- former. In: Proceedings of the IEEE/CVF conference on computer vision and pat- tern recognition. pp. 3202–3211 (2022)

2022

[30] [30]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision

Mao, Y., Deng, J., Zhou, W., Fang, Y., Ouyang, W., Li, H.: Masked motion predic- tors are strong 3d action representation learners. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 10181–10191 (2023)

2023

[31] [31]

In: Chinese Conference on Pattern Recognition and Computer Vision (PRCV)

Shang, T., Hao, Y., Pei, M., Li, K., Ben, H., Wang, S.: Cross-modal feature en- hancement and contrastive alignment for micro-gesture recognition. In: Chinese Conference on Pattern Recognition and Computer Vision (PRCV). pp. 203–217. Springer (2025)

2025

[32] [32]

In: Proceedings of the Computer Vision and Pattern Recognition Conference

Thoker, F.M., Jiang, L., Zhao, C., Ghanem, B.: Smile: Infusing spatial and motion semantics in masked video learning. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 8438–8449 (2025)

2025

[33] [33]

Advances in neural infor- mation processing systems35, 10078–10093 (2022)

Tong, Z., Song, Y., Wang, J., Wang, L.: Videomae: Masked autoencoders are data- efficient learners for self-supervised video pre-training. Advances in neural infor- mation processing systems35, 10078–10093 (2022)

2022

[34] [34]

arXiv preprint arXiv:2605.17179 (2026) Self-supervised Learning Matters 11

Wang,C.,Chen,H.,Wei,H.,Yang,Y.,Chen,Y.,Zhao,G.:imigue-3k:Alarge-scale benchmark for micro-gesture analysis with self-supervised learning. arXiv preprint arXiv:2605.17179 (2026) Self-supervised Learning Matters 11

Pith/arXiv arXiv 2026

[35] [35]

In: Proceedings of the AAAI Conference on Artificial Intelligence

Wang, F., Guo, D., Li, K., Wang, M.: Eulermormer: Robust eulerian motion mag- nification via dynamic filtering within transformer. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 38, pp. 5345–5353 (2024)

2024

[36] [36]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Wang, F., Guo, D., Li, K., Zhong, Z., Wang, M.: Frequency decoupling for mo- tion magnification via multi-level isomorphic architecture. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 18984– 18994 (2024)

2024

[37] [37]

In: Companion Proceedings of the ACM on Web Conference 2025

Wang, F., Li, K., Nie, Y., Duan, Z., Zou, P., Wu, Z., Wang, Y., Wei, Y.: Exploiting ensemble learning for cross-view isolated sign language recognition. In: Companion Proceedings of the ACM on Web Conference 2025. pp. 2453–2457 (2025)

2025

[38] [38]

In: Proceedings of the ACM Web Conference 2026

Wang, F., Yang, J., Chen, J., Liu, Y., Li, K., Wei, Y., Guo, D., Wang, M.: Xin- sight: Integrative stage-consistent psychological counseling support agents for dig- ital well-being. In: Proceedings of the ACM Web Conference 2026. pp. 9297–9308 (2026)

2026

[39] [39]

In: Proceed- ings of the IEEE/CVF conference on computer vision and pattern recognition

Wang, L., Huang, B., Zhao, Z., Tong, Z., He, Y., Wang, Y., Wang, Y., Qiao, Y.: Videomae v2: Scaling video masked autoencoders with dual masking. In: Proceed- ings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 14549–14560 (2023)

2023

[40] [40]

IEEE Transactions on Affective Computing pp

Wang,R.,Li,K.,Tong,A.,Xu,J.,Guo,D.,Wang,M.:Gaitemotionrecognitionvia uncertainty-oriented class discriminative learning. IEEE Transactions on Affective Computing pp. 1–14 (2026)

2026

[41] [41]

Machine Intelligence Research23(2), 308–330 (2026)

Wang, T., Lin, X., Xu, Y., Ye, Q., Guo, D., Escalera, S., Khoriba, G., Yu, Z.: Micro- gesture recognition: A comprehensive survey of datasets, methods, and challenges. Machine Intelligence Research23(2), 308–330 (2026)

2026

[42] [42]

In: MiGA@ IJCAI (2024)

Wang, Y., Dong, Z., Li, P., Liu, Y.: A multimodal micro-gesture classification model based on clip. In: MiGA@ IJCAI (2024)

2024

[43] [43]

arXiv preprint arXiv:2602.08057 (2026)

Wang, Y., Liu, H., Xu, T., Shi, C., Xing, H.: Weak to strong: Vlm-based pseudo- labeling as a weakly supervised training strategy in multimodal video-based hidden emotion understanding tasks. arXiv preprint arXiv:2602.08057 (2026)

arXiv 2026

[44] [44]

IEEE Trans- actions on Affective Computing (2025)

Xia, Z., Huang, H., Chen, H., Feng, X., Zhao, G.: Hybrid-supervised hypergraph- enhanced transformer for micro-gesture based emotion recognition. IEEE Trans- actions on Affective Computing (2025)

2025

[45] [45]

arXiv preprint arXiv:2506.12848 (2025)

Xu, H., Cheng, L., Wang, Y., Tang, S., Zhong, Z.: Towards fine-grained emo- tion understanding via skeleton-based micro-gesture recognition. arXiv preprint arXiv:2506.12848 (2025)

arXiv 2025