pith. sign in

arxiv: 2605.17179 · v1 · pith:7JNVWKRQnew · submitted 2026-05-16 · 💻 cs.CV

iMiGUE-3K: A Large-Scale Benchmark for Micro-Gesture Analysis with Self-Supervised Learning

Pith reviewed 2026-05-20 14:06 UTC · model grok-4.3

classification 💻 cs.CV
keywords micro-gestureemotion understandingvideo datasetbenchmarkfoundation modelaffective computingbody languageself-supervised learning
3
0 comments X

The pith

A dataset of micro-gestures from tennis interviews shows that body language cues substantially improve automated emotion understanding.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents iMiGUE-3K, the largest video dataset for micro-gesture analysis, built from more than 3,400 clips and 37 million frames of professional tennis players in public interviews. It annotates 32 classes of subtle, unintentional movements using a model-assisted collection process and introduces MG-FMs as a foundation model for learning gesture representations. Evaluations across recognition, retrieval, and emotion tasks indicate that micro-gesture signals add measurable value beyond facial or speech cues alone. A sympathetic reader would care because current affective systems largely ignore these subconscious body signals that reflect inner emotional states.

Core claim

By releasing iMiGUE-3K and MG-FMs, the work establishes that micro-gesture analysis significantly improves emotion understanding, as shown through systematic testing of representative methods on five tasks including unsupervised, semi-supervised, and supervised micro-gesture recognition plus retrieval and emotion recognition.

What carries the argument

The iMiGUE-3K dataset of 32 annotated micro-gesture classes collected via model-based crowd-sourcing from in-the-wild tennis interview videos, together with the MG-FMs discriminative foundation model for transferable gesture presentation learning.

If this is right

  • Micro-gesture recognition becomes feasible in unsupervised, semi-supervised, and supervised regimes on a shared benchmark.
  • The foundation model supports transferable representations for gesture retrieval and related downstream tasks.
  • Emotion recognition systems gain a new modality that captures subconscious cues not visible in faces or speech.
  • The benchmark enables progress toward applications in psychological diagnostics and human-computer interaction.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same collection approach could scale to other high-stakes public settings to gather more diverse natural micro-gestures.
  • Combining micro-gesture signals with existing modalities may increase robustness of emotion models in noisy real-world environments.
  • Self-supervised pre-training on this volume of data could yield representations useful for broader body-language understanding tasks.

Load-bearing premise

The model-based crowd-sourcing strategy produces accurate, unbiased fine-grained annotations for the 32 micro-gesture classes that generalize beyond tennis press interviews.

What would settle it

An independent test set of everyday human interactions where models trained on iMiGUE-3K show no accuracy gain in emotion recognition compared with face-and-speech baselines.

Figures

Figures reproduced from arXiv: 2605.17179 by Chengyan Wang, Guoying Zhao, Haoyu Chen, Hui Wei, Yueyi Yang, Yunquan Chen.

Figure 1
Figure 1. Figure 1: We introduce iMiGUE-3K, an in-the-wild dataset of micro-gesture [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the annotation schema and category distribution in iMiGUE-3K. (a) The 32 annotation labels are organized into five body-part [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Statistics of the iMiGUE-3K dataset. (a) Video-Level Distribution by Subject Region in iMiGUE-3K. (b) Duration distribution of the videos [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Comparison between duration distributions of iMiGUE [45] and [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
read the original abstract

Emotion understanding is a fundamental challenge in affective computing and artificial intelligence. While existing approaches predominantly focus on facial expressions and speech, they often overlook the rich emotional cues conveyed through body language. Recently, micro-gestures (MGs), unintentional, subconscious movements driven by inner feelings, have attracted increasing attention as an alternative to other cues. However, there are no existing large-scale datasets supporting the pre-training of the MG foundation model. To advance MG research, we present a new benchmark for micro-gesture-based emotion understanding, featuring key contributions with a novel dataset (iMiGUE-3K) and a series of foundation models for different tasks. Using a model-based crowd-sourcing data collection strategy, we construct iMiGUE-3K, the largest MG dataset to date. It comprises video recordings from 332 distinct professional tennis players' public press interviews over the past seven years, totaling more than 3.4K long video clips and 37 million frames. The dataset includes 32 micro-gesture classes with rich descriptive annotations, making it the first large-scale, in-the-wild, video dataset for fine-grained gesture-based emotion analysis. Built on iMiGUE-3K, we propose MG-FMs, a discriminative foundation model for transferable gesture presentation learning. Based on the foundation model, we establish five comprehensive evaluation tasks: MG recognition (unsupervised, semi-supervised, supervised), MG retrieval, and MG emotion recognition. Our systematic evaluation of representative methods demonstrates that micro-gesture-based analysis significantly improves emotion understanding. We hope this work can provide comprehensive tools for MG analysis and set a solid foundation for future research in psychological diagnostics, affective computing, and advanced human-computer interaction.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces iMiGUE-3K, the largest in-the-wild video dataset for micro-gesture analysis, constructed from 3.4K clips of 332 professional tennis players' press interviews using a model-based crowd-sourcing strategy to annotate 32 micro-gesture classes. It proposes MG-FMs, a discriminative foundation model for transferable gesture representation learning via self-supervised methods, and evaluates it across five tasks: unsupervised/semi-supervised/supervised MG recognition, MG retrieval, and MG emotion recognition, claiming that micro-gesture analysis significantly improves emotion understanding over existing approaches focused on facial expressions and speech.

Significance. If the annotations prove reliable and the reported gains generalize, this benchmark and the associated foundation models would address a clear gap in affective computing by enabling scalable study of subconscious body-language cues for emotion recognition. The scale (37M frames), multi-task evaluation protocol, and self-supervised pre-training approach represent concrete strengths that could support reproducible progress in psychological diagnostics and HCI.

major comments (2)
  1. [Dataset Construction] Dataset Construction section: The model-based crowd-sourcing procedure for producing the 32-class fine-grained micro-gesture annotations is presented without any reported validation metrics (inter-annotator agreement, human verification of model outputs, or cross-domain checks). This directly undermines the central claim that MG features improve emotion recognition, because systematic label errors or interview-specific biases would render the five evaluation tasks unreliable.
  2. [Evaluation] Evaluation section: The abstract and results claim that 'systematic evaluation of representative methods demonstrates that micro-gesture-based analysis significantly improves emotion understanding,' yet no quantitative numbers, error bars, data-split details, or baseline comparisons appear in the provided summary or abstract. Without these, the magnitude and robustness of the claimed gains cannot be assessed.
minor comments (2)
  1. [Abstract] The abstract would be strengthened by including at least one key quantitative result (e.g., accuracy delta on the MG emotion recognition task) to support the improvement claim.
  2. A comparison table of iMiGUE-3K statistics against prior micro-gesture or gesture datasets is missing and would help readers gauge the scale advantage.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. We address each major comment point by point below, with honest indications of where revisions will be made to strengthen the paper.

read point-by-point responses
  1. Referee: [Dataset Construction] Dataset Construction section: The model-based crowd-sourcing procedure for producing the 32-class fine-grained micro-gesture annotations is presented without any reported validation metrics (inter-annotator agreement, human verification of model outputs, or cross-domain checks). This directly undermines the central claim that MG features improve emotion recognition, because systematic label errors or interview-specific biases would render the five evaluation tasks unreliable.

    Authors: We agree that explicit validation metrics are essential to substantiate the annotation quality and support downstream claims. The current manuscript describes the model-based crowd-sourcing strategy but does not report inter-annotator agreement, human verification rates, or cross-domain checks. In the revised version we will add a new paragraph in the Dataset Construction section that includes these metrics, computed on a held-out verification subset, along with details on how model-proposed labels were cross-checked by multiple human annotators. This addition will directly address concerns about label reliability and bias. revision: yes

  2. Referee: [Evaluation] Evaluation section: The abstract and results claim that 'systematic evaluation of representative methods demonstrates that micro-gesture-based analysis significantly improves emotion understanding,' yet no quantitative numbers, error bars, data-split details, or baseline comparisons appear in the provided summary or abstract. Without these, the magnitude and robustness of the claimed gains cannot be assessed.

    Authors: The full Evaluation section of the manuscript reports quantitative results for all five tasks, including accuracy metrics, baseline comparisons, and data-split protocols. However, these specifics are not summarized with numbers in the abstract. We will revise the abstract to include key quantitative findings (e.g., the absolute and relative improvements in emotion recognition accuracy when incorporating micro-gesture features) and will ensure that error bars and split details are explicitly referenced or tabulated in the results for immediate assessment of robustness. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper introduces a new large-scale dataset iMiGUE-3K via model-based crowd-sourcing from tennis press interviews and applies standard self-supervised learning to train MG-FMs. It then reports empirical results on five evaluation tasks (MG recognition in unsupervised/semi-supervised/supervised settings, retrieval, and emotion recognition). No equations, predictions, or central claims reduce by construction to fitted parameters, self-definitions, or self-citation chains within the paper; the performance gains are demonstrated through external evaluations on the newly collected benchmark rather than definitional equivalences.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The paper relies on standard computer-vision assumptions about annotation quality and representation transferability; no explicit free parameters, axioms, or invented entities are described in the abstract.

pith-pipeline@v0.9.0 · 5863 in / 1189 out tokens · 61037 ms · 2026-05-20T14:06:44.404394+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

104 extracted references · 104 canonical work pages · 2 internal anchors

  1. [1]

    Multi- pie,

    R. Gross, I. Matthews, J. Cohn, T. Kanade, and S. Baker, “Multi- pie,”Image and Vision Computing, vol. 28, no. 5, pp. 807–813, 2010

  2. [2]

    Comprehensive database for facial expression analysis,

    T. Kanade, J. F. Cohn, and Y. Tian, “Comprehensive database for facial expression analysis,” inProceedings of the IEEE International Conference on Automatic Face and Gesture Recognition. IEEE, 2000, pp. 46–53

  3. [3]

    Web-based database for facial expression analysis,

    M. Pantic, M. Valstar, R. Rademaker, and L. Maat, “Web-based database for facial expression analysis,” inProceedings of the IEEE International Conference on Multimedia and Expo. IEEE, 2005, pp. 317–321

  4. [4]

    Induced disgust, happiness and surprise: an addition to the mmi facial expression database,

    M. Valstar and M. Pantic, “Induced disgust, happiness and surprise: an addition to the mmi facial expression database,” in Proceedings of the International Conference on Language Resources and Evaluation, Workshop EMOTION. Paris, France, 2010, pp. 65–70

  5. [5]

    A 3d facial expression database for facial behavior research,

    L. Yin, X. Wei, Y. Sun, J. Wang, and M. J. Rosato, “A 3d facial expression database for facial behavior research,” inProceedings of the IEEE International Conference on Automatic Face and Gesture Recognition. IEEE, 2006, pp. 211–216

  6. [6]

    A high-resolution spontaneous 3d dynamic facial expression database,

    X. Zhang, L. Yin, J. F. Cohn, S. Canavan, M. Reale, A. Horowitz, and P . Liu, “A high-resolution spontaneous 3d dynamic facial expression database,” inProceedings of the IEEE International Con- ference on Automatic Face and Gesture Recognition. IEEE, 2013, pp. 1–6

  7. [7]

    The automatic detection of chronic pain- related expression: requirements, challenges and the multimodal emopain dataset,

    M. S. Aung, S. Kaltwang, B. Romera-Paredes, B. Martinez, A. Singh, M. Cella, M. Valstar, H. Meng, A. Kemp, and M. Shafizadeh, “The automatic detection of chronic pain- related expression: requirements, challenges and the multimodal emopain dataset,”IEEE Trans. Affect. Comput., vol. 7, no. 4, pp. 435–451, 2015

  8. [8]

    Fully automatic facial action recognition in spontaneous behavior,

    M. S. Bartlett, G. Littlewort, M. Frank, C. Lainscsek, I. Fasel, and J. Movellan, “Fully automatic facial action recognition in spontaneous behavior,” inProc. IEEE Int. Conf. Auto. Face Gesture Recognit.IEEE, 2006, pp. 223–230

  9. [9]

    From individual to group-level emotion recognition: Emotiw 5.0,

    A. Dhall, R. Goecke, S. Ghosh, J. Joshi, J. Hoey, and T. Gedeon, “From individual to group-level emotion recognition: Emotiw 5.0,” inProc. ACM Int. Conf. Multimodal Interaction, 2017, pp. 524– 528

  10. [10]

    Painful data: The unbc-mcmaster shoulder pain expression archive database,

    P . Lucey, J. F. Cohn, K. M. Prkachin, P . E. Solomon, and I. Matthews, “Painful data: The unbc-mcmaster shoulder pain expression archive database,”Proc. IEEE Int. Conf. Auto. Face Gesture Recognit., pp. 57–64, 2011

  11. [11]

    Deep affect prediction in-the-wild: Aff-wild database and challenge, deep architectures, and beyond,

    D. Kollias, P . Tzirakis, M. A. Nicolaou, A. Papaioannou, G. Zhao, B. Schuller, I. Kotsia, and S. Zafeiriou, “Deep affect prediction in-the-wild: Aff-wild database and challenge, deep architectures, and beyond,”Int. J. Comput. Vision, vol. 127, no. 6-7, pp. 907–929, 2019

  12. [12]

    The semaine database: Annotated multimodal records of emo- tionally colored conversations between a person and a limited agent,

    G. McKeown, M. Valstar, R. Cowie, M. Pantic, and M. Schroder, “The semaine database: Annotated multimodal records of emo- tionally colored conversations between a person and a limited agent,”IEEE Trans. Affect. Comput., vol. 3, no. 1, pp. 5–17, 2011

  13. [13]

    Casme ii: An improved spontaneous micro-expression database and the baseline evaluation,

    W.-J. Yan, X. Li, S.-J. Wang, G. Zhao, Y.-J. Liu, Y.-H. Chen, and X. Fu, “Casme ii: An improved spontaneous micro-expression database and the baseline evaluation,”PloS one, vol. 9, no. 1, p. e86041, 2014

  14. [14]

    A spontaneous micro-expression database: Inducement, collection and baseline,

    X. Li, T. Pfister, X. Huang, G. Zhao, and M. Pietik ¨ainen, “A spontaneous micro-expression database: Inducement, collection and baseline,” in2013 10th IEEE International Conference and Workshops on Automatic face and gesture recognition (fg). IEEE, 2013, pp. 1–6

  15. [15]

    Samm: A spontaneous micro-facial movement dataset,

    A. K. Davison, C. Lansley, N. Costen, K. Tan, and M. H. Yap, “Samm: A spontaneous micro-facial movement dataset,”IEEE transactions on affective computing, vol. 9, no. 1, pp. 116–129, 2016

  16. [16]

    Avec 2011–the first international audio/visual emotion challenge,

    B. Schuller, M. Valstar, F. Eyben, G. McKeown, R. Cowie, and M. Pantic, “Avec 2011–the first international audio/visual emotion challenge,” inInt. Conf. Affect. Comput. Intell. Interact. Springer, 2011, pp. 415–424

  17. [17]

    Avec 2012: The continuous audio/visual emotion challenge,

    B. Schuller, M. Valster, F. Eyben, R. Cowie, and M. Pantic, “Avec 2012: The continuous audio/visual emotion challenge,” inProc. ACM Int. Conf. Multimodal Interact., 2012, pp. 449–456

  18. [18]

    Introduc- ing the recola multimodal corpus of remote collaborative and affective interactions,

    F. Ringeval, A. Sonderegger, J. Sauer, and D. Lalanne, “Introduc- ing the recola multimodal corpus of remote collaborative and affective interactions,” inProc. IEEE Int. Conf. Auto. Face Gesture Recognit.IEEE, 2013, pp. 1–8

  19. [19]

    Panoptic studio: A massively multi- view system for social motion capture,

    H. Joo, H. Liu, L. Tan, L. Gui, B. Nabbe, I. Matthews, T. Kanade, S. Nobuhara, and Y. Sheikh, “Panoptic studio: A massively multi- view system for social motion capture,” inProceedings of the IEEE International Conference on Computer Vision, 2015, pp. 3334–3342

  20. [20]

    Towards social artifi- cial intelligence: Nonverbal social signal prediction in a triadic interaction,

    H. Joo, T. Simon, M. Cikara, and Y. Sheikh, “Towards social artifi- cial intelligence: Nonverbal social signal prediction in a triadic interaction,” inProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 10 873–10 883

  21. [21]

    A multi- 12 modal database for affect recognition and implicit tagging,

    M. Soleymani, J. Lichtenauer, T. Pun, and M. Pantic, “A multi- 12 modal database for affect recognition and implicit tagging,”IEEE transactions on affective computing, vol. 3, no. 1, pp. 42–55, 2011

  22. [22]

    Deap: A database for emotion analysis; using physiological signals,

    K. Sander, M. Christian, M. Soleymani, J.-S. Lee, Y. Ashkan, E. Touradj, P . Thierry, N. Anton, and P . Ioannis, “Deap: A database for emotion analysis; using physiological signals,”IEEE Trans. Affect. Comput., vol. 3, no. 1, pp. 18–31, 2011

  23. [23]

    Ekman,Telling lies: Clues to deceit in the marketplace, politics, and marriage (revised edition)

    P . Ekman,Telling lies: Clues to deceit in the marketplace, politics, and marriage (revised edition). WW Norton & Company, 2009

  24. [24]

    Body cues, not facial expressions, discriminate between intense positive and negative emotions,

    H. Aviezer, Y. Trope, and A. Todorov, “Body cues, not facial expressions, discriminate between intense positive and negative emotions,”Science, vol. 338, no. 6111, pp. 1225–1229, 2012

  25. [25]

    R. E. Axtell,Gestures: the do’s and taboos of body language around the wor ld, 1991

  26. [26]

    J. K. Burgoon, D. B. Buller, and W. G. Woodall,Nonverbal commu- nication: The unspoken dialogue. Harpercollins College Division, 1989

  27. [27]

    Hand gesture recognition: a literature review,

    R. Z. Khan and N. A. Ibraheem, “Hand gesture recognition: a literature review,”International Journal of Artificial Intelligence & Applications, vol. 3, no. 4, p. 161, 2012

  28. [28]

    Survey on emotional body gesture recogni- tion,

    F. Noroozi, D. Kaminska, C. Corneanu, T. Sapinski, S. Escalera, and G. Anbarjafari, “Survey on emotional body gesture recogni- tion,”IEEE Transactions on Affective Computing, 2018

  29. [29]

    Recognizing emotions expressed by body pose: A biologically inspired neural model,

    K. Schindler, L. Van Gool, and B. De Gelder, “Recognizing emotions expressed by body pose: A biologically inspired neural model,”Neural networks, vol. 21, no. 9, pp. 1238–1246, 2008

  30. [30]

    A bimodal face and body gesture database for automatic analysis of human nonverbal affective behavior,

    H. Gunes and M. Piccardi, “A bimodal face and body gesture database for automatic analysis of human nonverbal affective behavior,” in18th International conference on pattern recognition (ICPR’06), vol. 1. IEEE, 2006, pp. 1148–1153

  31. [31]

    Recognising human emotions from body movement and gesture dynamics,

    G. Castellano, S. D. Villalba, and A. Camurri, “Recognising human emotions from body movement and gesture dynamics,” inInternational conference on affective computing and intelligent interaction. Springer, 2007, pp. 71–82

  32. [32]

    The humaine database: Addressing the collection and annotation of naturalistic and induced emotional data,

    E. Douglas-Cowie, R. Cowie, I. Sneddon, C. Cox, O. Lowry, M. Mcrorie, J.-C. Martin, L. Devillers, S. Abrilian, A. Batliner et al., “The humaine database: Addressing the collection and annotation of naturalistic and induced emotional data,” inInter- national conference on affective computing and intelligent interaction. Springer, 2007, pp. 488–500

  33. [33]

    Technique for automatic emotion recognition by body gesture analysis,

    D. Glowinski, A. Camurri, G. Volpe, N. Dael, and K. Scherer, “Technique for automatic emotion recognition by body gesture analysis,” in2008 IEEE Computer society conference on computer vision and pattern recognition workshops. IEEE, 2008, pp. 1–6

  34. [34]

    Liris- accede: A video database for affective content analysis,

    Y. Baveye, E. Dellandrea, C. Chamaret, and L. Chen, “Liris- accede: A video database for affective content analysis,”IEEE Transactions on Affective Computing, vol. 6, no. 1, pp. 43–55, 2015

  35. [35]

    A study on emotion recognition from body gestures using kinect sensor,

    S. Saha, S. Datta, A. Konar, and R. Janarthanan, “A study on emotion recognition from body gestures using kinect sensor,” in 2014 international conference on communication and signal processing. IEEE, 2014, pp. 056–060

  36. [36]

    Multi- modal emotion recognition using deep learning architectures,

    H. Ranganathan, S. Chakraborty, and S. Panchanathan, “Multi- modal emotion recognition using deep learning architectures,” in2016 IEEE Winter Conference on Applications of Computer Vision (WACV). IEEE, 2016, pp. 1–9

  37. [37]

    Emilya: Emotional body expres- sion in daily actions database

    N. Fourati and C. Pelachaud, “Emilya: Emotional body expres- sion in daily actions database.” inLREC, 2014, pp. 3486–3493

  38. [38]

    Gesture and emotion: Can basic gestu- ral form features discriminate emotions?

    M. Kipp and J.-C. Martin, “Gesture and emotion: Can basic gestu- ral form features discriminate emotions?” in2009 3rd International Conference on Affective Computing and Intelligent Interaction and Workshops. IEEE, 2009, pp. 1–8

  39. [39]

    Arbee: Towards automated recognition of bodily expression of emotion in the wild,

    Y. Luo, J. Ye, R. B. Adams, J. Li, M. G. Newman, and J. Z. Wang, “Arbee: Towards automated recognition of bodily expression of emotion in the wild,”International Journal of Computer Vision, vol. 128, no. 1, pp. 1–25, 2020

  40. [40]

    Analyze spontaneous gestures for emotional stress state recognition: A micro-gesture dataset and analysis with deep learning,

    H. Chen, X. Liu, X. Li, H. Shi, and G. Zhao, “Analyze spontaneous gestures for emotional stress state recognition: A micro-gesture dataset and analysis with deep learning,” in2019 14th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2019). IEEE, 2019, pp. 1–8

  41. [41]

    Face and body gesture recognition for a vision-based multimodal analyser,

    H. Gunes, M. Piccardi, and T. Jan, “Face and body gesture recognition for a vision-based multimodal analyser,” inPan- Sydney Area Workshop on Visual Information Processing. ACS, 2004

  42. [42]

    Affect recognition from face and body: early fusion vs. late fusion,

    H. Gunes and M. Piccardi, “Affect recognition from face and body: early fusion vs. late fusion,” in2005 IEEE International Conference on Systems, Man and Cybernetics, vol. 4. IEEE, 2005, pp. 3437–3443

  43. [43]

    Multimodal sentiment intensity analysis in videos: Facial gestures and verbal messages,

    A. Zadeh, R. Zellers, E. Pincus, and L.-P . Morency, “Multimodal sentiment intensity analysis in videos: Facial gestures and verbal messages,”IEEE Intelligent Systems, vol. 31, no. 6, pp. 82–88, 2016

  44. [44]

    Sentiment analysis and topic recognition in video transcriptions,

    L. Stappen, A. Baird, E. Cambria, and B. W. Schuller, “Sentiment analysis and topic recognition in video transcriptions,”IEEE Intelligent Systems, vol. 36, no. 2, pp. 88–95, 2021

  45. [45]

    imigue: An identity-free video dataset for micro-gesture understanding and emotion analysis,

    X. Liu, H. Shi, H. Chen, Z. Yu, X. Li, and G. Zhao, “imigue: An identity-free video dataset for micro-gesture understanding and emotion analysis,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 10 631–10 642

  46. [46]

    3d convolutional neural networks for human action recognition,

    S. Ji, W. Xu, M. Yang, and K. Yu, “3d convolutional neural networks for human action recognition,”IEEE transactions on pattern analysis and machine intelligence, vol. 35, no. 1, pp. 221–231, 2012

  47. [47]

    Hierarchical recurrent neural network for skeleton based action recognition,

    Y. Du, W. Wang, and L. Wang, “Hierarchical recurrent neural network for skeleton based action recognition,” inCVPR, 2015

  48. [48]

    Can spatiotemporal 3d cnns retrace the history of 2d cnns and imagenet?

    K. Hara, H. Kataoka, and Y. Satoh, “Can spatiotemporal 3d cnns retrace the history of 2d cnns and imagenet?” inProceedings of the IEEE conference on Computer Vision and Pattern Recognition, 2018, pp. 6546–6555

  49. [49]

    Deep residual learning for image recognition,

    K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” inProceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778

  50. [50]

    Batch normalization: Accelerating deep network training by reducing internal covariate shift,

    S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” inInterna- tional conference on machine learning. PMLR, 2015, pp. 448–456

  51. [51]

    Quo vadis, action recognition? a new model and the kinetics dataset,

    J. Carreira and A. Zisserman, “Quo vadis, action recognition? a new model and the kinetics dataset,” inproceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 6299–6308

  52. [52]

    Slowfast networks for video recognition,

    C. Feichtenhofer, H. Fan, J. Malik, and K. He, “Slowfast networks for video recognition,” inProceedings of the IEEE international conference on computer vision, 2019, pp. 6202–6211

  53. [53]

    An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

    A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly et al., “An image is worth 16x16 words: Transformers for image recognition at scale,”arXiv preprint arXiv:2010.11929, 2020

  54. [54]

    Vivit: A video vision transformer,

    A. Arnab, M. Dehghani, G. Heigold, C. Sun, M. Lu ˇci´c, and C. Schmid, “Vivit: A video vision transformer,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 6836–6846

  55. [55]

    Video swin transformer,

    Z. Liu, J. Ning, Y. Cao, Y. Wei, Z. Zhang, S. Lin, and H. Hu, “Video swin transformer,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 3202–3211

  56. [56]

    Foundation models defin- ing a new era in vision: a survey and outlook,

    M. Awais, M. Naseer, S. Khan, R. M. Anwer, H. Cholakkal, M. Shah, M.-H. Yang, and F. S. Khan, “Foundation models defin- ing a new era in vision: a survey and outlook,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 47, no. 4, pp. 2245–2264, 2025

  57. [57]

    Swin transformer: Hierarchical vision transformer using shifted windows,

    Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, and B. Guo, “Swin transformer: Hierarchical vision transformer using shifted windows,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 10 012–10 022

  58. [58]

    Momentum contrast for unsupervised visual representation learning,

    K. He, H. Fan, Y. Wu, S. Xie, and R. Girshick, “Momentum contrast for unsupervised visual representation learning,” inPro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 9729–9738

  59. [59]

    A simple framework for contrastive learning of visual representations,

    T. Chen, S. Kornblith, M. Norouzi, and G. Hinton, “A simple framework for contrastive learning of visual representations,” inInternational conference on machine learning. PMLR, 2020, pp. 1597–1607

  60. [60]

    Masked autoencoders are scalable vision learners,

    K. He, X. Chen, S. Xie, Y. Li, P . Doll ´ar, and R. Girshick, “Masked autoencoders are scalable vision learners,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 16 000–16 009

  61. [61]

    Learning transferable visual models from natural language supervision,

    A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agar- wal, G. Sastry, A. Askell, P . Mishkin, J. Clarket al., “Learning transferable visual models from natural language supervision,” inInternational conference on machine learning. PmLR, 2021, pp. 8748–8763

  62. [62]

    Scaling up visual and vision- language representation learning with noisy text supervision,

    C. Jia, Y. Yang, Y. Xia, Y.-T. Chen, Z. Parekh, H. Pham, Q. Le, Y.-H. Sung, Z. Li, and T. Duerig, “Scaling up visual and vision- language representation learning with noisy text supervision,” inInternational conference on machine learning. PMLR, 2021, pp. 4904–4916

  63. [63]

    Image as a foreign 13 language: Beit pretraining for vision and vision-language tasks,

    W. Wang, H. Bao, L. Dong, J. Bjorck, Z. Peng, Q. Liu, K. Aggarwal, O. K. Mohammed, S. Singhal, S. Somet al., “Image as a foreign 13 language: Beit pretraining for vision and vision-language tasks,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 19 175–19 186

  64. [64]

    Sigmoid loss for language image pre-training,

    X. Zhai, B. Mustafa, A. Kolesnikov, and L. Beyer, “Sigmoid loss for language image pre-training,” inProceedings of the IEEE/CVF international conference on computer vision, 2023, pp. 11 975–11 986

  65. [65]

    Florence-2: Advancing a unified representation for a variety of vision tasks,

    B. Xiao, H. Wu, W. Xu, X. Dai, H. Hu, Y. Lu, M. Zeng, C. Liu, and L. Yuan, “Florence-2: Advancing a unified representation for a variety of vision tasks,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 4818–4829

  66. [66]

    Internvl: Scaling up vision foun- dation models and aligning for generic visual-linguistic tasks,

    Z. Chen, J. Wu, W. Wang, W. Su, G. Chen, S. Xing, M. Zhong, Q. Zhang, X. Zhu, L. Luet al., “Internvl: Scaling up vision foun- dation models and aligning for generic visual-linguistic tasks,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2024, pp. 24 185–24 198

  67. [67]

    Grounded language- image pre-training,

    L. H. Li, P . Zhang, H. Zhang, J. Yang, C. Li, Y. Zhong, L. Wang, L. Yuan, L. Zhang, J.-N. Hwanget al., “Grounded language- image pre-training,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 10 965–10 975

  68. [68]

    Grounding dino: Marrying dino with grounded pre-training for open-set object detection,

    S. Liu, Z. Zeng, T. Ren, F. Li, H. Zhang, J. Yang, Q. Jiang, C. Li, J. Yang, H. Suet al., “Grounding dino: Marrying dino with grounded pre-training for open-set object detection,” inEuropean conference on computer vision. Springer, 2024, pp. 38–55

  69. [69]

    Sam 2: Segment anything in images and videos,

    N. Ravi, V . Gabeur, Y.-T. Hu, R. Hu, C. Ryali, T. Ma, H. Khedr, R. R ¨adle, C. Rolland, L. Gustafsonet al., “Sam 2: Segment anything in images and videos,” inInternational Conference on Learning Representations, vol. 2025, 2025, pp. 28 085–28 128

  70. [70]

    SAM 3: Segment Anything with Concepts

    N. Carion, L. Gustafson, Y.-T. Hu, S. Debnath, R. Hu, D. Suris, C. Ryali, K. V . Alwala, H. Khedr, A. Huanget al., “Sam 3: Segment anything with concepts,”arXiv preprint arXiv:2511.16719, 2025

  71. [71]

    Denoising diffusion probabilistic models,

    J. Ho, A. Jain, and P . Abbeel, “Denoising diffusion probabilistic models,”Advances in neural information processing systems, vol. 33, pp. 6840–6851, 2020

  72. [72]

    High-resolution image synthesis with latent diffusion models,

    R. Rombach, A. Blattmann, D. Lorenz, P . Esser, and B. Ommer, “High-resolution image synthesis with latent diffusion models,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 10 684–10 695

  73. [73]

    Videomae: Masked autoencoders are data-efficient learners for self-supervised video pre-training,

    Z. Tong, Y. Song, J. Wang, and L. Wang, “Videomae: Masked autoencoders are data-efficient learners for self-supervised video pre-training,”Advances in neural information processing systems, vol. 35, pp. 10 078–10 093, 2022

  74. [74]

    Videomae v2: Scaling video masked autoencoders with dual masking,

    L. Wang, B. Huang, Z. Zhao, Z. Tong, Y. He, Y. Wang, Y. Wang, and Y. Qiao, “Videomae v2: Scaling video masked autoencoders with dual masking,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2023, pp. 14 549–14 560

  75. [75]

    Internvideo2: Scaling foundation models for multimodal video understanding,

    Y. Wang, K. Li, X. Li, J. Yu, Y. He, G. Chen, B. Pei, R. Zheng, Z. Wang, Y. Shiet al., “Internvideo2: Scaling foundation models for multimodal video understanding,” inEuropean conference on computer vision. Springer, 2024, pp. 396–416

  76. [76]

    Motionbert: A unified perspective on learning human motion representa- tions,

    W. Zhu, X. Ma, Z. Liu, L. Liu, W. Wu, and Y. Wang, “Motionbert: A unified perspective on learning human motion representa- tions,” inProceedings of the IEEE/CVF international conference on computer vision, 2023, pp. 15 085–15 099

  77. [77]

    Skeletonmae: graph-based masked autoencoder for skeleton sequence pre- training,

    H. Yan, Y. Liu, Y. Wei, Z. Li, G. Li, and L. Lin, “Skeletonmae: graph-based masked autoencoder for skeleton sequence pre- training,” inProceedings of the IEEE/CVF international conference on computer vision, 2023, pp. 5606–5618

  78. [78]

    Masked motion predictors are strong 3d action representation learners,

    Y. Mao, J. Deng, W. Zhou, Y. Fang, W. Ouyang, and H. Li, “Masked motion predictors are strong 3d action representation learners,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 10 181–10 191

  79. [79]

    Unified multi-modal unsupervised representation learning for skeleton-based action understanding,

    S. Sun, D. Liu, J. Dong, X. Qu, J. Gao, X. Yang, X. Wang, and M. Wang, “Unified multi-modal unsupervised representation learning for skeleton-based action understanding,” inProceedings of the 31st ACM International Conference on Multimedia, 2023, pp. 2973–2984

  80. [80]

    Skeleton- in-context: Unified skeleton sequence modeling with in-context learning,

    X. Wang, Z. Fang, X. Li, X. Li, C. Chen, and M. Liu, “Skeleton- in-context: Unified skeleton sequence modeling with in-context learning,” inProceedings of the IEEE/CVF Conference on Computer vision and Pattern Recognition, 2024, pp. 2436–2446

Showing first 80 references.