iMiGUE-3K: A Large-Scale Benchmark for Micro-Gesture Analysis with Self-Supervised Learning

Chengyan Wang; Guoying Zhao; Haoyu Chen; Hui Wei; Yueyi Yang; Yunquan Chen

arxiv: 2605.17179 · v1 · pith:7JNVWKRQnew · submitted 2026-05-16 · 💻 cs.CV

iMiGUE-3K: A Large-Scale Benchmark for Micro-Gesture Analysis with Self-Supervised Learning

Chengyan Wang , Haoyu Chen , Hui Wei , Yueyi Yang , Yunquan Chen , Guoying Zhao This is my paper

Pith reviewed 2026-05-20 14:06 UTC · model grok-4.3

classification 💻 cs.CV

keywords micro-gestureemotion understandingvideo datasetbenchmarkfoundation modelaffective computingbody languageself-supervised learning

0 comments

The pith

A dataset of micro-gestures from tennis interviews shows that body language cues substantially improve automated emotion understanding.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents iMiGUE-3K, the largest video dataset for micro-gesture analysis, built from more than 3,400 clips and 37 million frames of professional tennis players in public interviews. It annotates 32 classes of subtle, unintentional movements using a model-assisted collection process and introduces MG-FMs as a foundation model for learning gesture representations. Evaluations across recognition, retrieval, and emotion tasks indicate that micro-gesture signals add measurable value beyond facial or speech cues alone. A sympathetic reader would care because current affective systems largely ignore these subconscious body signals that reflect inner emotional states.

Core claim

By releasing iMiGUE-3K and MG-FMs, the work establishes that micro-gesture analysis significantly improves emotion understanding, as shown through systematic testing of representative methods on five tasks including unsupervised, semi-supervised, and supervised micro-gesture recognition plus retrieval and emotion recognition.

What carries the argument

The iMiGUE-3K dataset of 32 annotated micro-gesture classes collected via model-based crowd-sourcing from in-the-wild tennis interview videos, together with the MG-FMs discriminative foundation model for transferable gesture presentation learning.

If this is right

Micro-gesture recognition becomes feasible in unsupervised, semi-supervised, and supervised regimes on a shared benchmark.
The foundation model supports transferable representations for gesture retrieval and related downstream tasks.
Emotion recognition systems gain a new modality that captures subconscious cues not visible in faces or speech.
The benchmark enables progress toward applications in psychological diagnostics and human-computer interaction.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same collection approach could scale to other high-stakes public settings to gather more diverse natural micro-gestures.
Combining micro-gesture signals with existing modalities may increase robustness of emotion models in noisy real-world environments.
Self-supervised pre-training on this volume of data could yield representations useful for broader body-language understanding tasks.

Load-bearing premise

The model-based crowd-sourcing strategy produces accurate, unbiased fine-grained annotations for the 32 micro-gesture classes that generalize beyond tennis press interviews.

What would settle it

An independent test set of everyday human interactions where models trained on iMiGUE-3K show no accuracy gain in emotion recognition compared with face-and-speech baselines.

Figures

Figures reproduced from arXiv: 2605.17179 by Chengyan Wang, Guoying Zhao, Haoyu Chen, Hui Wei, Yueyi Yang, Yunquan Chen.

**Figure 2.** Figure 2: Overview of the annotation schema and category distribution in iMiGUE-3K. (a) The 32 annotation labels are organized into five body-part [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Statistics of the iMiGUE-3K dataset. (a) Video-Level Distribution by Subject Region in iMiGUE-3K. (b) Duration distribution of the videos [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Comparison between duration distributions of iMiGUE [45] and [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

read the original abstract

Emotion understanding is a fundamental challenge in affective computing and artificial intelligence. While existing approaches predominantly focus on facial expressions and speech, they often overlook the rich emotional cues conveyed through body language. Recently, micro-gestures (MGs), unintentional, subconscious movements driven by inner feelings, have attracted increasing attention as an alternative to other cues. However, there are no existing large-scale datasets supporting the pre-training of the MG foundation model. To advance MG research, we present a new benchmark for micro-gesture-based emotion understanding, featuring key contributions with a novel dataset (iMiGUE-3K) and a series of foundation models for different tasks. Using a model-based crowd-sourcing data collection strategy, we construct iMiGUE-3K, the largest MG dataset to date. It comprises video recordings from 332 distinct professional tennis players' public press interviews over the past seven years, totaling more than 3.4K long video clips and 37 million frames. The dataset includes 32 micro-gesture classes with rich descriptive annotations, making it the first large-scale, in-the-wild, video dataset for fine-grained gesture-based emotion analysis. Built on iMiGUE-3K, we propose MG-FMs, a discriminative foundation model for transferable gesture presentation learning. Based on the foundation model, we establish five comprehensive evaluation tasks: MG recognition (unsupervised, semi-supervised, supervised), MG retrieval, and MG emotion recognition. Our systematic evaluation of representative methods demonstrates that micro-gesture-based analysis significantly improves emotion understanding. We hope this work can provide comprehensive tools for MG analysis and set a solid foundation for future research in psychological diagnostics, affective computing, and advanced human-computer interaction.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper's real contribution is a large new in-the-wild micro-gesture dataset from tennis interviews, but missing annotation validation leaves the emotion improvement claims on shaky ground.

read the letter

Hi colleague, the one or two things to know are that this work releases iMiGUE-3K, a dataset of 3.4K video clips and 37 million frames from 332 tennis players' press interviews with 32 micro-gesture classes, and that it applies standard self-supervised learning to create MG-FMs for five benchmark tasks including emotion recognition. That scale and the model-based crowd-sourcing collection method are the concrete advances over prior smaller or more controlled MG resources. The paper does a reasonable job framing the gap in affective computing and showing how body gestures could add to face and speech cues, with the tasks laid out in a straightforward way that lets others build on the data. The soft spots sit mainly in the label quality. The model-based crowd-sourcing is efficient for volume, yet the description gives no inter-annotator agreement numbers, no human spot-checks on the automatic outputs, and no tests for whether the gestures generalize beyond the specific interview setting of professional athletes. Without those, the claim that micro-gesture features significantly improve emotion understanding rests on an unverified assumption and could reflect context-specific mannerisms instead. The quantitative results are mentioned but not detailed enough in the abstract to judge effect sizes or robustness. This paper is for researchers in affective computing and body-language analysis who need more training data for gesture models or want to test new modalities in HCI or diagnostics. A reader planning to use the dataset for pre-training or to run their own experiments would get practical value from the size and the task definitions. It deserves a serious referee because the dataset itself is large enough and novel enough to matter for the field, even if the experiments need more scrutiny on validation and generalization. I would send it for peer review with a clear request for annotation reliability metrics and some cross-context checks.

Referee Report

2 major / 2 minor

Summary. The paper introduces iMiGUE-3K, the largest in-the-wild video dataset for micro-gesture analysis, constructed from 3.4K clips of 332 professional tennis players' press interviews using a model-based crowd-sourcing strategy to annotate 32 micro-gesture classes. It proposes MG-FMs, a discriminative foundation model for transferable gesture representation learning via self-supervised methods, and evaluates it across five tasks: unsupervised/semi-supervised/supervised MG recognition, MG retrieval, and MG emotion recognition, claiming that micro-gesture analysis significantly improves emotion understanding over existing approaches focused on facial expressions and speech.

Significance. If the annotations prove reliable and the reported gains generalize, this benchmark and the associated foundation models would address a clear gap in affective computing by enabling scalable study of subconscious body-language cues for emotion recognition. The scale (37M frames), multi-task evaluation protocol, and self-supervised pre-training approach represent concrete strengths that could support reproducible progress in psychological diagnostics and HCI.

major comments (2)

[Dataset Construction] Dataset Construction section: The model-based crowd-sourcing procedure for producing the 32-class fine-grained micro-gesture annotations is presented without any reported validation metrics (inter-annotator agreement, human verification of model outputs, or cross-domain checks). This directly undermines the central claim that MG features improve emotion recognition, because systematic label errors or interview-specific biases would render the five evaluation tasks unreliable.
[Evaluation] Evaluation section: The abstract and results claim that 'systematic evaluation of representative methods demonstrates that micro-gesture-based analysis significantly improves emotion understanding,' yet no quantitative numbers, error bars, data-split details, or baseline comparisons appear in the provided summary or abstract. Without these, the magnitude and robustness of the claimed gains cannot be assessed.

minor comments (2)

[Abstract] The abstract would be strengthened by including at least one key quantitative result (e.g., accuracy delta on the MG emotion recognition task) to support the improvement claim.
A comparison table of iMiGUE-3K statistics against prior micro-gesture or gesture datasets is missing and would help readers gauge the scale advantage.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. We address each major comment point by point below, with honest indications of where revisions will be made to strengthen the paper.

read point-by-point responses

Referee: [Dataset Construction] Dataset Construction section: The model-based crowd-sourcing procedure for producing the 32-class fine-grained micro-gesture annotations is presented without any reported validation metrics (inter-annotator agreement, human verification of model outputs, or cross-domain checks). This directly undermines the central claim that MG features improve emotion recognition, because systematic label errors or interview-specific biases would render the five evaluation tasks unreliable.

Authors: We agree that explicit validation metrics are essential to substantiate the annotation quality and support downstream claims. The current manuscript describes the model-based crowd-sourcing strategy but does not report inter-annotator agreement, human verification rates, or cross-domain checks. In the revised version we will add a new paragraph in the Dataset Construction section that includes these metrics, computed on a held-out verification subset, along with details on how model-proposed labels were cross-checked by multiple human annotators. This addition will directly address concerns about label reliability and bias. revision: yes
Referee: [Evaluation] Evaluation section: The abstract and results claim that 'systematic evaluation of representative methods demonstrates that micro-gesture-based analysis significantly improves emotion understanding,' yet no quantitative numbers, error bars, data-split details, or baseline comparisons appear in the provided summary or abstract. Without these, the magnitude and robustness of the claimed gains cannot be assessed.

Authors: The full Evaluation section of the manuscript reports quantitative results for all five tasks, including accuracy metrics, baseline comparisons, and data-split protocols. However, these specifics are not summarized with numbers in the abstract. We will revise the abstract to include key quantitative findings (e.g., the absolute and relative improvements in emotion recognition accuracy when incorporating micro-gesture features) and will ensure that error bars and split details are explicitly referenced or tabulated in the results for immediate assessment of robustness. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper introduces a new large-scale dataset iMiGUE-3K via model-based crowd-sourcing from tennis press interviews and applies standard self-supervised learning to train MG-FMs. It then reports empirical results on five evaluation tasks (MG recognition in unsupervised/semi-supervised/supervised settings, retrieval, and emotion recognition). No equations, predictions, or central claims reduce by construction to fitted parameters, self-definitions, or self-citation chains within the paper; the performance gains are demonstrated through external evaluations on the newly collected benchmark rather than definitional equivalences.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The paper relies on standard computer-vision assumptions about annotation quality and representation transferability; no explicit free parameters, axioms, or invented entities are described in the abstract.

pith-pipeline@v0.9.0 · 5863 in / 1189 out tokens · 61037 ms · 2026-05-20T14:06:44.404394+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Using a model-based crowd-sourcing data collection strategy, we construct iMiGUE-3K... 32 micro-gesture classes... MG-FMs... five comprehensive evaluation tasks: MG recognition... MG emotion recognition.
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

MG-FM-Skele... dual-stream Dense Spatio-Temporal Encoder... multi-grained feature decorrelation objective... Lfd(Hr) = λsim Lsim + λvac Lvac + λxcorr Lxcorr

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

104 extracted references · 104 canonical work pages · 2 internal anchors

[1]

Multi- pie,

R. Gross, I. Matthews, J. Cohn, T. Kanade, and S. Baker, “Multi- pie,”Image and Vision Computing, vol. 28, no. 5, pp. 807–813, 2010

work page 2010
[2]

Comprehensive database for facial expression analysis,

T. Kanade, J. F. Cohn, and Y. Tian, “Comprehensive database for facial expression analysis,” inProceedings of the IEEE International Conference on Automatic Face and Gesture Recognition. IEEE, 2000, pp. 46–53

work page 2000
[3]

Web-based database for facial expression analysis,

M. Pantic, M. Valstar, R. Rademaker, and L. Maat, “Web-based database for facial expression analysis,” inProceedings of the IEEE International Conference on Multimedia and Expo. IEEE, 2005, pp. 317–321

work page 2005
[4]

Induced disgust, happiness and surprise: an addition to the mmi facial expression database,

M. Valstar and M. Pantic, “Induced disgust, happiness and surprise: an addition to the mmi facial expression database,” in Proceedings of the International Conference on Language Resources and Evaluation, Workshop EMOTION. Paris, France, 2010, pp. 65–70

work page 2010
[5]

A 3d facial expression database for facial behavior research,

L. Yin, X. Wei, Y. Sun, J. Wang, and M. J. Rosato, “A 3d facial expression database for facial behavior research,” inProceedings of the IEEE International Conference on Automatic Face and Gesture Recognition. IEEE, 2006, pp. 211–216

work page 2006
[6]

A high-resolution spontaneous 3d dynamic facial expression database,

X. Zhang, L. Yin, J. F. Cohn, S. Canavan, M. Reale, A. Horowitz, and P . Liu, “A high-resolution spontaneous 3d dynamic facial expression database,” inProceedings of the IEEE International Con- ference on Automatic Face and Gesture Recognition. IEEE, 2013, pp. 1–6

work page 2013
[7]

The automatic detection of chronic pain- related expression: requirements, challenges and the multimodal emopain dataset,

M. S. Aung, S. Kaltwang, B. Romera-Paredes, B. Martinez, A. Singh, M. Cella, M. Valstar, H. Meng, A. Kemp, and M. Shafizadeh, “The automatic detection of chronic pain- related expression: requirements, challenges and the multimodal emopain dataset,”IEEE Trans. Affect. Comput., vol. 7, no. 4, pp. 435–451, 2015

work page 2015
[8]

Fully automatic facial action recognition in spontaneous behavior,

M. S. Bartlett, G. Littlewort, M. Frank, C. Lainscsek, I. Fasel, and J. Movellan, “Fully automatic facial action recognition in spontaneous behavior,” inProc. IEEE Int. Conf. Auto. Face Gesture Recognit.IEEE, 2006, pp. 223–230

work page 2006
[9]

From individual to group-level emotion recognition: Emotiw 5.0,

A. Dhall, R. Goecke, S. Ghosh, J. Joshi, J. Hoey, and T. Gedeon, “From individual to group-level emotion recognition: Emotiw 5.0,” inProc. ACM Int. Conf. Multimodal Interaction, 2017, pp. 524– 528

work page 2017
[10]

Painful data: The unbc-mcmaster shoulder pain expression archive database,

P . Lucey, J. F. Cohn, K. M. Prkachin, P . E. Solomon, and I. Matthews, “Painful data: The unbc-mcmaster shoulder pain expression archive database,”Proc. IEEE Int. Conf. Auto. Face Gesture Recognit., pp. 57–64, 2011

work page 2011
[11]

Deep affect prediction in-the-wild: Aff-wild database and challenge, deep architectures, and beyond,

D. Kollias, P . Tzirakis, M. A. Nicolaou, A. Papaioannou, G. Zhao, B. Schuller, I. Kotsia, and S. Zafeiriou, “Deep affect prediction in-the-wild: Aff-wild database and challenge, deep architectures, and beyond,”Int. J. Comput. Vision, vol. 127, no. 6-7, pp. 907–929, 2019

work page 2019
[12]

The semaine database: Annotated multimodal records of emo- tionally colored conversations between a person and a limited agent,

G. McKeown, M. Valstar, R. Cowie, M. Pantic, and M. Schroder, “The semaine database: Annotated multimodal records of emo- tionally colored conversations between a person and a limited agent,”IEEE Trans. Affect. Comput., vol. 3, no. 1, pp. 5–17, 2011

work page 2011
[13]

Casme ii: An improved spontaneous micro-expression database and the baseline evaluation,

W.-J. Yan, X. Li, S.-J. Wang, G. Zhao, Y.-J. Liu, Y.-H. Chen, and X. Fu, “Casme ii: An improved spontaneous micro-expression database and the baseline evaluation,”PloS one, vol. 9, no. 1, p. e86041, 2014

work page 2014
[14]

A spontaneous micro-expression database: Inducement, collection and baseline,

X. Li, T. Pfister, X. Huang, G. Zhao, and M. Pietik ¨ainen, “A spontaneous micro-expression database: Inducement, collection and baseline,” in2013 10th IEEE International Conference and Workshops on Automatic face and gesture recognition (fg). IEEE, 2013, pp. 1–6

work page 2013
[15]

Samm: A spontaneous micro-facial movement dataset,

A. K. Davison, C. Lansley, N. Costen, K. Tan, and M. H. Yap, “Samm: A spontaneous micro-facial movement dataset,”IEEE transactions on affective computing, vol. 9, no. 1, pp. 116–129, 2016

work page 2016
[16]

Avec 2011–the first international audio/visual emotion challenge,

B. Schuller, M. Valstar, F. Eyben, G. McKeown, R. Cowie, and M. Pantic, “Avec 2011–the first international audio/visual emotion challenge,” inInt. Conf. Affect. Comput. Intell. Interact. Springer, 2011, pp. 415–424

work page 2011
[17]

Avec 2012: The continuous audio/visual emotion challenge,

B. Schuller, M. Valster, F. Eyben, R. Cowie, and M. Pantic, “Avec 2012: The continuous audio/visual emotion challenge,” inProc. ACM Int. Conf. Multimodal Interact., 2012, pp. 449–456

work page 2012
[18]

Introduc- ing the recola multimodal corpus of remote collaborative and affective interactions,

F. Ringeval, A. Sonderegger, J. Sauer, and D. Lalanne, “Introduc- ing the recola multimodal corpus of remote collaborative and affective interactions,” inProc. IEEE Int. Conf. Auto. Face Gesture Recognit.IEEE, 2013, pp. 1–8

work page 2013
[19]

Panoptic studio: A massively multi- view system for social motion capture,

H. Joo, H. Liu, L. Tan, L. Gui, B. Nabbe, I. Matthews, T. Kanade, S. Nobuhara, and Y. Sheikh, “Panoptic studio: A massively multi- view system for social motion capture,” inProceedings of the IEEE International Conference on Computer Vision, 2015, pp. 3334–3342

work page 2015
[20]

Towards social artifi- cial intelligence: Nonverbal social signal prediction in a triadic interaction,

H. Joo, T. Simon, M. Cikara, and Y. Sheikh, “Towards social artifi- cial intelligence: Nonverbal social signal prediction in a triadic interaction,” inProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 10 873–10 883

work page 2019
[21]

A multi- 12 modal database for affect recognition and implicit tagging,

M. Soleymani, J. Lichtenauer, T. Pun, and M. Pantic, “A multi- 12 modal database for affect recognition and implicit tagging,”IEEE transactions on affective computing, vol. 3, no. 1, pp. 42–55, 2011

work page 2011
[22]

Deap: A database for emotion analysis; using physiological signals,

K. Sander, M. Christian, M. Soleymani, J.-S. Lee, Y. Ashkan, E. Touradj, P . Thierry, N. Anton, and P . Ioannis, “Deap: A database for emotion analysis; using physiological signals,”IEEE Trans. Affect. Comput., vol. 3, no. 1, pp. 18–31, 2011

work page 2011
[23]

Ekman,Telling lies: Clues to deceit in the marketplace, politics, and marriage (revised edition)

P . Ekman,Telling lies: Clues to deceit in the marketplace, politics, and marriage (revised edition). WW Norton & Company, 2009

work page 2009
[24]

Body cues, not facial expressions, discriminate between intense positive and negative emotions,

H. Aviezer, Y. Trope, and A. Todorov, “Body cues, not facial expressions, discriminate between intense positive and negative emotions,”Science, vol. 338, no. 6111, pp. 1225–1229, 2012

work page 2012
[25]

R. E. Axtell,Gestures: the do’s and taboos of body language around the wor ld, 1991

work page 1991
[26]

J. K. Burgoon, D. B. Buller, and W. G. Woodall,Nonverbal commu- nication: The unspoken dialogue. Harpercollins College Division, 1989

work page 1989
[27]

Hand gesture recognition: a literature review,

R. Z. Khan and N. A. Ibraheem, “Hand gesture recognition: a literature review,”International Journal of Artificial Intelligence & Applications, vol. 3, no. 4, p. 161, 2012

work page 2012
[28]

Survey on emotional body gesture recogni- tion,

F. Noroozi, D. Kaminska, C. Corneanu, T. Sapinski, S. Escalera, and G. Anbarjafari, “Survey on emotional body gesture recogni- tion,”IEEE Transactions on Affective Computing, 2018

work page 2018
[29]

Recognizing emotions expressed by body pose: A biologically inspired neural model,

K. Schindler, L. Van Gool, and B. De Gelder, “Recognizing emotions expressed by body pose: A biologically inspired neural model,”Neural networks, vol. 21, no. 9, pp. 1238–1246, 2008

work page 2008
[30]

A bimodal face and body gesture database for automatic analysis of human nonverbal affective behavior,

H. Gunes and M. Piccardi, “A bimodal face and body gesture database for automatic analysis of human nonverbal affective behavior,” in18th International conference on pattern recognition (ICPR’06), vol. 1. IEEE, 2006, pp. 1148–1153

work page 2006
[31]

Recognising human emotions from body movement and gesture dynamics,

G. Castellano, S. D. Villalba, and A. Camurri, “Recognising human emotions from body movement and gesture dynamics,” inInternational conference on affective computing and intelligent interaction. Springer, 2007, pp. 71–82

work page 2007
[32]

The humaine database: Addressing the collection and annotation of naturalistic and induced emotional data,

E. Douglas-Cowie, R. Cowie, I. Sneddon, C. Cox, O. Lowry, M. Mcrorie, J.-C. Martin, L. Devillers, S. Abrilian, A. Batliner et al., “The humaine database: Addressing the collection and annotation of naturalistic and induced emotional data,” inInter- national conference on affective computing and intelligent interaction. Springer, 2007, pp. 488–500

work page 2007
[33]

Technique for automatic emotion recognition by body gesture analysis,

D. Glowinski, A. Camurri, G. Volpe, N. Dael, and K. Scherer, “Technique for automatic emotion recognition by body gesture analysis,” in2008 IEEE Computer society conference on computer vision and pattern recognition workshops. IEEE, 2008, pp. 1–6

work page 2008
[34]

Liris- accede: A video database for affective content analysis,

Y. Baveye, E. Dellandrea, C. Chamaret, and L. Chen, “Liris- accede: A video database for affective content analysis,”IEEE Transactions on Affective Computing, vol. 6, no. 1, pp. 43–55, 2015

work page 2015
[35]

A study on emotion recognition from body gestures using kinect sensor,

S. Saha, S. Datta, A. Konar, and R. Janarthanan, “A study on emotion recognition from body gestures using kinect sensor,” in 2014 international conference on communication and signal processing. IEEE, 2014, pp. 056–060

work page 2014
[36]

Multi- modal emotion recognition using deep learning architectures,

H. Ranganathan, S. Chakraborty, and S. Panchanathan, “Multi- modal emotion recognition using deep learning architectures,” in2016 IEEE Winter Conference on Applications of Computer Vision (WACV). IEEE, 2016, pp. 1–9

work page 2016
[37]

Emilya: Emotional body expres- sion in daily actions database

N. Fourati and C. Pelachaud, “Emilya: Emotional body expres- sion in daily actions database.” inLREC, 2014, pp. 3486–3493

work page 2014
[38]

Gesture and emotion: Can basic gestu- ral form features discriminate emotions?

M. Kipp and J.-C. Martin, “Gesture and emotion: Can basic gestu- ral form features discriminate emotions?” in2009 3rd International Conference on Affective Computing and Intelligent Interaction and Workshops. IEEE, 2009, pp. 1–8

work page 2009
[39]

Arbee: Towards automated recognition of bodily expression of emotion in the wild,

Y. Luo, J. Ye, R. B. Adams, J. Li, M. G. Newman, and J. Z. Wang, “Arbee: Towards automated recognition of bodily expression of emotion in the wild,”International Journal of Computer Vision, vol. 128, no. 1, pp. 1–25, 2020

work page 2020
[40]

Analyze spontaneous gestures for emotional stress state recognition: A micro-gesture dataset and analysis with deep learning,

H. Chen, X. Liu, X. Li, H. Shi, and G. Zhao, “Analyze spontaneous gestures for emotional stress state recognition: A micro-gesture dataset and analysis with deep learning,” in2019 14th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2019). IEEE, 2019, pp. 1–8

work page 2019
[41]

Face and body gesture recognition for a vision-based multimodal analyser,

H. Gunes, M. Piccardi, and T. Jan, “Face and body gesture recognition for a vision-based multimodal analyser,” inPan- Sydney Area Workshop on Visual Information Processing. ACS, 2004

work page 2004
[42]

Affect recognition from face and body: early fusion vs. late fusion,

H. Gunes and M. Piccardi, “Affect recognition from face and body: early fusion vs. late fusion,” in2005 IEEE International Conference on Systems, Man and Cybernetics, vol. 4. IEEE, 2005, pp. 3437–3443

work page 2005
[43]

Multimodal sentiment intensity analysis in videos: Facial gestures and verbal messages,

A. Zadeh, R. Zellers, E. Pincus, and L.-P . Morency, “Multimodal sentiment intensity analysis in videos: Facial gestures and verbal messages,”IEEE Intelligent Systems, vol. 31, no. 6, pp. 82–88, 2016

work page 2016
[44]

Sentiment analysis and topic recognition in video transcriptions,

L. Stappen, A. Baird, E. Cambria, and B. W. Schuller, “Sentiment analysis and topic recognition in video transcriptions,”IEEE Intelligent Systems, vol. 36, no. 2, pp. 88–95, 2021

work page 2021
[45]

imigue: An identity-free video dataset for micro-gesture understanding and emotion analysis,

X. Liu, H. Shi, H. Chen, Z. Yu, X. Li, and G. Zhao, “imigue: An identity-free video dataset for micro-gesture understanding and emotion analysis,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 10 631–10 642

work page 2021
[46]

3d convolutional neural networks for human action recognition,

S. Ji, W. Xu, M. Yang, and K. Yu, “3d convolutional neural networks for human action recognition,”IEEE transactions on pattern analysis and machine intelligence, vol. 35, no. 1, pp. 221–231, 2012

work page 2012
[47]

Hierarchical recurrent neural network for skeleton based action recognition,

Y. Du, W. Wang, and L. Wang, “Hierarchical recurrent neural network for skeleton based action recognition,” inCVPR, 2015

work page 2015
[48]

Can spatiotemporal 3d cnns retrace the history of 2d cnns and imagenet?

K. Hara, H. Kataoka, and Y. Satoh, “Can spatiotemporal 3d cnns retrace the history of 2d cnns and imagenet?” inProceedings of the IEEE conference on Computer Vision and Pattern Recognition, 2018, pp. 6546–6555

work page 2018
[49]

Deep residual learning for image recognition,

K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” inProceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778

work page 2016
[50]

Batch normalization: Accelerating deep network training by reducing internal covariate shift,

S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” inInterna- tional conference on machine learning. PMLR, 2015, pp. 448–456

work page 2015
[51]

Quo vadis, action recognition? a new model and the kinetics dataset,

J. Carreira and A. Zisserman, “Quo vadis, action recognition? a new model and the kinetics dataset,” inproceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 6299–6308

work page 2017
[52]

Slowfast networks for video recognition,

C. Feichtenhofer, H. Fan, J. Malik, and K. He, “Slowfast networks for video recognition,” inProceedings of the IEEE international conference on computer vision, 2019, pp. 6202–6211

work page 2019
[53]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly et al., “An image is worth 16x16 words: Transformers for image recognition at scale,”arXiv preprint arXiv:2010.11929, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2010
[54]

Vivit: A video vision transformer,

A. Arnab, M. Dehghani, G. Heigold, C. Sun, M. Lu ˇci´c, and C. Schmid, “Vivit: A video vision transformer,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 6836–6846

work page 2021
[55]

Video swin transformer,

Z. Liu, J. Ning, Y. Cao, Y. Wei, Z. Zhang, S. Lin, and H. Hu, “Video swin transformer,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 3202–3211

work page 2022
[56]

Foundation models defin- ing a new era in vision: a survey and outlook,

M. Awais, M. Naseer, S. Khan, R. M. Anwer, H. Cholakkal, M. Shah, M.-H. Yang, and F. S. Khan, “Foundation models defin- ing a new era in vision: a survey and outlook,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 47, no. 4, pp. 2245–2264, 2025

work page 2025
[57]

Swin transformer: Hierarchical vision transformer using shifted windows,

Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, and B. Guo, “Swin transformer: Hierarchical vision transformer using shifted windows,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 10 012–10 022

work page 2021
[58]

Momentum contrast for unsupervised visual representation learning,

K. He, H. Fan, Y. Wu, S. Xie, and R. Girshick, “Momentum contrast for unsupervised visual representation learning,” inPro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 9729–9738

work page 2020
[59]

A simple framework for contrastive learning of visual representations,

T. Chen, S. Kornblith, M. Norouzi, and G. Hinton, “A simple framework for contrastive learning of visual representations,” inInternational conference on machine learning. PMLR, 2020, pp. 1597–1607

work page 2020
[60]

Masked autoencoders are scalable vision learners,

K. He, X. Chen, S. Xie, Y. Li, P . Doll ´ar, and R. Girshick, “Masked autoencoders are scalable vision learners,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 16 000–16 009

work page 2022
[61]

Learning transferable visual models from natural language supervision,

A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agar- wal, G. Sastry, A. Askell, P . Mishkin, J. Clarket al., “Learning transferable visual models from natural language supervision,” inInternational conference on machine learning. PmLR, 2021, pp. 8748–8763

work page 2021
[62]

Scaling up visual and vision- language representation learning with noisy text supervision,

C. Jia, Y. Yang, Y. Xia, Y.-T. Chen, Z. Parekh, H. Pham, Q. Le, Y.-H. Sung, Z. Li, and T. Duerig, “Scaling up visual and vision- language representation learning with noisy text supervision,” inInternational conference on machine learning. PMLR, 2021, pp. 4904–4916

work page 2021
[63]

Image as a foreign 13 language: Beit pretraining for vision and vision-language tasks,

W. Wang, H. Bao, L. Dong, J. Bjorck, Z. Peng, Q. Liu, K. Aggarwal, O. K. Mohammed, S. Singhal, S. Somet al., “Image as a foreign 13 language: Beit pretraining for vision and vision-language tasks,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 19 175–19 186

work page 2023
[64]

Sigmoid loss for language image pre-training,

X. Zhai, B. Mustafa, A. Kolesnikov, and L. Beyer, “Sigmoid loss for language image pre-training,” inProceedings of the IEEE/CVF international conference on computer vision, 2023, pp. 11 975–11 986

work page 2023
[65]

Florence-2: Advancing a unified representation for a variety of vision tasks,

B. Xiao, H. Wu, W. Xu, X. Dai, H. Hu, Y. Lu, M. Zeng, C. Liu, and L. Yuan, “Florence-2: Advancing a unified representation for a variety of vision tasks,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 4818–4829

work page 2024
[66]

Internvl: Scaling up vision foun- dation models and aligning for generic visual-linguistic tasks,

Z. Chen, J. Wu, W. Wang, W. Su, G. Chen, S. Xing, M. Zhong, Q. Zhang, X. Zhu, L. Luet al., “Internvl: Scaling up vision foun- dation models and aligning for generic visual-linguistic tasks,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2024, pp. 24 185–24 198

work page 2024
[67]

Grounded language- image pre-training,

L. H. Li, P . Zhang, H. Zhang, J. Yang, C. Li, Y. Zhong, L. Wang, L. Yuan, L. Zhang, J.-N. Hwanget al., “Grounded language- image pre-training,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 10 965–10 975

work page 2022
[68]

Grounding dino: Marrying dino with grounded pre-training for open-set object detection,

S. Liu, Z. Zeng, T. Ren, F. Li, H. Zhang, J. Yang, Q. Jiang, C. Li, J. Yang, H. Suet al., “Grounding dino: Marrying dino with grounded pre-training for open-set object detection,” inEuropean conference on computer vision. Springer, 2024, pp. 38–55

work page 2024
[69]

Sam 2: Segment anything in images and videos,

N. Ravi, V . Gabeur, Y.-T. Hu, R. Hu, C. Ryali, T. Ma, H. Khedr, R. R ¨adle, C. Rolland, L. Gustafsonet al., “Sam 2: Segment anything in images and videos,” inInternational Conference on Learning Representations, vol. 2025, 2025, pp. 28 085–28 128

work page 2025
[70]

SAM 3: Segment Anything with Concepts

N. Carion, L. Gustafson, Y.-T. Hu, S. Debnath, R. Hu, D. Suris, C. Ryali, K. V . Alwala, H. Khedr, A. Huanget al., “Sam 3: Segment anything with concepts,”arXiv preprint arXiv:2511.16719, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[71]

Denoising diffusion probabilistic models,

J. Ho, A. Jain, and P . Abbeel, “Denoising diffusion probabilistic models,”Advances in neural information processing systems, vol. 33, pp. 6840–6851, 2020

work page 2020
[72]

High-resolution image synthesis with latent diffusion models,

R. Rombach, A. Blattmann, D. Lorenz, P . Esser, and B. Ommer, “High-resolution image synthesis with latent diffusion models,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 10 684–10 695

work page 2022
[73]

Videomae: Masked autoencoders are data-efficient learners for self-supervised video pre-training,

Z. Tong, Y. Song, J. Wang, and L. Wang, “Videomae: Masked autoencoders are data-efficient learners for self-supervised video pre-training,”Advances in neural information processing systems, vol. 35, pp. 10 078–10 093, 2022

work page 2022
[74]

Videomae v2: Scaling video masked autoencoders with dual masking,

L. Wang, B. Huang, Z. Zhao, Z. Tong, Y. He, Y. Wang, Y. Wang, and Y. Qiao, “Videomae v2: Scaling video masked autoencoders with dual masking,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2023, pp. 14 549–14 560

work page 2023
[75]

Internvideo2: Scaling foundation models for multimodal video understanding,

Y. Wang, K. Li, X. Li, J. Yu, Y. He, G. Chen, B. Pei, R. Zheng, Z. Wang, Y. Shiet al., “Internvideo2: Scaling foundation models for multimodal video understanding,” inEuropean conference on computer vision. Springer, 2024, pp. 396–416

work page 2024
[76]

Motionbert: A unified perspective on learning human motion representa- tions,

W. Zhu, X. Ma, Z. Liu, L. Liu, W. Wu, and Y. Wang, “Motionbert: A unified perspective on learning human motion representa- tions,” inProceedings of the IEEE/CVF international conference on computer vision, 2023, pp. 15 085–15 099

work page 2023
[77]

Skeletonmae: graph-based masked autoencoder for skeleton sequence pre- training,

H. Yan, Y. Liu, Y. Wei, Z. Li, G. Li, and L. Lin, “Skeletonmae: graph-based masked autoencoder for skeleton sequence pre- training,” inProceedings of the IEEE/CVF international conference on computer vision, 2023, pp. 5606–5618

work page 2023
[78]

Masked motion predictors are strong 3d action representation learners,

Y. Mao, J. Deng, W. Zhou, Y. Fang, W. Ouyang, and H. Li, “Masked motion predictors are strong 3d action representation learners,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 10 181–10 191

work page 2023
[79]

Unified multi-modal unsupervised representation learning for skeleton-based action understanding,

S. Sun, D. Liu, J. Dong, X. Qu, J. Gao, X. Yang, X. Wang, and M. Wang, “Unified multi-modal unsupervised representation learning for skeleton-based action understanding,” inProceedings of the 31st ACM International Conference on Multimedia, 2023, pp. 2973–2984

work page 2023
[80]

Skeleton- in-context: Unified skeleton sequence modeling with in-context learning,

X. Wang, Z. Fang, X. Li, X. Li, C. Chen, and M. Liu, “Skeleton- in-context: Unified skeleton sequence modeling with in-context learning,” inProceedings of the IEEE/CVF Conference on Computer vision and Pattern Recognition, 2024, pp. 2436–2446

work page 2024

Showing first 80 references.

[1] [1]

Multi- pie,

R. Gross, I. Matthews, J. Cohn, T. Kanade, and S. Baker, “Multi- pie,”Image and Vision Computing, vol. 28, no. 5, pp. 807–813, 2010

work page 2010

[2] [2]

Comprehensive database for facial expression analysis,

T. Kanade, J. F. Cohn, and Y. Tian, “Comprehensive database for facial expression analysis,” inProceedings of the IEEE International Conference on Automatic Face and Gesture Recognition. IEEE, 2000, pp. 46–53

work page 2000

[3] [3]

Web-based database for facial expression analysis,

M. Pantic, M. Valstar, R. Rademaker, and L. Maat, “Web-based database for facial expression analysis,” inProceedings of the IEEE International Conference on Multimedia and Expo. IEEE, 2005, pp. 317–321

work page 2005

[4] [4]

Induced disgust, happiness and surprise: an addition to the mmi facial expression database,

M. Valstar and M. Pantic, “Induced disgust, happiness and surprise: an addition to the mmi facial expression database,” in Proceedings of the International Conference on Language Resources and Evaluation, Workshop EMOTION. Paris, France, 2010, pp. 65–70

work page 2010

[5] [5]

A 3d facial expression database for facial behavior research,

L. Yin, X. Wei, Y. Sun, J. Wang, and M. J. Rosato, “A 3d facial expression database for facial behavior research,” inProceedings of the IEEE International Conference on Automatic Face and Gesture Recognition. IEEE, 2006, pp. 211–216

work page 2006

[6] [6]

A high-resolution spontaneous 3d dynamic facial expression database,

X. Zhang, L. Yin, J. F. Cohn, S. Canavan, M. Reale, A. Horowitz, and P . Liu, “A high-resolution spontaneous 3d dynamic facial expression database,” inProceedings of the IEEE International Con- ference on Automatic Face and Gesture Recognition. IEEE, 2013, pp. 1–6

work page 2013

[7] [7]

The automatic detection of chronic pain- related expression: requirements, challenges and the multimodal emopain dataset,

M. S. Aung, S. Kaltwang, B. Romera-Paredes, B. Martinez, A. Singh, M. Cella, M. Valstar, H. Meng, A. Kemp, and M. Shafizadeh, “The automatic detection of chronic pain- related expression: requirements, challenges and the multimodal emopain dataset,”IEEE Trans. Affect. Comput., vol. 7, no. 4, pp. 435–451, 2015

work page 2015

[8] [8]

Fully automatic facial action recognition in spontaneous behavior,

M. S. Bartlett, G. Littlewort, M. Frank, C. Lainscsek, I. Fasel, and J. Movellan, “Fully automatic facial action recognition in spontaneous behavior,” inProc. IEEE Int. Conf. Auto. Face Gesture Recognit.IEEE, 2006, pp. 223–230

work page 2006

[9] [9]

From individual to group-level emotion recognition: Emotiw 5.0,

A. Dhall, R. Goecke, S. Ghosh, J. Joshi, J. Hoey, and T. Gedeon, “From individual to group-level emotion recognition: Emotiw 5.0,” inProc. ACM Int. Conf. Multimodal Interaction, 2017, pp. 524– 528

work page 2017

[10] [10]

Painful data: The unbc-mcmaster shoulder pain expression archive database,

P . Lucey, J. F. Cohn, K. M. Prkachin, P . E. Solomon, and I. Matthews, “Painful data: The unbc-mcmaster shoulder pain expression archive database,”Proc. IEEE Int. Conf. Auto. Face Gesture Recognit., pp. 57–64, 2011

work page 2011

[11] [11]

Deep affect prediction in-the-wild: Aff-wild database and challenge, deep architectures, and beyond,

D. Kollias, P . Tzirakis, M. A. Nicolaou, A. Papaioannou, G. Zhao, B. Schuller, I. Kotsia, and S. Zafeiriou, “Deep affect prediction in-the-wild: Aff-wild database and challenge, deep architectures, and beyond,”Int. J. Comput. Vision, vol. 127, no. 6-7, pp. 907–929, 2019

work page 2019

[12] [12]

The semaine database: Annotated multimodal records of emo- tionally colored conversations between a person and a limited agent,

G. McKeown, M. Valstar, R. Cowie, M. Pantic, and M. Schroder, “The semaine database: Annotated multimodal records of emo- tionally colored conversations between a person and a limited agent,”IEEE Trans. Affect. Comput., vol. 3, no. 1, pp. 5–17, 2011

work page 2011

[13] [13]

Casme ii: An improved spontaneous micro-expression database and the baseline evaluation,

W.-J. Yan, X. Li, S.-J. Wang, G. Zhao, Y.-J. Liu, Y.-H. Chen, and X. Fu, “Casme ii: An improved spontaneous micro-expression database and the baseline evaluation,”PloS one, vol. 9, no. 1, p. e86041, 2014

work page 2014

[14] [14]

A spontaneous micro-expression database: Inducement, collection and baseline,

X. Li, T. Pfister, X. Huang, G. Zhao, and M. Pietik ¨ainen, “A spontaneous micro-expression database: Inducement, collection and baseline,” in2013 10th IEEE International Conference and Workshops on Automatic face and gesture recognition (fg). IEEE, 2013, pp. 1–6

work page 2013

[15] [15]

Samm: A spontaneous micro-facial movement dataset,

A. K. Davison, C. Lansley, N. Costen, K. Tan, and M. H. Yap, “Samm: A spontaneous micro-facial movement dataset,”IEEE transactions on affective computing, vol. 9, no. 1, pp. 116–129, 2016

work page 2016

[16] [16]

Avec 2011–the first international audio/visual emotion challenge,

B. Schuller, M. Valstar, F. Eyben, G. McKeown, R. Cowie, and M. Pantic, “Avec 2011–the first international audio/visual emotion challenge,” inInt. Conf. Affect. Comput. Intell. Interact. Springer, 2011, pp. 415–424

work page 2011

[17] [17]

Avec 2012: The continuous audio/visual emotion challenge,

B. Schuller, M. Valster, F. Eyben, R. Cowie, and M. Pantic, “Avec 2012: The continuous audio/visual emotion challenge,” inProc. ACM Int. Conf. Multimodal Interact., 2012, pp. 449–456

work page 2012

[18] [18]

Introduc- ing the recola multimodal corpus of remote collaborative and affective interactions,

F. Ringeval, A. Sonderegger, J. Sauer, and D. Lalanne, “Introduc- ing the recola multimodal corpus of remote collaborative and affective interactions,” inProc. IEEE Int. Conf. Auto. Face Gesture Recognit.IEEE, 2013, pp. 1–8

work page 2013

[19] [19]

Panoptic studio: A massively multi- view system for social motion capture,

H. Joo, H. Liu, L. Tan, L. Gui, B. Nabbe, I. Matthews, T. Kanade, S. Nobuhara, and Y. Sheikh, “Panoptic studio: A massively multi- view system for social motion capture,” inProceedings of the IEEE International Conference on Computer Vision, 2015, pp. 3334–3342

work page 2015

[20] [20]

Towards social artifi- cial intelligence: Nonverbal social signal prediction in a triadic interaction,

H. Joo, T. Simon, M. Cikara, and Y. Sheikh, “Towards social artifi- cial intelligence: Nonverbal social signal prediction in a triadic interaction,” inProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 10 873–10 883

work page 2019

[21] [21]

A multi- 12 modal database for affect recognition and implicit tagging,

M. Soleymani, J. Lichtenauer, T. Pun, and M. Pantic, “A multi- 12 modal database for affect recognition and implicit tagging,”IEEE transactions on affective computing, vol. 3, no. 1, pp. 42–55, 2011

work page 2011

[22] [22]

Deap: A database for emotion analysis; using physiological signals,

K. Sander, M. Christian, M. Soleymani, J.-S. Lee, Y. Ashkan, E. Touradj, P . Thierry, N. Anton, and P . Ioannis, “Deap: A database for emotion analysis; using physiological signals,”IEEE Trans. Affect. Comput., vol. 3, no. 1, pp. 18–31, 2011

work page 2011

[23] [23]

Ekman,Telling lies: Clues to deceit in the marketplace, politics, and marriage (revised edition)

P . Ekman,Telling lies: Clues to deceit in the marketplace, politics, and marriage (revised edition). WW Norton & Company, 2009

work page 2009

[24] [24]

Body cues, not facial expressions, discriminate between intense positive and negative emotions,

H. Aviezer, Y. Trope, and A. Todorov, “Body cues, not facial expressions, discriminate between intense positive and negative emotions,”Science, vol. 338, no. 6111, pp. 1225–1229, 2012

work page 2012

[25] [25]

R. E. Axtell,Gestures: the do’s and taboos of body language around the wor ld, 1991

work page 1991

[26] [26]

J. K. Burgoon, D. B. Buller, and W. G. Woodall,Nonverbal commu- nication: The unspoken dialogue. Harpercollins College Division, 1989

work page 1989

[27] [27]

Hand gesture recognition: a literature review,

R. Z. Khan and N. A. Ibraheem, “Hand gesture recognition: a literature review,”International Journal of Artificial Intelligence & Applications, vol. 3, no. 4, p. 161, 2012

work page 2012

[28] [28]

Survey on emotional body gesture recogni- tion,

F. Noroozi, D. Kaminska, C. Corneanu, T. Sapinski, S. Escalera, and G. Anbarjafari, “Survey on emotional body gesture recogni- tion,”IEEE Transactions on Affective Computing, 2018

work page 2018

[29] [29]

Recognizing emotions expressed by body pose: A biologically inspired neural model,

K. Schindler, L. Van Gool, and B. De Gelder, “Recognizing emotions expressed by body pose: A biologically inspired neural model,”Neural networks, vol. 21, no. 9, pp. 1238–1246, 2008

work page 2008

[30] [30]

A bimodal face and body gesture database for automatic analysis of human nonverbal affective behavior,

H. Gunes and M. Piccardi, “A bimodal face and body gesture database for automatic analysis of human nonverbal affective behavior,” in18th International conference on pattern recognition (ICPR’06), vol. 1. IEEE, 2006, pp. 1148–1153

work page 2006

[31] [31]

Recognising human emotions from body movement and gesture dynamics,

G. Castellano, S. D. Villalba, and A. Camurri, “Recognising human emotions from body movement and gesture dynamics,” inInternational conference on affective computing and intelligent interaction. Springer, 2007, pp. 71–82

work page 2007

[32] [32]

The humaine database: Addressing the collection and annotation of naturalistic and induced emotional data,

E. Douglas-Cowie, R. Cowie, I. Sneddon, C. Cox, O. Lowry, M. Mcrorie, J.-C. Martin, L. Devillers, S. Abrilian, A. Batliner et al., “The humaine database: Addressing the collection and annotation of naturalistic and induced emotional data,” inInter- national conference on affective computing and intelligent interaction. Springer, 2007, pp. 488–500

work page 2007

[33] [33]

Technique for automatic emotion recognition by body gesture analysis,

D. Glowinski, A. Camurri, G. Volpe, N. Dael, and K. Scherer, “Technique for automatic emotion recognition by body gesture analysis,” in2008 IEEE Computer society conference on computer vision and pattern recognition workshops. IEEE, 2008, pp. 1–6

work page 2008

[34] [34]

Liris- accede: A video database for affective content analysis,

Y. Baveye, E. Dellandrea, C. Chamaret, and L. Chen, “Liris- accede: A video database for affective content analysis,”IEEE Transactions on Affective Computing, vol. 6, no. 1, pp. 43–55, 2015

work page 2015

[35] [35]

A study on emotion recognition from body gestures using kinect sensor,

S. Saha, S. Datta, A. Konar, and R. Janarthanan, “A study on emotion recognition from body gestures using kinect sensor,” in 2014 international conference on communication and signal processing. IEEE, 2014, pp. 056–060

work page 2014

[36] [36]

Multi- modal emotion recognition using deep learning architectures,

H. Ranganathan, S. Chakraborty, and S. Panchanathan, “Multi- modal emotion recognition using deep learning architectures,” in2016 IEEE Winter Conference on Applications of Computer Vision (WACV). IEEE, 2016, pp. 1–9

work page 2016

[37] [37]

Emilya: Emotional body expres- sion in daily actions database

N. Fourati and C. Pelachaud, “Emilya: Emotional body expres- sion in daily actions database.” inLREC, 2014, pp. 3486–3493

work page 2014

[38] [38]

Gesture and emotion: Can basic gestu- ral form features discriminate emotions?

M. Kipp and J.-C. Martin, “Gesture and emotion: Can basic gestu- ral form features discriminate emotions?” in2009 3rd International Conference on Affective Computing and Intelligent Interaction and Workshops. IEEE, 2009, pp. 1–8

work page 2009

[39] [39]

Arbee: Towards automated recognition of bodily expression of emotion in the wild,

Y. Luo, J. Ye, R. B. Adams, J. Li, M. G. Newman, and J. Z. Wang, “Arbee: Towards automated recognition of bodily expression of emotion in the wild,”International Journal of Computer Vision, vol. 128, no. 1, pp. 1–25, 2020

work page 2020

[40] [40]

Analyze spontaneous gestures for emotional stress state recognition: A micro-gesture dataset and analysis with deep learning,

H. Chen, X. Liu, X. Li, H. Shi, and G. Zhao, “Analyze spontaneous gestures for emotional stress state recognition: A micro-gesture dataset and analysis with deep learning,” in2019 14th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2019). IEEE, 2019, pp. 1–8

work page 2019

[41] [41]

Face and body gesture recognition for a vision-based multimodal analyser,

H. Gunes, M. Piccardi, and T. Jan, “Face and body gesture recognition for a vision-based multimodal analyser,” inPan- Sydney Area Workshop on Visual Information Processing. ACS, 2004

work page 2004

[42] [42]

Affect recognition from face and body: early fusion vs. late fusion,

H. Gunes and M. Piccardi, “Affect recognition from face and body: early fusion vs. late fusion,” in2005 IEEE International Conference on Systems, Man and Cybernetics, vol. 4. IEEE, 2005, pp. 3437–3443

work page 2005

[43] [43]

Multimodal sentiment intensity analysis in videos: Facial gestures and verbal messages,

A. Zadeh, R. Zellers, E. Pincus, and L.-P . Morency, “Multimodal sentiment intensity analysis in videos: Facial gestures and verbal messages,”IEEE Intelligent Systems, vol. 31, no. 6, pp. 82–88, 2016

work page 2016

[44] [44]

Sentiment analysis and topic recognition in video transcriptions,

L. Stappen, A. Baird, E. Cambria, and B. W. Schuller, “Sentiment analysis and topic recognition in video transcriptions,”IEEE Intelligent Systems, vol. 36, no. 2, pp. 88–95, 2021

work page 2021

[45] [45]

imigue: An identity-free video dataset for micro-gesture understanding and emotion analysis,

X. Liu, H. Shi, H. Chen, Z. Yu, X. Li, and G. Zhao, “imigue: An identity-free video dataset for micro-gesture understanding and emotion analysis,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 10 631–10 642

work page 2021

[46] [46]

3d convolutional neural networks for human action recognition,

S. Ji, W. Xu, M. Yang, and K. Yu, “3d convolutional neural networks for human action recognition,”IEEE transactions on pattern analysis and machine intelligence, vol. 35, no. 1, pp. 221–231, 2012

work page 2012

[47] [47]

Hierarchical recurrent neural network for skeleton based action recognition,

Y. Du, W. Wang, and L. Wang, “Hierarchical recurrent neural network for skeleton based action recognition,” inCVPR, 2015

work page 2015

[48] [48]

Can spatiotemporal 3d cnns retrace the history of 2d cnns and imagenet?

K. Hara, H. Kataoka, and Y. Satoh, “Can spatiotemporal 3d cnns retrace the history of 2d cnns and imagenet?” inProceedings of the IEEE conference on Computer Vision and Pattern Recognition, 2018, pp. 6546–6555

work page 2018

[49] [49]

Deep residual learning for image recognition,

K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” inProceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778

work page 2016

[50] [50]

Batch normalization: Accelerating deep network training by reducing internal covariate shift,

S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” inInterna- tional conference on machine learning. PMLR, 2015, pp. 448–456

work page 2015

[51] [51]

Quo vadis, action recognition? a new model and the kinetics dataset,

J. Carreira and A. Zisserman, “Quo vadis, action recognition? a new model and the kinetics dataset,” inproceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 6299–6308

work page 2017

[52] [52]

Slowfast networks for video recognition,

C. Feichtenhofer, H. Fan, J. Malik, and K. He, “Slowfast networks for video recognition,” inProceedings of the IEEE international conference on computer vision, 2019, pp. 6202–6211

work page 2019

[53] [53]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly et al., “An image is worth 16x16 words: Transformers for image recognition at scale,”arXiv preprint arXiv:2010.11929, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2010

[54] [54]

Vivit: A video vision transformer,

A. Arnab, M. Dehghani, G. Heigold, C. Sun, M. Lu ˇci´c, and C. Schmid, “Vivit: A video vision transformer,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 6836–6846

work page 2021

[55] [55]

Video swin transformer,

Z. Liu, J. Ning, Y. Cao, Y. Wei, Z. Zhang, S. Lin, and H. Hu, “Video swin transformer,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 3202–3211

work page 2022

[56] [56]

Foundation models defin- ing a new era in vision: a survey and outlook,

M. Awais, M. Naseer, S. Khan, R. M. Anwer, H. Cholakkal, M. Shah, M.-H. Yang, and F. S. Khan, “Foundation models defin- ing a new era in vision: a survey and outlook,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 47, no. 4, pp. 2245–2264, 2025

work page 2025

[57] [57]

Swin transformer: Hierarchical vision transformer using shifted windows,

Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, and B. Guo, “Swin transformer: Hierarchical vision transformer using shifted windows,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 10 012–10 022

work page 2021

[58] [58]

Momentum contrast for unsupervised visual representation learning,

K. He, H. Fan, Y. Wu, S. Xie, and R. Girshick, “Momentum contrast for unsupervised visual representation learning,” inPro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 9729–9738

work page 2020

[59] [59]

A simple framework for contrastive learning of visual representations,

T. Chen, S. Kornblith, M. Norouzi, and G. Hinton, “A simple framework for contrastive learning of visual representations,” inInternational conference on machine learning. PMLR, 2020, pp. 1597–1607

work page 2020

[60] [60]

Masked autoencoders are scalable vision learners,

K. He, X. Chen, S. Xie, Y. Li, P . Doll ´ar, and R. Girshick, “Masked autoencoders are scalable vision learners,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 16 000–16 009

work page 2022

[61] [61]

Learning transferable visual models from natural language supervision,

A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agar- wal, G. Sastry, A. Askell, P . Mishkin, J. Clarket al., “Learning transferable visual models from natural language supervision,” inInternational conference on machine learning. PmLR, 2021, pp. 8748–8763

work page 2021

[62] [62]

Scaling up visual and vision- language representation learning with noisy text supervision,

C. Jia, Y. Yang, Y. Xia, Y.-T. Chen, Z. Parekh, H. Pham, Q. Le, Y.-H. Sung, Z. Li, and T. Duerig, “Scaling up visual and vision- language representation learning with noisy text supervision,” inInternational conference on machine learning. PMLR, 2021, pp. 4904–4916

work page 2021

[63] [63]

Image as a foreign 13 language: Beit pretraining for vision and vision-language tasks,

W. Wang, H. Bao, L. Dong, J. Bjorck, Z. Peng, Q. Liu, K. Aggarwal, O. K. Mohammed, S. Singhal, S. Somet al., “Image as a foreign 13 language: Beit pretraining for vision and vision-language tasks,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 19 175–19 186

work page 2023

[64] [64]

Sigmoid loss for language image pre-training,

X. Zhai, B. Mustafa, A. Kolesnikov, and L. Beyer, “Sigmoid loss for language image pre-training,” inProceedings of the IEEE/CVF international conference on computer vision, 2023, pp. 11 975–11 986

work page 2023

[65] [65]

Florence-2: Advancing a unified representation for a variety of vision tasks,

B. Xiao, H. Wu, W. Xu, X. Dai, H. Hu, Y. Lu, M. Zeng, C. Liu, and L. Yuan, “Florence-2: Advancing a unified representation for a variety of vision tasks,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 4818–4829

work page 2024

[66] [66]

Internvl: Scaling up vision foun- dation models and aligning for generic visual-linguistic tasks,

Z. Chen, J. Wu, W. Wang, W. Su, G. Chen, S. Xing, M. Zhong, Q. Zhang, X. Zhu, L. Luet al., “Internvl: Scaling up vision foun- dation models and aligning for generic visual-linguistic tasks,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2024, pp. 24 185–24 198

work page 2024

[67] [67]

Grounded language- image pre-training,

L. H. Li, P . Zhang, H. Zhang, J. Yang, C. Li, Y. Zhong, L. Wang, L. Yuan, L. Zhang, J.-N. Hwanget al., “Grounded language- image pre-training,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 10 965–10 975

work page 2022

[68] [68]

Grounding dino: Marrying dino with grounded pre-training for open-set object detection,

S. Liu, Z. Zeng, T. Ren, F. Li, H. Zhang, J. Yang, Q. Jiang, C. Li, J. Yang, H. Suet al., “Grounding dino: Marrying dino with grounded pre-training for open-set object detection,” inEuropean conference on computer vision. Springer, 2024, pp. 38–55

work page 2024

[69] [69]

Sam 2: Segment anything in images and videos,

N. Ravi, V . Gabeur, Y.-T. Hu, R. Hu, C. Ryali, T. Ma, H. Khedr, R. R ¨adle, C. Rolland, L. Gustafsonet al., “Sam 2: Segment anything in images and videos,” inInternational Conference on Learning Representations, vol. 2025, 2025, pp. 28 085–28 128

work page 2025

[70] [70]

SAM 3: Segment Anything with Concepts

N. Carion, L. Gustafson, Y.-T. Hu, S. Debnath, R. Hu, D. Suris, C. Ryali, K. V . Alwala, H. Khedr, A. Huanget al., “Sam 3: Segment anything with concepts,”arXiv preprint arXiv:2511.16719, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[71] [71]

Denoising diffusion probabilistic models,

J. Ho, A. Jain, and P . Abbeel, “Denoising diffusion probabilistic models,”Advances in neural information processing systems, vol. 33, pp. 6840–6851, 2020

work page 2020

[72] [72]

High-resolution image synthesis with latent diffusion models,

R. Rombach, A. Blattmann, D. Lorenz, P . Esser, and B. Ommer, “High-resolution image synthesis with latent diffusion models,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 10 684–10 695

work page 2022

[73] [73]

Videomae: Masked autoencoders are data-efficient learners for self-supervised video pre-training,

Z. Tong, Y. Song, J. Wang, and L. Wang, “Videomae: Masked autoencoders are data-efficient learners for self-supervised video pre-training,”Advances in neural information processing systems, vol. 35, pp. 10 078–10 093, 2022

work page 2022

[74] [74]

Videomae v2: Scaling video masked autoencoders with dual masking,

L. Wang, B. Huang, Z. Zhao, Z. Tong, Y. He, Y. Wang, Y. Wang, and Y. Qiao, “Videomae v2: Scaling video masked autoencoders with dual masking,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2023, pp. 14 549–14 560

work page 2023

[75] [75]

Internvideo2: Scaling foundation models for multimodal video understanding,

Y. Wang, K. Li, X. Li, J. Yu, Y. He, G. Chen, B. Pei, R. Zheng, Z. Wang, Y. Shiet al., “Internvideo2: Scaling foundation models for multimodal video understanding,” inEuropean conference on computer vision. Springer, 2024, pp. 396–416

work page 2024

[76] [76]

Motionbert: A unified perspective on learning human motion representa- tions,

W. Zhu, X. Ma, Z. Liu, L. Liu, W. Wu, and Y. Wang, “Motionbert: A unified perspective on learning human motion representa- tions,” inProceedings of the IEEE/CVF international conference on computer vision, 2023, pp. 15 085–15 099

work page 2023

[77] [77]

Skeletonmae: graph-based masked autoencoder for skeleton sequence pre- training,

H. Yan, Y. Liu, Y. Wei, Z. Li, G. Li, and L. Lin, “Skeletonmae: graph-based masked autoencoder for skeleton sequence pre- training,” inProceedings of the IEEE/CVF international conference on computer vision, 2023, pp. 5606–5618

work page 2023

[78] [78]

Masked motion predictors are strong 3d action representation learners,

Y. Mao, J. Deng, W. Zhou, Y. Fang, W. Ouyang, and H. Li, “Masked motion predictors are strong 3d action representation learners,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 10 181–10 191

work page 2023

[79] [79]

Unified multi-modal unsupervised representation learning for skeleton-based action understanding,

S. Sun, D. Liu, J. Dong, X. Qu, J. Gao, X. Yang, X. Wang, and M. Wang, “Unified multi-modal unsupervised representation learning for skeleton-based action understanding,” inProceedings of the 31st ACM International Conference on Multimedia, 2023, pp. 2973–2984

work page 2023

[80] [80]

Skeleton- in-context: Unified skeleton sequence modeling with in-context learning,

X. Wang, Z. Fang, X. Li, X. Li, C. Chen, and M. Liu, “Skeleton- in-context: Unified skeleton sequence modeling with in-context learning,” inProceedings of the IEEE/CVF Conference on Computer vision and Pattern Recognition, 2024, pp. 2436–2446

work page 2024