iMiGUE-3K: A Large-Scale Benchmark for Micro-Gesture Analysis with Self-Supervised Learning
Pith reviewed 2026-05-20 14:06 UTC · model grok-4.3
The pith
A dataset of micro-gestures from tennis interviews shows that body language cues substantially improve automated emotion understanding.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By releasing iMiGUE-3K and MG-FMs, the work establishes that micro-gesture analysis significantly improves emotion understanding, as shown through systematic testing of representative methods on five tasks including unsupervised, semi-supervised, and supervised micro-gesture recognition plus retrieval and emotion recognition.
What carries the argument
The iMiGUE-3K dataset of 32 annotated micro-gesture classes collected via model-based crowd-sourcing from in-the-wild tennis interview videos, together with the MG-FMs discriminative foundation model for transferable gesture presentation learning.
If this is right
- Micro-gesture recognition becomes feasible in unsupervised, semi-supervised, and supervised regimes on a shared benchmark.
- The foundation model supports transferable representations for gesture retrieval and related downstream tasks.
- Emotion recognition systems gain a new modality that captures subconscious cues not visible in faces or speech.
- The benchmark enables progress toward applications in psychological diagnostics and human-computer interaction.
Where Pith is reading between the lines
- The same collection approach could scale to other high-stakes public settings to gather more diverse natural micro-gestures.
- Combining micro-gesture signals with existing modalities may increase robustness of emotion models in noisy real-world environments.
- Self-supervised pre-training on this volume of data could yield representations useful for broader body-language understanding tasks.
Load-bearing premise
The model-based crowd-sourcing strategy produces accurate, unbiased fine-grained annotations for the 32 micro-gesture classes that generalize beyond tennis press interviews.
What would settle it
An independent test set of everyday human interactions where models trained on iMiGUE-3K show no accuracy gain in emotion recognition compared with face-and-speech baselines.
Figures
read the original abstract
Emotion understanding is a fundamental challenge in affective computing and artificial intelligence. While existing approaches predominantly focus on facial expressions and speech, they often overlook the rich emotional cues conveyed through body language. Recently, micro-gestures (MGs), unintentional, subconscious movements driven by inner feelings, have attracted increasing attention as an alternative to other cues. However, there are no existing large-scale datasets supporting the pre-training of the MG foundation model. To advance MG research, we present a new benchmark for micro-gesture-based emotion understanding, featuring key contributions with a novel dataset (iMiGUE-3K) and a series of foundation models for different tasks. Using a model-based crowd-sourcing data collection strategy, we construct iMiGUE-3K, the largest MG dataset to date. It comprises video recordings from 332 distinct professional tennis players' public press interviews over the past seven years, totaling more than 3.4K long video clips and 37 million frames. The dataset includes 32 micro-gesture classes with rich descriptive annotations, making it the first large-scale, in-the-wild, video dataset for fine-grained gesture-based emotion analysis. Built on iMiGUE-3K, we propose MG-FMs, a discriminative foundation model for transferable gesture presentation learning. Based on the foundation model, we establish five comprehensive evaluation tasks: MG recognition (unsupervised, semi-supervised, supervised), MG retrieval, and MG emotion recognition. Our systematic evaluation of representative methods demonstrates that micro-gesture-based analysis significantly improves emotion understanding. We hope this work can provide comprehensive tools for MG analysis and set a solid foundation for future research in psychological diagnostics, affective computing, and advanced human-computer interaction.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces iMiGUE-3K, the largest in-the-wild video dataset for micro-gesture analysis, constructed from 3.4K clips of 332 professional tennis players' press interviews using a model-based crowd-sourcing strategy to annotate 32 micro-gesture classes. It proposes MG-FMs, a discriminative foundation model for transferable gesture representation learning via self-supervised methods, and evaluates it across five tasks: unsupervised/semi-supervised/supervised MG recognition, MG retrieval, and MG emotion recognition, claiming that micro-gesture analysis significantly improves emotion understanding over existing approaches focused on facial expressions and speech.
Significance. If the annotations prove reliable and the reported gains generalize, this benchmark and the associated foundation models would address a clear gap in affective computing by enabling scalable study of subconscious body-language cues for emotion recognition. The scale (37M frames), multi-task evaluation protocol, and self-supervised pre-training approach represent concrete strengths that could support reproducible progress in psychological diagnostics and HCI.
major comments (2)
- [Dataset Construction] Dataset Construction section: The model-based crowd-sourcing procedure for producing the 32-class fine-grained micro-gesture annotations is presented without any reported validation metrics (inter-annotator agreement, human verification of model outputs, or cross-domain checks). This directly undermines the central claim that MG features improve emotion recognition, because systematic label errors or interview-specific biases would render the five evaluation tasks unreliable.
- [Evaluation] Evaluation section: The abstract and results claim that 'systematic evaluation of representative methods demonstrates that micro-gesture-based analysis significantly improves emotion understanding,' yet no quantitative numbers, error bars, data-split details, or baseline comparisons appear in the provided summary or abstract. Without these, the magnitude and robustness of the claimed gains cannot be assessed.
minor comments (2)
- [Abstract] The abstract would be strengthened by including at least one key quantitative result (e.g., accuracy delta on the MG emotion recognition task) to support the improvement claim.
- A comparison table of iMiGUE-3K statistics against prior micro-gesture or gesture datasets is missing and would help readers gauge the scale advantage.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback on our manuscript. We address each major comment point by point below, with honest indications of where revisions will be made to strengthen the paper.
read point-by-point responses
-
Referee: [Dataset Construction] Dataset Construction section: The model-based crowd-sourcing procedure for producing the 32-class fine-grained micro-gesture annotations is presented without any reported validation metrics (inter-annotator agreement, human verification of model outputs, or cross-domain checks). This directly undermines the central claim that MG features improve emotion recognition, because systematic label errors or interview-specific biases would render the five evaluation tasks unreliable.
Authors: We agree that explicit validation metrics are essential to substantiate the annotation quality and support downstream claims. The current manuscript describes the model-based crowd-sourcing strategy but does not report inter-annotator agreement, human verification rates, or cross-domain checks. In the revised version we will add a new paragraph in the Dataset Construction section that includes these metrics, computed on a held-out verification subset, along with details on how model-proposed labels were cross-checked by multiple human annotators. This addition will directly address concerns about label reliability and bias. revision: yes
-
Referee: [Evaluation] Evaluation section: The abstract and results claim that 'systematic evaluation of representative methods demonstrates that micro-gesture-based analysis significantly improves emotion understanding,' yet no quantitative numbers, error bars, data-split details, or baseline comparisons appear in the provided summary or abstract. Without these, the magnitude and robustness of the claimed gains cannot be assessed.
Authors: The full Evaluation section of the manuscript reports quantitative results for all five tasks, including accuracy metrics, baseline comparisons, and data-split protocols. However, these specifics are not summarized with numbers in the abstract. We will revise the abstract to include key quantitative findings (e.g., the absolute and relative improvements in emotion recognition accuracy when incorporating micro-gesture features) and will ensure that error bars and split details are explicitly referenced or tabulated in the results for immediate assessment of robustness. revision: yes
Circularity Check
No significant circularity in derivation chain
full rationale
The paper introduces a new large-scale dataset iMiGUE-3K via model-based crowd-sourcing from tennis press interviews and applies standard self-supervised learning to train MG-FMs. It then reports empirical results on five evaluation tasks (MG recognition in unsupervised/semi-supervised/supervised settings, retrieval, and emotion recognition). No equations, predictions, or central claims reduce by construction to fitted parameters, self-definitions, or self-citation chains within the paper; the performance gains are demonstrated through external evaluations on the newly collected benchmark rather than definitional equivalences.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Using a model-based crowd-sourcing data collection strategy, we construct iMiGUE-3K... 32 micro-gesture classes... MG-FMs... five comprehensive evaluation tasks: MG recognition... MG emotion recognition.
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
MG-FM-Skele... dual-stream Dense Spatio-Temporal Encoder... multi-grained feature decorrelation objective... Lfd(Hr) = λsim Lsim + λvac Lvac + λxcorr Lxcorr
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
R. Gross, I. Matthews, J. Cohn, T. Kanade, and S. Baker, “Multi- pie,”Image and Vision Computing, vol. 28, no. 5, pp. 807–813, 2010
work page 2010
-
[2]
Comprehensive database for facial expression analysis,
T. Kanade, J. F. Cohn, and Y. Tian, “Comprehensive database for facial expression analysis,” inProceedings of the IEEE International Conference on Automatic Face and Gesture Recognition. IEEE, 2000, pp. 46–53
work page 2000
-
[3]
Web-based database for facial expression analysis,
M. Pantic, M. Valstar, R. Rademaker, and L. Maat, “Web-based database for facial expression analysis,” inProceedings of the IEEE International Conference on Multimedia and Expo. IEEE, 2005, pp. 317–321
work page 2005
-
[4]
Induced disgust, happiness and surprise: an addition to the mmi facial expression database,
M. Valstar and M. Pantic, “Induced disgust, happiness and surprise: an addition to the mmi facial expression database,” in Proceedings of the International Conference on Language Resources and Evaluation, Workshop EMOTION. Paris, France, 2010, pp. 65–70
work page 2010
-
[5]
A 3d facial expression database for facial behavior research,
L. Yin, X. Wei, Y. Sun, J. Wang, and M. J. Rosato, “A 3d facial expression database for facial behavior research,” inProceedings of the IEEE International Conference on Automatic Face and Gesture Recognition. IEEE, 2006, pp. 211–216
work page 2006
-
[6]
A high-resolution spontaneous 3d dynamic facial expression database,
X. Zhang, L. Yin, J. F. Cohn, S. Canavan, M. Reale, A. Horowitz, and P . Liu, “A high-resolution spontaneous 3d dynamic facial expression database,” inProceedings of the IEEE International Con- ference on Automatic Face and Gesture Recognition. IEEE, 2013, pp. 1–6
work page 2013
-
[7]
M. S. Aung, S. Kaltwang, B. Romera-Paredes, B. Martinez, A. Singh, M. Cella, M. Valstar, H. Meng, A. Kemp, and M. Shafizadeh, “The automatic detection of chronic pain- related expression: requirements, challenges and the multimodal emopain dataset,”IEEE Trans. Affect. Comput., vol. 7, no. 4, pp. 435–451, 2015
work page 2015
-
[8]
Fully automatic facial action recognition in spontaneous behavior,
M. S. Bartlett, G. Littlewort, M. Frank, C. Lainscsek, I. Fasel, and J. Movellan, “Fully automatic facial action recognition in spontaneous behavior,” inProc. IEEE Int. Conf. Auto. Face Gesture Recognit.IEEE, 2006, pp. 223–230
work page 2006
-
[9]
From individual to group-level emotion recognition: Emotiw 5.0,
A. Dhall, R. Goecke, S. Ghosh, J. Joshi, J. Hoey, and T. Gedeon, “From individual to group-level emotion recognition: Emotiw 5.0,” inProc. ACM Int. Conf. Multimodal Interaction, 2017, pp. 524– 528
work page 2017
-
[10]
Painful data: The unbc-mcmaster shoulder pain expression archive database,
P . Lucey, J. F. Cohn, K. M. Prkachin, P . E. Solomon, and I. Matthews, “Painful data: The unbc-mcmaster shoulder pain expression archive database,”Proc. IEEE Int. Conf. Auto. Face Gesture Recognit., pp. 57–64, 2011
work page 2011
-
[11]
Deep affect prediction in-the-wild: Aff-wild database and challenge, deep architectures, and beyond,
D. Kollias, P . Tzirakis, M. A. Nicolaou, A. Papaioannou, G. Zhao, B. Schuller, I. Kotsia, and S. Zafeiriou, “Deep affect prediction in-the-wild: Aff-wild database and challenge, deep architectures, and beyond,”Int. J. Comput. Vision, vol. 127, no. 6-7, pp. 907–929, 2019
work page 2019
-
[12]
G. McKeown, M. Valstar, R. Cowie, M. Pantic, and M. Schroder, “The semaine database: Annotated multimodal records of emo- tionally colored conversations between a person and a limited agent,”IEEE Trans. Affect. Comput., vol. 3, no. 1, pp. 5–17, 2011
work page 2011
-
[13]
Casme ii: An improved spontaneous micro-expression database and the baseline evaluation,
W.-J. Yan, X. Li, S.-J. Wang, G. Zhao, Y.-J. Liu, Y.-H. Chen, and X. Fu, “Casme ii: An improved spontaneous micro-expression database and the baseline evaluation,”PloS one, vol. 9, no. 1, p. e86041, 2014
work page 2014
-
[14]
A spontaneous micro-expression database: Inducement, collection and baseline,
X. Li, T. Pfister, X. Huang, G. Zhao, and M. Pietik ¨ainen, “A spontaneous micro-expression database: Inducement, collection and baseline,” in2013 10th IEEE International Conference and Workshops on Automatic face and gesture recognition (fg). IEEE, 2013, pp. 1–6
work page 2013
-
[15]
Samm: A spontaneous micro-facial movement dataset,
A. K. Davison, C. Lansley, N. Costen, K. Tan, and M. H. Yap, “Samm: A spontaneous micro-facial movement dataset,”IEEE transactions on affective computing, vol. 9, no. 1, pp. 116–129, 2016
work page 2016
-
[16]
Avec 2011–the first international audio/visual emotion challenge,
B. Schuller, M. Valstar, F. Eyben, G. McKeown, R. Cowie, and M. Pantic, “Avec 2011–the first international audio/visual emotion challenge,” inInt. Conf. Affect. Comput. Intell. Interact. Springer, 2011, pp. 415–424
work page 2011
-
[17]
Avec 2012: The continuous audio/visual emotion challenge,
B. Schuller, M. Valster, F. Eyben, R. Cowie, and M. Pantic, “Avec 2012: The continuous audio/visual emotion challenge,” inProc. ACM Int. Conf. Multimodal Interact., 2012, pp. 449–456
work page 2012
-
[18]
Introduc- ing the recola multimodal corpus of remote collaborative and affective interactions,
F. Ringeval, A. Sonderegger, J. Sauer, and D. Lalanne, “Introduc- ing the recola multimodal corpus of remote collaborative and affective interactions,” inProc. IEEE Int. Conf. Auto. Face Gesture Recognit.IEEE, 2013, pp. 1–8
work page 2013
-
[19]
Panoptic studio: A massively multi- view system for social motion capture,
H. Joo, H. Liu, L. Tan, L. Gui, B. Nabbe, I. Matthews, T. Kanade, S. Nobuhara, and Y. Sheikh, “Panoptic studio: A massively multi- view system for social motion capture,” inProceedings of the IEEE International Conference on Computer Vision, 2015, pp. 3334–3342
work page 2015
-
[20]
H. Joo, T. Simon, M. Cikara, and Y. Sheikh, “Towards social artifi- cial intelligence: Nonverbal social signal prediction in a triadic interaction,” inProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 10 873–10 883
work page 2019
-
[21]
A multi- 12 modal database for affect recognition and implicit tagging,
M. Soleymani, J. Lichtenauer, T. Pun, and M. Pantic, “A multi- 12 modal database for affect recognition and implicit tagging,”IEEE transactions on affective computing, vol. 3, no. 1, pp. 42–55, 2011
work page 2011
-
[22]
Deap: A database for emotion analysis; using physiological signals,
K. Sander, M. Christian, M. Soleymani, J.-S. Lee, Y. Ashkan, E. Touradj, P . Thierry, N. Anton, and P . Ioannis, “Deap: A database for emotion analysis; using physiological signals,”IEEE Trans. Affect. Comput., vol. 3, no. 1, pp. 18–31, 2011
work page 2011
-
[23]
Ekman,Telling lies: Clues to deceit in the marketplace, politics, and marriage (revised edition)
P . Ekman,Telling lies: Clues to deceit in the marketplace, politics, and marriage (revised edition). WW Norton & Company, 2009
work page 2009
-
[24]
Body cues, not facial expressions, discriminate between intense positive and negative emotions,
H. Aviezer, Y. Trope, and A. Todorov, “Body cues, not facial expressions, discriminate between intense positive and negative emotions,”Science, vol. 338, no. 6111, pp. 1225–1229, 2012
work page 2012
-
[25]
R. E. Axtell,Gestures: the do’s and taboos of body language around the wor ld, 1991
work page 1991
-
[26]
J. K. Burgoon, D. B. Buller, and W. G. Woodall,Nonverbal commu- nication: The unspoken dialogue. Harpercollins College Division, 1989
work page 1989
-
[27]
Hand gesture recognition: a literature review,
R. Z. Khan and N. A. Ibraheem, “Hand gesture recognition: a literature review,”International Journal of Artificial Intelligence & Applications, vol. 3, no. 4, p. 161, 2012
work page 2012
-
[28]
Survey on emotional body gesture recogni- tion,
F. Noroozi, D. Kaminska, C. Corneanu, T. Sapinski, S. Escalera, and G. Anbarjafari, “Survey on emotional body gesture recogni- tion,”IEEE Transactions on Affective Computing, 2018
work page 2018
-
[29]
Recognizing emotions expressed by body pose: A biologically inspired neural model,
K. Schindler, L. Van Gool, and B. De Gelder, “Recognizing emotions expressed by body pose: A biologically inspired neural model,”Neural networks, vol. 21, no. 9, pp. 1238–1246, 2008
work page 2008
-
[30]
H. Gunes and M. Piccardi, “A bimodal face and body gesture database for automatic analysis of human nonverbal affective behavior,” in18th International conference on pattern recognition (ICPR’06), vol. 1. IEEE, 2006, pp. 1148–1153
work page 2006
-
[31]
Recognising human emotions from body movement and gesture dynamics,
G. Castellano, S. D. Villalba, and A. Camurri, “Recognising human emotions from body movement and gesture dynamics,” inInternational conference on affective computing and intelligent interaction. Springer, 2007, pp. 71–82
work page 2007
-
[32]
E. Douglas-Cowie, R. Cowie, I. Sneddon, C. Cox, O. Lowry, M. Mcrorie, J.-C. Martin, L. Devillers, S. Abrilian, A. Batliner et al., “The humaine database: Addressing the collection and annotation of naturalistic and induced emotional data,” inInter- national conference on affective computing and intelligent interaction. Springer, 2007, pp. 488–500
work page 2007
-
[33]
Technique for automatic emotion recognition by body gesture analysis,
D. Glowinski, A. Camurri, G. Volpe, N. Dael, and K. Scherer, “Technique for automatic emotion recognition by body gesture analysis,” in2008 IEEE Computer society conference on computer vision and pattern recognition workshops. IEEE, 2008, pp. 1–6
work page 2008
-
[34]
Liris- accede: A video database for affective content analysis,
Y. Baveye, E. Dellandrea, C. Chamaret, and L. Chen, “Liris- accede: A video database for affective content analysis,”IEEE Transactions on Affective Computing, vol. 6, no. 1, pp. 43–55, 2015
work page 2015
-
[35]
A study on emotion recognition from body gestures using kinect sensor,
S. Saha, S. Datta, A. Konar, and R. Janarthanan, “A study on emotion recognition from body gestures using kinect sensor,” in 2014 international conference on communication and signal processing. IEEE, 2014, pp. 056–060
work page 2014
-
[36]
Multi- modal emotion recognition using deep learning architectures,
H. Ranganathan, S. Chakraborty, and S. Panchanathan, “Multi- modal emotion recognition using deep learning architectures,” in2016 IEEE Winter Conference on Applications of Computer Vision (WACV). IEEE, 2016, pp. 1–9
work page 2016
-
[37]
Emilya: Emotional body expres- sion in daily actions database
N. Fourati and C. Pelachaud, “Emilya: Emotional body expres- sion in daily actions database.” inLREC, 2014, pp. 3486–3493
work page 2014
-
[38]
Gesture and emotion: Can basic gestu- ral form features discriminate emotions?
M. Kipp and J.-C. Martin, “Gesture and emotion: Can basic gestu- ral form features discriminate emotions?” in2009 3rd International Conference on Affective Computing and Intelligent Interaction and Workshops. IEEE, 2009, pp. 1–8
work page 2009
-
[39]
Arbee: Towards automated recognition of bodily expression of emotion in the wild,
Y. Luo, J. Ye, R. B. Adams, J. Li, M. G. Newman, and J. Z. Wang, “Arbee: Towards automated recognition of bodily expression of emotion in the wild,”International Journal of Computer Vision, vol. 128, no. 1, pp. 1–25, 2020
work page 2020
-
[40]
H. Chen, X. Liu, X. Li, H. Shi, and G. Zhao, “Analyze spontaneous gestures for emotional stress state recognition: A micro-gesture dataset and analysis with deep learning,” in2019 14th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2019). IEEE, 2019, pp. 1–8
work page 2019
-
[41]
Face and body gesture recognition for a vision-based multimodal analyser,
H. Gunes, M. Piccardi, and T. Jan, “Face and body gesture recognition for a vision-based multimodal analyser,” inPan- Sydney Area Workshop on Visual Information Processing. ACS, 2004
work page 2004
-
[42]
Affect recognition from face and body: early fusion vs. late fusion,
H. Gunes and M. Piccardi, “Affect recognition from face and body: early fusion vs. late fusion,” in2005 IEEE International Conference on Systems, Man and Cybernetics, vol. 4. IEEE, 2005, pp. 3437–3443
work page 2005
-
[43]
Multimodal sentiment intensity analysis in videos: Facial gestures and verbal messages,
A. Zadeh, R. Zellers, E. Pincus, and L.-P . Morency, “Multimodal sentiment intensity analysis in videos: Facial gestures and verbal messages,”IEEE Intelligent Systems, vol. 31, no. 6, pp. 82–88, 2016
work page 2016
-
[44]
Sentiment analysis and topic recognition in video transcriptions,
L. Stappen, A. Baird, E. Cambria, and B. W. Schuller, “Sentiment analysis and topic recognition in video transcriptions,”IEEE Intelligent Systems, vol. 36, no. 2, pp. 88–95, 2021
work page 2021
-
[45]
imigue: An identity-free video dataset for micro-gesture understanding and emotion analysis,
X. Liu, H. Shi, H. Chen, Z. Yu, X. Li, and G. Zhao, “imigue: An identity-free video dataset for micro-gesture understanding and emotion analysis,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 10 631–10 642
work page 2021
-
[46]
3d convolutional neural networks for human action recognition,
S. Ji, W. Xu, M. Yang, and K. Yu, “3d convolutional neural networks for human action recognition,”IEEE transactions on pattern analysis and machine intelligence, vol. 35, no. 1, pp. 221–231, 2012
work page 2012
-
[47]
Hierarchical recurrent neural network for skeleton based action recognition,
Y. Du, W. Wang, and L. Wang, “Hierarchical recurrent neural network for skeleton based action recognition,” inCVPR, 2015
work page 2015
-
[48]
Can spatiotemporal 3d cnns retrace the history of 2d cnns and imagenet?
K. Hara, H. Kataoka, and Y. Satoh, “Can spatiotemporal 3d cnns retrace the history of 2d cnns and imagenet?” inProceedings of the IEEE conference on Computer Vision and Pattern Recognition, 2018, pp. 6546–6555
work page 2018
-
[49]
Deep residual learning for image recognition,
K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” inProceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778
work page 2016
-
[50]
Batch normalization: Accelerating deep network training by reducing internal covariate shift,
S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” inInterna- tional conference on machine learning. PMLR, 2015, pp. 448–456
work page 2015
-
[51]
Quo vadis, action recognition? a new model and the kinetics dataset,
J. Carreira and A. Zisserman, “Quo vadis, action recognition? a new model and the kinetics dataset,” inproceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 6299–6308
work page 2017
-
[52]
Slowfast networks for video recognition,
C. Feichtenhofer, H. Fan, J. Malik, and K. He, “Slowfast networks for video recognition,” inProceedings of the IEEE international conference on computer vision, 2019, pp. 6202–6211
work page 2019
-
[53]
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly et al., “An image is worth 16x16 words: Transformers for image recognition at scale,”arXiv preprint arXiv:2010.11929, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2010
-
[54]
Vivit: A video vision transformer,
A. Arnab, M. Dehghani, G. Heigold, C. Sun, M. Lu ˇci´c, and C. Schmid, “Vivit: A video vision transformer,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 6836–6846
work page 2021
-
[55]
Z. Liu, J. Ning, Y. Cao, Y. Wei, Z. Zhang, S. Lin, and H. Hu, “Video swin transformer,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 3202–3211
work page 2022
-
[56]
Foundation models defin- ing a new era in vision: a survey and outlook,
M. Awais, M. Naseer, S. Khan, R. M. Anwer, H. Cholakkal, M. Shah, M.-H. Yang, and F. S. Khan, “Foundation models defin- ing a new era in vision: a survey and outlook,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 47, no. 4, pp. 2245–2264, 2025
work page 2025
-
[57]
Swin transformer: Hierarchical vision transformer using shifted windows,
Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, and B. Guo, “Swin transformer: Hierarchical vision transformer using shifted windows,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 10 012–10 022
work page 2021
-
[58]
Momentum contrast for unsupervised visual representation learning,
K. He, H. Fan, Y. Wu, S. Xie, and R. Girshick, “Momentum contrast for unsupervised visual representation learning,” inPro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 9729–9738
work page 2020
-
[59]
A simple framework for contrastive learning of visual representations,
T. Chen, S. Kornblith, M. Norouzi, and G. Hinton, “A simple framework for contrastive learning of visual representations,” inInternational conference on machine learning. PMLR, 2020, pp. 1597–1607
work page 2020
-
[60]
Masked autoencoders are scalable vision learners,
K. He, X. Chen, S. Xie, Y. Li, P . Doll ´ar, and R. Girshick, “Masked autoencoders are scalable vision learners,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 16 000–16 009
work page 2022
-
[61]
Learning transferable visual models from natural language supervision,
A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agar- wal, G. Sastry, A. Askell, P . Mishkin, J. Clarket al., “Learning transferable visual models from natural language supervision,” inInternational conference on machine learning. PmLR, 2021, pp. 8748–8763
work page 2021
-
[62]
Scaling up visual and vision- language representation learning with noisy text supervision,
C. Jia, Y. Yang, Y. Xia, Y.-T. Chen, Z. Parekh, H. Pham, Q. Le, Y.-H. Sung, Z. Li, and T. Duerig, “Scaling up visual and vision- language representation learning with noisy text supervision,” inInternational conference on machine learning. PMLR, 2021, pp. 4904–4916
work page 2021
-
[63]
Image as a foreign 13 language: Beit pretraining for vision and vision-language tasks,
W. Wang, H. Bao, L. Dong, J. Bjorck, Z. Peng, Q. Liu, K. Aggarwal, O. K. Mohammed, S. Singhal, S. Somet al., “Image as a foreign 13 language: Beit pretraining for vision and vision-language tasks,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 19 175–19 186
work page 2023
-
[64]
Sigmoid loss for language image pre-training,
X. Zhai, B. Mustafa, A. Kolesnikov, and L. Beyer, “Sigmoid loss for language image pre-training,” inProceedings of the IEEE/CVF international conference on computer vision, 2023, pp. 11 975–11 986
work page 2023
-
[65]
Florence-2: Advancing a unified representation for a variety of vision tasks,
B. Xiao, H. Wu, W. Xu, X. Dai, H. Hu, Y. Lu, M. Zeng, C. Liu, and L. Yuan, “Florence-2: Advancing a unified representation for a variety of vision tasks,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 4818–4829
work page 2024
-
[66]
Internvl: Scaling up vision foun- dation models and aligning for generic visual-linguistic tasks,
Z. Chen, J. Wu, W. Wang, W. Su, G. Chen, S. Xing, M. Zhong, Q. Zhang, X. Zhu, L. Luet al., “Internvl: Scaling up vision foun- dation models and aligning for generic visual-linguistic tasks,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2024, pp. 24 185–24 198
work page 2024
-
[67]
Grounded language- image pre-training,
L. H. Li, P . Zhang, H. Zhang, J. Yang, C. Li, Y. Zhong, L. Wang, L. Yuan, L. Zhang, J.-N. Hwanget al., “Grounded language- image pre-training,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 10 965–10 975
work page 2022
-
[68]
Grounding dino: Marrying dino with grounded pre-training for open-set object detection,
S. Liu, Z. Zeng, T. Ren, F. Li, H. Zhang, J. Yang, Q. Jiang, C. Li, J. Yang, H. Suet al., “Grounding dino: Marrying dino with grounded pre-training for open-set object detection,” inEuropean conference on computer vision. Springer, 2024, pp. 38–55
work page 2024
-
[69]
Sam 2: Segment anything in images and videos,
N. Ravi, V . Gabeur, Y.-T. Hu, R. Hu, C. Ryali, T. Ma, H. Khedr, R. R ¨adle, C. Rolland, L. Gustafsonet al., “Sam 2: Segment anything in images and videos,” inInternational Conference on Learning Representations, vol. 2025, 2025, pp. 28 085–28 128
work page 2025
-
[70]
SAM 3: Segment Anything with Concepts
N. Carion, L. Gustafson, Y.-T. Hu, S. Debnath, R. Hu, D. Suris, C. Ryali, K. V . Alwala, H. Khedr, A. Huanget al., “Sam 3: Segment anything with concepts,”arXiv preprint arXiv:2511.16719, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[71]
Denoising diffusion probabilistic models,
J. Ho, A. Jain, and P . Abbeel, “Denoising diffusion probabilistic models,”Advances in neural information processing systems, vol. 33, pp. 6840–6851, 2020
work page 2020
-
[72]
High-resolution image synthesis with latent diffusion models,
R. Rombach, A. Blattmann, D. Lorenz, P . Esser, and B. Ommer, “High-resolution image synthesis with latent diffusion models,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 10 684–10 695
work page 2022
-
[73]
Videomae: Masked autoencoders are data-efficient learners for self-supervised video pre-training,
Z. Tong, Y. Song, J. Wang, and L. Wang, “Videomae: Masked autoencoders are data-efficient learners for self-supervised video pre-training,”Advances in neural information processing systems, vol. 35, pp. 10 078–10 093, 2022
work page 2022
-
[74]
Videomae v2: Scaling video masked autoencoders with dual masking,
L. Wang, B. Huang, Z. Zhao, Z. Tong, Y. He, Y. Wang, Y. Wang, and Y. Qiao, “Videomae v2: Scaling video masked autoencoders with dual masking,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2023, pp. 14 549–14 560
work page 2023
-
[75]
Internvideo2: Scaling foundation models for multimodal video understanding,
Y. Wang, K. Li, X. Li, J. Yu, Y. He, G. Chen, B. Pei, R. Zheng, Z. Wang, Y. Shiet al., “Internvideo2: Scaling foundation models for multimodal video understanding,” inEuropean conference on computer vision. Springer, 2024, pp. 396–416
work page 2024
-
[76]
Motionbert: A unified perspective on learning human motion representa- tions,
W. Zhu, X. Ma, Z. Liu, L. Liu, W. Wu, and Y. Wang, “Motionbert: A unified perspective on learning human motion representa- tions,” inProceedings of the IEEE/CVF international conference on computer vision, 2023, pp. 15 085–15 099
work page 2023
-
[77]
Skeletonmae: graph-based masked autoencoder for skeleton sequence pre- training,
H. Yan, Y. Liu, Y. Wei, Z. Li, G. Li, and L. Lin, “Skeletonmae: graph-based masked autoencoder for skeleton sequence pre- training,” inProceedings of the IEEE/CVF international conference on computer vision, 2023, pp. 5606–5618
work page 2023
-
[78]
Masked motion predictors are strong 3d action representation learners,
Y. Mao, J. Deng, W. Zhou, Y. Fang, W. Ouyang, and H. Li, “Masked motion predictors are strong 3d action representation learners,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 10 181–10 191
work page 2023
-
[79]
Unified multi-modal unsupervised representation learning for skeleton-based action understanding,
S. Sun, D. Liu, J. Dong, X. Qu, J. Gao, X. Yang, X. Wang, and M. Wang, “Unified multi-modal unsupervised representation learning for skeleton-based action understanding,” inProceedings of the 31st ACM International Conference on Multimedia, 2023, pp. 2973–2984
work page 2023
-
[80]
Skeleton- in-context: Unified skeleton sequence modeling with in-context learning,
X. Wang, Z. Fang, X. Li, X. Li, C. Chen, and M. Liu, “Skeleton- in-context: Unified skeleton sequence modeling with in-context learning,” inProceedings of the IEEE/CVF Conference on Computer vision and Pattern Recognition, 2024, pp. 2436–2446
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.