Rethinking the Role of Feature Engineering and Learning Strategies in Few-Shot Hidden Emotion Recognition

Dan Guo; Fei Wang; Haixu Liu; Jihao Gu; Kun Li; Qi Wang; Xiaochuan Guo; Yufei Wang; Yuxin Liu

arxiv: 2606.31249 · v1 · pith:O5ALRKUCnew · submitted 2026-06-30 · 💻 cs.CV

Rethinking the Role of Feature Engineering and Learning Strategies in Few-Shot Hidden Emotion Recognition

Xiaochuan Guo , Jihao Gu , Haixu Liu , Yuxin Liu , Qi Wang , Yufei Wang , Fei Wang , Kun Li

show 1 more author

Dan Guo

This is my paper

Pith reviewed 2026-07-01 06:23 UTC · model grok-4.3

classification 💻 cs.CV

keywords hidden emotion recognitionmicro-dynamic tasksrepresentation collapsecross-attentionmulti-modal video featuresmultiple instance learningfew-shot video analysis

0 comments

The pith

Cross-attention with static base and dynamic offset features tops a hidden emotion recognition challenge at 0.76923 accuracy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper describes a multi-modal framework that secured first place in Track 3 of the 4th EI-MIGA-IJCAI Challenge for recognizing weak and sparse emotions in long videos. It combines 2D/3D skeletons, facial Blendshapes, DINOv2/v3, X-CLIP, and Gemini features. The central mechanism feeds static pose features as Query into cross-attention while using micro-motion differentials as Key and Value, then applies Multiple Instance Learning adaptive pooling to isolate momentary signals amid sequence noise. The work also documents representation collapse in general vision foundation models on micro-dynamic tasks and traces it to shortcut learning and leaderboard-driven memorization.

Core claim

The authors' solution achieved a test accuracy of 0.76923 by integrating multi-source features into a compact temporal modeling framework. This framework uses a cross-attention mechanism with static Base features as Query and dynamic Offset features as Key/Value to eliminate identity biases, paired with MIL-based adaptive pooling to extract instantaneous emotions from noisy long sequences. The work additionally reveals representation collapse in vision foundation models for micro-dynamic tasks and attributes pseudo-generalization to shortcut learning and rote memorization driven by public leaderboards.

What carries the argument

Cross-attention mechanism that takes static pose features (Base) as Query and dynamic micro-motion differential features (Offset) as Key and Value, together with Multiple Instance Learning adaptive pooling.

If this is right

Static biases from body shape and identity are removed by treating poses as fixed queries.
Background noise is suppressed while momentary emotional signals are retained across long sequences.
General vision foundation models enter representation collapse on micro-dynamic tasks.
Networks shift to shortcut learning and rote memorization when trained against public leaderboards.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same Base-Offset split could be tested on other subtle temporal signals such as micro-expressions or gait anomalies.
Leaderboard-driven training may systematically favor memorization over genuine feature learning in any sparse-signal domain.
Evaluating the same models on private or synthetic micro-dynamic datasets would isolate whether collapse is leaderboard-induced.

Load-bearing premise

The cross-attention with Base as Query and Offset as Key/Value plus MIL adaptive pooling successfully extracts instantaneous emotions while suppressing background noise in long sequences.

What would settle it

An ablation on the challenge test set showing that removing the cross-attention or the MIL pooling leaves accuracy unchanged, or a direct measurement showing vision foundation models maintain distinct representations on micro-dynamic emotion data.

Figures

Figures reproduced from arXiv: 2606.31249 by Dan Guo, Fei Wang, Haixu Liu, Jihao Gu, Kun Li, Qi Wang, Xiaochuan Guo, Yufei Wang, Yuxin Liu.

**Figure 1.** Figure 1: Comparison of keypoint detection between MediaPipe and OpenPose. behaviors. To address this, we introduce a Vision-Language Large Model (Gemini) to perform high-level semantic cognitive reasoning on video frame sequences. By combining advanced prompting strategies of “Chain-of-Thought and Reflection” (CoT+Reflection), we compel the large model to objectively extract Facial Action Coding System Action Unit… view at source ↗

**Figure 2.** Figure 2: Base-Offset Cross-Attention Model Structure. surveillance datasets of concealed emotions, the “hard examples” in the model’s eyes are often absolute noise and defective footage caused by severe camera defocusing, physical occlusions, or unavoidable subjective human annotation errors. This loss function parameter configuration explicitly prevents the model from over-focusing on invalid, extreme hard example… view at source ↗

read the original abstract

In this paper, we present the solution developed by our team, XInsight Lab, which achieved first place in Track 3 of the 4th EI-MIGA-IJCAI Challenge with a test accuracy of 0.76923. To address the challenge of weak and sparse implicit emotion evidence in long videos, this paper extends the winning solution from the previous competition and proposes a compact multi-modal temporal modeling framework. The framework integrates and evaluates the effects of multi-source features, including 2D/3D skeletons, facial expression Blendshapes, DINOv2/v3 vision foundation models, X-CLIP video features, and Gemini semantic priors. Architecturally, we propose a cross-attention mechanism that utilizes static pose features, denoted as Base, as the Query and dynamic micro-motion differential features, denoted as Offset, as the Key and Value. By capturing local relative velocities, this mechanism eliminates static biases related to individual body shape and identity. Concurrently, an adaptive pooling method based on Multiple Instance Learning is employed to extract instantaneous emotions while suppressing background noise in long sequences. Finally, the paper reveals the representation collapse phenomenon of general vision foundation models in micro-dynamic tasks, and analyzes the underlying mechanisms where networks fall into public-leaderboard-driven pseudo-generalization due to shortcut learning and rote memorization.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This is a competition report that describes a practical multi-modal pipeline for hidden emotion recognition and reports a first-place result, but the claims about revealing representation collapse lack supporting measurements or ablations.

read the letter

The paper's core contribution is a concrete winning entry for the EI-MIGA challenge Track 3. It extends the team's prior solution by pulling in standard features—2D/3D skeletons, Blendshapes, DINOv2/v3, X-CLIP, and Gemini priors—then fuses them through a cross-attention block that treats static pose as the query and micro-motion offsets as key/value, plus MIL-based adaptive pooling to handle long sequences. That setup is a sensible engineering choice for suppressing identity bias and background noise while focusing on weak dynamic cues, and the reported test accuracy of 0.76923 is a verifiable outcome.

The work is useful for anyone replicating or extending affective computing pipelines on similar video data. It shows how off-the-shelf foundation model features can be combined without inventing new theory, and the emphasis on relative velocities rather than absolute poses is a clear practical step.

The main gap is the interpretive section on representation collapse in general vision models for micro-dynamic tasks. The text asserts shortcut learning and leaderboard-driven pseudo-generalization, yet supplies no feature-space diagnostics, similarity metrics, or ablation isolating rote memorization from genuine generalization. The cross-attention and pooling claims also rest on the final score without reported controls or validation details. These points are asserted after the architecture description rather than demonstrated through targeted experiments.

This paper is aimed at researchers tracking the specific challenge or building video-based emotion systems. A reader who needs a working recipe with known components will find value; someone looking for new mechanisms or falsifiable claims about foundation model behavior will not. It deserves peer review because the competition result and pipeline are concrete and reproducible enough to warrant referee scrutiny, even if the collapse analysis needs strengthening.

Referee Report

2 major / 2 minor

Summary. The paper presents the winning solution for Track 3 of the 4th EI-MIGA-IJCAI Challenge, achieving a test accuracy of 0.76923 using a multi-modal framework that combines 2D/3D skeletons, facial Blendshapes, DINOv2/v3, X-CLIP, and Gemini features. It introduces a cross-attention mechanism using static pose (Base) as Query and micro-motion differentials (Offset) as Key/Value to capture relative velocities, along with MIL-based adaptive pooling for extracting instantaneous emotions from long sequences. The work also claims to identify representation collapse in general vision foundation models for micro-dynamic tasks caused by shortcut learning and leaderboard-driven memorization.

Significance. If the accuracy is robust and the collapse diagnosis is backed by quantitative diagnostics, the work would offer a useful benchmark for few-shot micro-dynamic emotion recognition and highlight practical limitations of vision foundation models. The concrete leaderboard result and architectural choices for handling sparse signals in long videos provide empirical value, though the interpretive claims require additional evidence to elevate significance beyond a competition report.

major comments (2)

[Abstract] Abstract: The claim that the solution 'reveals the representation collapse phenomenon of general vision foundation models in micro-dynamic tasks' and analyzes mechanisms of 'public-leaderboard-driven pseudo-generalization due to shortcut learning and rote memorization' is asserted without any supporting quantitative evidence such as feature-space similarity scores, t-SNE analyses with metrics, or ablation experiments isolating shortcut vs. genuine generalization.
[Abstract] Abstract: The reported test accuracy of 0.76923 is given without information on validation splits, baseline comparisons, error bars, or ablation studies, leaving unclear whether the cross-attention (Base as Query, Offset as Key/Value) plus MIL pooling is responsible for the result or if performance reflects other factors.

minor comments (2)

[Abstract] Abstract: The description of extending the prior winning solution does not detail the specific modifications or their incremental impact.
[Abstract] Abstract: No dataset statistics (e.g., average sequence length, number of micro-emotion instances) are supplied to ground claims about long sequences and sparse evidence.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed review and constructive suggestions. We address the two major comments on the abstract below, agreeing that additional quantitative support and experimental details will strengthen the manuscript. We will perform a major revision incorporating these points.

read point-by-point responses

Referee: [Abstract] Abstract: The claim that the solution 'reveals the representation collapse phenomenon of general vision foundation models in micro-dynamic tasks' and analyzes mechanisms of 'public-leaderboard-driven pseudo-generalization due to shortcut learning and rote memorization' is asserted without any supporting quantitative evidence such as feature-space similarity scores, t-SNE analyses with metrics, or ablation experiments isolating shortcut vs. genuine generalization.

Authors: We acknowledge that the abstract states the collapse observation without explicit quantitative diagnostics such as similarity scores or t-SNE metrics. The full manuscript grounds the claim in empirical observations from the competition (e.g., consistent failure modes of DINOv2/v3 and X-CLIP on micro-motion despite strong static performance), but we agree these are primarily qualitative. In revision we will add (i) pairwise cosine similarity matrices between static and dynamic feature embeddings, (ii) t-SNE visualizations with quantitative cluster-separation metrics, and (iii) controlled ablations that isolate shortcut learning by training on leaderboard-only vs. held-out splits. These additions will be placed in a new subsection of the experiments. revision: yes
Referee: [Abstract] Abstract: The reported test accuracy of 0.76923 is given without information on validation splits, baseline comparisons, error bars, or ablation studies, leaving unclear whether the cross-attention (Base as Query, Offset as Key/Value) plus MIL pooling is responsible for the result or if performance reflects other factors.

Authors: The 0.76923 figure is the official test-set score returned by the competition organizers. Our internal development used a 5-fold temporal split of the training videos (ensuring no subject overlap) and we tracked per-fold accuracy; however, these details and the corresponding baseline comparisons (single-modality, no cross-attention, mean pooling) were omitted from the abstract to meet length constraints. We will expand the abstract and add a dedicated “Ablation Studies” subsection that reports (a) validation accuracy with standard deviation across folds, (b) incremental gains from Base-Query/Offset-KeyValue cross-attention, and (c) MIL pooling versus fixed temporal pooling. Error bars from three independent runs with different random seeds will also be included. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical competition entry with no derivation chain

full rationale

The manuscript is an empirical competition solution that integrates multi-source features into a cross-attention + MIL framework and reports a leaderboard score of 0.76923. No equations, parameter-fitting steps, or formal derivations are present in the provided text. The claim of revealing representation collapse is an interpretive observation attached to the architectural choices and results, not a mathematical reduction that collapses to fitted inputs or prior self-citations by construction. Self-reference to a prior competition win is noted but does not bear any derivational load. The work is therefore self-contained against external benchmarks and receives the default non-circularity finding.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields no explicit free parameters, axioms, or invented entities; the framework relies on standard multi-modal fusion and attention techniques whose assumptions are not detailed here.

pith-pipeline@v0.9.1-grok · 5790 in / 1123 out tokens · 33620 ms · 2026-07-01T06:23:55.533021+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

53 extracted references · 14 canonical work pages · 8 internal anchors

[1]

In: 2018 13th IEEE International Conference on Automatic Face & Gesture Recognition

Baltrušaitis, T., Zadeh, A., Lim, Y.C., Morency, L.P.: Openface 2.0: Facial behavior analysis toolkit. In: 2018 13th IEEE International Conference on Automatic Face & Gesture Recognition. pp. 59–66 (2018)

2018
[2]

IEEE Transactions on Pattern Analysis and Machine Intelligence43(1), 172–186 (2021) 16 X

Cao, Z., Hidalgo, G., Simon, T., Wei, S.E., Sheikh, Y.: Openpose: Realtime multi- person 2d pose estimation using part affinity fields. IEEE Transactions on Pattern Analysis and Machine Intelligence43(1), 172–186 (2021) 16 X. Guo, J. Gu, H. Liu et al

2021
[3]

In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 6299–6308 (2017)

2017
[4]

arXiv preprint arXiv:2408.03097 (2024)

Chen, G., Wang, F., Li, K., Wu, Z., Fan, H., Yang, Y., Wang, M., Guo, D.: Prototype learning for micro-gesture classification. arXiv preprint arXiv:2408.03097 (2024)

work page arXiv 2024
[5]

International Journal of Computer Vision131, 1346–1366 (2023)

Chen, H., Shi, H., Liu, X., Li, X., Zhao, G.: Smg: A micro-gesture dataset towards spontaneous body gestures for emotional stress state analysis. International Journal of Computer Vision131, 1346–1366 (2023)

2023
[6]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Cui, Y., Jia, M., Lin, T.Y., Song, Y., Belongie, S.: Class-balanced loss based on effective number of samples. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 9268–9277 (2019)

2019
[7]

In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics

Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: Pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics. pp. 4171–4186 (2019)

2019
[8]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Duan, H., Zhao, Y., Chen, K., Lin, D., Dai, B.: Revisiting skeleton-based action recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 2969–2978 (2022)

2022
[9]

Gemini: A Family of Highly Capable Multimodal Models

Gemini Team: Gemini: A family of highly capable multimodal models. arXiv preprint arXiv:2312.11805 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[10]

In: 2025 IEEE 6th International Conference on Pattern Recognition and Machine Learning (PRML)

Gu, J.: Performance analysis of traditional vqa models under limited computational resources. In: 2025 IEEE 6th International Conference on Pattern Recognition and Machine Learning (PRML). pp. 323–328. IEEE (2025)

2025
[11]

In: Proceedings of the 33rd ACM International Conference on Multimedia

Gu, J., Li, K., Wang, F., Wei, Y., Wu, Z., Fan, H., Wang, M.: Motion matters: Motion-guided modulation network for skeleton-based micro-action recognition. In: Proceedings of the 33rd ACM International Conference on Multimedia. pp. 5461–5470 (2025)

2025
[12]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops

Gu, J., Li, K., Wang, H., Akşit, K.: Text-guided fine-grained video anomaly under- standing. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops. pp. 1251–1261 (June 2026)

2026
[13]

arXiv preprint arXiv:2507.08344 (2025)

Gu, J., Wang, F., Li, K., Wei, Y., Wu, Z., Guo, D.: Mm-gesture: towards precise micro-gesture recognition through multimodal fusion. arXiv preprint arXiv:2507.08344 (2025)

work page arXiv 2025
[14]

IEEE Transactions on Circuits and Systems for Video Technology34(7), 6238–6252 (2024)

Guo, D., Li, K., Hu, B., Zhang, Y., Wang, M.: Benchmarking micro-action recog- nition: Dataset, methods, and applications. IEEE Transactions on Circuits and Systems for Video Technology34(7), 6238–6252 (2024)

2024
[15]

In: Proceedings of the 32nd ACM International Conference on Multimedia

Guo, D., Li, X., Li, K., Chen, H., Hu, J., Zhao, G., Yang, Y., Wang, M.: Mac 2024: Micro-action analysis grand challenge. In: Proceedings of the 32nd ACM International Conference on Multimedia. pp. 11304–11305 (2024)

2024
[16]

In: MiGA@IJCAI (2024)

Huang, H., Wang, Y., Linghu, K., Xia, Z.: Multi-modal micro-gesture classification via multi-scale heterogeneous ensemble network. In: MiGA@IJCAI (2024)

2024
[17]

Jiang, T., Lu, P., Zhang, L., Ma, N., Han, R., Lyu, C., Li, Y., Chen, K.: Rtmpose: Real-time multi-person pose estimation based on mmpose (2023)

2023
[18]

ACM Transactions on Graphics42(4), 1–14 (2023)

Kerbl, B., Kopanas, G., Leimkühler, T., Drettakis, G.: 3d gaussian splatting for real-time radiance field rendering. ACM Transactions on Graphics42(4), 1–14 (2023)

2023
[19]

In: European Conference on Computer Vision (2024) ReFELS 17

Khirodkar, R., Bagautdinov, T., Martinez, J., Zhaoen, S., James, A., Selednik, P., Anderson, S., Saito, S.: Sapiens: Foundation for human vision models. In: European Conference on Computer Vision (2024) ReFELS 17

2024
[20]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Li, K., Gu, J., Wang, F., Wu, Z., Fan, H., Guo, D.: Ma-bench: Towards fine-grained micro-action understanding. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 20118–20128 (2026)

2026
[21]

In: Proceedings of the AAAI Conference on Artificial Intelligence

Li, K., Guo, D., Chen, G., Fan, C., Xu, J., Wu, Z., Fan, H., Wang, M.: Prototypical calibrating ambiguous samples for micro-action recognition. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 39, pp. 4815–4823 (2025)

2025
[22]

In: Proceedings of the 31st ACM International Conference on Multimedia

Li, K., Guo, D., Chen, G., Liu, F., Wang, M.: Data augmentation for human behavior analysis in multi-person conversations. In: Proceedings of the 31st ACM International Conference on Multimedia. pp. 9516–9520 (2023)

2023
[23]

arXiv preprint arXiv:2307.10624 (2023)

Li, K., Guo, D., Chen, G., Peng, X., Wang, M.: Joint skeletal and semantic embedding loss for micro-gesture classification. arXiv preprint arXiv:2307.10624 (2023)

work page arXiv 2023
[24]

In: Proceedings of the 33rd ACM International Conference on Multimedia

Li, K., Guo, D., Li, X., Chen, H., Liu, P., Wang, F., Hu, J., Zhao, G., Wang, M.: Mac 2025: The 2nd micro-action analysis grand challenge. In: Proceedings of the 33rd ACM International Conference on Multimedia. pp. 14216–14221 (2025)

2025
[25]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision

Li, K., Liu, P., Guo, D., Wang, F., Wu, Z., Fan, H., Wang, M.: Mmad: Multi-label micro-action detection in videos. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 13225–13236 (2025)

2025
[26]

ACM Transactions on Graphics36(6), 1–17 (2017)

Li, T., Bolkart, T., Black, M.J., Li, H., Romero, J.: Learning a model of facial shape and expression from 4d scans. ACM Transactions on Graphics36(6), 1–17 (2017)

2017
[27]

Depth Anything 3: Recovering the Visual Space from Any Views

Lin, H., Chen, S., Liew, J., Chen, D.Y., Li, Z., Shi, G., Feng, J., Kang, B.: Depth any- thing 3: Recovering the visual space from any views. arXiv preprint arXiv:2511.10647 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[28]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision

Lin, J., Gan, C., Han, S.: Tsm: Temporal shift module for efficient video under- standing. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 7083–7093 (2019)

2019
[29]

In: Proceedings of the IEEE International Conference on Computer Vision

Lin, T.Y., Goyal, P., Girshick, R., He, K., Dollár, P.: Focal loss for dense object detection. In: Proceedings of the IEEE International Conference on Computer Vision. pp. 2980–2988 (2017)

2017
[30]

arXiv preprint arXiv:2507.09512 (2025)

Liu, P., Li, K., Wang, F., Wei, Y., She, J., Guo, D.: Online micro-gesture recog- nition using data augmentation and spatial-temporal attention. arXiv preprint arXiv:2507.09512 (2025)

work page arXiv 2025
[31]

arXiv preprint arXiv:2407.04490 (2024)

Liu, P., Wang, F., Li, K., Chen, G., Wei, Y., Tang, S., Wu, Z., Guo, D.: Micro-gesture online recognition using learnable query points. arXiv preprint arXiv:2407.04490 (2024)

work page arXiv 2024
[32]

Self-supervised Learning Matters: A Simple Ensemble Solution for Micro-Gesture Recognition

Liu, T., Li, K., Wang, F., Chen, J., Wu, Z., Gu, J., Liu, H., Guo, D.: Self-supervised learning matters: A simple ensemble solution for micro-gesture recognition. arXiv preprint arXiv:2606.09261 (2026)

work page internal anchor Pith review Pith/arXiv arXiv 2026
[33]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Liu, X., Shi, H., Chen, H., Yu, Z., Li, X., Zhao, G.: imigue: An identity-free video dataset for micro-gesture understanding and emotion analysis. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 10631–10642 (2021)

2021
[34]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Liu, Z., Ning, J., Cao, Y., Wei, Y., Zhang, Z., Lin, S., Hu, H.: Video swin transformer. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 3202–3211 (2022)

2022
[35]

MediaPipe: A Framework for Building Perception Pipelines

Lugaresi, C., Tang, J., Nash, H., McClanahan, C., Uboweja, E., Hays, M., Zhang, F., Chang, C.L., Yong, M.G., Lee, J., et al.: Mediapipe: A framework for building perception pipelines. arXiv preprint arXiv:1906.08172 (2019)

work page internal anchor Pith review Pith/arXiv arXiv 1906
[36]

In: European Conference on Computer Vision

Ni, B., Peng, H., Chen, M., Zhang, S., Meng, G., Fu, J., Xiang, S., Ling, H.: Expanding language-image pretrained models for general video recognition. In: European Conference on Computer Vision. pp. 1–18 (2022) 18 X. Guo, J. Gu, H. Liu et al

2022
[37]

DINOv2: Learning Robust Visual Features without Supervision

Oquab, M., Darcet, T., Moutakanni, T., Vo, H., Szafraniec, M., Khalidov, V., Fernandez, P., Haziza, D., Massa, F., El-Nouby, A., et al.: Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[38]

Spatial-Temporal Decoupled Adapter for Micro-gesture Online Recognition

Shen, X., Li, K., Wang, F., Qian, W., Jiang, J., Guo, D.: Spatial-temporal decoupled adapter for micro-gesture online recognition. arXiv preprint arXiv:2606.07355 (2026)

work page internal anchor Pith review Pith/arXiv arXiv 2026
[39]

DINOv3

Siméoni, O., Vo, H.V., Seitzer, M., Baldassarre, F., Oquab, M., Jose, C., Khali- dov, V., Szafraniec, M., Yi, S., Ramamonjisoa, M., et al.: Dinov3. arXiv preprint arXiv:2508.10104 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[40]

In: European Conference on Computer Vision

Teed, Z., Deng, J.: Raft: Recurrent all-pairs field transforms for optical flow. In: European Conference on Computer Vision. pp. 402–419. Springer (2020)

2020
[41]

In: Advances in Neural Information Processing Systems (2017)

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Advances in Neural Information Processing Systems (2017)

2017
[42]

iMiGUE-3K: A Large-Scale Benchmark for Micro-Gesture Analysis with Self-Supervised Learning

Wang, C., Chen, H., Wei, H., Yang, Y., Chen, Y., Zhao, G.: imigue-3k: A large-scale benchmark for micro-gesture analysis with self-supervised learning. arXiv preprint arXiv:2605.17179 (2026)

work page internal anchor Pith review Pith/arXiv arXiv 2026
[43]

In: Proceedings of the AAAI Conference on Artificial Intelligence

Wang, F., Guo, D., Li, K., Wang, M.: Eulermormer: Robust eulerian motion magnification via dynamic filtering within transformer. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 38, pp. 5345–5353 (2024)

2024
[44]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Wang, F., Guo, D., Li, K., Zhong, Z., Wang, M.: Frequency decoupling for mo- tion magnification via multi-level isomorphic architecture. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 18984– 18994 (2024)

2024
[45]

In: Companion Proceedings of the ACM on Web Conference 2025

Wang, F., Li, K., Nie, Y., Duan, Z., Zou, P., Wu, Z., Wang, Y., Wei, Y.: Exploiting ensemble learning for cross-view isolated sign language recognition. In: Companion Proceedings of the ACM on Web Conference 2025. pp. 2453–2457 (2025)

2025
[46]

In: Proceedings of the ACM Web Conference 2026

Wang, F., Yang, J., Chen, J., Liu, Y., Li, K., Wei, Y., Guo, D., Wang, M.: Xinsight: Integrative stage-consistent psychological counseling support agents for digital well-being. In: Proceedings of the ACM Web Conference 2026. pp. 9297–9308 (2026)

2026
[47]

IEEE Transactions on Multimedia (2026)

Wang, M., Liu, Z., Li, K., Wang, Y., Wang, Y., Wei, Y., Wang, F.: Task-generalized adaptive cross-domain learning for multimodal image fusion. IEEE Transactions on Multimedia (2026)

2026
[48]

IEEE Transactions on Affective Computing pp

Wang, R., Li, K., Tong, A., Xu, J., Guo, D., Wang, M.: Gait emotion recognition via uncertainty-oriented class discriminative learning. IEEE Transactions on Affective Computing pp. 1–14 (2026)

2026
[49]

Water Research284, 123994 (2025)

Wang, W., Wang, F., Wang, Z., Shi, X., Xu, C.: Robust s3former deep learning model for the direct diagnosis and prediction of natural organic matter (nom) from three-dimensional excitation-emission-matrix (3d-eem) data. Water Research284, 123994 (2025)

2025
[50]

arXiv preprint arXiv:2602.08057 (2026)

Wang, Y., Liu, H., Xu, T., Shi, C., Xing, H.: Weak to strong: Vlm-based pseudo- labeling as a weakly supervised training strategy in multimodal video-based hidden emotion understanding tasks. arXiv preprint arXiv:2602.08057 (2026)

work page arXiv 2026
[51]

In: Advances in Neural Information Processing Systems (2022)

Xu, Y., Zhang, J., Zhang, Q., Tao, D.: Vitpose: Simple vision transformer baselines for human pose estimation. In: Advances in Neural Information Processing Systems (2022)

2022
[52]

In: Proceedings of the AAAI Conference on Artificial Intelligence (2018)

Yan, S., Xiong, Y., Lin, D.: Spatial temporal graph convolutional networks for skeleton-based action recognition. In: Proceedings of the AAAI Conference on Artificial Intelligence (2018)

2018
[53]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops (2023)

Yang, Z., Zeng, A., Yuan, C., Li, Y.: Effective whole-body pose estimation with two-stages distillation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops (2023)

2023

[1] [1]

In: 2018 13th IEEE International Conference on Automatic Face & Gesture Recognition

Baltrušaitis, T., Zadeh, A., Lim, Y.C., Morency, L.P.: Openface 2.0: Facial behavior analysis toolkit. In: 2018 13th IEEE International Conference on Automatic Face & Gesture Recognition. pp. 59–66 (2018)

2018

[2] [2]

IEEE Transactions on Pattern Analysis and Machine Intelligence43(1), 172–186 (2021) 16 X

Cao, Z., Hidalgo, G., Simon, T., Wei, S.E., Sheikh, Y.: Openpose: Realtime multi- person 2d pose estimation using part affinity fields. IEEE Transactions on Pattern Analysis and Machine Intelligence43(1), 172–186 (2021) 16 X. Guo, J. Gu, H. Liu et al

2021

[3] [3]

In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 6299–6308 (2017)

2017

[4] [4]

arXiv preprint arXiv:2408.03097 (2024)

Chen, G., Wang, F., Li, K., Wu, Z., Fan, H., Yang, Y., Wang, M., Guo, D.: Prototype learning for micro-gesture classification. arXiv preprint arXiv:2408.03097 (2024)

work page arXiv 2024

[5] [5]

International Journal of Computer Vision131, 1346–1366 (2023)

Chen, H., Shi, H., Liu, X., Li, X., Zhao, G.: Smg: A micro-gesture dataset towards spontaneous body gestures for emotional stress state analysis. International Journal of Computer Vision131, 1346–1366 (2023)

2023

[6] [6]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Cui, Y., Jia, M., Lin, T.Y., Song, Y., Belongie, S.: Class-balanced loss based on effective number of samples. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 9268–9277 (2019)

2019

[7] [7]

In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics

Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: Pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics. pp. 4171–4186 (2019)

2019

[8] [8]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Duan, H., Zhao, Y., Chen, K., Lin, D., Dai, B.: Revisiting skeleton-based action recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 2969–2978 (2022)

2022

[9] [9]

Gemini: A Family of Highly Capable Multimodal Models

Gemini Team: Gemini: A family of highly capable multimodal models. arXiv preprint arXiv:2312.11805 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023

[10] [10]

In: 2025 IEEE 6th International Conference on Pattern Recognition and Machine Learning (PRML)

Gu, J.: Performance analysis of traditional vqa models under limited computational resources. In: 2025 IEEE 6th International Conference on Pattern Recognition and Machine Learning (PRML). pp. 323–328. IEEE (2025)

2025

[11] [11]

In: Proceedings of the 33rd ACM International Conference on Multimedia

Gu, J., Li, K., Wang, F., Wei, Y., Wu, Z., Fan, H., Wang, M.: Motion matters: Motion-guided modulation network for skeleton-based micro-action recognition. In: Proceedings of the 33rd ACM International Conference on Multimedia. pp. 5461–5470 (2025)

2025

[12] [12]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops

Gu, J., Li, K., Wang, H., Akşit, K.: Text-guided fine-grained video anomaly under- standing. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops. pp. 1251–1261 (June 2026)

2026

[13] [13]

arXiv preprint arXiv:2507.08344 (2025)

Gu, J., Wang, F., Li, K., Wei, Y., Wu, Z., Guo, D.: Mm-gesture: towards precise micro-gesture recognition through multimodal fusion. arXiv preprint arXiv:2507.08344 (2025)

work page arXiv 2025

[14] [14]

IEEE Transactions on Circuits and Systems for Video Technology34(7), 6238–6252 (2024)

Guo, D., Li, K., Hu, B., Zhang, Y., Wang, M.: Benchmarking micro-action recog- nition: Dataset, methods, and applications. IEEE Transactions on Circuits and Systems for Video Technology34(7), 6238–6252 (2024)

2024

[15] [15]

In: Proceedings of the 32nd ACM International Conference on Multimedia

Guo, D., Li, X., Li, K., Chen, H., Hu, J., Zhao, G., Yang, Y., Wang, M.: Mac 2024: Micro-action analysis grand challenge. In: Proceedings of the 32nd ACM International Conference on Multimedia. pp. 11304–11305 (2024)

2024

[16] [16]

In: MiGA@IJCAI (2024)

Huang, H., Wang, Y., Linghu, K., Xia, Z.: Multi-modal micro-gesture classification via multi-scale heterogeneous ensemble network. In: MiGA@IJCAI (2024)

2024

[17] [17]

Jiang, T., Lu, P., Zhang, L., Ma, N., Han, R., Lyu, C., Li, Y., Chen, K.: Rtmpose: Real-time multi-person pose estimation based on mmpose (2023)

2023

[18] [18]

ACM Transactions on Graphics42(4), 1–14 (2023)

Kerbl, B., Kopanas, G., Leimkühler, T., Drettakis, G.: 3d gaussian splatting for real-time radiance field rendering. ACM Transactions on Graphics42(4), 1–14 (2023)

2023

[19] [19]

In: European Conference on Computer Vision (2024) ReFELS 17

Khirodkar, R., Bagautdinov, T., Martinez, J., Zhaoen, S., James, A., Selednik, P., Anderson, S., Saito, S.: Sapiens: Foundation for human vision models. In: European Conference on Computer Vision (2024) ReFELS 17

2024

[20] [20]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Li, K., Gu, J., Wang, F., Wu, Z., Fan, H., Guo, D.: Ma-bench: Towards fine-grained micro-action understanding. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 20118–20128 (2026)

2026

[21] [21]

In: Proceedings of the AAAI Conference on Artificial Intelligence

Li, K., Guo, D., Chen, G., Fan, C., Xu, J., Wu, Z., Fan, H., Wang, M.: Prototypical calibrating ambiguous samples for micro-action recognition. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 39, pp. 4815–4823 (2025)

2025

[22] [22]

In: Proceedings of the 31st ACM International Conference on Multimedia

Li, K., Guo, D., Chen, G., Liu, F., Wang, M.: Data augmentation for human behavior analysis in multi-person conversations. In: Proceedings of the 31st ACM International Conference on Multimedia. pp. 9516–9520 (2023)

2023

[23] [23]

arXiv preprint arXiv:2307.10624 (2023)

Li, K., Guo, D., Chen, G., Peng, X., Wang, M.: Joint skeletal and semantic embedding loss for micro-gesture classification. arXiv preprint arXiv:2307.10624 (2023)

work page arXiv 2023

[24] [24]

In: Proceedings of the 33rd ACM International Conference on Multimedia

Li, K., Guo, D., Li, X., Chen, H., Liu, P., Wang, F., Hu, J., Zhao, G., Wang, M.: Mac 2025: The 2nd micro-action analysis grand challenge. In: Proceedings of the 33rd ACM International Conference on Multimedia. pp. 14216–14221 (2025)

2025

[25] [25]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision

Li, K., Liu, P., Guo, D., Wang, F., Wu, Z., Fan, H., Wang, M.: Mmad: Multi-label micro-action detection in videos. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 13225–13236 (2025)

2025

[26] [26]

ACM Transactions on Graphics36(6), 1–17 (2017)

Li, T., Bolkart, T., Black, M.J., Li, H., Romero, J.: Learning a model of facial shape and expression from 4d scans. ACM Transactions on Graphics36(6), 1–17 (2017)

2017

[27] [27]

Depth Anything 3: Recovering the Visual Space from Any Views

Lin, H., Chen, S., Liew, J., Chen, D.Y., Li, Z., Shi, G., Feng, J., Kang, B.: Depth any- thing 3: Recovering the visual space from any views. arXiv preprint arXiv:2511.10647 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[28] [28]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision

Lin, J., Gan, C., Han, S.: Tsm: Temporal shift module for efficient video under- standing. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 7083–7093 (2019)

2019

[29] [29]

In: Proceedings of the IEEE International Conference on Computer Vision

Lin, T.Y., Goyal, P., Girshick, R., He, K., Dollár, P.: Focal loss for dense object detection. In: Proceedings of the IEEE International Conference on Computer Vision. pp. 2980–2988 (2017)

2017

[30] [30]

arXiv preprint arXiv:2507.09512 (2025)

Liu, P., Li, K., Wang, F., Wei, Y., She, J., Guo, D.: Online micro-gesture recog- nition using data augmentation and spatial-temporal attention. arXiv preprint arXiv:2507.09512 (2025)

work page arXiv 2025

[31] [31]

arXiv preprint arXiv:2407.04490 (2024)

Liu, P., Wang, F., Li, K., Chen, G., Wei, Y., Tang, S., Wu, Z., Guo, D.: Micro-gesture online recognition using learnable query points. arXiv preprint arXiv:2407.04490 (2024)

work page arXiv 2024

[32] [32]

Self-supervised Learning Matters: A Simple Ensemble Solution for Micro-Gesture Recognition

Liu, T., Li, K., Wang, F., Chen, J., Wu, Z., Gu, J., Liu, H., Guo, D.: Self-supervised learning matters: A simple ensemble solution for micro-gesture recognition. arXiv preprint arXiv:2606.09261 (2026)

work page internal anchor Pith review Pith/arXiv arXiv 2026

[33] [33]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Liu, X., Shi, H., Chen, H., Yu, Z., Li, X., Zhao, G.: imigue: An identity-free video dataset for micro-gesture understanding and emotion analysis. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 10631–10642 (2021)

2021

[34] [34]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Liu, Z., Ning, J., Cao, Y., Wei, Y., Zhang, Z., Lin, S., Hu, H.: Video swin transformer. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 3202–3211 (2022)

2022

[35] [35]

MediaPipe: A Framework for Building Perception Pipelines

Lugaresi, C., Tang, J., Nash, H., McClanahan, C., Uboweja, E., Hays, M., Zhang, F., Chang, C.L., Yong, M.G., Lee, J., et al.: Mediapipe: A framework for building perception pipelines. arXiv preprint arXiv:1906.08172 (2019)

work page internal anchor Pith review Pith/arXiv arXiv 1906

[36] [36]

In: European Conference on Computer Vision

Ni, B., Peng, H., Chen, M., Zhang, S., Meng, G., Fu, J., Xiang, S., Ling, H.: Expanding language-image pretrained models for general video recognition. In: European Conference on Computer Vision. pp. 1–18 (2022) 18 X. Guo, J. Gu, H. Liu et al

2022

[37] [37]

DINOv2: Learning Robust Visual Features without Supervision

Oquab, M., Darcet, T., Moutakanni, T., Vo, H., Szafraniec, M., Khalidov, V., Fernandez, P., Haziza, D., Massa, F., El-Nouby, A., et al.: Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023

[38] [38]

Spatial-Temporal Decoupled Adapter for Micro-gesture Online Recognition

Shen, X., Li, K., Wang, F., Qian, W., Jiang, J., Guo, D.: Spatial-temporal decoupled adapter for micro-gesture online recognition. arXiv preprint arXiv:2606.07355 (2026)

work page internal anchor Pith review Pith/arXiv arXiv 2026

[39] [39]

DINOv3

Siméoni, O., Vo, H.V., Seitzer, M., Baldassarre, F., Oquab, M., Jose, C., Khali- dov, V., Szafraniec, M., Yi, S., Ramamonjisoa, M., et al.: Dinov3. arXiv preprint arXiv:2508.10104 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[40] [40]

In: European Conference on Computer Vision

Teed, Z., Deng, J.: Raft: Recurrent all-pairs field transforms for optical flow. In: European Conference on Computer Vision. pp. 402–419. Springer (2020)

2020

[41] [41]

In: Advances in Neural Information Processing Systems (2017)

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Advances in Neural Information Processing Systems (2017)

2017

[42] [42]

iMiGUE-3K: A Large-Scale Benchmark for Micro-Gesture Analysis with Self-Supervised Learning

Wang, C., Chen, H., Wei, H., Yang, Y., Chen, Y., Zhao, G.: imigue-3k: A large-scale benchmark for micro-gesture analysis with self-supervised learning. arXiv preprint arXiv:2605.17179 (2026)

work page internal anchor Pith review Pith/arXiv arXiv 2026

[43] [43]

In: Proceedings of the AAAI Conference on Artificial Intelligence

Wang, F., Guo, D., Li, K., Wang, M.: Eulermormer: Robust eulerian motion magnification via dynamic filtering within transformer. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 38, pp. 5345–5353 (2024)

2024

[44] [44]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Wang, F., Guo, D., Li, K., Zhong, Z., Wang, M.: Frequency decoupling for mo- tion magnification via multi-level isomorphic architecture. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 18984– 18994 (2024)

2024

[45] [45]

In: Companion Proceedings of the ACM on Web Conference 2025

Wang, F., Li, K., Nie, Y., Duan, Z., Zou, P., Wu, Z., Wang, Y., Wei, Y.: Exploiting ensemble learning for cross-view isolated sign language recognition. In: Companion Proceedings of the ACM on Web Conference 2025. pp. 2453–2457 (2025)

2025

[46] [46]

In: Proceedings of the ACM Web Conference 2026

Wang, F., Yang, J., Chen, J., Liu, Y., Li, K., Wei, Y., Guo, D., Wang, M.: Xinsight: Integrative stage-consistent psychological counseling support agents for digital well-being. In: Proceedings of the ACM Web Conference 2026. pp. 9297–9308 (2026)

2026

[47] [47]

IEEE Transactions on Multimedia (2026)

Wang, M., Liu, Z., Li, K., Wang, Y., Wang, Y., Wei, Y., Wang, F.: Task-generalized adaptive cross-domain learning for multimodal image fusion. IEEE Transactions on Multimedia (2026)

2026

[48] [48]

IEEE Transactions on Affective Computing pp

Wang, R., Li, K., Tong, A., Xu, J., Guo, D., Wang, M.: Gait emotion recognition via uncertainty-oriented class discriminative learning. IEEE Transactions on Affective Computing pp. 1–14 (2026)

2026

[49] [49]

Water Research284, 123994 (2025)

Wang, W., Wang, F., Wang, Z., Shi, X., Xu, C.: Robust s3former deep learning model for the direct diagnosis and prediction of natural organic matter (nom) from three-dimensional excitation-emission-matrix (3d-eem) data. Water Research284, 123994 (2025)

2025

[50] [50]

arXiv preprint arXiv:2602.08057 (2026)

Wang, Y., Liu, H., Xu, T., Shi, C., Xing, H.: Weak to strong: Vlm-based pseudo- labeling as a weakly supervised training strategy in multimodal video-based hidden emotion understanding tasks. arXiv preprint arXiv:2602.08057 (2026)

work page arXiv 2026

[51] [51]

In: Advances in Neural Information Processing Systems (2022)

Xu, Y., Zhang, J., Zhang, Q., Tao, D.: Vitpose: Simple vision transformer baselines for human pose estimation. In: Advances in Neural Information Processing Systems (2022)

2022

[52] [52]

In: Proceedings of the AAAI Conference on Artificial Intelligence (2018)

Yan, S., Xiong, Y., Lin, D.: Spatial temporal graph convolutional networks for skeleton-based action recognition. In: Proceedings of the AAAI Conference on Artificial Intelligence (2018)

2018

[53] [53]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops (2023)

Yang, Z., Zeng, A., Yuan, C., Li, Y.: Effective whole-body pose estimation with two-stages distillation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops (2023)

2023