Spatial-Temporal Decoupled Adapter for Micro-gesture Online Recognition

Dan Guo; Fei Wang; Jin Jiang; Kun Li; Wei Qian; Xucheng Shen

arxiv: 2606.07355 · v1 · pith:H4T345HCnew · submitted 2026-06-05 · 💻 cs.CV

Spatial-Temporal Decoupled Adapter for Micro-gesture Online Recognition

Xucheng Shen , Kun Li , Fei Wang , Wei Qian , Jin Jiang , Dan Guo This is my paper

Pith reviewed 2026-06-27 22:15 UTC · model grok-4.3

classification 💻 cs.CV

keywords micro-gesture recognitiononline video recognitionspatial-temporal decouplingparameter-efficient adapterdepthwise convolutionlong-tail augmentationvideo adaptation

0 comments

The pith

Separating spatial and temporal branches in adapters improves recognition of subtle micro-gestures.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tries to solve the problem of spotting and classifying very brief, low-amplitude gestures inside long untrimmed videos. It claims that ordinary single-branch adapters mix spatial appearance and temporal motion too tightly and therefore miss the fine details that define micro-gestures. The authors replace the single branch with two separate lightweight branches, one for time and one for space, each using depthwise convolutions. They also add a dynamic augmentation scheme that gives more help to rare or hard-to-learn gesture classes without any hand-set thresholds. The resulting system records the highest score on the official challenge benchmark.

Core claim

The Spatial-Temporal Decoupled Adapter decomposes video adaptation into independent temporal and spatial branches via lightweight depthwise convolutions. In addition, Adaptive Soft Balanced Augmentation dynamically allocates augmentation intensity based on class rarity and learning difficulty, without manual thresholds. Our method achieves an F1 score of 0.43808, ranking 1st in Track 2 of the 4th EI-MiGA-IJCAI Challenge.

What carries the argument

The Spatial-Temporal Decoupled Adapter, which splits adaptation into independent temporal and spatial branches that each apply lightweight depthwise convolutions to process their respective cues separately.

If this is right

Independent branches capture fine-grained spatiotemporal patterns that joint modeling misses.
Dynamic augmentation balances long-tail class distributions without requiring manual thresholds.
The method reaches an F1 score of 0.43808 on the official micro-gesture challenge test set.
The same decoupled structure can be plugged into other video backbones with only small added cost.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same separation of space and time may help other fine-grained video tasks such as subtle action spotting or facial micro-expression recognition.
Because each branch uses only depthwise convolutions, the adapter adds very few parameters and could be attractive for on-device video models.
The dynamic augmentation rule depends only on observed class frequency and loss, so it could transfer to any imbalanced video dataset without task-specific tuning.

Load-bearing premise

Single-branch adapters lose fine-grained micro-gesture patterns because they model spatial and temporal cues jointly, while independent branches can capture those patterns without any cross-branch interaction.

What would settle it

An ablation experiment on the same benchmark dataset that merges the two branches into one while keeping total parameters fixed and shows whether the F1 score falls below 0.438.

read the original abstract

Micro-gesture online recognition aims to temporally localize and classify subtle gestures in untrimmed videos. Owing to their extremely short duration, low motion amplitude, and ambiguous visual cues, capturing discriminative spatiotemporal representations remains highly challenging. Existing parameter-efficient adapters typically employ a single branch to model spatial and temporal cues jointly, which may fail to capture the fine-grained patterns of micro-gestures. To address this limitation, we propose a Spatial-Temporal Decoupled Adapter that decomposes video adaptation into independent temporal and spatial branches via lightweight depthwise convolutions. In addition, to address the long-tail distribution problem in the benchmark dataset, we introduce Adaptive Soft Balanced Augmentation, which dynamically allocates augmentation intensity based on class rarity and learning difficulty, without manual thresholds. Our method achieves an F1 score of 0.43808, ranking 1st in Track 2 of the 4th EI-MiGA-IJCAI Challenge.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The decoupled adapter and adaptive augmentation win the challenge with 0.438 F1, but the abstract gives almost no evidence that the split is what matters.

read the letter

The paper's concrete result is first place on track 2 of the EI-MiGA challenge using a spatial-temporal decoupled adapter and an adaptive augmentation rule. The adapter splits adaptation into separate depthwise-convolution branches for space and time instead of a single joint branch. The augmentation scales intensity by class rarity and per-class learning difficulty measured on the fly, without preset thresholds.

That combination is new for this narrow task. Micro-gesture online recognition has tight constraints on duration and amplitude, so the explicit separation makes sense as a targeted fix. The challenge win is a real external check on the full pipeline.

The main weakness is the missing support for the central design choice. The abstract never shows an ablation that isolates the decoupled branches from the augmentation, nor does it compare against a matched single-branch adapter. It also says nothing about cross-branch interaction, which the stress-test note flags. If micro-gestures need coordinated spatiotemporal signals, independent branches could be leaving performance on the table and the score could be driven mostly by the balancing rule. Without those checks the performance claim is hard to attribute.

The work is aimed at teams already competing on micro-gesture benchmarks or building adapters for fine-grained video tasks. A reader outside that niche will not find general lessons on adapter design or new theory.

It deserves peer review. The benchmark result is falsifiable and the method is described clearly enough for referees to request the missing experiments and decide whether the decoupling adds value.

Referee Report

2 major / 1 minor

Summary. The paper proposes a Spatial-Temporal Decoupled Adapter (STDA) that decomposes video adaptation for micro-gesture online recognition into independent temporal and spatial branches using lightweight depthwise convolutions, addressing limitations of single-branch adapters that jointly model cues. It additionally introduces Adaptive Soft Balanced Augmentation (ASBA) that dynamically allocates augmentation intensity based on class rarity and learning difficulty without manual thresholds. The method reports an F1 score of 0.43808 and ranks first in Track 2 of the 4th EI-MiGA-IJCAI Challenge.

Significance. If the empirical result holds, the top challenge ranking provides concrete evidence that parameter-efficient adaptation can succeed on subtle, short-duration gestures in untrimmed videos. The decoupling approach and threshold-free augmentation strategy offer practical advances for long-tail video tasks. The work ships a falsifiable benchmark result on an external challenge dataset, which strengthens its assessment.

major comments (2)

[Abstract] Abstract: The central claim attributes improved fine-grained pattern capture to the independent spatial/temporal branches, yet provides no ablation studies, single-branch baselines, or error analysis to isolate this contribution from ASBA; the 0.43808 F1 could therefore stem primarily from the augmentation rule rather than the adapter design.
[Abstract] Abstract: The adapter is described as using 'independent temporal and spatial branches via lightweight depthwise convolutions' with no mention of cross-branch fusion or interaction; this design assumes micro-gesture cues (short duration, low amplitude, ambiguous) do not require explicit spatiotemporal coordination, but the manuscript offers no test or justification for this assumption.

minor comments (1)

[Abstract] Abstract: The high-level method description would benefit from one sentence summarizing the backbone network or input resolution used in the challenge submission.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on the abstract. We address each major point below and will revise the manuscript to strengthen the claims with additional analysis where feasible.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim attributes improved fine-grained pattern capture to the independent spatial/temporal branches, yet provides no ablation studies, single-branch baselines, or error analysis to isolate this contribution from ASBA; the 0.43808 F1 could therefore stem primarily from the augmentation rule rather than the adapter design.

Authors: We agree that the current version does not include ablations or error analysis to disentangle the contributions of the decoupled adapter from ASBA. The reported F1 score reflects the full pipeline on the challenge test set. In revision we will add single-branch adapter baselines (with and without ASBA) and per-class error breakdowns to isolate the effect of the spatial-temporal decoupling. revision: yes
Referee: [Abstract] Abstract: The adapter is described as using 'independent temporal and spatial branches via lightweight depthwise convolutions' with no mention of cross-branch fusion or interaction; this design assumes micro-gesture cues (short duration, low amplitude, ambiguous) do not require explicit spatiotemporal coordination, but the manuscript offers no test or justification for this assumption.

Authors: The design choice of fully independent branches is driven by the goal of parameter-efficient adaptation and the hypothesis that depthwise convolutions suffice to capture the subtle, short-duration cues separately. The manuscript does not provide an explicit test or justification for omitting cross-branch fusion. We will expand the method section with a rationale grounded in micro-gesture properties and, if compute permits, include a lightweight fusion variant for comparison. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical design evaluated on external challenge data

full rationale

The paper introduces a Spatial-Temporal Decoupled Adapter (decomposing adaptation via depthwise convolutions) and Adaptive Soft Balanced Augmentation (dynamic intensity based on rarity and difficulty) as design choices. The reported F1 of 0.43808 is an empirical outcome on the EI-MiGA-IJCAI challenge dataset, not a quantity derived by construction from fitted parameters or self-citations. No equations reduce predictions to inputs, no uniqueness theorems are imported from prior self-work, and no ansatz is smuggled via citation. The derivation chain consists of motivated architectural decisions tested externally, remaining self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on standard deep-learning assumptions about the utility of depthwise convolutions for feature separation and on the existence of a long-tail distribution in the benchmark that the augmentation can address; no new physical entities or heavily fitted constants are introduced.

axioms (2)

domain assumption Depthwise convolutions suffice to model spatial and temporal features independently without loss of necessary cross-dimensional interactions for micro-gesture discrimination.
Invoked to justify the decoupled branches as a solution to the single-branch limitation.
domain assumption Learning difficulty can be reliably estimated during training to guide augmentation intensity without introducing bias.
Underlies the Adaptive Soft Balanced Augmentation rule.

pith-pipeline@v0.9.1-grok · 5695 in / 1461 out tokens · 27372 ms · 2026-06-27T22:15:50.513421+00:00 · methodology

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Motion Reinforces Appearance: RGB-Skeleton Gated Residual Fusion for Micro-Gesture Online Recognition
cs.CV 2026-06 unverdicted novelty 4.0

DyFADet+ extends a prior detector with gated RGB-skeleton fusion and reports 40.88 F1 on the SMG dataset for micro-gesture online recognition.
Rethinking the Role of Feature Engineering and Learning Strategies in Few-Shot Hidden Emotion Recognition
cs.CV 2026-06 unverdicted novelty 3.0

A competition-winning multi-modal model for hidden emotion recognition integrates static and dynamic pose features via cross-attention and MIL pooling while noting representation collapse in vision foundation models o...

Reference graph

Works this paper leans on

52 extracted references · 1 linked inside Pith · cited by 2 Pith papers

[1]

In: Proceed- ings of the IEEE/CVF International Conference on Computer Vision

Agrawal, T., Ali, A., Dantcheva, A., Bremond, F.: Scaling action detection: Adatad++ with transformer-enhanced temporal-spatial adaptation. In: Proceed- ings of the IEEE/CVF International Conference on Computer Vision. pp. 12222– 12231 (2025) 14 X. Shen et al

2025
[2]

In: The European Conference on Computer Vision (Septem- ber 2018)

Alwassel,H.,CabaHeilbron,F.,Escorcia,V.,Ghanem,B.:Diagnosingerrorintem- poral action detectors. In: The European Conference on Computer Vision (Septem- ber 2018)

2018
[3]

arXiv preprint arXiv:2408.03097 (2024)

Chen, G., Wang, F., Li, K., Wu, Z., Fan, H., Yang, Y., Wang, M., Guo, D.: Pro- totype learning for micro-gesture classification. arXiv preprint arXiv:2408.03097 (2024)

arXiv 2024
[4]

In: 2019 14th IEEE International Conference on Automatic Face & Ges- ture Recognition (FG 2019)

Chen, H., Liu, X., Li, X., Shi, H., Zhao, G.: Analyze spontaneous gestures for emotional stress state recognition: A micro-gesture dataset and analysis with deep learning. In: 2019 14th IEEE International Conference on Automatic Face & Ges- ture Recognition (FG 2019). pp. 1–8. IEEE (2019)

2019
[5]

Chen, H., Schuller, B.W., Adeli, E., Zhao, G.: The 3rd challenge on human behav- ior analysis for emotion understanding (miga) 2025: From recognition to emotion understanding (2025)

2025
[6]

International Journal of Computer Vision131(6), 1346–1366 (2023)

Chen, H., Shi, H., Liu, X., Li, X., Zhao, G.: Smg: A micro-gesture dataset to- wards spontaneous body gestures for emotional stress state analysis. International Journal of Computer Vision131(6), 1346–1366 (2023)

2023
[7]

Advances in Neural Information Processing Systems35, 16664–16678 (2022)

Chen, S., Ge, C., Tong, Z., Wang, J., Song, Y., Wang, J., Luo, P.: Adaptformer: Adapting vision transformers for scalable visual recognition. Advances in Neural Information Processing Systems35, 16664–16678 (2022)

2022
[8]

In: Proceedings of the IEEE/CVF conference on com- puter vision and pattern recognition

Cui, Y., Jia, M., Lin, T.Y., Song, Y., Belongie, S.: Class-balanced loss based on effective number of samples. In: Proceedings of the IEEE/CVF conference on com- puter vision and pattern recognition. pp. 9268–9277 (2019)

2019
[9]

IEEE Transactions on Affective Computing (2026)

Gao, R., Liu, X., Xing, B., Yu, Z., Schuller, B.W., Kälviäinen, H.: Identity-free ar- tificial emotional intelligence via micro-gesture understanding. IEEE Transactions on Affective Computing (2026)

2026
[10]

In: Proceedings of the 33rd ACM International Conference on Multimedia

Gu, J., Li, K., Wang, F., Wei, Y., Wu, Z., Fan, H., Wang, M.: Motion matters: Motion-guided modulation network for skeleton-based micro-action recognition. In: Proceedings of the 33rd ACM International Conference on Multimedia. pp. 5461–5470 (2025)

2025
[11]

arXiv preprint arXiv:2507.08344 (2025)

Gu, J., Wang, F., Li, K., Wei, Y., Wu, Z., Guo, D.: Mm-gesture: towards precise micro-gesture recognition through multimodal fusion. arXiv preprint arXiv:2507.08344 (2025)

arXiv 2025
[12]

IEEE Transactions on Circuits and Systems for Video Technology34(7), 6238–6252 (2024)

Guo, D., Li, K., Hu, B., Zhang, Y., Wang, M.: Benchmarking micro-action recog- nition: Dataset, methods, and applications. IEEE Transactions on Circuits and Systems for Video Technology34(7), 6238–6252 (2024)

2024
[13]

In: Proceedings of the 32nd ACM International Conference on Multimedia

Guo, D., Li, X., Li, K., Chen, H., Hu, J., Zhao, G., Yang, Y., Wang, M.: Mac 2024: Micro-action analysis grand challenge. In: Proceedings of the 32nd ACM International Conference on Multimedia. pp. 11304–11305 (2024)

2024
[14]

In: MiGA@ IJ- CAI (2023)

Guo, X., Peng, W., Huang, H., Xia, Z.: Micro-gesture online recognition with graph-convolution and multiscale transformers for long sequence. In: MiGA@ IJ- CAI (2023)

2023
[15]

IEEE Transactions on Circuits and Systems for Video Technology32(10), 7120–7132 (2022)

Hao, Y., Wang, S., Cao, P., Gao, X., Xu, T., Wu, J., He, X.: Attention in attention: Modeling context correlation for efficient video classification. IEEE Transactions on Circuits and Systems for Video Technology32(10), 7120–7132 (2022)

2022
[16]

In: Proceedings of the ieee/cvf conference on computer vision and pattern recognition

Hao, Y., Zhang, H., Ngo, C.W., He, X.: Group contextualization for video recog- nition. In: Proceedings of the ieee/cvf conference on computer vision and pattern recognition. pp. 928–938 (2022)

2022
[17]

In: International Conference on Machine Learning (2019) Spatial-Temporal Decoupled Adapter for Micro-gesture Online Recognition 15

Houlsby, N., Giurgiu, A., Jastrzebski, S., Morrone, B., de Laroussilhe, Q., Ges- mundo, A., Attariyan, M., Gelly, S.: Parameter-efficient transfer learning for nlp. In: International Conference on Machine Learning (2019) Spatial-Temporal Decoupled Adapter for Micro-gesture Online Recognition 15

2019
[18]

In: International Conference on Learning Representations (2022)

Hu, E.J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., Chen, W.: Lora: Low-rank adaptation of large language models. In: International Conference on Learning Representations (2022)

2022
[19]

In: European Conference on Computer Vision (2022)

Jia, M., Tang, L., Chen, B.C., Cardie, C., Belongie, S., Hariharan, B., Lim, S.N.: Visual prompt tuning. In: European Conference on Computer Vision (2022)

2022
[20]

arXiv preprint arXiv:2603.26586 (2026)

Li, K., Gu, J., Wang, F., Wu, Z., Fan, H., Guo, D.: Ma-bench: Towards fine-grained micro-action understanding. arXiv preprint arXiv:2603.26586 (2026)

arXiv 2026
[21]

In: Proceedings of the AAAI Conference on Artificial Intelligence

Li, K., Guo, D., Chen, G., Fan, C., Xu, J., Wu, Z., Fan, H., Wang, M.: Prototypical calibrating ambiguous samples for micro-action recognition. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 39, pp. 4815–4823 (2025)

2025
[22]

arXiv preprint arXiv:2307.10624 (2023)

Li, K., Guo, D., Chen, G., Peng, X., Wang, M.: Joint skeletal and semantic embed- ding loss for micro-gesture classification. arXiv preprint arXiv:2307.10624 (2023)

arXiv 2023
[23]

In: Proceedings of the 33rd ACM International Conference on Multimedia

Li, K., Guo, D., Li, X., Chen, H., Liu, P., Wang, F., Hu, J., Zhao, G., Wang, M.: Mac 2025: The 2nd micro-action analysis grand challenge. In: Proceedings of the 33rd ACM International Conference on Multimedia. pp. 14216–14221 (2025)

2025
[24]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision

Li, K., Liu, P., Guo, D., Wang, F., Wu, Z., Fan, H., Wang, M.: Mmad: Multi-label micro-action detection in videos. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 13225–13236 (2025)

2025
[25]

In: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Pro- cessing (2021)

Li, X.L., Liang, P.: Prefix-tuning: Optimizing continuous prompts for generation. In: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Pro- cessing (2021)

2021
[26]

In: Proceedings of the IEEE/CVF interna- tional conference on computer vision

Lin, T., Liu, X., Li, X., Ding, E., Wen, S.: Bmn: Boundary-matching network for temporal action proposal generation. In: Proceedings of the IEEE/CVF interna- tional conference on computer vision. pp. 3889–3898 (2019)

2019
[27]

arXiv preprint arXiv:2503.15978 (2025)

Liu,P.,Dong,G.,Guo,D.,Li,K.,Li,F.,Yang,X.,Wang,M.,Ying,X.:Asurveyon fMRI-based brain decoding for reconstructing multimodal stimuli. arXiv preprint arXiv:2503.15978 (2025)

arXiv 2025
[28]

arXiv preprint arXiv:2507.09512 (2025)

Liu, P., Li, K., Wang, F., Wei, Y., She, J., Guo, D.: Online micro-gesture recog- nition using data augmentation and spatial-temporal attention. arXiv preprint arXiv:2507.09512 (2025)

arXiv 2025
[29]

arXiv preprint arXiv:2407.04490 (2024)

Liu, P., Wang, F., Li, K., Chen, G., Wei, Y., Tang, S., Wu, Z., Guo, D.: Micro-gesture online recognition using learnable query points. arXiv preprint arXiv:2407.04490 (2024)

arXiv 2024
[30]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Liu, S., Zhang, C.L., Zhao, C., Ghanem, B.: End-to-end temporal action detec- tion with 1b parameters across 1000 frames. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 18591–18601 (2024)

2024
[31]

In: Proceedings of the Asian Conference on Computer Vision (2020)

Liu, S., Zhao, X., Su, H., Hu, Z.: Tsi: Temporal scale invariant network for action proposal generation. In: Proceedings of the Asian Conference on Computer Vision (2020)

2020
[32]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Liu, X., Shi, H., Chen, H., Yu, Z., Li, X., Zhao, G.: imigue: An identity-free video dataset for micro-gesture understanding and emotion analysis. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 10631– 10642 (2021)

2021
[33]

MiGA@ IJCAI (2025)

Meng, C., Ma, F., Zhang, C., Miao, J., Yang, Y., Zhuang, Y.: Online micro-gesture recognition in long videos via spatiotemporal feature encoding and query-based temporal detection. MiGA@ IJCAI (2025)

2025
[34]

Advances in Neural Information Processing Systems35, 26462–26477 (2022) 16 X

Pan, J., Lin, Z., Zhu, X., Shao, J., Li, H.: St-adapter: Parameter-efficient image- to-video transfer learning. Advances in Neural Information Processing Systems35, 26462–26477 (2022) 16 X. Shen et al

2022
[35]

IEEE Transactions on Computational Social Systems (2024)

Qian, W., Guo, D., Li, K., Zhang, X., Tian, X., Yang, X., Wang, M.: Dual-path tokenlearner for remote photoplethysmography-based physiological measurement with facial videos. IEEE Transactions on Computational Social Systems (2024)

2024
[36]

arXiv preprint arXiv:2604.00534 (2026)

Qian, W., Guo, D., Zhou, J., Zou, B., Yu, Z., Wang, M.: Freqphys: Repurposing implicit physiological frequency prior for robust remote photoplethysmography. arXiv preprint arXiv:2604.00534 (2026)

arXiv 2026
[37]

In: Proceedings of the 32nd ACM International Conference on Multimedia

Qian, W., Li, K., Guo, D., Hu, B., Wang, M.: Cluster-phys: Facial clues clustering towards efficient remote physiological measurement. In: Proceedings of the 32nd ACM International Conference on Multimedia. pp. 330–339 (2024)

2024
[38]

Proceedings of the AAAI Conference on Artificial Intelligence (AAAI) (2025)

Qian, W., Su, G., Guo, D., Zhou, J., Li, X., Hu, B., Tang, S., Wang, M.: Physdiff: Physiology-based dynamicity disentangled diffusion model for remote physiologi- cal measurement. Proceedings of the AAAI Conference on Artificial Intelligence (AAAI) (2025)

2025
[39]

In: Chinese Conference on Pattern Recognition and Computer Vision (PRCV)

Shang, T., Hao, Y., Pei, M., Li, K., Ben, H., Wang, S.: Cross-modal feature en- hancement and contrastive alignment for micro-gesture recognition. In: Chinese Conference on Pattern Recognition and Computer Vision (PRCV). pp. 203–217. Springer (2025)

2025
[40]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Shi, D., Zhong, Y., Cao, Q., Ma, L., Li, J., Tao, D.: Tridet: Temporal action detection with relative boundary modeling. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 18857–18866 (2023)

2023
[41]

Advances in Neural Information Pro- cessing Systems35, 15268–15280 (2022)

Tan, J., Zhao, X., Shi, X., Kang, B., Wang, L.: Pointtad: Multi-label temporal action detection with learnable query points. Advances in Neural Information Pro- cessing Systems35, 15268–15280 (2022)

2022
[42]

arXiv preprint arXiv:2303.09055 (2023)

Tang, T.N., Kim, K., Sohn, K.: Temporalmaxer: Maximize temporal context with only max pooling for temporal action localization. arXiv preprint arXiv:2303.09055 (2023)

arXiv 2023
[43]

arXiv preprint arXiv:2605.17179 (2026)

Wang,C.,Chen,H.,Wei,H.,Yang,Y.,Chen,Y.,Zhao,G.:imigue-3k:Alarge-scale benchmark for micro-gesture analysis with self-supervised learning. arXiv preprint arXiv:2605.17179 (2026)

Pith/arXiv arXiv 2026
[44]

In: Proceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Wang, L., Huang, B., Zhao, Z., Tong, Z., He, Y., Wang, Y., Wang, Y., Qiao, Y.: Videomae v2: Scaling video masked autoencoders with dual masking. In: Proceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 14549–14560 (2023)

2023
[45]

IEEE Transactions on Affective Computing pp

Wang,R.,Li,K.,Tong,A.,Xu,J.,Guo,D.,Wang,M.:Gaitemotionrecognitionvia uncertainty-oriented class discriminative learning. IEEE Transactions on Affective Computing pp. 1–14 (2026)

2026
[46]

MiGA@ IJCAI (2024)

Wang, Y., Kerui, L., Huang, H., Xia, Z.: Micro-gesture online recognition with dual-stream multi-scale transformer in long videos. MiGA@ IJCAI (2024)

2024
[47]

IEEE Trans- actions on Affective Computing (2025)

Xia, Z., Huang, H., Chen, H., Feng, X., Zhao, G.: Hybrid-supervised hypergraph- enhanced transformer for micro-gesture based emotion recognition. IEEE Trans- actions on Affective Computing (2025)

2025
[48]

In: European conference on computer vision

Yang, L., Zheng, Z., Han, Y., Cheng, H., Song, S., Huang, G., Li, F.: Dyfadet: Dy- namic feature aggregation for temporal action detection. In: European conference on computer vision. pp. 305–322. Springer (2024)

2024
[49]

arXiv preprint arXiv:2302.03024 (2023)

Yang, T., Zhu, Y., Xie, Y., Zhang, A., Chen, C., Li, M.: Aim: Adapting image mod- els for efficient video action recognition. arXiv preprint arXiv:2302.03024 (2023)

arXiv 2023
[50]

In: European Conference on Computer Vision

Zhang, C.L., Wu, J., Li, Y.: Actionformer: Localizing moments of actions with transformers. In: European Conference on Computer Vision. pp. 492–510. Springer (2022) Spatial-Temporal Decoupled Adapter for Micro-gesture Online Recognition 17

2022
[51]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Zhao, C., Liu, S., Mangalam, K., Ghanem, B.: Re2tal: Rewiring pretrained video backbones for reversible temporal action localization. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 10637– 10647 (2023)

2023
[52]

In: Proceedings of the IEEE/CVF international conference on computer vision

Zhao, C., Thabet, A.K., Ghanem, B.: Video self-stitching graph network for tempo- ral action localization. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 13658–13667 (2021)

2021

[1] [1]

In: Proceed- ings of the IEEE/CVF International Conference on Computer Vision

Agrawal, T., Ali, A., Dantcheva, A., Bremond, F.: Scaling action detection: Adatad++ with transformer-enhanced temporal-spatial adaptation. In: Proceed- ings of the IEEE/CVF International Conference on Computer Vision. pp. 12222– 12231 (2025) 14 X. Shen et al

2025

[2] [2]

In: The European Conference on Computer Vision (Septem- ber 2018)

Alwassel,H.,CabaHeilbron,F.,Escorcia,V.,Ghanem,B.:Diagnosingerrorintem- poral action detectors. In: The European Conference on Computer Vision (Septem- ber 2018)

2018

[3] [3]

arXiv preprint arXiv:2408.03097 (2024)

Chen, G., Wang, F., Li, K., Wu, Z., Fan, H., Yang, Y., Wang, M., Guo, D.: Pro- totype learning for micro-gesture classification. arXiv preprint arXiv:2408.03097 (2024)

arXiv 2024

[4] [4]

In: 2019 14th IEEE International Conference on Automatic Face & Ges- ture Recognition (FG 2019)

Chen, H., Liu, X., Li, X., Shi, H., Zhao, G.: Analyze spontaneous gestures for emotional stress state recognition: A micro-gesture dataset and analysis with deep learning. In: 2019 14th IEEE International Conference on Automatic Face & Ges- ture Recognition (FG 2019). pp. 1–8. IEEE (2019)

2019

[5] [5]

Chen, H., Schuller, B.W., Adeli, E., Zhao, G.: The 3rd challenge on human behav- ior analysis for emotion understanding (miga) 2025: From recognition to emotion understanding (2025)

2025

[6] [6]

International Journal of Computer Vision131(6), 1346–1366 (2023)

Chen, H., Shi, H., Liu, X., Li, X., Zhao, G.: Smg: A micro-gesture dataset to- wards spontaneous body gestures for emotional stress state analysis. International Journal of Computer Vision131(6), 1346–1366 (2023)

2023

[7] [7]

Advances in Neural Information Processing Systems35, 16664–16678 (2022)

Chen, S., Ge, C., Tong, Z., Wang, J., Song, Y., Wang, J., Luo, P.: Adaptformer: Adapting vision transformers for scalable visual recognition. Advances in Neural Information Processing Systems35, 16664–16678 (2022)

2022

[8] [8]

In: Proceedings of the IEEE/CVF conference on com- puter vision and pattern recognition

Cui, Y., Jia, M., Lin, T.Y., Song, Y., Belongie, S.: Class-balanced loss based on effective number of samples. In: Proceedings of the IEEE/CVF conference on com- puter vision and pattern recognition. pp. 9268–9277 (2019)

2019

[9] [9]

IEEE Transactions on Affective Computing (2026)

Gao, R., Liu, X., Xing, B., Yu, Z., Schuller, B.W., Kälviäinen, H.: Identity-free ar- tificial emotional intelligence via micro-gesture understanding. IEEE Transactions on Affective Computing (2026)

2026

[10] [10]

In: Proceedings of the 33rd ACM International Conference on Multimedia

Gu, J., Li, K., Wang, F., Wei, Y., Wu, Z., Fan, H., Wang, M.: Motion matters: Motion-guided modulation network for skeleton-based micro-action recognition. In: Proceedings of the 33rd ACM International Conference on Multimedia. pp. 5461–5470 (2025)

2025

[11] [11]

arXiv preprint arXiv:2507.08344 (2025)

Gu, J., Wang, F., Li, K., Wei, Y., Wu, Z., Guo, D.: Mm-gesture: towards precise micro-gesture recognition through multimodal fusion. arXiv preprint arXiv:2507.08344 (2025)

arXiv 2025

[12] [12]

IEEE Transactions on Circuits and Systems for Video Technology34(7), 6238–6252 (2024)

Guo, D., Li, K., Hu, B., Zhang, Y., Wang, M.: Benchmarking micro-action recog- nition: Dataset, methods, and applications. IEEE Transactions on Circuits and Systems for Video Technology34(7), 6238–6252 (2024)

2024

[13] [13]

In: Proceedings of the 32nd ACM International Conference on Multimedia

Guo, D., Li, X., Li, K., Chen, H., Hu, J., Zhao, G., Yang, Y., Wang, M.: Mac 2024: Micro-action analysis grand challenge. In: Proceedings of the 32nd ACM International Conference on Multimedia. pp. 11304–11305 (2024)

2024

[14] [14]

In: MiGA@ IJ- CAI (2023)

Guo, X., Peng, W., Huang, H., Xia, Z.: Micro-gesture online recognition with graph-convolution and multiscale transformers for long sequence. In: MiGA@ IJ- CAI (2023)

2023

[15] [15]

IEEE Transactions on Circuits and Systems for Video Technology32(10), 7120–7132 (2022)

Hao, Y., Wang, S., Cao, P., Gao, X., Xu, T., Wu, J., He, X.: Attention in attention: Modeling context correlation for efficient video classification. IEEE Transactions on Circuits and Systems for Video Technology32(10), 7120–7132 (2022)

2022

[16] [16]

In: Proceedings of the ieee/cvf conference on computer vision and pattern recognition

Hao, Y., Zhang, H., Ngo, C.W., He, X.: Group contextualization for video recog- nition. In: Proceedings of the ieee/cvf conference on computer vision and pattern recognition. pp. 928–938 (2022)

2022

[17] [17]

In: International Conference on Machine Learning (2019) Spatial-Temporal Decoupled Adapter for Micro-gesture Online Recognition 15

Houlsby, N., Giurgiu, A., Jastrzebski, S., Morrone, B., de Laroussilhe, Q., Ges- mundo, A., Attariyan, M., Gelly, S.: Parameter-efficient transfer learning for nlp. In: International Conference on Machine Learning (2019) Spatial-Temporal Decoupled Adapter for Micro-gesture Online Recognition 15

2019

[18] [18]

In: International Conference on Learning Representations (2022)

Hu, E.J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., Chen, W.: Lora: Low-rank adaptation of large language models. In: International Conference on Learning Representations (2022)

2022

[19] [19]

In: European Conference on Computer Vision (2022)

Jia, M., Tang, L., Chen, B.C., Cardie, C., Belongie, S., Hariharan, B., Lim, S.N.: Visual prompt tuning. In: European Conference on Computer Vision (2022)

2022

[20] [20]

arXiv preprint arXiv:2603.26586 (2026)

Li, K., Gu, J., Wang, F., Wu, Z., Fan, H., Guo, D.: Ma-bench: Towards fine-grained micro-action understanding. arXiv preprint arXiv:2603.26586 (2026)

arXiv 2026

[21] [21]

In: Proceedings of the AAAI Conference on Artificial Intelligence

Li, K., Guo, D., Chen, G., Fan, C., Xu, J., Wu, Z., Fan, H., Wang, M.: Prototypical calibrating ambiguous samples for micro-action recognition. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 39, pp. 4815–4823 (2025)

2025

[22] [22]

arXiv preprint arXiv:2307.10624 (2023)

Li, K., Guo, D., Chen, G., Peng, X., Wang, M.: Joint skeletal and semantic embed- ding loss for micro-gesture classification. arXiv preprint arXiv:2307.10624 (2023)

arXiv 2023

[23] [23]

In: Proceedings of the 33rd ACM International Conference on Multimedia

Li, K., Guo, D., Li, X., Chen, H., Liu, P., Wang, F., Hu, J., Zhao, G., Wang, M.: Mac 2025: The 2nd micro-action analysis grand challenge. In: Proceedings of the 33rd ACM International Conference on Multimedia. pp. 14216–14221 (2025)

2025

[24] [24]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision

Li, K., Liu, P., Guo, D., Wang, F., Wu, Z., Fan, H., Wang, M.: Mmad: Multi-label micro-action detection in videos. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 13225–13236 (2025)

2025

[25] [25]

In: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Pro- cessing (2021)

Li, X.L., Liang, P.: Prefix-tuning: Optimizing continuous prompts for generation. In: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Pro- cessing (2021)

2021

[26] [26]

In: Proceedings of the IEEE/CVF interna- tional conference on computer vision

Lin, T., Liu, X., Li, X., Ding, E., Wen, S.: Bmn: Boundary-matching network for temporal action proposal generation. In: Proceedings of the IEEE/CVF interna- tional conference on computer vision. pp. 3889–3898 (2019)

2019

[27] [27]

arXiv preprint arXiv:2503.15978 (2025)

Liu,P.,Dong,G.,Guo,D.,Li,K.,Li,F.,Yang,X.,Wang,M.,Ying,X.:Asurveyon fMRI-based brain decoding for reconstructing multimodal stimuli. arXiv preprint arXiv:2503.15978 (2025)

arXiv 2025

[28] [28]

arXiv preprint arXiv:2507.09512 (2025)

Liu, P., Li, K., Wang, F., Wei, Y., She, J., Guo, D.: Online micro-gesture recog- nition using data augmentation and spatial-temporal attention. arXiv preprint arXiv:2507.09512 (2025)

arXiv 2025

[29] [29]

arXiv preprint arXiv:2407.04490 (2024)

Liu, P., Wang, F., Li, K., Chen, G., Wei, Y., Tang, S., Wu, Z., Guo, D.: Micro-gesture online recognition using learnable query points. arXiv preprint arXiv:2407.04490 (2024)

arXiv 2024

[30] [30]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Liu, S., Zhang, C.L., Zhao, C., Ghanem, B.: End-to-end temporal action detec- tion with 1b parameters across 1000 frames. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 18591–18601 (2024)

2024

[31] [31]

In: Proceedings of the Asian Conference on Computer Vision (2020)

Liu, S., Zhao, X., Su, H., Hu, Z.: Tsi: Temporal scale invariant network for action proposal generation. In: Proceedings of the Asian Conference on Computer Vision (2020)

2020

[32] [32]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Liu, X., Shi, H., Chen, H., Yu, Z., Li, X., Zhao, G.: imigue: An identity-free video dataset for micro-gesture understanding and emotion analysis. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 10631– 10642 (2021)

2021

[33] [33]

MiGA@ IJCAI (2025)

Meng, C., Ma, F., Zhang, C., Miao, J., Yang, Y., Zhuang, Y.: Online micro-gesture recognition in long videos via spatiotemporal feature encoding and query-based temporal detection. MiGA@ IJCAI (2025)

2025

[34] [34]

Advances in Neural Information Processing Systems35, 26462–26477 (2022) 16 X

Pan, J., Lin, Z., Zhu, X., Shao, J., Li, H.: St-adapter: Parameter-efficient image- to-video transfer learning. Advances in Neural Information Processing Systems35, 26462–26477 (2022) 16 X. Shen et al

2022

[35] [35]

IEEE Transactions on Computational Social Systems (2024)

Qian, W., Guo, D., Li, K., Zhang, X., Tian, X., Yang, X., Wang, M.: Dual-path tokenlearner for remote photoplethysmography-based physiological measurement with facial videos. IEEE Transactions on Computational Social Systems (2024)

2024

[36] [36]

arXiv preprint arXiv:2604.00534 (2026)

Qian, W., Guo, D., Zhou, J., Zou, B., Yu, Z., Wang, M.: Freqphys: Repurposing implicit physiological frequency prior for robust remote photoplethysmography. arXiv preprint arXiv:2604.00534 (2026)

arXiv 2026

[37] [37]

In: Proceedings of the 32nd ACM International Conference on Multimedia

Qian, W., Li, K., Guo, D., Hu, B., Wang, M.: Cluster-phys: Facial clues clustering towards efficient remote physiological measurement. In: Proceedings of the 32nd ACM International Conference on Multimedia. pp. 330–339 (2024)

2024

[38] [38]

Proceedings of the AAAI Conference on Artificial Intelligence (AAAI) (2025)

Qian, W., Su, G., Guo, D., Zhou, J., Li, X., Hu, B., Tang, S., Wang, M.: Physdiff: Physiology-based dynamicity disentangled diffusion model for remote physiologi- cal measurement. Proceedings of the AAAI Conference on Artificial Intelligence (AAAI) (2025)

2025

[39] [39]

In: Chinese Conference on Pattern Recognition and Computer Vision (PRCV)

Shang, T., Hao, Y., Pei, M., Li, K., Ben, H., Wang, S.: Cross-modal feature en- hancement and contrastive alignment for micro-gesture recognition. In: Chinese Conference on Pattern Recognition and Computer Vision (PRCV). pp. 203–217. Springer (2025)

2025

[40] [40]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Shi, D., Zhong, Y., Cao, Q., Ma, L., Li, J., Tao, D.: Tridet: Temporal action detection with relative boundary modeling. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 18857–18866 (2023)

2023

[41] [41]

Advances in Neural Information Pro- cessing Systems35, 15268–15280 (2022)

Tan, J., Zhao, X., Shi, X., Kang, B., Wang, L.: Pointtad: Multi-label temporal action detection with learnable query points. Advances in Neural Information Pro- cessing Systems35, 15268–15280 (2022)

2022

[42] [42]

arXiv preprint arXiv:2303.09055 (2023)

Tang, T.N., Kim, K., Sohn, K.: Temporalmaxer: Maximize temporal context with only max pooling for temporal action localization. arXiv preprint arXiv:2303.09055 (2023)

arXiv 2023

[43] [43]

arXiv preprint arXiv:2605.17179 (2026)

Wang,C.,Chen,H.,Wei,H.,Yang,Y.,Chen,Y.,Zhao,G.:imigue-3k:Alarge-scale benchmark for micro-gesture analysis with self-supervised learning. arXiv preprint arXiv:2605.17179 (2026)

Pith/arXiv arXiv 2026

[44] [44]

In: Proceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Wang, L., Huang, B., Zhao, Z., Tong, Z., He, Y., Wang, Y., Wang, Y., Qiao, Y.: Videomae v2: Scaling video masked autoencoders with dual masking. In: Proceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 14549–14560 (2023)

2023

[45] [45]

IEEE Transactions on Affective Computing pp

Wang,R.,Li,K.,Tong,A.,Xu,J.,Guo,D.,Wang,M.:Gaitemotionrecognitionvia uncertainty-oriented class discriminative learning. IEEE Transactions on Affective Computing pp. 1–14 (2026)

2026

[46] [46]

MiGA@ IJCAI (2024)

Wang, Y., Kerui, L., Huang, H., Xia, Z.: Micro-gesture online recognition with dual-stream multi-scale transformer in long videos. MiGA@ IJCAI (2024)

2024

[47] [47]

IEEE Trans- actions on Affective Computing (2025)

Xia, Z., Huang, H., Chen, H., Feng, X., Zhao, G.: Hybrid-supervised hypergraph- enhanced transformer for micro-gesture based emotion recognition. IEEE Trans- actions on Affective Computing (2025)

2025

[48] [48]

In: European conference on computer vision

Yang, L., Zheng, Z., Han, Y., Cheng, H., Song, S., Huang, G., Li, F.: Dyfadet: Dy- namic feature aggregation for temporal action detection. In: European conference on computer vision. pp. 305–322. Springer (2024)

2024

[49] [49]

arXiv preprint arXiv:2302.03024 (2023)

Yang, T., Zhu, Y., Xie, Y., Zhang, A., Chen, C., Li, M.: Aim: Adapting image mod- els for efficient video action recognition. arXiv preprint arXiv:2302.03024 (2023)

arXiv 2023

[50] [50]

In: European Conference on Computer Vision

Zhang, C.L., Wu, J., Li, Y.: Actionformer: Localizing moments of actions with transformers. In: European Conference on Computer Vision. pp. 492–510. Springer (2022) Spatial-Temporal Decoupled Adapter for Micro-gesture Online Recognition 17

2022

[51] [51]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Zhao, C., Liu, S., Mangalam, K., Ghanem, B.: Re2tal: Rewiring pretrained video backbones for reversible temporal action localization. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 10637– 10647 (2023)

2023

[52] [52]

In: Proceedings of the IEEE/CVF international conference on computer vision

Zhao, C., Thabet, A.K., Ghanem, B.: Video self-stitching graph network for tempo- ral action localization. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 13658–13667 (2021)

2021