Spatial-Temporal Decoupled Adapter for Micro-gesture Online Recognition

· 2026 · cs.CV · arXiv 2606.07355

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it

open full Pith review browse 1 citing papers arXiv PDF

abstract

Micro-gesture online recognition aims to temporally localize and classify subtle gestures in untrimmed videos. Owing to their extremely short duration, low motion amplitude, and ambiguous visual cues, capturing discriminative spatiotemporal representations remains highly challenging. Existing parameter-efficient adapters typically employ a single branch to model spatial and temporal cues jointly, which may fail to capture the fine-grained patterns of micro-gestures. To address this limitation, we propose a Spatial-Temporal Decoupled Adapter that decomposes video adaptation into independent temporal and spatial branches via lightweight depthwise convolutions. In addition, to address the long-tail distribution problem in the benchmark dataset, we introduce Adaptive Soft Balanced Augmentation, which dynamically allocates augmentation intensity based on class rarity and learning difficulty, without manual thresholds. Our method achieves an F1 score of 0.43808, ranking 1st in Track 2 of the 4th EI-MiGA-IJCAI Challenge.

representative citing papers

Rethinking the Role of Feature Engineering and Learning Strategies in Few-Shot Hidden Emotion Recognition

cs.CV · 2026-06-30 · unverdicted · novelty 3.0

A competition-winning multi-modal model for hidden emotion recognition integrates static and dynamic pose features via cross-attention and MIL pooling while noting representation collapse in vision foundation models on micro-dynamic tasks.

citing papers explorer

Showing 1 of 1 citing paper.

Rethinking the Role of Feature Engineering and Learning Strategies in Few-Shot Hidden Emotion Recognition cs.CV · 2026-06-30 · unverdicted · none · ref 38 · internal anchor
A competition-winning multi-modal model for hidden emotion recognition integrates static and dynamic pose features via cross-attention and MIL pooling while noting representation collapse in vision foundation models on micro-dynamic tasks.

Spatial-Temporal Decoupled Adapter for Micro-gesture Online Recognition

fields

years

verdicts

representative citing papers

citing papers explorer