UniFormer: Unified Transformer for Efficient Spatiotemporal Representation Learning

Guanglu Song; Hongsheng Li; Kunchang Li; Peng Gao; Yali Wang; Yu Liu; Yu Qiao

arxiv: 2201.04676 · v3 · pith:KSOGMIUNnew · submitted 2022-01-12 · 💻 cs.CV

UniFormer: Unified Transformer for Efficient Spatiotemporal Representation Learning

Kunchang Li , Yali Wang , Peng Gao , Guanglu Song , Yu Liu , Hongsheng Li , Yu Qiao This is my paper

classification 💻 cs.CV

keywords localuniformerdependencyredundancyspatiotemporalaccuracyachievesglobal

0 comments

read the original abstract

It is a challenging task to learn rich and multi-scale spatiotemporal semantics from high-dimensional videos, due to large local redundancy and complex global dependency between video frames. The recent advances in this research have been mainly driven by 3D convolutional neural networks and vision transformers. Although 3D convolution can efficiently aggregate local context to suppress local redundancy from a small 3D neighborhood, it lacks the capability to capture global dependency because of the limited receptive field. Alternatively, vision transformers can effectively capture long-range dependency by self-attention mechanism, while having the limitation on reducing local redundancy with blind similarity comparison among all the tokens in each layer. Based on these observations, we propose a novel Unified transFormer (UniFormer) which seamlessly integrates merits of 3D convolution and spatiotemporal self-attention in a concise transformer format, and achieves a preferable balance between computation and accuracy. Different from traditional transformers, our relation aggregator can tackle both spatiotemporal redundancy and dependency, by learning local and global token affinity respectively in shallow and deep layers. We conduct extensive experiments on the popular video benchmarks, e.g., Kinetics-400, Kinetics-600, and Something-Something V1&V2. With only ImageNet-1K pretraining, our UniFormer achieves 82.9%/84.8% top-1 accuracy on Kinetics-400/Kinetics-600, while requiring 10x fewer GFLOPs than other state-of-the-art methods. For Something-Something V1 and V2, our UniFormer achieves new state-of-the-art performances of 60.9% and 71.2% top-1 accuracy respectively. Code is available at https://github.com/Sense-X/UniFormer.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 8 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

V-Nutri: Dish-Level Nutrition Estimation from Egocentric Cooking Videos
cs.CV 2026-04 unverdicted novelty 7.0

V-Nutri fuses final-dish features with cooking-process keyframes from egocentric videos to improve dish-level calorie and macronutrient estimation over single-image baselines.
EAST: Early Action Prediction Sampling Strategy with Token Masking
cs.CV 2026-04 unverdicted novelty 6.0

EAST uses randomized time-step sampling and token masking to train a single encoder-only model that generalizes across all observation ratios in early action prediction and reports new state-of-the-art accuracy on NTU...
Revisiting Feature Prediction for Learning Visual Representations from Video
cs.CV 2024-02 conditional novelty 6.0

V-JEPA models trained only on feature prediction from 2 million public videos achieve 81.9% on Kinetics-400, 72.2% on Something-Something-v2, and 77.9% on ImageNet-1K using frozen ViT-H/16 backbones.
Sea-Scan: High-Accuracy, ML-based Dark Vessel Detection and Localisation via Weakly Supervised DAS Monitoring
cs.SD 2026-06 unverdicted novelty 5.0

ML-based dark vessel detection system using weakly supervised learning on DAS data achieves 97.8% detection rate at 1.98% false-trigger rate.
PestVL-Net: Enabling Multimodal Pest Learning via Fine-grained Vision-Language Interaction
cs.CV 2026-04 unverdicted novelty 5.0

PestVL-Net combines an RWKV visual backbone with saliency-guided window partitioning and MLLM-derived linguistic priors via multimodal chain-of-thought to enable fine-grained multimodal pest recognition on dedicated datasets.
Insights from Visual Cognition: Understanding Human Action Dynamics with Overall Glance and Refined Gaze Transformer
cs.CV 2026-04 unverdicted novelty 5.0

The OG-ReG Transformer achieves state-of-the-art results on Kinetics-400, Something-Something v2, and Diving-48 by combining global glance and local gaze processing paths.
EV-CLIP: Efficient Visual Prompt Adaptation for CLIP in Few-shot Action Recognition under Visual Challenges
cs.CV 2026-04 unverdicted novelty 4.0

EV-CLIP introduces mask and context visual prompts to adapt CLIP for improved few-shot video action recognition under visual challenges such as low light and egocentric views, outperforming other efficient methods wit...
ASGNet: Adaptive Spectrum Guidance Network for Automatic Polyp Segmentation
cs.CV 2026-04 unverdicted novelty 4.0

ASGNet combines a spectrum-guided non-local perception module, multi-source semantic extractor, and dense cross-layer decoder to outperform 21 prior methods on five polyp segmentation benchmarks.