Spatio-Temporal Channel Correlation Networks for Action Classification

Ali Diba; Juergen Gall; Luc Van Gool; M.Mahdi Arzani; Mohsen Fayyaz; Rahman Yousefzadeh; Vivek Sharma

arxiv: 1806.07754 · v3 · pith:PZ4K57KXnew · submitted 2018-06-19 · 💻 cs.CV

Spatio-Temporal Channel Correlation Networks for Action Classification

Ali Diba , Mohsen Fayyaz , Vivek Sharma , M.Mahdi Arzani , Rahman Yousefzadeh , Juergen Gall , Luc Van Gool This is my paper

classification 💻 cs.CV

keywords cnnsblockspatio-temporaldatasetsnetworksperformancestate-of-the-arttraining

0 comments

read the original abstract

The work in this paper is driven by the question if spatio-temporal correlations are enough for 3D convolutional neural networks (CNN)? Most of the traditional 3D networks use local spatio-temporal features. We introduce a new block that models correlations between channels of a 3D CNN with respect to temporal and spatial features. This new block can be added as a residual unit to different parts of 3D CNNs. We name our novel block 'Spatio-Temporal Channel Correlation' (STC). By embedding this block to the current state-of-the-art architectures such as ResNext and ResNet, we improved the performance by 2-3\% on Kinetics dataset. Our experiments show that adding STC blocks to current state-of-the-art architectures outperforms the state-of-the-art methods on the HMDB51, UCF101 and Kinetics datasets. The other issue in training 3D CNNs is about training them from scratch with a huge labeled dataset to get a reasonable performance. So the knowledge learned in 2D CNNs is completely ignored. Another contribution in this work is a simple and effective technique to transfer knowledge from a pre-trained 2D CNN to a randomly initialized 3D CNN for a stable weight initialization. This allows us to significantly reduce the number of training samples for 3D CNNs. Thus, by fine-tuning this network, we beat the performance of generic and recent methods in 3D CNNs, which were trained on large video datasets, e.g. Sports-1M, and fine-tuned on the target datasets, e.g. HMDB51/UCF101.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Video Action Recognition Via Neural Architecture Searching
cs.CV 2019-07 unverdicted novelty 6.0

Uses differentiable NAS with temporal segments and pseudo-3D operators to discover a video action recognition network that outperforms hand-designed models on UCF101 with ~1% of the parameters when trained from scratch.