pith. sign in

arxiv: 1711.11248 · v3 · pith:DD2UQCW3new · submitted 2017-11-30 · 💻 cs.CV

A Closer Look at Spatiotemporal Convolutions for Action Recognition

classification 💻 cs.CV
keywords cnnsactionrecognitionspatiotemporalaccuracyadvantagesconvolutionalconvolutions
0
0 comments X
read the original abstract

In this paper we discuss several forms of spatiotemporal convolutions for video analysis and study their effects on action recognition. Our motivation stems from the observation that 2D CNNs applied to individual frames of the video have remained solid performers in action recognition. In this work we empirically demonstrate the accuracy advantages of 3D CNNs over 2D CNNs within the framework of residual learning. Furthermore, we show that factorizing the 3D convolutional filters into separate spatial and temporal components yields significantly advantages in accuracy. Our empirical study leads to the design of a new spatiotemporal convolutional block "R(2+1)D" which gives rise to CNNs that achieve results comparable or superior to the state-of-the-art on Sports-1M, Kinetics, UCF101 and HMDB51.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. GPROF-IR: An Improved Single-Channel Infrared Precipitation Retrieval for Merged Satellite Precipitation Products

    physics.ao-ph 2026-05 unverdicted novelty 7.0

    GPROF-IR is a CNN-based retrieval that uses temporal context in geostationary IR observations to produce precipitation estimates with lower error than prior IR methods and climatological consistency with PMW retrieval...

  2. Decoding Alignment without Encoding Alignment: A critique of similarity analysis in neuroscience

    q-bio.NC 2026-05 unverdicted novelty 6.0

    Decoding alignment metrics can remain high and unchanged even when encoding manifold topology is causally altered, so they do not imply similar function or computation across neural populations.