Two-Stream Convolutional Networks for Action Recognition in Videos

Andrew Zisserman; Karen Simonyan

arxiv: 1406.2199 · v2 · pith:M3ZEIEXKnew · submitted 2014-06-09 · 💻 cs.CV

Two-Stream Convolutional Networks for Action Recognition in Videos

Karen Simonyan , Andrew Zisserman This is my paper

classification 💻 cs.CV

keywords actionnetworkstrainedvideoarchitectureclassificationconvnetconvolutional

0 comments

read the original abstract

We investigate architectures of discriminatively trained deep Convolutional Networks (ConvNets) for action recognition in video. The challenge is to capture the complementary information on appearance from still frames and motion between frames. We also aim to generalise the best performing hand-crafted features within a data-driven learning framework. Our contribution is three-fold. First, we propose a two-stream ConvNet architecture which incorporates spatial and temporal networks. Second, we demonstrate that a ConvNet trained on multi-frame dense optical flow is able to achieve very good performance in spite of limited training data. Finally, we show that multi-task learning, applied to two different action classification datasets, can be used to increase the amount of training data and improve the performance on both. Our architecture is trained and evaluated on the standard video actions benchmarks of UCF-101 and HMDB-51, where it is competitive with the state of the art. It also exceeds by a large margin previous attempts to use deep nets for video classification.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

A Heterogeneous Two-Stream Framework for Video Action Recognition with Comparative Fusion Analysis
cs.CV 2026-04 unverdicted novelty 4.0

DualStreamHybrid assigns ViT-Tiny to RGB and MobileNetV2 to 20-channel flow, projects features to common space, and finds cross-attention best on UCF11 (98.12%) while weighted fusion is most consistent on UCF50 (96.86%).