NeXtVLAD: An Efficient Neural Network to Aggregate Frame-level Features for Large-scale Video Classification

Jianping Fan; Jing Xiao; Rongcheng Lin

arxiv: 1811.05014 · v1 · pith:NAAFLBIGnew · submitted 2018-11-12 · 💻 cs.CV

NeXtVLAD: An Efficient Neural Network to Aggregate Frame-level Features for Large-scale Video Classification

Rongcheng Lin , Jing Xiao , Jianping Fan This is my paper

classification 💻 cs.CV

keywords nextvladefficientvideoaggregateclassificationfeaturefeaturesframe-level

0 comments

read the original abstract

This paper introduces a fast and efficient network architecture, NeXtVLAD, to aggregate frame-level features into a compact feature vector for large-scale video classification. Briefly speaking, the basic idea is to decompose a high-dimensional feature into a group of relatively low-dimensional vectors with attention before applying NetVLAD aggregation over time. This NeXtVLAD approach turns out to be both effective and parameter efficient in aggregating temporal information. In the 2nd Youtube-8M video understanding challenge, a single NeXtVLAD model with less than 80M parameters achieves a GAP score of 0.87846 in private leaderboard. A mixture of 3 NeXtVLAD models results in 0.88722, which is ranked 3rd over 394 teams. The code is publicly available at https://github.com/linrongc/youtube-8m.

This paper has not been read by Pith yet.

NeXtVLAD: An Efficient Neural Network to Aggregate Frame-level Features for Large-scale Video Classification

discussion (0)