AVD: Adversarial Video Distillation

Abdenour Hadid; Mohammad Sabokrou; Mohammad Tavakolian

arxiv: 1907.05640 · v1 · pith:L55WUSZOnew · submitted 2019-07-12 · 💻 cs.CV

AVD: Adversarial Video Distillation

Mohammad Tavakolian , Mohammad Sabokrou , Abdenour Hadid This is my paper

Pith reviewed 2026-05-24 22:38 UTC · model grok-4.3

classification 💻 cs.CV

keywords adversarial video distillationvideo representation3D to 2D mappingactivity recognitionUCF101HMDB51Kineticsimage model transfer

0 comments

The pith

Videos can be compressed into single realistic images via a 3D encoder with adversarial training so that pre-trained image models classify the original video content directly.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Adversarial Video Distillation to turn input videos into 2D image representations that retain semantic meaning. A 3D convolutional encoder-decoder minimizes reconstruction error while an adversarial procedure forces the encoder outputs to resemble natural images. These generated images serve as direct inputs to networks already trained on large image collections for activity recognition. The approach reduces video tasks to image tasks without custom video architectures or extra temporal modules. Results on UCF101, HMDB51, and Kinetics show higher accuracy than previous video-specific methods.

Core claim

A 3D convolutional encoder maps each input video to a 2D latent image while an adversarial loss on the encoder output ensures the image remains semantically realistic; the resulting image can be passed unchanged into deep models pre-trained on static images and yields state-of-the-art classification accuracy on UCF101, HMDB51, and Kinetics.

What carries the argument

3D convolutional encoder-decoder trained with reconstruction loss plus adversarial supervision on the 2D encoder output to produce semantically realistic images from videos.

If this is right

The 2D images act as plug-in inputs for any image-pretrained network without fine-tuning or temporal extensions.
Video classification accuracy on UCF101, HMDB51, and Kinetics exceeds prior state-of-the-art video methods.
Video analysis reduces to standard image analysis pipelines.
The same encoder can be applied across datasets of different scales including Kinetics.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the 2D images implicitly encode motion, the same mapping could be tested on other sequential signals such as audio spectrograms.
The method opens a route to transfer large image foundation models to video tasks by first distilling each clip to an image.
Similar distillation could be explored for video detection or segmentation once the image outputs are shown to preserve spatial layout.

Load-bearing premise

The adversarial training produces 2D images whose semantic content remains intact enough for image-only classifiers to recognize the original video actions without any added temporal modeling.

What would settle it

Generate the 2D images from held-out videos, feed them to the same pre-trained image classifiers, and observe whether accuracy falls to the level of random images or below the accuracy of standard video models.

Figures

Figures reproduced from arXiv: 1907.05640 by Abdenour Hadid, Mohammad Sabokrou, Mohammad Tavakolian.

**Figure 1.** Figure 1: Examples of videos (denoted by V) and their representation using AVD. AVD represents both spatial and temporal characteristics of raw videos as an RGB image (i.e. discriminative feature map) which can be used as the input of deep models pre-trained on still images. performance for complex tasks such as scene understanding. This gives more importance to investigate video analysis approaches. Evidently, … view at source ↗

**Figure 2.** Figure 2: The outline of our proposed AVD for video representation. The encoder network [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

read the original abstract

In this paper, we present a simple yet efficient approach for video representation, called Adversarial Video Distillation (AVD). The key idea is to represent videos by compressing them in the form of realistic images, which can be used in a variety of video-based scene analysis applications. Representing a video as a single image enables us to address the problem of video analysis by image analysis techniques. To this end, we exploit a 3D convolutional encoder-decoder network to encode the input video as an image by minimizing the reconstruction error. Furthermore, weak supervision by an adversarial training procedure is imposed on the output of the encoder to generate semantically realistic images. The encoder learns to extract semantically meaningful representations from a given input video by mapping the 3D input into a 2D latent representation. The obtained representation can be simply used as the input of deep models pre-trained on images for video classification. We evaluated the effectiveness of our proposed method for video-based activity recognition on three standard and challenging benchmark datasets, i.e. UCF101, HMDB51, and Kinetics. The experimental results demonstrate that AVD achieves interesting performance, outperforming the state-of-the-art methods for video classification.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

AVD distills videos to single images via 3D encoder-decoder plus adversarial loss for direct image-model transfer, but the abstract supplies no numbers or ablations to support the transfer claim.

read the letter

The one thing to know is that this paper proposes Adversarial Video Distillation to compress videos into realistic 2D images using a 3D encoder-decoder with an adversarial loss, allowing direct application of image-pretrained models to video classification tasks. The abstract claims it outperforms SOTA on three benchmarks but gives no numbers. The new element is the use of adversarial training to ensure the distilled image is semantically realistic rather than just a reconstruction. The paper does well in presenting a clean pipeline that avoids the need for task-specific video architectures by leveraging existing image models. The main soft spot is the absence of any quantitative evidence or detailed analysis in the abstract. Without reported accuracies, ablations on the adversarial term, or examples of the generated images, it's impossible to assess whether the semantic preservation actually works as claimed. The assumption that the 2D representation captures sufficient temporal dynamics for zero-shot transfer to image classifiers is the critical part, and the stress-test note correctly identifies that this is not obviously guaranteed by the reconstruction plus adversarial setup. Video actions depend on motion, and mapping to a static image requires some mechanism to encode that, which is not explained or shown. The method is presented as an empirical procedure, which is appropriate, but the circularity burden is low since there are no self-referential claims. This paper is aimed at applied computer vision practitioners looking for efficient ways to handle video data with image tools. A reader might get value from the architectural description if they are exploring distillation techniques. However, given the low soundness due to missing results, it does not merit a serious referee at this point. The work would need substantial additional experimental validation to be worth reviewing. I would not cite it or bring it to a reading group based on the current version.

Referee Report

3 major / 1 minor

Summary. The paper proposes Adversarial Video Distillation (AVD), a method that compresses input videos into single realistic 2D images via a 3D convolutional encoder-decoder minimizing reconstruction error, with weak adversarial supervision on the encoder output to enforce semantic realism. The resulting 2D latent representations are asserted to capture sufficient semantic content (including action dynamics) to be used directly as input to image-pretrained deep models for video classification, outperforming SOTA on UCF101, HMDB51, and Kinetics.

Significance. If the empirical results and the semantic-preservation claim hold under rigorous validation, the work would offer a simple bridge between video and image analysis, allowing reuse of large-scale image models without video-specific architectures or fine-tuning. The idea of distilling 3D video to transferable 2D images is conceptually appealing and could impact efficiency in video tasks, but the absence of quantitative support, ablations, or mechanistic explanation in the manuscript limits its assessed significance.

major comments (3)

[Abstract] Abstract: the central claim that the obtained 2D representation 'can be simply used as the input of deep models pre-trained on images for video classification' and 'outperforms the state-of-the-art methods' is unsupported by any numerical results, tables, error bars, or dataset-specific scores; this is load-bearing for the paper's primary contribution.
[Abstract] Abstract: no derivation, equations, or ablation is supplied showing how the reconstruction-plus-adversarial objective on the 3D-to-2D encoder output embeds temporal motion information into a static image such that an ImageNet-pretrained classifier (no fine-tuning, no extra temporal modules) can outperform dedicated video methods; this is the least secure step in the argument.
[Abstract] Abstract: the description of the adversarial training procedure provides no detail on the discriminator architecture, the precise form of the adversarial loss, or its interaction with the reconstruction term, preventing assessment of whether the generated images preserve action semantics rather than merely appearing realistic.

minor comments (1)

[Abstract] The phrase 'achieves interesting performance' is vague and should be replaced with precise quantitative statements once results are added.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback on the abstract and the overall presentation of our contribution. We address each major comment below and will make targeted revisions to improve clarity and support for the claims.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim that the obtained 2D representation 'can be simply used as the input of deep models pre-trained on images for video classification' and 'outperforms the state-of-the-art methods' is unsupported by any numerical results, tables, error bars, or dataset-specific scores; this is load-bearing for the paper's primary contribution.

Authors: We agree that the abstract would benefit from explicit quantitative support. The manuscript body presents results on UCF101, HMDB51, and Kinetics with comparisons to prior methods. We will revise the abstract to include key performance metrics and direct references to the experimental tables. revision: yes
Referee: [Abstract] Abstract: no derivation, equations, or ablation is supplied showing how the reconstruction-plus-adversarial objective on the 3D-to-2D encoder output embeds temporal motion information into a static image such that an ImageNet-pretrained classifier (no fine-tuning, no extra temporal modules) can outperform dedicated video methods; this is the least secure step in the argument.

Authors: The method section explains that the 3D convolutional encoder extracts spatio-temporal features which are compressed into the 2D output, with reconstruction preserving content and adversarial training enforcing semantic realism. While a formal derivation of temporal embedding is not provided, the design rationale and empirical outperformance on action recognition tasks support the claim. We will add a short clarifying paragraph in the method description. revision: partial
Referee: [Abstract] Abstract: the description of the adversarial training procedure provides no detail on the discriminator architecture, the precise form of the adversarial loss, or its interaction with the reconstruction term, preventing assessment of whether the generated images preserve action semantics rather than merely appearing realistic.

Authors: The full manuscript details the discriminator (a 2D CNN), the adversarial loss formulation, and its combination with the reconstruction objective in the training procedure section. We will update the abstract to briefly reference these elements and point to the methods for complete specifications. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical procedure evaluated on external benchmarks

full rationale

The paper presents an empirical method using a 3D encoder-decoder with reconstruction loss plus adversarial training to map input videos to single 2D images, then measures success via downstream classification accuracy on UCF101, HMDB51, and Kinetics using pre-trained image models. No equations, derivations, or parameter-fitting steps are described that reduce any claim to its own inputs by construction. The central performance assertions rest on external experimental outcomes rather than self-referential definitions or self-citation chains, so the derivation chain is self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The approach rests on standard assumptions of convolutional networks and adversarial training; no free parameters, axioms, or invented entities are explicitly introduced or quantified in the abstract.

pith-pipeline@v0.9.0 · 5740 in / 1110 out tokens · 24007 ms · 2026-05-24T22:38:42.973094+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

3D convolutional encoder-decoder network to encode the input video as an image by minimizing the reconstruction error. Furthermore, weak supervision by an adversarial training procedure is imposed on the output of the encoder
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

The obtained representation can be simply used as the input of deep models pre-trained on images for video classification

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

37 extracted references · 37 canonical work pages · 4 internal anchors

[1]

Dynamic image networks for action recognition

Hakan Bilen, Basura Fernando, Efstratios Gavves, Andrea Vedaldi, and Stephen Gould. Dynamic image networks for action recognition. In CVPR, pages 3034–3042, 2016

work page 2016
[2]

Multi-view super vector for action recognition

Zhuowei Cai, Limin Wang, Xiaojiang Peng, and Yu Qiao. Multi-view super vector for action recognition. In CVPR, pages 596–603, 2014

work page 2014
[3]

Quo vadis, action recognition? a new model and the kinetics dataset

Joao Carreira and Andrew Zisserman. Quo vadis, action recognition? a new model and the kinetics dataset. In CVPR, pages 6299–6308, 2017

work page 2017
[4]

The cityscapes dataset for semantic urban scene understanding

Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld, Markus Enzweiler, Rodrigo Benenson, Uwe Franke, Stefan Roth, and Bernt Schiele. The cityscapes dataset for semantic urban scene understanding. In Proceed- ings of the IEEE conference on computer vision and pattern recognition, pages 3213–3223, 2016

work page 2016
[5]

Imagenet: A large-scale hierarchical image database

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In CVPR, pages 248–255, 2009

work page 2009
[6]

Temporal 3d convnets using temporal transi- tion layer

Ali Diba, Mohsen Fayyaz, Vivek Sharma, A Hos- sein Karami, M Mahdi Arzani, Rahman Yousefzadeh, and Luc Van Gool. Temporal 3d convnets using temporal transi- tion layer. In CVPR Workshops, pages 1117–1121, 2018

work page 2018
[7]

Deep residual learning for image recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, pages 770–778, 2016

work page 2016
[8]

Densely connected convolutional net- works

Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kil- ian Q Weinberger. Densely connected convolutional net- works. In CVPR, pages 4700–4708, 2017

work page 2017
[9]

Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift

Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal co- variate shift. arXiv preprint arXiv:1502.03167, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015
[10]

3D convolu- tional neural networks for human action recognition

Shuiwang Ji, Wei Xu, Ming Yang, and Kai Yu. 3D convolu- tional neural networks for human action recognition. IEEE Trans. PAMI, 35(1):221–231, 2013

work page 2013
[11]

The Kinetics Human Action Video Dataset

Will Kay, Joao Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier, Sudheendra Vijayanarasimhan, Fabio Viola, Tim Green, Trevor Back, Paul Natsev, et al. The kinetics hu- man action video dataset. arXiv preprint arXiv:1705.06950, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[12]

Imagenet classiﬁcation with deep convolutional neural net- works

Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classiﬁcation with deep convolutional neural net- works. In NIPS, pages 1097–1105, 2012

work page 2012
[13]

HMDB: a large video database for human motion recognition

Hildegard Kuehne, Hueihan Jhuang, Est ´ıbaliz Garrote, Tomaso Poggio, and Thomas Serre. HMDB: a large video database for human motion recognition. In ICCV, pages 2556–2563, 2011

work page 2011
[14]

Fully convolutional networks for semantic segmentation

Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully convolutional networks for semantic segmentation. In Pro- ceedings of the IEEE conference on computer vision and pat- tern recognition, pages 3431–3440, 2015

work page 2015
[15]

Bag of visual words and fusion methods for action recog- nition: Comprehensive study and good practice

Xiaojiang Peng, Limin Wang, Xingxing Wang, and Yu Qiao. Bag of visual words and fusion methods for action recog- nition: Comprehensive study and good practice. CVIU, 150:109–125, 2016

work page 2016
[16]

Learning spatio- temporal representation with pseudo-3d residual networks

Zhaofan Qiu, Ting Yao, and Tao Mei. Learning spatio- temporal representation with pseudo-3d residual networks. In CVPR, pages 5533–5541, 2017

work page 2017
[17]

You only look once: Uniﬁed, real-time object de- tection

Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. You only look once: Uniﬁed, real-time object de- tection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 779–788, 2016

work page 2016
[18]

Imagenet large scale visual recognition challenge

Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, San- jeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. Imagenet large scale visual recognition challenge. IJCV, 115(3):211–252, 2015

work page 2015
[19]

Two-stream con- volutional networks for action recognition in videos

Karen Simonyan and Andrew Zisserman. Two-stream con- volutional networks for action recognition in videos. In NIPS, pages 568–576, 2014

work page 2014
[20]

UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild

Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. UCF101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402, 2012

work page internal anchor Pith review Pith/arXiv arXiv 2012
[21]

Human action recognition using factorized spatio-temporal convolu- tional networks

Lin Sun, Kui Jia, Dit-Yan Yeung, and Bertram E Shi. Human action recognition using factorized spatio-temporal convolu- tional networks. In ICCV, pages 4597–4605, 2015

work page 2015
[22]

Deepface: Closing the gap to human-level perfor- mance in face veriﬁcation

Yaniv Taigman, Ming Yang, Marc’Aurelio Ranzato, and Lior Wolf. Deepface: Closing the gap to human-level perfor- mance in face veriﬁcation. In CVPR, pages 1701–1708, 2014

work page 2014
[23]

Deep discrim- inative model for video classiﬁcation

Mohammad Tavakolian and Abdenour Hadid. Deep discrim- inative model for video classiﬁcation. In ECCV, pages 382– 398, 2018

work page 2018
[24]

A spatiotem- poral convolutional neural network for automatic pain inten- sity estimation from facial dynamics

Mohammad Tavakolian and Abdenour Hadid. A spatiotem- poral convolutional neural network for automatic pain inten- sity estimation from facial dynamics. International Journal of Computer Vision, pages 1–13, 2019

work page 2019
[25]

Learning spatiotemporal features with 3d convolutional networks

Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torre- sani, and Manohar Paluri. Learning spatiotemporal features with 3d convolutional networks. InICCV, pages 4489–4497, 2015

work page 2015
[26]

ConvNet Architecture Search for Spatiotemporal Feature Learning

Du Tran, Jamie Ray, Zheng Shou, Shih-Fu Chang, and Manohar Paluri. Convnet architecture search for spatiotem- poral feature learning. arXiv preprint arXiv:1708.05038 , 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[27]

Long-term temporal convolutions for action recognition

G ¨ul Varol, Ivan Laptev, and Cordelia Schmid. Long-term temporal convolutions for action recognition. IEEE Trans. PAMI, 40(6):1510–1517, 2018

work page 2018
[28]

Action recognition by dense trajectories

Heng Wang, Alexander Kl ¨aser, Cordelia Schmid, and Liu Cheng-Lin. Action recognition by dense trajectories. In CVPR, pages 3169–3176, 2011

work page 2011
[29]

Action recognition with improved trajectories

Heng Wang and Cordelia Schmid. Action recognition with improved trajectories. In CVPR, pages 3551–3558, 2013

work page 2013
[30]

Video representation learning using discriminative pooling

Jue Wang, Anoop Cherian, Fatih Porikli, and Stephen Gould. Video representation learning using discriminative pooling. In CVPR, pages 1149–1158, 2018

work page 2018
[31]

MoFAP: A multi-level representation for action recognition

Limin Wang, Yu Qiao, and Xiaoou Tang. MoFAP: A multi-level representation for action recognition. IJCV, 119(3):254–271, 2016

work page 2016
[32]

Temporal segment net- works for action recognition in videos

Limin Wang, Yuanjun Xiong, Zhe Wang, Yu Qiao, Dahua Lin, Xiaoou Tang, and Luc Van Gool. Temporal segment net- works for action recognition in videos. IEEE Trans. PAMI, 2018

work page 2018
[33]

An efﬁ- cient dense and scale-invariant spatio-temporal interest point detector

Geert Willems, Tinne Tuytelaars, and Luc Van Gool. An efﬁ- cient dense and scale-invariant spatio-temporal interest point detector. In ECCV, pages 650–663, 2008

work page 2008
[34]

Rethinking spatiotemporal feature learning: Speed-accuracy trade-offs in video classiﬁcation

Saining Xie, Chen Sun, Jonathan Huang, Zhuowen Tu, and Kevin Murphy. Rethinking spatiotemporal feature learning: Speed-accuracy trade-offs in video classiﬁcation. In ECCV, pages 305–321, 2018

work page 2018
[35]

A du- ality based approach for realtime tv-l 1 optical ﬂow

Christopher Zach, Thomas Pock, and Horst Bischof. A du- ality based approach for realtime tv-l 1 optical ﬂow. In Joint pattern recognition symposium, pages 214–223, 2007

work page 2007
[36]

Learning deep features for scene recognition using places database

Bolei Zhou, Agata Lapedriza, Jianxiong Xiao, Antonio Tor- ralba, and Aude Oliva. Learning deep features for scene recognition using places database. In NIPS, pages 487–495, 2014

work page 2014
[37]

A key volume mining deep framework for action recognition

Wangjiang Zhu, Jie Hu, Gang Sun, Xudong Cao, and Yu Qiao. A key volume mining deep framework for action recognition. In CVPR, pages 1991–1999, 2016

work page 1991

[1] [1]

Dynamic image networks for action recognition

Hakan Bilen, Basura Fernando, Efstratios Gavves, Andrea Vedaldi, and Stephen Gould. Dynamic image networks for action recognition. In CVPR, pages 3034–3042, 2016

work page 2016

[2] [2]

Multi-view super vector for action recognition

Zhuowei Cai, Limin Wang, Xiaojiang Peng, and Yu Qiao. Multi-view super vector for action recognition. In CVPR, pages 596–603, 2014

work page 2014

[3] [3]

Quo vadis, action recognition? a new model and the kinetics dataset

Joao Carreira and Andrew Zisserman. Quo vadis, action recognition? a new model and the kinetics dataset. In CVPR, pages 6299–6308, 2017

work page 2017

[4] [4]

The cityscapes dataset for semantic urban scene understanding

Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld, Markus Enzweiler, Rodrigo Benenson, Uwe Franke, Stefan Roth, and Bernt Schiele. The cityscapes dataset for semantic urban scene understanding. In Proceed- ings of the IEEE conference on computer vision and pattern recognition, pages 3213–3223, 2016

work page 2016

[5] [5]

Imagenet: A large-scale hierarchical image database

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In CVPR, pages 248–255, 2009

work page 2009

[6] [6]

Temporal 3d convnets using temporal transi- tion layer

Ali Diba, Mohsen Fayyaz, Vivek Sharma, A Hos- sein Karami, M Mahdi Arzani, Rahman Yousefzadeh, and Luc Van Gool. Temporal 3d convnets using temporal transi- tion layer. In CVPR Workshops, pages 1117–1121, 2018

work page 2018

[7] [7]

Deep residual learning for image recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, pages 770–778, 2016

work page 2016

[8] [8]

Densely connected convolutional net- works

Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kil- ian Q Weinberger. Densely connected convolutional net- works. In CVPR, pages 4700–4708, 2017

work page 2017

[9] [9]

Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift

Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal co- variate shift. arXiv preprint arXiv:1502.03167, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015

[10] [10]

3D convolu- tional neural networks for human action recognition

Shuiwang Ji, Wei Xu, Ming Yang, and Kai Yu. 3D convolu- tional neural networks for human action recognition. IEEE Trans. PAMI, 35(1):221–231, 2013

work page 2013

[11] [11]

The Kinetics Human Action Video Dataset

Will Kay, Joao Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier, Sudheendra Vijayanarasimhan, Fabio Viola, Tim Green, Trevor Back, Paul Natsev, et al. The kinetics hu- man action video dataset. arXiv preprint arXiv:1705.06950, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[12] [12]

Imagenet classiﬁcation with deep convolutional neural net- works

Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classiﬁcation with deep convolutional neural net- works. In NIPS, pages 1097–1105, 2012

work page 2012

[13] [13]

HMDB: a large video database for human motion recognition

Hildegard Kuehne, Hueihan Jhuang, Est ´ıbaliz Garrote, Tomaso Poggio, and Thomas Serre. HMDB: a large video database for human motion recognition. In ICCV, pages 2556–2563, 2011

work page 2011

[14] [14]

Fully convolutional networks for semantic segmentation

Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully convolutional networks for semantic segmentation. In Pro- ceedings of the IEEE conference on computer vision and pat- tern recognition, pages 3431–3440, 2015

work page 2015

[15] [15]

Bag of visual words and fusion methods for action recog- nition: Comprehensive study and good practice

Xiaojiang Peng, Limin Wang, Xingxing Wang, and Yu Qiao. Bag of visual words and fusion methods for action recog- nition: Comprehensive study and good practice. CVIU, 150:109–125, 2016

work page 2016

[16] [16]

Learning spatio- temporal representation with pseudo-3d residual networks

Zhaofan Qiu, Ting Yao, and Tao Mei. Learning spatio- temporal representation with pseudo-3d residual networks. In CVPR, pages 5533–5541, 2017

work page 2017

[17] [17]

You only look once: Uniﬁed, real-time object de- tection

Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. You only look once: Uniﬁed, real-time object de- tection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 779–788, 2016

work page 2016

[18] [18]

Imagenet large scale visual recognition challenge

Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, San- jeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. Imagenet large scale visual recognition challenge. IJCV, 115(3):211–252, 2015

work page 2015

[19] [19]

Two-stream con- volutional networks for action recognition in videos

Karen Simonyan and Andrew Zisserman. Two-stream con- volutional networks for action recognition in videos. In NIPS, pages 568–576, 2014

work page 2014

[20] [20]

UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild

Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. UCF101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402, 2012

work page internal anchor Pith review Pith/arXiv arXiv 2012

[21] [21]

Human action recognition using factorized spatio-temporal convolu- tional networks

Lin Sun, Kui Jia, Dit-Yan Yeung, and Bertram E Shi. Human action recognition using factorized spatio-temporal convolu- tional networks. In ICCV, pages 4597–4605, 2015

work page 2015

[22] [22]

Deepface: Closing the gap to human-level perfor- mance in face veriﬁcation

Yaniv Taigman, Ming Yang, Marc’Aurelio Ranzato, and Lior Wolf. Deepface: Closing the gap to human-level perfor- mance in face veriﬁcation. In CVPR, pages 1701–1708, 2014

work page 2014

[23] [23]

Deep discrim- inative model for video classiﬁcation

Mohammad Tavakolian and Abdenour Hadid. Deep discrim- inative model for video classiﬁcation. In ECCV, pages 382– 398, 2018

work page 2018

[24] [24]

A spatiotem- poral convolutional neural network for automatic pain inten- sity estimation from facial dynamics

Mohammad Tavakolian and Abdenour Hadid. A spatiotem- poral convolutional neural network for automatic pain inten- sity estimation from facial dynamics. International Journal of Computer Vision, pages 1–13, 2019

work page 2019

[25] [25]

Learning spatiotemporal features with 3d convolutional networks

Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torre- sani, and Manohar Paluri. Learning spatiotemporal features with 3d convolutional networks. InICCV, pages 4489–4497, 2015

work page 2015

[26] [26]

ConvNet Architecture Search for Spatiotemporal Feature Learning

Du Tran, Jamie Ray, Zheng Shou, Shih-Fu Chang, and Manohar Paluri. Convnet architecture search for spatiotem- poral feature learning. arXiv preprint arXiv:1708.05038 , 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[27] [27]

Long-term temporal convolutions for action recognition

G ¨ul Varol, Ivan Laptev, and Cordelia Schmid. Long-term temporal convolutions for action recognition. IEEE Trans. PAMI, 40(6):1510–1517, 2018

work page 2018

[28] [28]

Action recognition by dense trajectories

Heng Wang, Alexander Kl ¨aser, Cordelia Schmid, and Liu Cheng-Lin. Action recognition by dense trajectories. In CVPR, pages 3169–3176, 2011

work page 2011

[29] [29]

Action recognition with improved trajectories

Heng Wang and Cordelia Schmid. Action recognition with improved trajectories. In CVPR, pages 3551–3558, 2013

work page 2013

[30] [30]

Video representation learning using discriminative pooling

Jue Wang, Anoop Cherian, Fatih Porikli, and Stephen Gould. Video representation learning using discriminative pooling. In CVPR, pages 1149–1158, 2018

work page 2018

[31] [31]

MoFAP: A multi-level representation for action recognition

Limin Wang, Yu Qiao, and Xiaoou Tang. MoFAP: A multi-level representation for action recognition. IJCV, 119(3):254–271, 2016

work page 2016

[32] [32]

Temporal segment net- works for action recognition in videos

Limin Wang, Yuanjun Xiong, Zhe Wang, Yu Qiao, Dahua Lin, Xiaoou Tang, and Luc Van Gool. Temporal segment net- works for action recognition in videos. IEEE Trans. PAMI, 2018

work page 2018

[33] [33]

An efﬁ- cient dense and scale-invariant spatio-temporal interest point detector

Geert Willems, Tinne Tuytelaars, and Luc Van Gool. An efﬁ- cient dense and scale-invariant spatio-temporal interest point detector. In ECCV, pages 650–663, 2008

work page 2008

[34] [34]

Rethinking spatiotemporal feature learning: Speed-accuracy trade-offs in video classiﬁcation

Saining Xie, Chen Sun, Jonathan Huang, Zhuowen Tu, and Kevin Murphy. Rethinking spatiotemporal feature learning: Speed-accuracy trade-offs in video classiﬁcation. In ECCV, pages 305–321, 2018

work page 2018

[35] [35]

A du- ality based approach for realtime tv-l 1 optical ﬂow

Christopher Zach, Thomas Pock, and Horst Bischof. A du- ality based approach for realtime tv-l 1 optical ﬂow. In Joint pattern recognition symposium, pages 214–223, 2007

work page 2007

[36] [36]

Learning deep features for scene recognition using places database

Bolei Zhou, Agata Lapedriza, Jianxiong Xiao, Antonio Tor- ralba, and Aude Oliva. Learning deep features for scene recognition using places database. In NIPS, pages 487–495, 2014

work page 2014

[37] [37]

A key volume mining deep framework for action recognition

Wangjiang Zhu, Jie Hu, Gang Sun, Xudong Cao, and Yu Qiao. A key volume mining deep framework for action recognition. In CVPR, pages 1991–1999, 2016

work page 1991