Video Action Recognition Via Neural Architecture Searching

Guoying Zhao; Wei Peng; Xiaopeng Hong

arxiv: 1907.04632 · v1 · pith:53NZSF6Knew · submitted 2019-07-10 · 💻 cs.CV · cs.LG

Video Action Recognition Via Neural Architecture Searching

Wei Peng , Xiaopeng Hong , Guoying Zhao This is my paper

Pith reviewed 2026-05-24 23:57 UTC · model grok-4.3

classification 💻 cs.CV cs.LG

keywords neural architecture searchvideo action recognitionUCF101pseudo-3D operatorstemporal segment samplingspatio-temporal networksdifferentiable architecture search

0 comments

The pith

Automated search over a graph of pseudo-3D operators produces a video action network that beats hand-designed models on UCF101 while using only one percent of the parameters.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to replace manual expert design of neural networks for video action recognition with an automated search process. It represents candidate networks as a directed acyclic graph whose edges are chosen from a space of pseudo-3D operators and optimizes the choice with gradients. A temporal segment sampling scheme is added so that the search remains computationally tractable on video data. A sympathetic reader would care because the method claims to deliver higher accuracy than established architectures on UCF101 even though the resulting network contains roughly one percent as many parameters and is trained entirely from scratch.

Core claim

A spatio-temporal network is obtained by gradient-based search inside a differentiable space modeled by a directed acyclic graph. The search employs a temporal segment approach on video inputs to keep global information while lowering cost and restricts the operator space to pseudo-3D convolutions. On the UCF101 dataset the discovered architecture, trained from scratch, exceeds the accuracy of popular manual networks while requiring only around one percent of their parameter count.

What carries the argument

Gradient-based optimization of a directed acyclic graph whose nodes are pseudo-3D spatio-temporal operators, combined with temporal segment sampling of video clips.

If this is right

The searched architecture achieves higher accuracy on UCF101 than popular manual designs under identical from-scratch training.
The model requires only about one percent of the parameters of its manual-design counterparts.
The temporal segment sampling reduces evaluation cost during search without discarding essential video information.
Pseudo-3D operators form an efficient yet expressive search space for spatio-temporal networks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same search procedure could be applied directly to other video tasks such as temporal action localization or video captioning.
Because the discovered model is small, it may run in real time on embedded hardware where larger manual networks cannot.
Expanding the search space to include additional operator types might yield further accuracy gains on larger video datasets.
The low parameter count suggests the search process discovers unusually efficient connectivity patterns rather than simply scaling up existing designs.

Load-bearing premise

That sampling a few temporal segments from each video preserves enough global information for action recognition and that the pseudo-3D operator space contains structures close to the best possible networks for the task.

What would settle it

If any popular hand-designed architecture, when trained from scratch on UCF101 under the same protocol, reaches equal or higher accuracy than the searched model while using a similar or smaller parameter count, the performance claim would be refuted.

read the original abstract

Deep neural networks have achieved great success for video analysis and understanding. However, designing a high-performance neural architecture requires substantial efforts and expertise. In this paper, we make the first attempt to let algorithm automatically design neural networks for video action recognition tasks. Specifically, a spatio-temporal network is developed in a differentiable space modeled by a directed acyclic graph, thus a gradient-based strategy can be performed to search an optimal architecture. Nonetheless, it is computationally expensive, since the computational burden to evaluate each architecture candidate is still heavy. To alleviate this issue, we, for the video input, introduce a temporal segment approach to reduce the computational cost without losing global video information. For the architecture, we explore in an efficient search space by introducing pseudo 3D operators. Experiments show that, our architecture outperforms popular neural architectures, under the training from scratch protocol, on the challenging UCF101 dataset, surprisingly, with only around one percentage of parameters of its manual-design counterparts.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

First application of NAS to video action recognition with efficiency adaptations, but the outperformance claim has no numbers or ablations to back it up.

read the letter

The paper's main contribution is applying neural architecture search to video action recognition for the first time, using a differentiable DAG-based search with temporal segment sampling and pseudo-3D operators to keep costs down. What it does well is identifying the computational barriers in standard NAS for video and proposing practical workarounds that allow the search to run. The claim of beating hand-designed networks on UCF101 while using far fewer parameters is the kind of result that could matter if it holds up. The soft spots are in the lack of detail around the actual search process and results. The abstract mentions outperformance but gives no specific accuracy numbers, no comparison tables, no ablations on the temporal segments or the operator choices. Without those, it's difficult to separate the effect of the NAS from possible differences in training or implementation of the baselines. The assumption that the pseudo-3D space is rich enough and that segmenting doesn't lose key information isn't tested in the provided text. This looks like an early step in a new direction rather than a polished result. The citation pattern seems standard for the area, with no obvious circularity. This paper is for people working on automated design for video models. A reader looking for new ideas in NAS for CV might get some value from the method description, but the performance claims need more backing before they can be taken as established. I would send it to peer review because it's the first attempt in this subfield and the ideas are worth checking, even if the current evidence is preliminary.

Referee Report

3 major / 0 minor

Summary. The manuscript presents the first application of neural architecture search (NAS) to video action recognition. It models a spatio-temporal network as a differentiable directed acyclic graph (DAG) searchable via gradient-based optimization. To mitigate evaluation cost, the authors introduce a temporal segment approach for video inputs and restrict the search space to pseudo-3D operators. The central empirical claim is that the resulting architecture, when trained from scratch, outperforms popular manual-designed networks on UCF101 while using only ~1% of their parameter count.

Significance. If the performance and efficiency claims are substantiated with full experimental details, the work would establish a proof-of-concept for automated architecture design in the video domain and highlight the value of efficiency-oriented search spaces. Credit is due for targeting a new application area for differentiable NAS and for attempting to trade off search cost against global video modeling.

major comments (3)

[Abstract] Abstract: the headline outperformance claim on UCF101 is presented without any numerical accuracy figures, baseline re-implementations, standard deviations, or training protocols, rendering the central performance assertion unverifiable from the given text.
[Abstract] Abstract: the assertion that the temporal segment approach 'reduce[s] the computational cost without losing global video information' is load-bearing for both the efficiency and correctness of the method, yet no ablation against full-video inputs or alternative sampling strategies is supplied.
[Abstract] Abstract: the claim that gains arise from the NAS procedure rather than baseline differences rests on the untested assumption that the pseudo-3D operator DAG is expressive enough to contain near-optimal video architectures; no comparison to full-3D operators or random search within the same space is reported.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below, indicating where we will revise the manuscript to improve clarity and substantiation of the claims.

read point-by-point responses

Referee: [Abstract] Abstract: the headline outperformance claim on UCF101 is presented without any numerical accuracy figures, baseline re-implementations, standard deviations, or training protocols, rendering the central performance assertion unverifiable from the given text.

Authors: We agree that the abstract should include key quantitative results to enhance verifiability. In the revised version, we will incorporate the reported top-1 accuracy on UCF101 for the discovered architecture versus the compared hand-designed networks, the approximate parameter reduction to ~1%, and explicit references to the training-from-scratch protocol and standard deviations detailed in the experimental section. revision: yes
Referee: [Abstract] Abstract: the assertion that the temporal segment approach 'reduce[s] the computational cost without losing global video information' is load-bearing for both the efficiency and correctness of the method, yet no ablation against full-video inputs or alternative sampling strategies is supplied.

Authors: The temporal segment approach is introduced specifically to enable feasible evaluation of architecture candidates on video inputs while sampling multiple segments to retain global context, following established practices in video recognition. We acknowledge that an explicit ablation would provide stronger support and will add a targeted comparison of segment-based versus full-video or alternative sampling strategies in the revised experimental analysis. revision: yes
Referee: [Abstract] Abstract: the claim that gains arise from the NAS procedure rather than baseline differences rests on the untested assumption that the pseudo-3D operator DAG is expressive enough to contain near-optimal video architectures; no comparison to full-3D operators or random search within the same space is reported.

Authors: The search space is deliberately restricted to pseudo-3D operators to maintain tractability during the gradient-based search on video data; expanding to full-3D operators would substantially increase search cost. The discovered architecture's outperformance over multiple hand-designed baselines (which employ comparable or richer operators) provides evidence that the space yields competitive results. We did not perform random search within the space or full-3D comparisons, as these fall outside the efficiency-focused scope. We will add a discussion of the search-space design rationale and this limitation in the revision. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical NAS result independent of inputs

full rationale

The paper reports an empirical outcome from running a gradient-based architecture search in a defined pseudo-3D DAG space with temporal segment sampling, then evaluating the discovered network on UCF101 under training-from-scratch. No equations, fitted parameters, or self-citations are shown that would make the reported accuracy or parameter count equivalent to the search-space definition or any input by construction. The modeling choices (temporal segments, pseudo-3D operators) are presented as explicit design decisions to reduce cost, not as derived results. The outperformance claim therefore remains an external measurement rather than a renaming or tautology.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only input provides no concrete free parameters, axioms, or invented entities; the search space and temporal segment method are mentioned at high level but not formalized.

pith-pipeline@v0.9.0 · 5689 in / 1118 out tokens · 22648 ms · 2026-05-24T23:57:00.004852+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

27 extracted references · 27 canonical work pages · 11 internal anchors

[1]

Video Action Recognition Via Neural Architecture Searching

INTRODUCTION Video action recognition [1], which is a hot topic of video analysis and understanding, has drawn considerable attention from both academia and industry, since it has great value to many potential applications, like behaviour analysis [2], security, and video affective computing [3]. On one hand, new and large-scale datasets, such as Kinetics...

work page internal anchor Pith review Pith/arXiv arXiv 2019
[2]

Here, we search for a single (2+1)D convolutional module unit and repeat it for multiple times to build a neural network

PROPOSED METHODS In this section we detail the modularized architectureC to be searched and the search strategy. Here, we search for a single (2+1)D convolutional module unit and repeat it for multiple times to build a neural network. During the search procedure, we build a shallow networks,N (C), to explore in the search space. For video action recogniti...

work page
[3]

EXPERIMENTS In this section, we study the proposed automatic method of designing action recognition network to demonstrate its ad- vantages over other famous action recognition architectures, e.g., 3D-ResNet [19], C3D network [20], and STC-ResNet [21]. We evaluate our algorithm on the challenging action recognition dataset UCF101, which is a trimmed datas...

work page
[4]

Speciﬁcally, we model the neural network by a directed acyclic graph and efﬁciently search a spatial-temporal neural architecture in a continuous search space

CONCLUSION In this paper, we perform neural architecture search for the ac- tion recognition task for the ﬁrst time. Speciﬁcally, we model the neural network by a directed acyclic graph and efﬁciently search a spatial-temporal neural architecture in a continuous search space. We demonstrate that our method outperforms other popular models under the traini...

work page
[5]

313600), Tekes Fidipro program (Grant No

ACKNOWLEDGEMENTS This work was supported by the Academy of Finland ICT 2023 project (Grant No. 313600), Tekes Fidipro program (Grant No. 1849/31/2015) and Business Finland project (Grant No. 3116/31/2017), Infotech Oulu, and the National Natural Science Foundation of China (Grants No. 61772419). As well, the authors wish to acknowledge CSC-IT Center for S...

work page 2023
[6]

Learning hi- erarchical invariant spatio-temporal features for action recognition with independent subspace analysis,

Q. Le, W. Zou, S. Yeung, and A. Ng, “Learning hi- erarchical invariant spatio-temporal features for action recognition with independent subspace analysis,” in Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on. IEEE, 2011, pp. 3361–3368

work page 2011
[7]

Keenan and C

M. Keenan and C. Nikopoulos, Video modelling and behaviour analysis: A guide for teaching social skills to children with autism, Jessica Kingsley Publishers, 2006

work page 2006
[8]

A Boost in Revealing Subtle Facial Expressions: A Consolidated Eulerian Framework

W. Peng, X. Hong, Y . Xu, and G. Zhao, “A boost in revealing subtle facial expressions: A consolidated eu- lerian framework,” arXiv preprint arXiv:1901.07765 , 2019

work page internal anchor Pith review Pith/arXiv arXiv 1901
[9]

The Kinetics Human Action Video Dataset

W. Kay, J. Carreira, K. Simonyan, B. Zhang, C. Hillier, S. Vijayanarasimhan, F. Viola, T. Green, T. Back, P. Nat- sev, et al., “The kinetics human action video dataset,” arXiv preprint arXiv:1705.06950, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[10]

The something something video database for learning and evaluating visual com- mon sense,

R. Goyal, S. Kahou, V . Michalski, J. Materzynska, S. Westphal, H. Kim, V . Haenel, I. Fruend, P. Yianilos, M. Mueller-Freitag, et al., “The something something video database for learning and evaluating visual com- mon sense,” in The IEEE International Conference on Computer Vision (ICCV), 2017, vol. 1, p. 3

work page 2017
[11]

You only look once: Uniﬁed, real-time object detec- tion,

J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only look once: Uniﬁed, real-time object detec- tion,” in Proceedings of the IEEE conference on com- puter vision and pattern recognition , 2016, pp. 779– 788

work page 2016
[12]

Neural Architecture Search with Reinforcement Learning

B. Zoph and Q. Le, “Neural architecture search with reinforcement learning,” arXiv preprint arXiv:1611.01578, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[13]

Learn- ing transferable architectures for scalable image recog- nition,

B. Zoph, V . Vasudevan, J. Shlens, and Q. Le, “Learn- ing transferable architectures for scalable image recog- nition,” in Proceedings of the IEEE conference on com- puter vision and pattern recognition , 2018, pp. 8697– 8710

work page 2018
[14]

Regularized Evolution for Image Classifier Architecture Search

E.n Real, A. Aggarwal, Y . Huang, and Q. Le, “Regular- ized evolution for image classiﬁer architecture search,” arXiv preprint arXiv:1802.01548, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[15]

Very Deep Convolutional Networks for Large-Scale Image Recognition

K. Simonyan and A. Zisserman, “Very deep convo- lutional networks for large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014
[16]

Deep residual learning for image recognition,

K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recog- nition, 2016, pp. 770–778

work page 2016
[17]

Auto-DeepLab: Hierarchical Neural Architecture Search for Semantic Image Segmentation

C. Liu, L. Chen, F. Schroff, H. Adam, W. Hua, A. Yuille, and Li F., “Auto-deeplab: Hierarchical neural architec- ture search for semantic image segmentation,” arXiv preprint arXiv:1901.02985, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1901
[18]

Efficient Neural Architecture Search via Parameter Sharing

Hieu Pham, Melody Y Guan, Barret Zoph, Quoc V Le, and Jeff Dean, “Efﬁcient neural architecture search via parameter sharing,” arXiv preprint arXiv:1802.03268 , 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[19]

DARTS: Differentiable Architecture Search

H. Liu, K. Simonyan, and Y . Yang, “Darts: Differentiable architecture search,” arXiv preprint arXiv:1806.09055, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[20]

Temporal segment networks: Towards good practices for deep action recognition,

Limin W., Yuanjun X., Zhe W., Yu Q., Dahua L., Xiaoou T., and Luc V ., “Temporal segment networks: Towards good practices for deep action recognition,” in ECCV, 2016

work page 2016
[21]

A closer look at spatiotemporal convolu- tions for action recognition,

Du T., H. Wang, L. Torresani, J. Ray, Y . LeCun, and M. Paluri, “A closer look at spatiotemporal convolu- tions for action recognition,” inProceedings of the IEEE Conference on Computer Vision and Pattern Recogni- tion, 2018, pp. 6450–6459

work page 2018
[22]

Ucf101: A dataset of 101 human actions classes from videos in the wild,

K. Soomro, A. Zamir, and M. Shah, “Ucf101: A dataset of 101 human actions classes from videos in the wild,” Computer Science, 2012

work page 2012
[23]

Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks

C. Finn, P. Abbeel, and S. Levine, “Model-agnostic meta-learning for fast adaptation of deep networks,” arXiv preprint arXiv:1703.03400, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[24]

Can spatiotemporal 3d cnns retrace the history of 2d cnns and imagenet,

K. Hara, H. Kataoka, and Y . Satoh, “Can spatiotemporal 3d cnns retrace the history of 2d cnns and imagenet,” in Proceedings of the IEEE Conference on Computer Vi- sion and Pattern Recognition, Salt Lake City, UT, USA , 2018, pp. 18–22

work page 2018
[25]

Learning spatiotemporal features with 3d convolutional networks,

D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri, “Learning spatiotemporal features with 3d convolutional networks,” in Proceedings of the IEEE international conference on computer vision , 2015, pp. 4489–4497

work page 2015
[26]

Spatio-Temporal Channel Correlation Networks for Action Classification

A. Diba, M. Fayyaz, V . Sharma, M. Arzani, R. Youse- fzadeh, J. Gall, and L. Van Gool, “Spatio-temporal channel correlation networks for action classiﬁcation,” arXiv preprint arXiv:1806.07754, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[27]

Quo vadis, ac- tion recognition? a new model and the kinetics dataset,

Joao Carreira and Andrew Zisserman, “Quo vadis, ac- tion recognition? a new model and the kinetics dataset,” in The IEEE Conference on Computer Vision and Pat- tern Recognition (CVPR), July 2017

work page 2017

[1] [1]

Video Action Recognition Via Neural Architecture Searching

INTRODUCTION Video action recognition [1], which is a hot topic of video analysis and understanding, has drawn considerable attention from both academia and industry, since it has great value to many potential applications, like behaviour analysis [2], security, and video affective computing [3]. On one hand, new and large-scale datasets, such as Kinetics...

work page internal anchor Pith review Pith/arXiv arXiv 2019

[2] [2]

Here, we search for a single (2+1)D convolutional module unit and repeat it for multiple times to build a neural network

PROPOSED METHODS In this section we detail the modularized architectureC to be searched and the search strategy. Here, we search for a single (2+1)D convolutional module unit and repeat it for multiple times to build a neural network. During the search procedure, we build a shallow networks,N (C), to explore in the search space. For video action recogniti...

work page

[3] [3]

EXPERIMENTS In this section, we study the proposed automatic method of designing action recognition network to demonstrate its ad- vantages over other famous action recognition architectures, e.g., 3D-ResNet [19], C3D network [20], and STC-ResNet [21]. We evaluate our algorithm on the challenging action recognition dataset UCF101, which is a trimmed datas...

work page

[4] [4]

Speciﬁcally, we model the neural network by a directed acyclic graph and efﬁciently search a spatial-temporal neural architecture in a continuous search space

CONCLUSION In this paper, we perform neural architecture search for the ac- tion recognition task for the ﬁrst time. Speciﬁcally, we model the neural network by a directed acyclic graph and efﬁciently search a spatial-temporal neural architecture in a continuous search space. We demonstrate that our method outperforms other popular models under the traini...

work page

[5] [5]

313600), Tekes Fidipro program (Grant No

ACKNOWLEDGEMENTS This work was supported by the Academy of Finland ICT 2023 project (Grant No. 313600), Tekes Fidipro program (Grant No. 1849/31/2015) and Business Finland project (Grant No. 3116/31/2017), Infotech Oulu, and the National Natural Science Foundation of China (Grants No. 61772419). As well, the authors wish to acknowledge CSC-IT Center for S...

work page 2023

[6] [6]

Learning hi- erarchical invariant spatio-temporal features for action recognition with independent subspace analysis,

Q. Le, W. Zou, S. Yeung, and A. Ng, “Learning hi- erarchical invariant spatio-temporal features for action recognition with independent subspace analysis,” in Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on. IEEE, 2011, pp. 3361–3368

work page 2011

[7] [7]

Keenan and C

M. Keenan and C. Nikopoulos, Video modelling and behaviour analysis: A guide for teaching social skills to children with autism, Jessica Kingsley Publishers, 2006

work page 2006

[8] [8]

A Boost in Revealing Subtle Facial Expressions: A Consolidated Eulerian Framework

W. Peng, X. Hong, Y . Xu, and G. Zhao, “A boost in revealing subtle facial expressions: A consolidated eu- lerian framework,” arXiv preprint arXiv:1901.07765 , 2019

work page internal anchor Pith review Pith/arXiv arXiv 1901

[9] [9]

The Kinetics Human Action Video Dataset

W. Kay, J. Carreira, K. Simonyan, B. Zhang, C. Hillier, S. Vijayanarasimhan, F. Viola, T. Green, T. Back, P. Nat- sev, et al., “The kinetics human action video dataset,” arXiv preprint arXiv:1705.06950, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[10] [10]

The something something video database for learning and evaluating visual com- mon sense,

R. Goyal, S. Kahou, V . Michalski, J. Materzynska, S. Westphal, H. Kim, V . Haenel, I. Fruend, P. Yianilos, M. Mueller-Freitag, et al., “The something something video database for learning and evaluating visual com- mon sense,” in The IEEE International Conference on Computer Vision (ICCV), 2017, vol. 1, p. 3

work page 2017

[11] [11]

You only look once: Uniﬁed, real-time object detec- tion,

J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only look once: Uniﬁed, real-time object detec- tion,” in Proceedings of the IEEE conference on com- puter vision and pattern recognition , 2016, pp. 779– 788

work page 2016

[12] [12]

Neural Architecture Search with Reinforcement Learning

B. Zoph and Q. Le, “Neural architecture search with reinforcement learning,” arXiv preprint arXiv:1611.01578, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016

[13] [13]

Learn- ing transferable architectures for scalable image recog- nition,

B. Zoph, V . Vasudevan, J. Shlens, and Q. Le, “Learn- ing transferable architectures for scalable image recog- nition,” in Proceedings of the IEEE conference on com- puter vision and pattern recognition , 2018, pp. 8697– 8710

work page 2018

[14] [14]

Regularized Evolution for Image Classifier Architecture Search

E.n Real, A. Aggarwal, Y . Huang, and Q. Le, “Regular- ized evolution for image classiﬁer architecture search,” arXiv preprint arXiv:1802.01548, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[15] [15]

Very Deep Convolutional Networks for Large-Scale Image Recognition

K. Simonyan and A. Zisserman, “Very deep convo- lutional networks for large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014

[16] [16]

Deep residual learning for image recognition,

K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recog- nition, 2016, pp. 770–778

work page 2016

[17] [17]

Auto-DeepLab: Hierarchical Neural Architecture Search for Semantic Image Segmentation

C. Liu, L. Chen, F. Schroff, H. Adam, W. Hua, A. Yuille, and Li F., “Auto-deeplab: Hierarchical neural architec- ture search for semantic image segmentation,” arXiv preprint arXiv:1901.02985, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1901

[18] [18]

Efficient Neural Architecture Search via Parameter Sharing

Hieu Pham, Melody Y Guan, Barret Zoph, Quoc V Le, and Jeff Dean, “Efﬁcient neural architecture search via parameter sharing,” arXiv preprint arXiv:1802.03268 , 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[19] [19]

DARTS: Differentiable Architecture Search

H. Liu, K. Simonyan, and Y . Yang, “Darts: Differentiable architecture search,” arXiv preprint arXiv:1806.09055, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[20] [20]

Temporal segment networks: Towards good practices for deep action recognition,

Limin W., Yuanjun X., Zhe W., Yu Q., Dahua L., Xiaoou T., and Luc V ., “Temporal segment networks: Towards good practices for deep action recognition,” in ECCV, 2016

work page 2016

[21] [21]

A closer look at spatiotemporal convolu- tions for action recognition,

Du T., H. Wang, L. Torresani, J. Ray, Y . LeCun, and M. Paluri, “A closer look at spatiotemporal convolu- tions for action recognition,” inProceedings of the IEEE Conference on Computer Vision and Pattern Recogni- tion, 2018, pp. 6450–6459

work page 2018

[22] [22]

Ucf101: A dataset of 101 human actions classes from videos in the wild,

K. Soomro, A. Zamir, and M. Shah, “Ucf101: A dataset of 101 human actions classes from videos in the wild,” Computer Science, 2012

work page 2012

[23] [23]

Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks

C. Finn, P. Abbeel, and S. Levine, “Model-agnostic meta-learning for fast adaptation of deep networks,” arXiv preprint arXiv:1703.03400, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[24] [24]

Can spatiotemporal 3d cnns retrace the history of 2d cnns and imagenet,

K. Hara, H. Kataoka, and Y . Satoh, “Can spatiotemporal 3d cnns retrace the history of 2d cnns and imagenet,” in Proceedings of the IEEE Conference on Computer Vi- sion and Pattern Recognition, Salt Lake City, UT, USA , 2018, pp. 18–22

work page 2018

[25] [25]

Learning spatiotemporal features with 3d convolutional networks,

D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri, “Learning spatiotemporal features with 3d convolutional networks,” in Proceedings of the IEEE international conference on computer vision , 2015, pp. 4489–4497

work page 2015

[26] [26]

Spatio-Temporal Channel Correlation Networks for Action Classification

A. Diba, M. Fayyaz, V . Sharma, M. Arzani, R. Youse- fzadeh, J. Gall, and L. Van Gool, “Spatio-temporal channel correlation networks for action classiﬁcation,” arXiv preprint arXiv:1806.07754, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[27] [27]

Quo vadis, ac- tion recognition? a new model and the kinetics dataset,

Joao Carreira and Andrew Zisserman, “Quo vadis, ac- tion recognition? a new model and the kinetics dataset,” in The IEEE Conference on Computer Vision and Pat- tern Recognition (CVPR), July 2017

work page 2017