Video Action Recognition Via Neural Architecture Searching
Pith reviewed 2026-05-24 23:57 UTC · model grok-4.3
The pith
Automated search over a graph of pseudo-3D operators produces a video action network that beats hand-designed models on UCF101 while using only one percent of the parameters.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
A spatio-temporal network is obtained by gradient-based search inside a differentiable space modeled by a directed acyclic graph. The search employs a temporal segment approach on video inputs to keep global information while lowering cost and restricts the operator space to pseudo-3D convolutions. On the UCF101 dataset the discovered architecture, trained from scratch, exceeds the accuracy of popular manual networks while requiring only around one percent of their parameter count.
What carries the argument
Gradient-based optimization of a directed acyclic graph whose nodes are pseudo-3D spatio-temporal operators, combined with temporal segment sampling of video clips.
If this is right
- The searched architecture achieves higher accuracy on UCF101 than popular manual designs under identical from-scratch training.
- The model requires only about one percent of the parameters of its manual-design counterparts.
- The temporal segment sampling reduces evaluation cost during search without discarding essential video information.
- Pseudo-3D operators form an efficient yet expressive search space for spatio-temporal networks.
Where Pith is reading between the lines
- The same search procedure could be applied directly to other video tasks such as temporal action localization or video captioning.
- Because the discovered model is small, it may run in real time on embedded hardware where larger manual networks cannot.
- Expanding the search space to include additional operator types might yield further accuracy gains on larger video datasets.
- The low parameter count suggests the search process discovers unusually efficient connectivity patterns rather than simply scaling up existing designs.
Load-bearing premise
That sampling a few temporal segments from each video preserves enough global information for action recognition and that the pseudo-3D operator space contains structures close to the best possible networks for the task.
What would settle it
If any popular hand-designed architecture, when trained from scratch on UCF101 under the same protocol, reaches equal or higher accuracy than the searched model while using a similar or smaller parameter count, the performance claim would be refuted.
read the original abstract
Deep neural networks have achieved great success for video analysis and understanding. However, designing a high-performance neural architecture requires substantial efforts and expertise. In this paper, we make the first attempt to let algorithm automatically design neural networks for video action recognition tasks. Specifically, a spatio-temporal network is developed in a differentiable space modeled by a directed acyclic graph, thus a gradient-based strategy can be performed to search an optimal architecture. Nonetheless, it is computationally expensive, since the computational burden to evaluate each architecture candidate is still heavy. To alleviate this issue, we, for the video input, introduce a temporal segment approach to reduce the computational cost without losing global video information. For the architecture, we explore in an efficient search space by introducing pseudo 3D operators. Experiments show that, our architecture outperforms popular neural architectures, under the training from scratch protocol, on the challenging UCF101 dataset, surprisingly, with only around one percentage of parameters of its manual-design counterparts.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents the first application of neural architecture search (NAS) to video action recognition. It models a spatio-temporal network as a differentiable directed acyclic graph (DAG) searchable via gradient-based optimization. To mitigate evaluation cost, the authors introduce a temporal segment approach for video inputs and restrict the search space to pseudo-3D operators. The central empirical claim is that the resulting architecture, when trained from scratch, outperforms popular manual-designed networks on UCF101 while using only ~1% of their parameter count.
Significance. If the performance and efficiency claims are substantiated with full experimental details, the work would establish a proof-of-concept for automated architecture design in the video domain and highlight the value of efficiency-oriented search spaces. Credit is due for targeting a new application area for differentiable NAS and for attempting to trade off search cost against global video modeling.
major comments (3)
- [Abstract] Abstract: the headline outperformance claim on UCF101 is presented without any numerical accuracy figures, baseline re-implementations, standard deviations, or training protocols, rendering the central performance assertion unverifiable from the given text.
- [Abstract] Abstract: the assertion that the temporal segment approach 'reduce[s] the computational cost without losing global video information' is load-bearing for both the efficiency and correctness of the method, yet no ablation against full-video inputs or alternative sampling strategies is supplied.
- [Abstract] Abstract: the claim that gains arise from the NAS procedure rather than baseline differences rests on the untested assumption that the pseudo-3D operator DAG is expressive enough to contain near-optimal video architectures; no comparison to full-3D operators or random search within the same space is reported.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address each major comment below, indicating where we will revise the manuscript to improve clarity and substantiation of the claims.
read point-by-point responses
-
Referee: [Abstract] Abstract: the headline outperformance claim on UCF101 is presented without any numerical accuracy figures, baseline re-implementations, standard deviations, or training protocols, rendering the central performance assertion unverifiable from the given text.
Authors: We agree that the abstract should include key quantitative results to enhance verifiability. In the revised version, we will incorporate the reported top-1 accuracy on UCF101 for the discovered architecture versus the compared hand-designed networks, the approximate parameter reduction to ~1%, and explicit references to the training-from-scratch protocol and standard deviations detailed in the experimental section. revision: yes
-
Referee: [Abstract] Abstract: the assertion that the temporal segment approach 'reduce[s] the computational cost without losing global video information' is load-bearing for both the efficiency and correctness of the method, yet no ablation against full-video inputs or alternative sampling strategies is supplied.
Authors: The temporal segment approach is introduced specifically to enable feasible evaluation of architecture candidates on video inputs while sampling multiple segments to retain global context, following established practices in video recognition. We acknowledge that an explicit ablation would provide stronger support and will add a targeted comparison of segment-based versus full-video or alternative sampling strategies in the revised experimental analysis. revision: yes
-
Referee: [Abstract] Abstract: the claim that gains arise from the NAS procedure rather than baseline differences rests on the untested assumption that the pseudo-3D operator DAG is expressive enough to contain near-optimal video architectures; no comparison to full-3D operators or random search within the same space is reported.
Authors: The search space is deliberately restricted to pseudo-3D operators to maintain tractability during the gradient-based search on video data; expanding to full-3D operators would substantially increase search cost. The discovered architecture's outperformance over multiple hand-designed baselines (which employ comparable or richer operators) provides evidence that the space yields competitive results. We did not perform random search within the space or full-3D comparisons, as these fall outside the efficiency-focused scope. We will add a discussion of the search-space design rationale and this limitation in the revision. revision: partial
Circularity Check
No circularity: empirical NAS result independent of inputs
full rationale
The paper reports an empirical outcome from running a gradient-based architecture search in a defined pseudo-3D DAG space with temporal segment sampling, then evaluating the discovered network on UCF101 under training-from-scratch. No equations, fitted parameters, or self-citations are shown that would make the reported accuracy or parameter count equivalent to the search-space definition or any input by construction. The modeling choices (temporal segments, pseudo-3D operators) are presented as explicit design decisions to reduce cost, not as derived results. The outperformance claim therefore remains an external measurement rather than a renaming or tautology.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Video Action Recognition Via Neural Architecture Searching
INTRODUCTION Video action recognition [1], which is a hot topic of video analysis and understanding, has drawn considerable attention from both academia and industry, since it has great value to many potential applications, like behaviour analysis [2], security, and video affective computing [3]. On one hand, new and large-scale datasets, such as Kinetics...
work page internal anchor Pith review Pith/arXiv arXiv 2019
-
[2]
PROPOSED METHODS In this section we detail the modularized architectureC to be searched and the search strategy. Here, we search for a single (2+1)D convolutional module unit and repeat it for multiple times to build a neural network. During the search procedure, we build a shallow networks,N (C), to explore in the search space. For video action recogniti...
-
[3]
EXPERIMENTS In this section, we study the proposed automatic method of designing action recognition network to demonstrate its ad- vantages over other famous action recognition architectures, e.g., 3D-ResNet [19], C3D network [20], and STC-ResNet [21]. We evaluate our algorithm on the challenging action recognition dataset UCF101, which is a trimmed datas...
-
[4]
CONCLUSION In this paper, we perform neural architecture search for the ac- tion recognition task for the first time. Specifically, we model the neural network by a directed acyclic graph and efficiently search a spatial-temporal neural architecture in a continuous search space. We demonstrate that our method outperforms other popular models under the traini...
-
[5]
313600), Tekes Fidipro program (Grant No
ACKNOWLEDGEMENTS This work was supported by the Academy of Finland ICT 2023 project (Grant No. 313600), Tekes Fidipro program (Grant No. 1849/31/2015) and Business Finland project (Grant No. 3116/31/2017), Infotech Oulu, and the National Natural Science Foundation of China (Grants No. 61772419). As well, the authors wish to acknowledge CSC-IT Center for S...
work page 2023
-
[6]
Q. Le, W. Zou, S. Yeung, and A. Ng, “Learning hi- erarchical invariant spatio-temporal features for action recognition with independent subspace analysis,” in Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on. IEEE, 2011, pp. 3361–3368
work page 2011
-
[7]
M. Keenan and C. Nikopoulos, Video modelling and behaviour analysis: A guide for teaching social skills to children with autism, Jessica Kingsley Publishers, 2006
work page 2006
-
[8]
A Boost in Revealing Subtle Facial Expressions: A Consolidated Eulerian Framework
W. Peng, X. Hong, Y . Xu, and G. Zhao, “A boost in revealing subtle facial expressions: A consolidated eu- lerian framework,” arXiv preprint arXiv:1901.07765 , 2019
work page internal anchor Pith review Pith/arXiv arXiv 1901
-
[9]
The Kinetics Human Action Video Dataset
W. Kay, J. Carreira, K. Simonyan, B. Zhang, C. Hillier, S. Vijayanarasimhan, F. Viola, T. Green, T. Back, P. Nat- sev, et al., “The kinetics human action video dataset,” arXiv preprint arXiv:1705.06950, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[10]
The something something video database for learning and evaluating visual com- mon sense,
R. Goyal, S. Kahou, V . Michalski, J. Materzynska, S. Westphal, H. Kim, V . Haenel, I. Fruend, P. Yianilos, M. Mueller-Freitag, et al., “The something something video database for learning and evaluating visual com- mon sense,” in The IEEE International Conference on Computer Vision (ICCV), 2017, vol. 1, p. 3
work page 2017
-
[11]
You only look once: Unified, real-time object detec- tion,
J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only look once: Unified, real-time object detec- tion,” in Proceedings of the IEEE conference on com- puter vision and pattern recognition , 2016, pp. 779– 788
work page 2016
-
[12]
Neural Architecture Search with Reinforcement Learning
B. Zoph and Q. Le, “Neural architecture search with reinforcement learning,” arXiv preprint arXiv:1611.01578, 2016
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[13]
Learn- ing transferable architectures for scalable image recog- nition,
B. Zoph, V . Vasudevan, J. Shlens, and Q. Le, “Learn- ing transferable architectures for scalable image recog- nition,” in Proceedings of the IEEE conference on com- puter vision and pattern recognition , 2018, pp. 8697– 8710
work page 2018
-
[14]
Regularized Evolution for Image Classifier Architecture Search
E.n Real, A. Aggarwal, Y . Huang, and Q. Le, “Regular- ized evolution for image classifier architecture search,” arXiv preprint arXiv:1802.01548, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[15]
Very Deep Convolutional Networks for Large-Scale Image Recognition
K. Simonyan and A. Zisserman, “Very deep convo- lutional networks for large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014
work page internal anchor Pith review Pith/arXiv arXiv 2014
-
[16]
Deep residual learning for image recognition,
K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recog- nition, 2016, pp. 770–778
work page 2016
-
[17]
Auto-DeepLab: Hierarchical Neural Architecture Search for Semantic Image Segmentation
C. Liu, L. Chen, F. Schroff, H. Adam, W. Hua, A. Yuille, and Li F., “Auto-deeplab: Hierarchical neural architec- ture search for semantic image segmentation,” arXiv preprint arXiv:1901.02985, 2019
work page internal anchor Pith review Pith/arXiv arXiv 1901
-
[18]
Efficient Neural Architecture Search via Parameter Sharing
Hieu Pham, Melody Y Guan, Barret Zoph, Quoc V Le, and Jeff Dean, “Efficient neural architecture search via parameter sharing,” arXiv preprint arXiv:1802.03268 , 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[19]
DARTS: Differentiable Architecture Search
H. Liu, K. Simonyan, and Y . Yang, “Darts: Differentiable architecture search,” arXiv preprint arXiv:1806.09055, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[20]
Temporal segment networks: Towards good practices for deep action recognition,
Limin W., Yuanjun X., Zhe W., Yu Q., Dahua L., Xiaoou T., and Luc V ., “Temporal segment networks: Towards good practices for deep action recognition,” in ECCV, 2016
work page 2016
-
[21]
A closer look at spatiotemporal convolu- tions for action recognition,
Du T., H. Wang, L. Torresani, J. Ray, Y . LeCun, and M. Paluri, “A closer look at spatiotemporal convolu- tions for action recognition,” inProceedings of the IEEE Conference on Computer Vision and Pattern Recogni- tion, 2018, pp. 6450–6459
work page 2018
-
[22]
Ucf101: A dataset of 101 human actions classes from videos in the wild,
K. Soomro, A. Zamir, and M. Shah, “Ucf101: A dataset of 101 human actions classes from videos in the wild,” Computer Science, 2012
work page 2012
-
[23]
Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks
C. Finn, P. Abbeel, and S. Levine, “Model-agnostic meta-learning for fast adaptation of deep networks,” arXiv preprint arXiv:1703.03400, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[24]
Can spatiotemporal 3d cnns retrace the history of 2d cnns and imagenet,
K. Hara, H. Kataoka, and Y . Satoh, “Can spatiotemporal 3d cnns retrace the history of 2d cnns and imagenet,” in Proceedings of the IEEE Conference on Computer Vi- sion and Pattern Recognition, Salt Lake City, UT, USA , 2018, pp. 18–22
work page 2018
-
[25]
Learning spatiotemporal features with 3d convolutional networks,
D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri, “Learning spatiotemporal features with 3d convolutional networks,” in Proceedings of the IEEE international conference on computer vision , 2015, pp. 4489–4497
work page 2015
-
[26]
Spatio-Temporal Channel Correlation Networks for Action Classification
A. Diba, M. Fayyaz, V . Sharma, M. Arzani, R. Youse- fzadeh, J. Gall, and L. Van Gool, “Spatio-temporal channel correlation networks for action classification,” arXiv preprint arXiv:1806.07754, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[27]
Quo vadis, ac- tion recognition? a new model and the kinetics dataset,
Joao Carreira and Andrew Zisserman, “Quo vadis, ac- tion recognition? a new model and the kinetics dataset,” in The IEEE Conference on Computer Vision and Pat- tern Recognition (CVPR), July 2017
work page 2017
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.