Baidu-UTS Submission to the EPIC-Kitchens Action Recognition Challenge 2019
Pith reviewed 2026-05-25 18:35 UTC · model grok-4.3
The pith
Object detection features guide 3D CNN training via a gated aggregator to raise noun prediction accuracy in egocentric kitchen videos.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Object detection features are used to guide the training of 3D CNNs, and a Gated Feature Aggregator module is introduced that learns from both the clip feature and the object feature; this module strengthens the interaction between the two kinds of activations and avoids gradient exploding, which significantly improves the accuracy of noun prediction on both seen and unseen test sets.
What carries the argument
The Gated Feature Aggregator module, which fuses clip features with object detection features while controlling their interaction to strengthen noun signals.
If this is right
- Noun prediction accuracy rises when object detection features are supplied to the 3D CNN through the aggregator.
- The performance gain holds on both kitchens the model has seen and on entirely new kitchens.
- The aggregator prevents gradient explosion while allowing the two feature streams to interact.
- Overall verb-noun action recognition improves as a direct result of better noun scores.
Where Pith is reading between the lines
- The same gated fusion pattern could be tested on other video tasks that already have access to pre-trained detectors.
- The approach points toward lighter supervision strategies that combine cheap object labels with video-level action labels.
- Similar modules might help action models cope with the visual degradations typical of head-mounted cameras outside kitchens.
Load-bearing premise
Object detectors can extract reliable features from the same egocentric videos that contain small objects, motion blur, and occlusions.
What would settle it
Train identical 3D CNNs with and without the object features and Gated Feature Aggregator, then compare noun top-1 accuracy on the unseen test set; if accuracy does not rise, the guidance claim is false.
Figures
read the original abstract
In this report, we present the Baidu-UTS submission to the EPIC-Kitchens Action Recognition Challenge in CVPR 2019. This is the winning solution to this challenge. In this task, the goal is to predict verbs, nouns, and actions from the vocabulary for each video segment. The EPIC-Kitchens dataset contains various small objects, intense motion blur, and occlusions. It is challenging to locate and recognize the object that an actor interacts with. To address these problems, we utilize object detection features to guide the training of 3D Convolutional Neural Networks (CNN), which can significantly improve the accuracy of noun prediction. Specifically, we introduce a Gated Feature Aggregator module to learn from the clip feature and the object feature. This module can strengthen the interaction between the two kinds of activations and avoid gradient exploding. Experimental results demonstrate our approach outperforms other methods on both seen and unseen test set.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents the Baidu-UTS winning submission to the EPIC-Kitchens Action Recognition Challenge 2019. It claims that object detection features, combined with a Gated Feature Aggregator module that fuses clip and object features, significantly improve noun prediction accuracy for 3D CNNs on egocentric videos containing small objects, motion blur, and occlusions, leading to top performance on both seen and unseen test sets.
Significance. If the performance claims hold, the work supplies a concrete, leaderboard-validated demonstration that auxiliary object detection features can be effectively integrated into video action models for egocentric settings. The competition result on unseen kitchens provides an external falsification test of the approach's robustness.
major comments (1)
- [Abstract] Abstract: the central claim that object detection features 'significantly improve the accuracy of noun prediction' and that the method 'outperforms other methods on both seen and unseen test set' is stated without any numerical results, ablation tables, or per-class error analysis, leaving the magnitude and reliability of the reported noun improvement unsubstantiated in the manuscript.
minor comments (1)
- [Method] The description of the Gated Feature Aggregator remains at a conceptual level; adding the precise equations or block diagram would clarify how the gating avoids gradient explosion while strengthening feature interaction.
Simulated Author's Rebuttal
We thank the referee for the constructive comment on our manuscript. We address it point-by-point below and will incorporate revisions as indicated.
read point-by-point responses
-
Referee: [Abstract] Abstract: the central claim that object detection features 'significantly improve the accuracy of noun prediction' and that the method 'outperforms other methods on both seen and unseen test set' is stated without any numerical results, ablation tables, or per-class error analysis, leaving the magnitude and reliability of the reported noun improvement unsubstantiated in the manuscript.
Authors: The comment correctly notes that the abstract itself contains no numerical values. The body of the manuscript reports the leaderboard results on seen and unseen test sets along with ablation experiments on the Gated Feature Aggregator. To make the central claims more concrete within the space constraints of the abstract, we will revise the abstract to include the key noun-accuracy deltas and overall action-recognition improvements achieved on the official test sets. revision: yes
Circularity Check
No significant circularity in derivation chain
full rationale
This is an empirical competition report describing a 3D CNN architecture augmented with object detection features and a Gated Feature Aggregator. The central performance claims are validated directly on held-out test sets from the EPIC-Kitchens challenge leaderboard rather than derived from equations or first-principles arguments. No mathematical derivations, fitted parameters renamed as predictions, or load-bearing self-citations appear in the provided text; the method is evaluated end-to-end on external data, rendering the result self-contained.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
we introduce a Gated Feature Aggregator module to learn from the clip feature and the object feature
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
This is the winning solution to this challenge
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
YouTube-8M: A Large-Scale Video Classification Benchmark
Sami Abu-El-Haija, Nisarg Kothari, Joonseok Lee, Paul Natsev, George Toderici, Balakrishnan Varadarajan, and Sudheendra Vijayanarasimhan. Youtube-8m: A large- scale video classification benchmark. arXiv preprint arXiv:1609.08675, 2016
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[2]
Bottom-up and top-down attention for image captioning and visual question answering
Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang. Bottom-up and top-down attention for image captioning and visual question answering. In CVPR, 2018
work page 2018
-
[3]
Object level visual reasoning in videos
Fabien Baradel, Natalia Neverova, Christian Wolf, Julien Mille, and Greg Mori. Object level visual reasoning in videos. In ECCV, 2018
work page 2018
-
[4]
Quo vadis, action recognition? a new model and the kinetics dataset
Jo ˜ao Carreira and Andrew Zisserman. Quo vadis, action recognition? a new model and the kinetics dataset. In CVPR, 2017
work page 2017
-
[5]
Scaling egocentric vision: The epic-kitchens dataset
Dima Damen, Hazel Doughty, Giovanni Maria Farinella, Sanja Fidler, Antonino Furnari, Evangelos Kazakos, Davide Moltisanti, Jonathan Munro, Toby Perrett, Will Price, and Michael Wray. Scaling egocentric vision: The epic-kitchens dataset. In ECCV, 2018
work page 2018
-
[6]
Imagenet: A large-scale hierarchical image database
Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In CVPR, 2009
work page 2009
-
[7]
Large- scale weakly-supervised pre-training for video action recog- nition
Deepti Ghadiyaram, Du Tran, and Dhruv Mahajan. Large- scale weakly-supervised pre-training for video action recog- nition. In CVPR, 2019
work page 2019
-
[8]
Actionvlad: Learning spatio-temporal aggregation for action classification
Rohit Girdhar, Deva Ramanan, Abhinav Gupta, Josef Sivic, and Bryan Russell. Actionvlad: Learning spatio-temporal aggregation for action classification. In CVPR, 2017
work page 2017
-
[9]
Can spatiotemporal 3d cnns retrace the history of 2d cnns and imagenet? In CVPR, 2018
Kensho Hara, Hirokatsu Kataoka, and Yutaka Satoh. Can spatiotemporal 3d cnns retrace the history of 2d cnns and imagenet? In CVPR, 2018
work page 2018
- [10]
-
[11]
Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalan- tidis, Li-Jia Li, David A. Shamma, Michael S. Bernstein, and Li Fei-Fei. Visual genome: Connecting language and vision using crowdsourced dense image annotations. IJCV, 2016
work page 2016
-
[12]
Girshick, Kaiming He, Bharath Hariharan, and Serge J
Tsung-Yi Lin, Piotr Doll ´ar, Ross B. Girshick, Kaiming He, Bharath Hariharan, and Serge J. Belongie. Feature pyramid networks for object detection. In CVPR, 2017. 4
work page 2017
-
[13]
Learnable pooling with Context Gating for video classification
Antoine Miech, Ivan Laptev, and Josef Sivic. Learnable pooling with context gating for video classification. arXiv preprint arXiv:1706.06905, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[14]
Shaoqing Ren, Kaiming He, Ross B. Girshick, and Jian Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. T-PAMI, 2015
work page 2015
-
[15]
Two-stream con- volutional networks for action recognition in videos
Karen Simonyan and Andrew Zisserman. Two-stream con- volutional networks for action recognition in videos. In Neurips, 2014
work page 2014
-
[16]
Lsta: Long short-term attention for egocentric action recog- nition
Swathikiran Sudhakaran, Sergio Escalera, and Oswald Lanz. Lsta: Long short-term attention for egocentric action recog- nition. In CVPR, 2019
work page 2019
-
[17]
Learning spatiotemporal features with 3d convolutional networks
Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. Learning spatiotemporal features with 3d convolutional networks. In ICCV, 2015
work page 2015
-
[18]
Temporal segment net- works: Towards good practices for deep action recognition
Limin Wang, Yuanjun Xiong, Zhe Wang, Yu Qiao, Dahua Lin, Xiaoou Tang, and Luc Val Gool. Temporal segment net- works: Towards good practices for deep action recognition. In ECCV, 2016
work page 2016
-
[19]
Long-term feature banks for detailed video understanding
Chao-Yuan Wu, Christoph Feichtenhofer, Haoqi Fan, Kaim- ing He, Philipp Krahenbuhl, and Ross Girshick. Long-term feature banks for detailed video understanding. In CVPR, 2019
work page 2019
-
[20]
Girshick, Piotr Doll ´ar, Zhuowen Tu, and Kaiming He
Saining Xie, Ross B. Girshick, Piotr Doll ´ar, Zhuowen Tu, and Kaiming He. Aggregated residual transformations for deep neural networks. In CVPR, 2017
work page 2017
-
[21]
A dis- criminative cnn video representation for event detection
Zhongwen Xu, Yi Yang, and Alex G Hauptmann. A dis- criminative cnn video representation for event detection. In CVPR, 2015
work page 2015
-
[22]
Faster recurrent networks for video classification
Linchao Zhu, Laura Sevilla-Lara, Du Tran, Matt Feiszli, Yi Yang, and Heng Wang. Faster recurrent networks for video classification. arXiv preprint arXiv:1906.04226, 2019
-
[23]
Bidirectional multirate reconstruction for temporal modeling in videos
Linchao Zhu, Zhongwen Xu, and Yi Yang. Bidirectional multirate reconstruction for temporal modeling in videos. In CVPR, 2017
work page 2017
-
[24]
Uncovering the temporal context for video question answering
Linchao Zhu, Zhongwen Xu, Yi Yang, and Alexander G Hauptmann. Uncovering the temporal context for video question answering. IJCV, 2017. 5
work page 2017
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.