Baidu-UTS Submission to the EPIC-Kitchens Action Recognition Challenge 2019

Linchao Zhu; Xiaohan Wang; Yi Yang; Yu Wu

arxiv: 1906.09383 · v1 · pith:7MKIX7R2new · submitted 2019-06-22 · 💻 cs.CV

Baidu-UTS Submission to the EPIC-Kitchens Action Recognition Challenge 2019

Xiaohan Wang , Yu Wu , Linchao Zhu , Yi Yang This is my paper

Pith reviewed 2026-05-25 18:35 UTC · model grok-4.3

classification 💻 cs.CV

keywords action recognitionegocentric videoobject detection3D CNNgated feature aggregatornoun predictionEPIC-Kitchens

0 comments

The pith

Object detection features guide 3D CNN training via a gated aggregator to raise noun prediction accuracy in egocentric kitchen videos.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The submission demonstrates that object detection outputs can steer the learning of 3D convolutional networks so that they more accurately identify the nouns in first-person action clips. These videos contain small items, heavy blur, and frequent occlusions, which make direct noun recognition difficult from motion features alone. The authors insert a Gated Feature Aggregator that combines clip-level activations with object features, controls their interaction, and prevents gradient explosion during training. The resulting model records higher noun accuracy on both kitchens seen during training and entirely new kitchens. This approach produced the winning entry in the 2019 EPIC-Kitchens action recognition challenge.

Core claim

Object detection features are used to guide the training of 3D CNNs, and a Gated Feature Aggregator module is introduced that learns from both the clip feature and the object feature; this module strengthens the interaction between the two kinds of activations and avoids gradient exploding, which significantly improves the accuracy of noun prediction on both seen and unseen test sets.

What carries the argument

The Gated Feature Aggregator module, which fuses clip features with object detection features while controlling their interaction to strengthen noun signals.

If this is right

Noun prediction accuracy rises when object detection features are supplied to the 3D CNN through the aggregator.
The performance gain holds on both kitchens the model has seen and on entirely new kitchens.
The aggregator prevents gradient explosion while allowing the two feature streams to interact.
Overall verb-noun action recognition improves as a direct result of better noun scores.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same gated fusion pattern could be tested on other video tasks that already have access to pre-trained detectors.
The approach points toward lighter supervision strategies that combine cheap object labels with video-level action labels.
Similar modules might help action models cope with the visual degradations typical of head-mounted cameras outside kitchens.

Load-bearing premise

Object detectors can extract reliable features from the same egocentric videos that contain small objects, motion blur, and occlusions.

What would settle it

Train identical 3D CNNs with and without the object features and Gated Feature Aggregator, then compare noun top-1 accuracy on the unseen test set; if accuracy does not rise, the guidance claim is false.

Figures

Figures reproduced from arXiv: 1906.09383 by Linchao Zhu, Xiaohan Wang, Yi Yang, Yu Wu.

**Figure 1.** Figure 1: The overall framework of our approach. the-art method for third-person video recognition does not achieve promising results, especially on noun classification. The attention mechanism is efficient to locate the region of interest on the feature map. Sudhakaran et al. [16] propose a Long Short-Term Attention model to focus on features from relevant spatial parts. They extend LSTM with a recurrent attentio… view at source ↗

**Figure 2.** Figure 2: The two different types of GFA. 3.2. Gated Feature Aggregator Wu et al. [19] propose to concatenate the object feature and the clip feature directly as the final representation. However, in our experiments, this method is sensitive to the backbones of 3D CNN and detector. When the two branches have different backbones, e.g., I3D and ResNext, the training loss is difficult to converge thus the final perfo… view at source ↗

read the original abstract

In this report, we present the Baidu-UTS submission to the EPIC-Kitchens Action Recognition Challenge in CVPR 2019. This is the winning solution to this challenge. In this task, the goal is to predict verbs, nouns, and actions from the vocabulary for each video segment. The EPIC-Kitchens dataset contains various small objects, intense motion blur, and occlusions. It is challenging to locate and recognize the object that an actor interacts with. To address these problems, we utilize object detection features to guide the training of 3D Convolutional Neural Networks (CNN), which can significantly improve the accuracy of noun prediction. Specifically, we introduce a Gated Feature Aggregator module to learn from the clip feature and the object feature. This module can strengthen the interaction between the two kinds of activations and avoid gradient exploding. Experimental results demonstrate our approach outperforms other methods on both seen and unseen test set.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This is a competition report on the 2019 EPIC-Kitchens winner that combines object detection features with 3D CNNs via a gated aggregator, but offers little beyond the leaderboard result.

read the letter

The paper reports the Baidu-UTS entry that won the EPIC-Kitchens action recognition challenge. It uses object detection to supply features that guide 3D CNN training, with a Gated Feature Aggregator meant to fuse clip and object activations while controlling gradient flow. The main practical payoff is better noun prediction on both seen and unseen kitchens, which aligns with the dataset's issues of small objects, blur, and occlusion. The win itself supplies the primary evidence that the combination worked on the held-out test sets.

Referee Report

1 major / 1 minor

Summary. The manuscript presents the Baidu-UTS winning submission to the EPIC-Kitchens Action Recognition Challenge 2019. It claims that object detection features, combined with a Gated Feature Aggregator module that fuses clip and object features, significantly improve noun prediction accuracy for 3D CNNs on egocentric videos containing small objects, motion blur, and occlusions, leading to top performance on both seen and unseen test sets.

Significance. If the performance claims hold, the work supplies a concrete, leaderboard-validated demonstration that auxiliary object detection features can be effectively integrated into video action models for egocentric settings. The competition result on unseen kitchens provides an external falsification test of the approach's robustness.

major comments (1)

[Abstract] Abstract: the central claim that object detection features 'significantly improve the accuracy of noun prediction' and that the method 'outperforms other methods on both seen and unseen test set' is stated without any numerical results, ablation tables, or per-class error analysis, leaving the magnitude and reliability of the reported noun improvement unsubstantiated in the manuscript.

minor comments (1)

[Method] The description of the Gated Feature Aggregator remains at a conceptual level; adding the precise equations or block diagram would clarify how the gating avoids gradient explosion while strengthening feature interaction.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive comment on our manuscript. We address it point-by-point below and will incorporate revisions as indicated.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim that object detection features 'significantly improve the accuracy of noun prediction' and that the method 'outperforms other methods on both seen and unseen test set' is stated without any numerical results, ablation tables, or per-class error analysis, leaving the magnitude and reliability of the reported noun improvement unsubstantiated in the manuscript.

Authors: The comment correctly notes that the abstract itself contains no numerical values. The body of the manuscript reports the leaderboard results on seen and unseen test sets along with ablation experiments on the Gated Feature Aggregator. To make the central claims more concrete within the space constraints of the abstract, we will revise the abstract to include the key noun-accuracy deltas and overall action-recognition improvements achieved on the official test sets. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

This is an empirical competition report describing a 3D CNN architecture augmented with object detection features and a Gated Feature Aggregator. The central performance claims are validated directly on held-out test sets from the EPIC-Kitchens challenge leaderboard rather than derived from equations or first-principles arguments. No mathematical derivations, fitted parameters renamed as predictions, or load-bearing self-citations appear in the provided text; the method is evaluated end-to-end on external data, rendering the result self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities are stated. The Gated Feature Aggregator is a new architectural component but is not an invented physical entity.

pith-pipeline@v0.9.0 · 5694 in / 1055 out tokens · 16385 ms · 2026-05-25T18:35:54.193461+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

we introduce a Gated Feature Aggregator module to learn from the clip feature and the object feature
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

This is the winning solution to this challenge

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

24 extracted references · 24 canonical work pages · 2 internal anchors

[1]

YouTube-8M: A Large-Scale Video Classification Benchmark

Sami Abu-El-Haija, Nisarg Kothari, Joonseok Lee, Paul Natsev, George Toderici, Balakrishnan Varadarajan, and Sudheendra Vijayanarasimhan. Youtube-8m: A large- scale video classiﬁcation benchmark. arXiv preprint arXiv:1609.08675, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[2]

Bottom-up and top-down attention for image captioning and visual question answering

Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang. Bottom-up and top-down attention for image captioning and visual question answering. In CVPR, 2018

work page 2018
[3]

Object level visual reasoning in videos

Fabien Baradel, Natalia Neverova, Christian Wolf, Julien Mille, and Greg Mori. Object level visual reasoning in videos. In ECCV, 2018

work page 2018
[4]

Quo vadis, action recognition? a new model and the kinetics dataset

Jo ˜ao Carreira and Andrew Zisserman. Quo vadis, action recognition? a new model and the kinetics dataset. In CVPR, 2017

work page 2017
[5]

Scaling egocentric vision: The epic-kitchens dataset

Dima Damen, Hazel Doughty, Giovanni Maria Farinella, Sanja Fidler, Antonino Furnari, Evangelos Kazakos, Davide Moltisanti, Jonathan Munro, Toby Perrett, Will Price, and Michael Wray. Scaling egocentric vision: The epic-kitchens dataset. In ECCV, 2018

work page 2018
[6]

Imagenet: A large-scale hierarchical image database

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In CVPR, 2009

work page 2009
[7]

Large- scale weakly-supervised pre-training for video action recog- nition

Deepti Ghadiyaram, Du Tran, and Dhruv Mahajan. Large- scale weakly-supervised pre-training for video action recog- nition. In CVPR, 2019

work page 2019
[8]

Actionvlad: Learning spatio-temporal aggregation for action classiﬁcation

Rohit Girdhar, Deva Ramanan, Abhinav Gupta, Josef Sivic, and Bryan Russell. Actionvlad: Learning spatio-temporal aggregation for action classiﬁcation. In CVPR, 2017

work page 2017
[9]

Can spatiotemporal 3d cnns retrace the history of 2d cnns and imagenet? In CVPR, 2018

Kensho Hara, Hirokatsu Kataoka, and Yutaka Satoh. Can spatiotemporal 3d cnns retrace the history of 2d cnns and imagenet? In CVPR, 2018

work page 2018
[10]

Girshick

Kaiming He, Georgia Gkioxari, Piotr Doll ´ar, and Ross B. Girshick. Mask r-cnn. In ICCV, 2017

work page 2017
[11]

Shamma, Michael S

Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalan- tidis, Li-Jia Li, David A. Shamma, Michael S. Bernstein, and Li Fei-Fei. Visual genome: Connecting language and vision using crowdsourced dense image annotations. IJCV, 2016

work page 2016
[12]

Girshick, Kaiming He, Bharath Hariharan, and Serge J

Tsung-Yi Lin, Piotr Doll ´ar, Ross B. Girshick, Kaiming He, Bharath Hariharan, and Serge J. Belongie. Feature pyramid networks for object detection. In CVPR, 2017. 4

work page 2017
[13]

Learnable pooling with Context Gating for video classification

Antoine Miech, Ivan Laptev, and Josef Sivic. Learnable pooling with context gating for video classiﬁcation. arXiv preprint arXiv:1706.06905, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[14]

Girshick, and Jian Sun

Shaoqing Ren, Kaiming He, Ross B. Girshick, and Jian Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. T-PAMI, 2015

work page 2015
[15]

Two-stream con- volutional networks for action recognition in videos

Karen Simonyan and Andrew Zisserman. Two-stream con- volutional networks for action recognition in videos. In Neurips, 2014

work page 2014
[16]

Lsta: Long short-term attention for egocentric action recog- nition

Swathikiran Sudhakaran, Sergio Escalera, and Oswald Lanz. Lsta: Long short-term attention for egocentric action recog- nition. In CVPR, 2019

work page 2019
[17]

Learning spatiotemporal features with 3d convolutional networks

Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. Learning spatiotemporal features with 3d convolutional networks. In ICCV, 2015

work page 2015
[18]

Temporal segment net- works: Towards good practices for deep action recognition

Limin Wang, Yuanjun Xiong, Zhe Wang, Yu Qiao, Dahua Lin, Xiaoou Tang, and Luc Val Gool. Temporal segment net- works: Towards good practices for deep action recognition. In ECCV, 2016

work page 2016
[19]

Long-term feature banks for detailed video understanding

Chao-Yuan Wu, Christoph Feichtenhofer, Haoqi Fan, Kaim- ing He, Philipp Krahenbuhl, and Ross Girshick. Long-term feature banks for detailed video understanding. In CVPR, 2019

work page 2019
[20]

Girshick, Piotr Doll ´ar, Zhuowen Tu, and Kaiming He

Saining Xie, Ross B. Girshick, Piotr Doll ´ar, Zhuowen Tu, and Kaiming He. Aggregated residual transformations for deep neural networks. In CVPR, 2017

work page 2017
[21]

A dis- criminative cnn video representation for event detection

Zhongwen Xu, Yi Yang, and Alex G Hauptmann. A dis- criminative cnn video representation for event detection. In CVPR, 2015

work page 2015
[22]

Faster recurrent networks for video classiﬁcation

Linchao Zhu, Laura Sevilla-Lara, Du Tran, Matt Feiszli, Yi Yang, and Heng Wang. Faster recurrent networks for video classiﬁcation. arXiv preprint arXiv:1906.04226, 2019

work page arXiv 1906
[23]

Bidirectional multirate reconstruction for temporal modeling in videos

Linchao Zhu, Zhongwen Xu, and Yi Yang. Bidirectional multirate reconstruction for temporal modeling in videos. In CVPR, 2017

work page 2017
[24]

Uncovering the temporal context for video question answering

Linchao Zhu, Zhongwen Xu, Yi Yang, and Alexander G Hauptmann. Uncovering the temporal context for video question answering. IJCV, 2017. 5

work page 2017

[1] [1]

YouTube-8M: A Large-Scale Video Classification Benchmark

Sami Abu-El-Haija, Nisarg Kothari, Joonseok Lee, Paul Natsev, George Toderici, Balakrishnan Varadarajan, and Sudheendra Vijayanarasimhan. Youtube-8m: A large- scale video classiﬁcation benchmark. arXiv preprint arXiv:1609.08675, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016

[2] [2]

Bottom-up and top-down attention for image captioning and visual question answering

Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang. Bottom-up and top-down attention for image captioning and visual question answering. In CVPR, 2018

work page 2018

[3] [3]

Object level visual reasoning in videos

Fabien Baradel, Natalia Neverova, Christian Wolf, Julien Mille, and Greg Mori. Object level visual reasoning in videos. In ECCV, 2018

work page 2018

[4] [4]

Quo vadis, action recognition? a new model and the kinetics dataset

Jo ˜ao Carreira and Andrew Zisserman. Quo vadis, action recognition? a new model and the kinetics dataset. In CVPR, 2017

work page 2017

[5] [5]

Scaling egocentric vision: The epic-kitchens dataset

Dima Damen, Hazel Doughty, Giovanni Maria Farinella, Sanja Fidler, Antonino Furnari, Evangelos Kazakos, Davide Moltisanti, Jonathan Munro, Toby Perrett, Will Price, and Michael Wray. Scaling egocentric vision: The epic-kitchens dataset. In ECCV, 2018

work page 2018

[6] [6]

Imagenet: A large-scale hierarchical image database

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In CVPR, 2009

work page 2009

[7] [7]

Large- scale weakly-supervised pre-training for video action recog- nition

Deepti Ghadiyaram, Du Tran, and Dhruv Mahajan. Large- scale weakly-supervised pre-training for video action recog- nition. In CVPR, 2019

work page 2019

[8] [8]

Actionvlad: Learning spatio-temporal aggregation for action classiﬁcation

Rohit Girdhar, Deva Ramanan, Abhinav Gupta, Josef Sivic, and Bryan Russell. Actionvlad: Learning spatio-temporal aggregation for action classiﬁcation. In CVPR, 2017

work page 2017

[9] [9]

Can spatiotemporal 3d cnns retrace the history of 2d cnns and imagenet? In CVPR, 2018

Kensho Hara, Hirokatsu Kataoka, and Yutaka Satoh. Can spatiotemporal 3d cnns retrace the history of 2d cnns and imagenet? In CVPR, 2018

work page 2018

[10] [10]

Girshick

Kaiming He, Georgia Gkioxari, Piotr Doll ´ar, and Ross B. Girshick. Mask r-cnn. In ICCV, 2017

work page 2017

[11] [11]

Shamma, Michael S

Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalan- tidis, Li-Jia Li, David A. Shamma, Michael S. Bernstein, and Li Fei-Fei. Visual genome: Connecting language and vision using crowdsourced dense image annotations. IJCV, 2016

work page 2016

[12] [12]

Girshick, Kaiming He, Bharath Hariharan, and Serge J

Tsung-Yi Lin, Piotr Doll ´ar, Ross B. Girshick, Kaiming He, Bharath Hariharan, and Serge J. Belongie. Feature pyramid networks for object detection. In CVPR, 2017. 4

work page 2017

[13] [13]

Learnable pooling with Context Gating for video classification

Antoine Miech, Ivan Laptev, and Josef Sivic. Learnable pooling with context gating for video classiﬁcation. arXiv preprint arXiv:1706.06905, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[14] [14]

Girshick, and Jian Sun

Shaoqing Ren, Kaiming He, Ross B. Girshick, and Jian Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. T-PAMI, 2015

work page 2015

[15] [15]

Two-stream con- volutional networks for action recognition in videos

Karen Simonyan and Andrew Zisserman. Two-stream con- volutional networks for action recognition in videos. In Neurips, 2014

work page 2014

[16] [16]

Lsta: Long short-term attention for egocentric action recog- nition

Swathikiran Sudhakaran, Sergio Escalera, and Oswald Lanz. Lsta: Long short-term attention for egocentric action recog- nition. In CVPR, 2019

work page 2019

[17] [17]

Learning spatiotemporal features with 3d convolutional networks

Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. Learning spatiotemporal features with 3d convolutional networks. In ICCV, 2015

work page 2015

[18] [18]

Temporal segment net- works: Towards good practices for deep action recognition

Limin Wang, Yuanjun Xiong, Zhe Wang, Yu Qiao, Dahua Lin, Xiaoou Tang, and Luc Val Gool. Temporal segment net- works: Towards good practices for deep action recognition. In ECCV, 2016

work page 2016

[19] [19]

Long-term feature banks for detailed video understanding

Chao-Yuan Wu, Christoph Feichtenhofer, Haoqi Fan, Kaim- ing He, Philipp Krahenbuhl, and Ross Girshick. Long-term feature banks for detailed video understanding. In CVPR, 2019

work page 2019

[20] [20]

Girshick, Piotr Doll ´ar, Zhuowen Tu, and Kaiming He

Saining Xie, Ross B. Girshick, Piotr Doll ´ar, Zhuowen Tu, and Kaiming He. Aggregated residual transformations for deep neural networks. In CVPR, 2017

work page 2017

[21] [21]

A dis- criminative cnn video representation for event detection

Zhongwen Xu, Yi Yang, and Alex G Hauptmann. A dis- criminative cnn video representation for event detection. In CVPR, 2015

work page 2015

[22] [22]

Faster recurrent networks for video classiﬁcation

Linchao Zhu, Laura Sevilla-Lara, Du Tran, Matt Feiszli, Yi Yang, and Heng Wang. Faster recurrent networks for video classiﬁcation. arXiv preprint arXiv:1906.04226, 2019

work page arXiv 1906

[23] [23]

Bidirectional multirate reconstruction for temporal modeling in videos

Linchao Zhu, Zhongwen Xu, and Yi Yang. Bidirectional multirate reconstruction for temporal modeling in videos. In CVPR, 2017

work page 2017

[24] [24]

Uncovering the temporal context for video question answering

Linchao Zhu, Zhongwen Xu, Yi Yang, and Alexander G Hauptmann. Uncovering the temporal context for video question answering. IJCV, 2017. 5

work page 2017