pith. sign in

arxiv: 1906.09383 · v1 · pith:7MKIX7R2new · submitted 2019-06-22 · 💻 cs.CV

Baidu-UTS Submission to the EPIC-Kitchens Action Recognition Challenge 2019

Pith reviewed 2026-05-25 18:35 UTC · model grok-4.3

classification 💻 cs.CV
keywords action recognitionegocentric videoobject detection3D CNNgated feature aggregatornoun predictionEPIC-Kitchens
0
0 comments X

The pith

Object detection features guide 3D CNN training via a gated aggregator to raise noun prediction accuracy in egocentric kitchen videos.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The submission demonstrates that object detection outputs can steer the learning of 3D convolutional networks so that they more accurately identify the nouns in first-person action clips. These videos contain small items, heavy blur, and frequent occlusions, which make direct noun recognition difficult from motion features alone. The authors insert a Gated Feature Aggregator that combines clip-level activations with object features, controls their interaction, and prevents gradient explosion during training. The resulting model records higher noun accuracy on both kitchens seen during training and entirely new kitchens. This approach produced the winning entry in the 2019 EPIC-Kitchens action recognition challenge.

Core claim

Object detection features are used to guide the training of 3D CNNs, and a Gated Feature Aggregator module is introduced that learns from both the clip feature and the object feature; this module strengthens the interaction between the two kinds of activations and avoids gradient exploding, which significantly improves the accuracy of noun prediction on both seen and unseen test sets.

What carries the argument

The Gated Feature Aggregator module, which fuses clip features with object detection features while controlling their interaction to strengthen noun signals.

If this is right

  • Noun prediction accuracy rises when object detection features are supplied to the 3D CNN through the aggregator.
  • The performance gain holds on both kitchens the model has seen and on entirely new kitchens.
  • The aggregator prevents gradient explosion while allowing the two feature streams to interact.
  • Overall verb-noun action recognition improves as a direct result of better noun scores.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same gated fusion pattern could be tested on other video tasks that already have access to pre-trained detectors.
  • The approach points toward lighter supervision strategies that combine cheap object labels with video-level action labels.
  • Similar modules might help action models cope with the visual degradations typical of head-mounted cameras outside kitchens.

Load-bearing premise

Object detectors can extract reliable features from the same egocentric videos that contain small objects, motion blur, and occlusions.

What would settle it

Train identical 3D CNNs with and without the object features and Gated Feature Aggregator, then compare noun top-1 accuracy on the unseen test set; if accuracy does not rise, the guidance claim is false.

Figures

Figures reproduced from arXiv: 1906.09383 by Linchao Zhu, Xiaohan Wang, Yi Yang, Yu Wu.

Figure 1
Figure 1. Figure 1: The overall framework of our approach. the-art method for third-person video recognition does not achieve promising results, especially on noun classification. The attention mechanism is efficient to locate the region of interest on the feature map. Sudhakaran et al. [16] pro￾pose a Long Short-Term Attention model to focus on fea￾tures from relevant spatial parts. They extend LSTM with a recurrent attentio… view at source ↗
Figure 2
Figure 2. Figure 2: The two different types of GFA. 3.2. Gated Feature Aggregator Wu et al. [19] propose to concatenate the object fea￾ture and the clip feature directly as the final representation. However, in our experiments, this method is sensitive to the backbones of 3D CNN and detector. When the two branches have different backbones, e.g., I3D and ResNext, the train￾ing loss is difficult to converge thus the final perfo… view at source ↗
read the original abstract

In this report, we present the Baidu-UTS submission to the EPIC-Kitchens Action Recognition Challenge in CVPR 2019. This is the winning solution to this challenge. In this task, the goal is to predict verbs, nouns, and actions from the vocabulary for each video segment. The EPIC-Kitchens dataset contains various small objects, intense motion blur, and occlusions. It is challenging to locate and recognize the object that an actor interacts with. To address these problems, we utilize object detection features to guide the training of 3D Convolutional Neural Networks (CNN), which can significantly improve the accuracy of noun prediction. Specifically, we introduce a Gated Feature Aggregator module to learn from the clip feature and the object feature. This module can strengthen the interaction between the two kinds of activations and avoid gradient exploding. Experimental results demonstrate our approach outperforms other methods on both seen and unseen test set.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The manuscript presents the Baidu-UTS winning submission to the EPIC-Kitchens Action Recognition Challenge 2019. It claims that object detection features, combined with a Gated Feature Aggregator module that fuses clip and object features, significantly improve noun prediction accuracy for 3D CNNs on egocentric videos containing small objects, motion blur, and occlusions, leading to top performance on both seen and unseen test sets.

Significance. If the performance claims hold, the work supplies a concrete, leaderboard-validated demonstration that auxiliary object detection features can be effectively integrated into video action models for egocentric settings. The competition result on unseen kitchens provides an external falsification test of the approach's robustness.

major comments (1)
  1. [Abstract] Abstract: the central claim that object detection features 'significantly improve the accuracy of noun prediction' and that the method 'outperforms other methods on both seen and unseen test set' is stated without any numerical results, ablation tables, or per-class error analysis, leaving the magnitude and reliability of the reported noun improvement unsubstantiated in the manuscript.
minor comments (1)
  1. [Method] The description of the Gated Feature Aggregator remains at a conceptual level; adding the precise equations or block diagram would clarify how the gating avoids gradient explosion while strengthening feature interaction.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive comment on our manuscript. We address it point-by-point below and will incorporate revisions as indicated.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central claim that object detection features 'significantly improve the accuracy of noun prediction' and that the method 'outperforms other methods on both seen and unseen test set' is stated without any numerical results, ablation tables, or per-class error analysis, leaving the magnitude and reliability of the reported noun improvement unsubstantiated in the manuscript.

    Authors: The comment correctly notes that the abstract itself contains no numerical values. The body of the manuscript reports the leaderboard results on seen and unseen test sets along with ablation experiments on the Gated Feature Aggregator. To make the central claims more concrete within the space constraints of the abstract, we will revise the abstract to include the key noun-accuracy deltas and overall action-recognition improvements achieved on the official test sets. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

This is an empirical competition report describing a 3D CNN architecture augmented with object detection features and a Gated Feature Aggregator. The central performance claims are validated directly on held-out test sets from the EPIC-Kitchens challenge leaderboard rather than derived from equations or first-principles arguments. No mathematical derivations, fitted parameters renamed as predictions, or load-bearing self-citations appear in the provided text; the method is evaluated end-to-end on external data, rendering the result self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities are stated. The Gated Feature Aggregator is a new architectural component but is not an invented physical entity.

pith-pipeline@v0.9.0 · 5694 in / 1055 out tokens · 16385 ms · 2026-05-25T18:35:54.193461+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

24 extracted references · 24 canonical work pages · 2 internal anchors

  1. [1]

    YouTube-8M: A Large-Scale Video Classification Benchmark

    Sami Abu-El-Haija, Nisarg Kothari, Joonseok Lee, Paul Natsev, George Toderici, Balakrishnan Varadarajan, and Sudheendra Vijayanarasimhan. Youtube-8m: A large- scale video classification benchmark. arXiv preprint arXiv:1609.08675, 2016

  2. [2]

    Bottom-up and top-down attention for image captioning and visual question answering

    Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang. Bottom-up and top-down attention for image captioning and visual question answering. In CVPR, 2018

  3. [3]

    Object level visual reasoning in videos

    Fabien Baradel, Natalia Neverova, Christian Wolf, Julien Mille, and Greg Mori. Object level visual reasoning in videos. In ECCV, 2018

  4. [4]

    Quo vadis, action recognition? a new model and the kinetics dataset

    Jo ˜ao Carreira and Andrew Zisserman. Quo vadis, action recognition? a new model and the kinetics dataset. In CVPR, 2017

  5. [5]

    Scaling egocentric vision: The epic-kitchens dataset

    Dima Damen, Hazel Doughty, Giovanni Maria Farinella, Sanja Fidler, Antonino Furnari, Evangelos Kazakos, Davide Moltisanti, Jonathan Munro, Toby Perrett, Will Price, and Michael Wray. Scaling egocentric vision: The epic-kitchens dataset. In ECCV, 2018

  6. [6]

    Imagenet: A large-scale hierarchical image database

    Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In CVPR, 2009

  7. [7]

    Large- scale weakly-supervised pre-training for video action recog- nition

    Deepti Ghadiyaram, Du Tran, and Dhruv Mahajan. Large- scale weakly-supervised pre-training for video action recog- nition. In CVPR, 2019

  8. [8]

    Actionvlad: Learning spatio-temporal aggregation for action classification

    Rohit Girdhar, Deva Ramanan, Abhinav Gupta, Josef Sivic, and Bryan Russell. Actionvlad: Learning spatio-temporal aggregation for action classification. In CVPR, 2017

  9. [9]

    Can spatiotemporal 3d cnns retrace the history of 2d cnns and imagenet? In CVPR, 2018

    Kensho Hara, Hirokatsu Kataoka, and Yutaka Satoh. Can spatiotemporal 3d cnns retrace the history of 2d cnns and imagenet? In CVPR, 2018

  10. [10]

    Girshick

    Kaiming He, Georgia Gkioxari, Piotr Doll ´ar, and Ross B. Girshick. Mask r-cnn. In ICCV, 2017

  11. [11]

    Shamma, Michael S

    Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalan- tidis, Li-Jia Li, David A. Shamma, Michael S. Bernstein, and Li Fei-Fei. Visual genome: Connecting language and vision using crowdsourced dense image annotations. IJCV, 2016

  12. [12]

    Girshick, Kaiming He, Bharath Hariharan, and Serge J

    Tsung-Yi Lin, Piotr Doll ´ar, Ross B. Girshick, Kaiming He, Bharath Hariharan, and Serge J. Belongie. Feature pyramid networks for object detection. In CVPR, 2017. 4

  13. [13]

    Learnable pooling with Context Gating for video classification

    Antoine Miech, Ivan Laptev, and Josef Sivic. Learnable pooling with context gating for video classification. arXiv preprint arXiv:1706.06905, 2017

  14. [14]

    Girshick, and Jian Sun

    Shaoqing Ren, Kaiming He, Ross B. Girshick, and Jian Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. T-PAMI, 2015

  15. [15]

    Two-stream con- volutional networks for action recognition in videos

    Karen Simonyan and Andrew Zisserman. Two-stream con- volutional networks for action recognition in videos. In Neurips, 2014

  16. [16]

    Lsta: Long short-term attention for egocentric action recog- nition

    Swathikiran Sudhakaran, Sergio Escalera, and Oswald Lanz. Lsta: Long short-term attention for egocentric action recog- nition. In CVPR, 2019

  17. [17]

    Learning spatiotemporal features with 3d convolutional networks

    Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. Learning spatiotemporal features with 3d convolutional networks. In ICCV, 2015

  18. [18]

    Temporal segment net- works: Towards good practices for deep action recognition

    Limin Wang, Yuanjun Xiong, Zhe Wang, Yu Qiao, Dahua Lin, Xiaoou Tang, and Luc Val Gool. Temporal segment net- works: Towards good practices for deep action recognition. In ECCV, 2016

  19. [19]

    Long-term feature banks for detailed video understanding

    Chao-Yuan Wu, Christoph Feichtenhofer, Haoqi Fan, Kaim- ing He, Philipp Krahenbuhl, and Ross Girshick. Long-term feature banks for detailed video understanding. In CVPR, 2019

  20. [20]

    Girshick, Piotr Doll ´ar, Zhuowen Tu, and Kaiming He

    Saining Xie, Ross B. Girshick, Piotr Doll ´ar, Zhuowen Tu, and Kaiming He. Aggregated residual transformations for deep neural networks. In CVPR, 2017

  21. [21]

    A dis- criminative cnn video representation for event detection

    Zhongwen Xu, Yi Yang, and Alex G Hauptmann. A dis- criminative cnn video representation for event detection. In CVPR, 2015

  22. [22]

    Faster recurrent networks for video classification

    Linchao Zhu, Laura Sevilla-Lara, Du Tran, Matt Feiszli, Yi Yang, and Heng Wang. Faster recurrent networks for video classification. arXiv preprint arXiv:1906.04226, 2019

  23. [23]

    Bidirectional multirate reconstruction for temporal modeling in videos

    Linchao Zhu, Zhongwen Xu, and Yi Yang. Bidirectional multirate reconstruction for temporal modeling in videos. In CVPR, 2017

  24. [24]

    Uncovering the temporal context for video question answering

    Linchao Zhu, Zhongwen Xu, Yi Yang, and Alexander G Hauptmann. Uncovering the temporal context for video question answering. IJCV, 2017. 5