FBK-HUPBA Submission to the EPIC-Kitchens 2019 Action Recognition Challenge

Oswald Lanz; Sergio Escalera; Swathikiran Sudhakaran

arxiv: 1906.08960 · v1 · pith:ZB7UCOX4new · submitted 2019-06-21 · 💻 cs.CV

FBK-HUPBA Submission to the EPIC-Kitchens 2019 Action Recognition Challenge

Swathikiran Sudhakaran , Sergio Escalera , Oswald Lanz This is my paper

Pith reviewed 2026-05-25 19:17 UTC · model grok-4.3

classification 💻 cs.CV

keywords action recognitionegocentric videoEPIC-KitchensCNN-LSTAHF-TSNensemblevideo classificationdeep learning

0 comments

The pith

An ensemble of CNN-LSTA and HF-TSN variants achieves 35.54% top-1 accuracy on EPIC-Kitchens 2019 S1 action recognition.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper reports the technical details behind a submission to the EPIC-Kitchens 2019 action recognition challenge. The authors created multiple variants of the CNN-LSTA and HF-TSN architectures and combined their predictions in an ensemble. The resulting system recorded 35.54% top-1 accuracy on the S1 test setting and 20.25% on the S2 setting according to the public leaderboard. A reader focused on practical video understanding would see how these two model families together address the demands of recognizing fine-grained actions from wearable camera footage in kitchen environments.

Core claim

The FBK-HUPBA submission compiled predictions from an ensemble of CNN-LSTA and HF-TSN model variants and attained top-1 action recognition accuracies of 35.54% on the S1 setting and 20.25% on the S2 setting of the EPIC-Kitchens 2019 challenge.

What carries the argument

Ensemble compiled out of multiple CNN-LSTA and HF-TSN variants that aggregates class predictions from the two families of deep models.

If this is right

The ensemble of the two model families produces higher accuracy than either family alone would achieve.
CNN-LSTA variants contribute temporal modeling suited to the sequential nature of kitchen actions.
HF-TSN variants supply spatial feature extraction that remains stable across the two evaluation settings.
The reported scores establish a concrete performance level for any future method submitted to the same splits.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The results imply that mixing recurrent and two-stream temporal models can compensate for the limited field of view and motion blur typical in egocentric recordings.
Similar ensembles could be tested on other first-person video datasets without retraining the base architectures from scratch.
The gap between S1 and S2 performance points to sensitivity in how the models generalize across different participants or environments.

Load-bearing premise

The CNN-LSTA and HF-TSN variants were trained on the challenge data without leakage or overfitting and the ensemble was scored according to the official protocol.

What would settle it

An independent run of the same model variants and ensemble procedure on the hidden test sets that produces accuracy figures different from the reported 35.54% and 20.25%.

Figures

Figures reproduced from arXiv: 1906.08960 by Oswald Lanz, Sergio Escalera, Swathikiran Sudhakaran.

**Figure 1.** Figure 1: Block diagram illustrating the two model families used for generating the action recognition scores. The first model [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗

read the original abstract

In this report we describe the technical details of our submission to the EPIC-Kitchens 2019 action recognition challenge. To participate in the challenge we have developed a number of CNN-LSTA [3] and HF-TSN [2] variants, and submitted predictions from an ensemble compiled out of these two model families. Our submission, visible on the public leaderboard with team name FBK-HUPBA, achieved a top-1 action recognition accuracy of 35.54% on S1 setting, and 20.25% on S2 setting.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This is a short technical report on one competition entry that adds no new methods or analysis.

read the letter

The main thing here is that this is not a research paper. It is a brief report on the FBK-HUPBA submission to the EPIC-Kitchens 2019 challenge. The authors took variants of two existing models (CNN-LSTA and HF-TSN), ensembled them, and report the resulting leaderboard scores of 35.54% top-1 on S1 and 20.25% on S2. Those numbers sit on the public leaderboard, so the central claim is independently checkable and does not depend on hidden experiments or modeling assumptions that could be falsified from the text alone. The stress-test note is correct on that point. Beyond the numbers, the paper introduces nothing new. Both base models are cited from prior publications, and the work contains no new architecture, training procedure, ablation, or insight into why the ensemble performed as it did. The text is also thin on implementation details, which limits what anyone can learn or reproduce, though that is typical for a pure leaderboard report. This note will interest only readers who are tracking the specific challenge and want a quick record of one team's approach. It has no broader value for action recognition research and does not resolve open questions. I would not bring it to a reading group, would not cite it, and would not send it for peer review. It is fine as an arXiv record of the submission but does not need referee time.

Referee Report

0 major / 2 minor

Summary. The manuscript is a short report on the FBK-HUPBA submission to the EPIC-Kitchens 2019 Action Recognition Challenge. It states that variants of the CNN-LSTA and HF-TSN architectures were developed and combined into an ensemble whose predictions were submitted, yielding a public-leaderboard top-1 accuracy of 35.54% on the S1 setting and 20.25% on the S2 setting.

Significance. If the reported leaderboard scores are accurate, the work documents a competitive entry on a challenging egocentric video dataset. The public verifiability of the numbers on the official leaderboard constitutes a modest strength, as it permits independent confirmation without reliance on unreleased code or internal validation splits. However, the manuscript introduces no new methodological contributions beyond the two cited base models and therefore has limited significance for advancing the broader field of action recognition.

minor comments (2)

[Abstract] The abstract asserts that the report 'describe[s] the technical details' of the CNN-LSTA and HF-TSN variants, yet the manuscript supplies no information on architecture modifications, training protocols, hyper-parameters, or ensemble construction.
References [2] and [3] are cited but the manuscript contains no References section or bibliographic details for these works.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for reviewing our manuscript. We address the key point raised in the significance assessment below.

read point-by-point responses

Referee: the manuscript introduces no new methodological contributions beyond the two cited base models and therefore has limited significance for advancing the broader field of action recognition.

Authors: We agree that the manuscript does not introduce new methodological contributions. It is explicitly a short technical report documenting the details of our EPIC-Kitchens 2019 challenge submission, including the specific CNN-LSTA and HF-TSN variants we developed and the ensemble we formed. The primary contribution is the public, verifiable leaderboard performance (35.54% top-1 on S1 and 20.25% on S2) achieved by this ensemble. Such reports serve the community by providing concrete, reproducible details on competitive approaches for this dataset without claiming methodological novelty. revision: no

Circularity Check

0 steps flagged

No significant circularity

full rationale

The manuscript is a competition report that states the public leaderboard scores achieved by an ensemble of CNN-LSTA and HF-TSN variants. No derivation chain, equations, fitted parameters presented as predictions, or self-referential uniqueness claims exist in the text. The central factual claim (35.54% S1, 20.25% S2) is externally verifiable on the EPIC-Kitchens leaderboard and does not reduce to any internal construction or self-citation load-bearing step.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The paper is a short competition report relying on existing published models; no new mathematical axioms, free parameters, or invented entities are introduced or required for the central claim. Model variants in deep learning typically involve many unfixed hyperparameters, but none are specified here.

pith-pipeline@v0.9.0 · 5623 in / 1065 out tokens · 29790 ms · 2026-05-25T19:17:43.262387+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

5 extracted references · 5 canonical work pages · 1 internal anchor

[1]

Damen, H

D. Damen, H. Doughty, G. Maria Farinella, S. Fidler, A. Furnari, E. Kazakos, D. Moltisanti, J. Munro, T. Perrett, W. Price, and M. Wray. Scaling Egocentric Vision: The EPIC-KITCHENS Dataset . In Proc. ECCV , 2018

work page 2018
[2]

Hierarchical Feature Aggregation Networks for Video Action Recognition

S. Sudhakaran, S. Escalera, and O. Lanz. Hierarchical Feature Aggregation Networks for Video Action Recognition . arXiv preprint arXiv:1905.12462 , 2019

work page internal anchor Pith review Pith/arXiv arXiv 1905
[3]

Sudhakaran, S

S. Sudhakaran, S. Escalera, and O. Lanz. LSTA: Long Short-Term Attention for Egocentric Action Recognition . In Proc. CVPR , 2019

work page 2019
[4]

Sudhakaran and O

S. Sudhakaran and O. Lanz. Attention is All We Need: Nailing Down Object-centric Attention for Egocentric Activity Recognition . In Proc. BMVC , 2018

work page 2018
[5]

write newline

" write newline "" before.all 'output.state := FUNCTION fin.entry add.period write newline FUNCTION new.block output.state before.all = 'skip after.block 'output.state := if FUNCTION new.sentence output.state after.block = 'skip output.state before.all = 'skip after.sentence 'output.state := if if FUNCTION not #0 #1 if FUNCTION and 'skip pop #0 if FUNCTIO...

work page

[1] [1]

Damen, H

D. Damen, H. Doughty, G. Maria Farinella, S. Fidler, A. Furnari, E. Kazakos, D. Moltisanti, J. Munro, T. Perrett, W. Price, and M. Wray. Scaling Egocentric Vision: The EPIC-KITCHENS Dataset . In Proc. ECCV , 2018

work page 2018

[2] [2]

Hierarchical Feature Aggregation Networks for Video Action Recognition

S. Sudhakaran, S. Escalera, and O. Lanz. Hierarchical Feature Aggregation Networks for Video Action Recognition . arXiv preprint arXiv:1905.12462 , 2019

work page internal anchor Pith review Pith/arXiv arXiv 1905

[3] [3]

Sudhakaran, S

S. Sudhakaran, S. Escalera, and O. Lanz. LSTA: Long Short-Term Attention for Egocentric Action Recognition . In Proc. CVPR , 2019

work page 2019

[4] [4]

Sudhakaran and O

S. Sudhakaran and O. Lanz. Attention is All We Need: Nailing Down Object-centric Attention for Egocentric Activity Recognition . In Proc. BMVC , 2018

work page 2018

[5] [5]

write newline

" write newline "" before.all 'output.state := FUNCTION fin.entry add.period write newline FUNCTION new.block output.state before.all = 'skip after.block 'output.state := if FUNCTION new.sentence output.state after.block = 'skip output.state before.all = 'skip after.sentence 'output.state := if if FUNCTION not #0 #1 if FUNCTION and 'skip pop #0 if FUNCTIO...

work page