Two-stream Spatiotemporal Feature for Video QA Task

Chiwan Song; Sung-Eui Yoon; Woobin Im

arxiv: 1907.05006 · v1 · pith:X6RUQWIVnew · submitted 2019-07-11 · 💻 cs.CV

Two-stream Spatiotemporal Feature for Video QA Task

Chiwan Song , Woobin Im , Sung-Eui Yoon This is my paper

Pith reviewed 2026-05-24 23:26 UTC · model grok-4.3

classification 💻 cs.CV

keywords two-stream networkspatiotemporal featuresvideo question answeringTVQAsqueeze-and-excitationcontext matchingsmoothed ranking loss

0 comments

The pith

A two-stream network from action recognition serves as a spatiotemporal feature extractor that improves text-only video QA on TVQA.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper aims to show that a two-stream network structure, successful in action recognition, can be adapted as a spatiotemporal video feature extractor for video question answering. By incorporating squeeze-and-excitation for channel attention and a context matching module to bridge visual and textual features, the model jointly processes video and question data. It uses a scoring mechanism with smoothed ranking loss to pick the correct answer. Tests on the TVQA dataset indicate gains in text-only mode, pointing to both potential and limits when visual features are included.

Core claim

We propose a multi-channel neural network that adopts a two-stream network structure as a spatiotemporal video feature extractor for the video QA task. We adopt a squeeze-and-excitation structure for channel-wise attended spatiotemporal features. A context matching module with a level adjusting layer removes the information gap between visual and textual features using attention. A scoring mechanism and smoothed ranking loss select the correct answer. Evaluation on TVQA shows improved results in the textual only setting, but limitations and possibilities when using visual features.

What carries the argument

Two-stream network structure used as spatiotemporal video feature extractor with squeeze-and-excitation and context matching module.

If this is right

The two-stream structure provides improved results in textual only video QA setting.
Squeeze-and-excitation achieves channel-wise attended spatiotemporal feature.
Context matching module with level adjusting layer enables joint modeling of visual and textual features.
Scoring mechanism and smoothed ranking loss select the correct answer from candidates.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the limitations with visual features are resolved, the approach may enable more robust multimodal video QA systems.
The method could extend to other tasks requiring integration of video content with natural language questions.
Further experiments on additional video QA datasets would test the general applicability of the two-stream adaptation.

Load-bearing premise

The two-stream network structure that works well for human action recognition can be directly adopted as a spatiotemporal feature extractor for video QA tasks.

What would settle it

An experiment replacing the two-stream extractor with a standard video encoder on the TVQA dataset and observing no improvement or worse performance in the textual setting would falsify the claim of effective adoption.

Figures

Figures reproduced from arXiv: 1907.05006 by Chiwan Song, Sung-Eui Yoon, Woobin Im.

**Figure 1.** Figure 1: The figure of our multi-channel neural network structure with two-stream spatiotemporal video feature extractor [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗

**Figure 2.** Figure 2: Our two-stream I3D with the Squeeze-and-Excitation structure. [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Our Squeeze-and-Excitation structure focused on related objects or actions. This kind of gap between two modalities raise a difficulty in context matching stage where visual and textual information is merged. To make our feature extractor concentrate more on the crucial objects in the video frames, we utilize the Squeezeand-Excitation (SE) structure [16] and integrate it within the two-stream I3D, where a… view at source ↗

read the original abstract

Understanding the content of videos is one of the core techniques for developing various helpful applications in the real world, such as recognizing various human actions for surveillance systems or customer behavior analysis in an autonomous shop. However, understanding the content or story of the video still remains a challenging problem due to its sheer amount of data and temporal structure. In this paper, we propose a multi-channel neural network structure that adopts a two-stream network structure, which has been shown high performance in human action recognition field, and use it as a spatiotemporal video feature extractor for solving video question and answering task. We also adopt a squeeze-and-excitation structure to two-stream network structure for achieving a channel-wise attended spatiotemporal feature. For jointly modeling the spatiotemporal features from video and the textual features from the question, we design a context matching module with a level adjusting layer to remove the gap of information between visual and textual features by applying attention mechanism on joint modeling. Finally, we adopt a scoring mechanism and smoothed ranking loss objective function for selecting the correct answer from answer candidates. We evaluate our model with TVQA dataset, and our approach shows the improved result in textual only setting, but the result with visual feature shows the limitation and possibility of our approach.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper adapts two-stream action recognition to video QA but only improves in the text-only setting on TVQA.

read the letter

The headline take is that this paper tries to apply two-stream spatiotemporal features to video question answering but finds that it only helps in the text-only case. They propose a multi-channel network that uses the two-stream structure as a video feature extractor, incorporates squeeze-and-excitation for attended features, and adds a context matching module with attention to handle the gap between visual and textual information. They use a scoring mechanism and smoothed ranking loss for answer selection. On the TVQA dataset, the approach improves results when using only text but shows limitations with the visual features. What the paper does well is describe a complete system that combines these elements in a logical way. The context matching module seems like a reasonable attempt to align the features from different modalities. Being explicit about the limitation with visual features is a positive, as it gives a realistic picture rather than claiming success across the board. The soft spots are in the strength of the evidence. The improvement is only in the textual setting, which suggests that the two-stream adaptation did not provide the expected benefit for video understanding in QA. Relying on one dataset means we do not know if this is a general issue or specific to TVQA. The work depends on various network hyperparameters, which is typical but requires careful validation. This paper is for researchers working on video QA who are looking at ways to incorporate action recognition techniques. It would be useful for someone trying to understand why such transfers can be tricky. It does not deserve a serious referee because the evidence does not support the utility of the visual component. I would not recommend engaging with this work for peer review.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes a multi-channel neural network that adopts a two-stream architecture (previously successful in action recognition) as a spatiotemporal video feature extractor for video question answering. It augments this with squeeze-and-excitation blocks for channel-wise attention, a context matching module incorporating a level adjusting layer and attention to bridge visual-textual gaps, and a scoring mechanism with smoothed ranking loss. Evaluation is performed on the TVQA dataset, with the central claim being improved results in the textual-only setting alongside acknowledged limitations when visual features are included.

Significance. If the empirical gains are substantiated with quantitative evidence, the work could illustrate a viable transfer of two-stream spatiotemporal extractors to the video QA domain, offering a concrete architecture for joint modeling of temporal video structure and questions. The explicit reporting of both gains and limitations in the textual vs. visual regimes provides a balanced starting point for further multimodal research.

major comments (2)

[Abstract] Abstract: the claim that the approach 'shows the improved result in textual only setting' supplies no quantitative numbers, error bars, baseline comparisons, or experimental details, leaving the central empirical assertion without supporting evidence.
[Abstract] Abstract: the description of adopting the two-stream network 'without substantial additional adaptation' (as the weakest assumption) is not accompanied by any ablation or comparison showing what adaptations were in fact required, undermining assessment of whether the transfer is effective.

minor comments (1)

[Abstract] The abstract uses the phrase 'multi-channel neural network structure' in a way that could be clarified relative to the two-stream component to avoid potential reader confusion.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on the abstract. We address each major comment below and will revise the manuscript accordingly to strengthen the presentation of our claims.

read point-by-point responses

Referee: [Abstract] Abstract: the claim that the approach 'shows the improved result in textual only setting' supplies no quantitative numbers, error bars, baseline comparisons, or experimental details, leaving the central empirical assertion without supporting evidence.

Authors: We agree that the abstract would be strengthened by including key quantitative results. The full manuscript reports experimental results on the TVQA dataset with baseline comparisons in the textual-only setting. In the revised version, we will update the abstract to include specific accuracy metrics and comparisons to support the claim of improvement. revision: yes
Referee: [Abstract] Abstract: the description of adopting the two-stream network 'without substantial additional adaptation' (as the weakest assumption) is not accompanied by any ablation or comparison showing what adaptations were in fact required, undermining assessment of whether the transfer is effective.

Authors: The provided abstract does not use the exact phrasing 'without substantial additional adaptation,' but describes direct adoption of the two-stream structure with added components (squeeze-and-excitation and context matching). To address the concern, we will add a brief discussion of the specific modifications required for the video QA task and consider including an ablation if space permits in the revision. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical architecture proposal with direct testing

full rationale

The paper describes an empirical neural network proposal adopting a two-stream structure (previously successful in action recognition) as a video feature extractor for QA, plus squeeze-and-excitation, context matching, and a scoring loss. It evaluates on TVQA and explicitly reports both textual-only gains and limitations when using visual features. No equations, parameter-fitting steps, derivations, or self-citation chains are described that would reduce any claim to its own inputs by construction. The central claims are the architecture choice and the observed experimental outcomes, which are presented as tested rather than assumed or fitted. This matches the most common honest finding of a self-contained empirical contribution.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 0 invented entities

The model rests on standard deep learning assumptions about feature transfer from action recognition to QA and the effectiveness of attention for multimodal alignment; many typical neural network hyperparameters are implicitly free parameters but not enumerated in the abstract.

free parameters (1)

network hyperparameters and training settings
Standard in neural network models; learning rates, layer dimensions, and optimization choices are chosen or fitted but not detailed in the abstract.

axioms (2)

domain assumption Two-stream networks extract effective spatiotemporal features from video
Invoked when adopting the structure from action recognition for video QA.
domain assumption Attention mechanisms can remove the information gap between visual and textual features
Central to the context matching module design.

pith-pipeline@v0.9.0 · 5744 in / 1355 out tokens · 29191 ms · 2026-05-24T23:26:59.037339+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

38 extracted references · 38 canonical work pages · 4 internal anchors

[1]

A read-write memory network for movie story understanding,

S. Na, S. Lee, J. Kim, and G. Kim, “A read-write memory network for movie story understanding,” in Proceedings of the IEEE International Conference on Computer Vision , pp. 677–685, 2017

work page 2017
[2]

Deepstory: Video story qa by deep embedded memory networks,

K.-M. Kim, M.-O. Heo, S.-H. Choi, and B.-T. Zhang, “Deepstory: Video story qa by deep embedded memory networks,” in IJCAI, 2017

work page 2017
[3]

Tvqa: Localized, composi- tional video question answering,

J. Lei, L. Yu, M. Bansal, and T. L. Berg, “Tvqa: Localized, composi- tional video question answering,” in EMNLP, 2018

work page 2018
[4]

Motion-appearance co- memory networks for video question answering,

J. Gao, R. Ge, K. Chen, and R. Nevatia, “Motion-appearance co- memory networks for video question answering,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pp. 6576–6585, 2018

work page 2018
[5]

Movieqa: Understanding stories in movies through question-answering,

M. Tapaswi, Y . Zhu, R. Stiefelhagen, A. Torralba, R. Urtasun, and S. Fidler, “Movieqa: Understanding stories in movies through question-answering,” in Proceedings of the IEEE conference on com- puter vision and pattern recognition , pp. 4631–4640, 2016

work page 2016
[6]

Uncovering the temporal context for video question answering,

L. Zhu, Z. Xu, Y . Yang, and A. G. Hauptmann, “Uncovering the temporal context for video question answering,” International Journal of Computer Vision , vol. 124, no. 3, pp. 409–421, 2017

work page 2017
[7]

Tgif-qa: Towardt spatio- temporal reasoning in visual question answering,

Y . Jang, Y . Song, Y . Yu, Y . Kim, and G. Kim, “Tgif-qa: Towardt spatio- temporal reasoning in visual question answering,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pp. 2758–2766, 2017

work page 2017
[8]

Video question answering via gradually reﬁned attention over ap- pearance and motion,

D. Xu, Z. Zhao, J. Xiao, F. Wu, H. Zhang, X. He, and Y . Zhuang, “Video question answering via gradually reﬁned attention over ap- pearance and motion,” in Proceedings of the 25th ACM international conference on Multimedia , pp. 1645–1653, ACM, 2017

work page 2017
[9]

Marioqa: Answering questions by watching gameplay videos,

J. Mun, P. H. Seo, I. Jung, and B. Han, “Marioqa: Answering questions by watching gameplay videos,” in ICCV, 2017

work page 2017
[10]

Two-stream convolutional networks for action recognition in videos,

K. Simonyan and A. Zisserman, “Two-stream convolutional networks for action recognition in videos,” in Advances in neural information processing systems, pp. 568–576, 2014

work page 2014
[11]

Quo vadis, action recognition? a new model and the kinetics dataset,

J. Carreira and A. Zisserman, “Quo vadis, action recognition? a new model and the kinetics dataset,” inproceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pp. 6299–6308, 2017

work page 2017
[12]

Temporal segment networks: Towards good practices for deep action recognition,

L. Wang, Y . Xiong, Z. Wang, Y . Qiao, D. Lin, X. Tang, and L. Van Gool, “Temporal segment networks: Towards good practices for deep action recognition,” in European conference on computer vision , pp. 20–36, Springer, 2016

work page 2016
[13]

Long short-term memory,

S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural computation, vol. 9, no. 8, pp. 1735–1780, 1997

work page 1997
[14]

Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling

J. Chung, C. Gulcehre, K. Cho, and Y . Bengio, “Empirical evaluation of gated recurrent neural networks on sequence modeling,” arXiv preprint arXiv:1412.3555, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014
[15]

Learning spatiotemporal features with 3d convolutional networks,

D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri, “Learning spatiotemporal features with 3d convolutional networks,” in Pro- ceedings of the IEEE international conference on computer vision , pp. 4489–4497, 2015

work page 2015
[16]

Squeeze-and-excitation networks,

J. Hu, L. Shen, and G. Sun, “Squeeze-and-excitation networks,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 7132–7141, 2018

work page 2018
[17]

Improving pairwise ranking for multi- label image classiﬁcation,

Y . Li, Y . Song, and J. Luo, “Improving pairwise ranking for multi- label image classiﬁcation,” in Proceedings of the IEEE conference on computer vision and pattern recognition , pp. 3617–3625, 2017

work page 2017
[18]

ImageNet: A Large-Scale Hierarchical Image Database,

J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “ImageNet: A Large-Scale Hierarchical Image Database,” in CVPR09, 2009

work page 2009
[19]

Identifying ﬁrst-person camera wearers in third-person videos,

C. Fan, J. Lee, M. Xu, K. Kumar Singh, Y . Jae Lee, D. J. Crandall, and M. S. Ryoo, “Identifying ﬁrst-person camera wearers in third-person videos,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5125–5133, 2017

work page 2017
[20]

Regional attention based deep feature for image retrieval,

J. Kim and S.-E. Yoon, “Regional attention based deep feature for image retrieval,” in Proc. British Machine Vision Conference (BMVC 2018), 2018

work page 2018
[21]

Cross-dimensional weighting for aggregated deep convolutional features,

Y . Kalantidis, C. Mellina, and S. Osindero, “Cross-dimensional weighting for aggregated deep convolutional features,” in European conference on computer vision , pp. 685–701, Springer, 2016

work page 2016
[22]

Large-scale image retrieval with attentive deep local features,

H. Noh, A. Araujo, J. Sim, T. Weyand, and B. Han, “Large-scale image retrieval with attentive deep local features,” inProceedings of the IEEE International Conference on Computer Vision , pp. 3456–3465, 2017

work page 2017
[23]

Bidirectional attention ﬂow for machine comprehension,

M. Seo, A. Kembhavi, A. Farhadi, and H. Hajishirzi, “Bidirectional attention ﬂow for machine comprehension,” ICLR, 2017

work page 2017
[24]

Deep residual learning for image recognition,

K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition , pp. 770–778, 2016

work page 2016
[25]

Going deeper with convolutions,

C. Szegedy, W. Liu, Y . Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V . Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,” in Proceedings of the IEEE conference on computer vision and pattern recognition , pp. 1–9, 2015

work page 2015
[26]

Glove: Global vectors for word representation,

J. Pennington, R. Socher, and C. Manning, “Glove: Global vectors for word representation,” in Proceedings of the 2014 conference on em- pirical methods in natural language processing (EMNLP) , pp. 1532– 1543, 2014

work page 2014
[27]

The Kinetics Human Action Video Dataset

W. Kay, J. Carreira, K. Simonyan, B. Zhang, C. Hillier, S. Vijaya- narasimhan, F. Viola, T. Green, T. Back, P. Natsev,et al., “The kinetics human action video dataset,” arXiv preprint arXiv:1705.06950 , 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[28]

Rectiﬁed linear units improve restricted boltzmann machines,

V . Nair and G. E. Hinton, “Rectiﬁed linear units improve restricted boltzmann machines,” in Proceedings of the 27th international con- ference on machine learning (ICML-10) , pp. 807–814, 2010

work page 2010
[29]

Bidirectional recurrent neural net- works,

M. Schuster and K. K. Paliwal, “Bidirectional recurrent neural net- works,” IEEE Transactions on Signal Processing , vol. 45, no. 11, pp. 2673–2681, 1997

work page 1997
[30]

Empirical Evaluation of Rectified Activations in Convolutional Network

B. Xu, N. Wang, T. Chen, and M. Li, “Empirical evaluation of rectiﬁed activations in convolutional network,” arXiv preprint arXiv:1505.00853, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015
[31]

An im- proved algorithm for tv-l 1 optical ﬂow,

A. Wedel, T. Pock, C. Zach, H. Bischof, and D. Cremers, “An im- proved algorithm for tv-l 1 optical ﬂow,” in Statistical and geometrical approaches to visual motion analysis , pp. 23–45, Springer, 2009

work page 2009
[32]

Ensemble methods in machine learning,

T. G. Dietterich, “Ensemble methods in machine learning,” in Inter- national workshop on multiple classiﬁer systems , pp. 1–15, Springer, 2000

work page 2000
[33]

Adam: A Method for Stochastic Optimization

D. P. Kingma and J. Ba, “Adam: A method for stochastic optimiza- tion,” arXiv preprint arXiv:1412.6980 , 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014
[34]

Overﬁtting in neural nets: Backpropagation, conjugate gradient, and early stopping,

R. Caruana, S. Lawrence, and C. L. Giles, “Overﬁtting in neural nets: Backpropagation, conjugate gradient, and early stopping,” in Advances in neural information processing systems , pp. 402–408, 2001

work page 2001
[35]

Large-scale machine learning with stochastic gradient descent,

L. Bottou, “Large-scale machine learning with stochastic gradient descent,” in Proceedings of COMPSTAT’2010, pp. 177–186, Springer, 2010

work page 2010
[36]

Tensorﬂow: A system for large-scale machine learning,

M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard, et al. , “Tensorﬂow: A system for large-scale machine learning,” in 12th{USENIX} Symposium on Operating Systems Design and Implementation ({OSDI} 16), pp. 265– 283, 2016

work page 2016
[37]

Bottom-up and top-down attention for image captioning and visual question answering,

P. Anderson, X. He, C. Buehler, D. Teney, M. Johnson, S. Gould, and L. Zhang, “Bottom-up and top-down attention for image captioning and visual question answering,” in Proceedings of the IEEE Confer- ence on Computer Vision and Pattern Recognition , pp. 6077–6086, 2018

work page 2018
[38]

Visual genome: Connecting language and vision using crowdsourced dense image annotations,

R. Krishna, Y . Zhu, O. Groth, J. Johnson, K. Hata, J. Kravitz, S. Chen, Y . Kalantidis, L.-J. Li, D. A. Shamma, et al. , “Visual genome: Connecting language and vision using crowdsourced dense image annotations,” International Journal of Computer Vision , vol. 123, no. 1, pp. 32–73, 2017

work page 2017

[1] [1]

A read-write memory network for movie story understanding,

S. Na, S. Lee, J. Kim, and G. Kim, “A read-write memory network for movie story understanding,” in Proceedings of the IEEE International Conference on Computer Vision , pp. 677–685, 2017

work page 2017

[2] [2]

Deepstory: Video story qa by deep embedded memory networks,

K.-M. Kim, M.-O. Heo, S.-H. Choi, and B.-T. Zhang, “Deepstory: Video story qa by deep embedded memory networks,” in IJCAI, 2017

work page 2017

[3] [3]

Tvqa: Localized, composi- tional video question answering,

J. Lei, L. Yu, M. Bansal, and T. L. Berg, “Tvqa: Localized, composi- tional video question answering,” in EMNLP, 2018

work page 2018

[4] [4]

Motion-appearance co- memory networks for video question answering,

J. Gao, R. Ge, K. Chen, and R. Nevatia, “Motion-appearance co- memory networks for video question answering,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pp. 6576–6585, 2018

work page 2018

[5] [5]

Movieqa: Understanding stories in movies through question-answering,

M. Tapaswi, Y . Zhu, R. Stiefelhagen, A. Torralba, R. Urtasun, and S. Fidler, “Movieqa: Understanding stories in movies through question-answering,” in Proceedings of the IEEE conference on com- puter vision and pattern recognition , pp. 4631–4640, 2016

work page 2016

[6] [6]

Uncovering the temporal context for video question answering,

L. Zhu, Z. Xu, Y . Yang, and A. G. Hauptmann, “Uncovering the temporal context for video question answering,” International Journal of Computer Vision , vol. 124, no. 3, pp. 409–421, 2017

work page 2017

[7] [7]

Tgif-qa: Towardt spatio- temporal reasoning in visual question answering,

Y . Jang, Y . Song, Y . Yu, Y . Kim, and G. Kim, “Tgif-qa: Towardt spatio- temporal reasoning in visual question answering,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pp. 2758–2766, 2017

work page 2017

[8] [8]

Video question answering via gradually reﬁned attention over ap- pearance and motion,

D. Xu, Z. Zhao, J. Xiao, F. Wu, H. Zhang, X. He, and Y . Zhuang, “Video question answering via gradually reﬁned attention over ap- pearance and motion,” in Proceedings of the 25th ACM international conference on Multimedia , pp. 1645–1653, ACM, 2017

work page 2017

[9] [9]

Marioqa: Answering questions by watching gameplay videos,

J. Mun, P. H. Seo, I. Jung, and B. Han, “Marioqa: Answering questions by watching gameplay videos,” in ICCV, 2017

work page 2017

[10] [10]

Two-stream convolutional networks for action recognition in videos,

K. Simonyan and A. Zisserman, “Two-stream convolutional networks for action recognition in videos,” in Advances in neural information processing systems, pp. 568–576, 2014

work page 2014

[11] [11]

Quo vadis, action recognition? a new model and the kinetics dataset,

J. Carreira and A. Zisserman, “Quo vadis, action recognition? a new model and the kinetics dataset,” inproceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pp. 6299–6308, 2017

work page 2017

[12] [12]

Temporal segment networks: Towards good practices for deep action recognition,

L. Wang, Y . Xiong, Z. Wang, Y . Qiao, D. Lin, X. Tang, and L. Van Gool, “Temporal segment networks: Towards good practices for deep action recognition,” in European conference on computer vision , pp. 20–36, Springer, 2016

work page 2016

[13] [13]

Long short-term memory,

S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural computation, vol. 9, no. 8, pp. 1735–1780, 1997

work page 1997

[14] [14]

Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling

J. Chung, C. Gulcehre, K. Cho, and Y . Bengio, “Empirical evaluation of gated recurrent neural networks on sequence modeling,” arXiv preprint arXiv:1412.3555, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014

[15] [15]

Learning spatiotemporal features with 3d convolutional networks,

D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri, “Learning spatiotemporal features with 3d convolutional networks,” in Pro- ceedings of the IEEE international conference on computer vision , pp. 4489–4497, 2015

work page 2015

[16] [16]

Squeeze-and-excitation networks,

J. Hu, L. Shen, and G. Sun, “Squeeze-and-excitation networks,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 7132–7141, 2018

work page 2018

[17] [17]

Improving pairwise ranking for multi- label image classiﬁcation,

Y . Li, Y . Song, and J. Luo, “Improving pairwise ranking for multi- label image classiﬁcation,” in Proceedings of the IEEE conference on computer vision and pattern recognition , pp. 3617–3625, 2017

work page 2017

[18] [18]

ImageNet: A Large-Scale Hierarchical Image Database,

J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “ImageNet: A Large-Scale Hierarchical Image Database,” in CVPR09, 2009

work page 2009

[19] [19]

Identifying ﬁrst-person camera wearers in third-person videos,

C. Fan, J. Lee, M. Xu, K. Kumar Singh, Y . Jae Lee, D. J. Crandall, and M. S. Ryoo, “Identifying ﬁrst-person camera wearers in third-person videos,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5125–5133, 2017

work page 2017

[20] [20]

Regional attention based deep feature for image retrieval,

J. Kim and S.-E. Yoon, “Regional attention based deep feature for image retrieval,” in Proc. British Machine Vision Conference (BMVC 2018), 2018

work page 2018

[21] [21]

Cross-dimensional weighting for aggregated deep convolutional features,

Y . Kalantidis, C. Mellina, and S. Osindero, “Cross-dimensional weighting for aggregated deep convolutional features,” in European conference on computer vision , pp. 685–701, Springer, 2016

work page 2016

[22] [22]

Large-scale image retrieval with attentive deep local features,

H. Noh, A. Araujo, J. Sim, T. Weyand, and B. Han, “Large-scale image retrieval with attentive deep local features,” inProceedings of the IEEE International Conference on Computer Vision , pp. 3456–3465, 2017

work page 2017

[23] [23]

Bidirectional attention ﬂow for machine comprehension,

M. Seo, A. Kembhavi, A. Farhadi, and H. Hajishirzi, “Bidirectional attention ﬂow for machine comprehension,” ICLR, 2017

work page 2017

[24] [24]

Deep residual learning for image recognition,

K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition , pp. 770–778, 2016

work page 2016

[25] [25]

Going deeper with convolutions,

C. Szegedy, W. Liu, Y . Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V . Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,” in Proceedings of the IEEE conference on computer vision and pattern recognition , pp. 1–9, 2015

work page 2015

[26] [26]

Glove: Global vectors for word representation,

J. Pennington, R. Socher, and C. Manning, “Glove: Global vectors for word representation,” in Proceedings of the 2014 conference on em- pirical methods in natural language processing (EMNLP) , pp. 1532– 1543, 2014

work page 2014

[27] [27]

The Kinetics Human Action Video Dataset

W. Kay, J. Carreira, K. Simonyan, B. Zhang, C. Hillier, S. Vijaya- narasimhan, F. Viola, T. Green, T. Back, P. Natsev,et al., “The kinetics human action video dataset,” arXiv preprint arXiv:1705.06950 , 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[28] [28]

Rectiﬁed linear units improve restricted boltzmann machines,

V . Nair and G. E. Hinton, “Rectiﬁed linear units improve restricted boltzmann machines,” in Proceedings of the 27th international con- ference on machine learning (ICML-10) , pp. 807–814, 2010

work page 2010

[29] [29]

Bidirectional recurrent neural net- works,

M. Schuster and K. K. Paliwal, “Bidirectional recurrent neural net- works,” IEEE Transactions on Signal Processing , vol. 45, no. 11, pp. 2673–2681, 1997

work page 1997

[30] [30]

Empirical Evaluation of Rectified Activations in Convolutional Network

B. Xu, N. Wang, T. Chen, and M. Li, “Empirical evaluation of rectiﬁed activations in convolutional network,” arXiv preprint arXiv:1505.00853, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015

[31] [31]

An im- proved algorithm for tv-l 1 optical ﬂow,

A. Wedel, T. Pock, C. Zach, H. Bischof, and D. Cremers, “An im- proved algorithm for tv-l 1 optical ﬂow,” in Statistical and geometrical approaches to visual motion analysis , pp. 23–45, Springer, 2009

work page 2009

[32] [32]

Ensemble methods in machine learning,

T. G. Dietterich, “Ensemble methods in machine learning,” in Inter- national workshop on multiple classiﬁer systems , pp. 1–15, Springer, 2000

work page 2000

[33] [33]

Adam: A Method for Stochastic Optimization

D. P. Kingma and J. Ba, “Adam: A method for stochastic optimiza- tion,” arXiv preprint arXiv:1412.6980 , 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014

[34] [34]

Overﬁtting in neural nets: Backpropagation, conjugate gradient, and early stopping,

R. Caruana, S. Lawrence, and C. L. Giles, “Overﬁtting in neural nets: Backpropagation, conjugate gradient, and early stopping,” in Advances in neural information processing systems , pp. 402–408, 2001

work page 2001

[35] [35]

Large-scale machine learning with stochastic gradient descent,

L. Bottou, “Large-scale machine learning with stochastic gradient descent,” in Proceedings of COMPSTAT’2010, pp. 177–186, Springer, 2010

work page 2010

[36] [36]

Tensorﬂow: A system for large-scale machine learning,

M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard, et al. , “Tensorﬂow: A system for large-scale machine learning,” in 12th{USENIX} Symposium on Operating Systems Design and Implementation ({OSDI} 16), pp. 265– 283, 2016

work page 2016

[37] [37]

Bottom-up and top-down attention for image captioning and visual question answering,

P. Anderson, X. He, C. Buehler, D. Teney, M. Johnson, S. Gould, and L. Zhang, “Bottom-up and top-down attention for image captioning and visual question answering,” in Proceedings of the IEEE Confer- ence on Computer Vision and Pattern Recognition , pp. 6077–6086, 2018

work page 2018

[38] [38]

Visual genome: Connecting language and vision using crowdsourced dense image annotations,

R. Krishna, Y . Zhu, O. Groth, J. Johnson, K. Hata, J. Kravitz, S. Chen, Y . Kalantidis, L.-J. Li, D. A. Shamma, et al. , “Visual genome: Connecting language and vision using crowdsourced dense image annotations,” International Journal of Computer Vision , vol. 123, no. 1, pp. 32–73, 2017

work page 2017