pith. sign in

arxiv: 1907.05006 · v1 · pith:X6RUQWIVnew · submitted 2019-07-11 · 💻 cs.CV

Two-stream Spatiotemporal Feature for Video QA Task

Pith reviewed 2026-05-24 23:26 UTC · model grok-4.3

classification 💻 cs.CV
keywords two-stream networkspatiotemporal featuresvideo question answeringTVQAsqueeze-and-excitationcontext matchingsmoothed ranking loss
0
0 comments X

The pith

A two-stream network from action recognition serves as a spatiotemporal feature extractor that improves text-only video QA on TVQA.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper aims to show that a two-stream network structure, successful in action recognition, can be adapted as a spatiotemporal video feature extractor for video question answering. By incorporating squeeze-and-excitation for channel attention and a context matching module to bridge visual and textual features, the model jointly processes video and question data. It uses a scoring mechanism with smoothed ranking loss to pick the correct answer. Tests on the TVQA dataset indicate gains in text-only mode, pointing to both potential and limits when visual features are included.

Core claim

We propose a multi-channel neural network that adopts a two-stream network structure as a spatiotemporal video feature extractor for the video QA task. We adopt a squeeze-and-excitation structure for channel-wise attended spatiotemporal features. A context matching module with a level adjusting layer removes the information gap between visual and textual features using attention. A scoring mechanism and smoothed ranking loss select the correct answer. Evaluation on TVQA shows improved results in the textual only setting, but limitations and possibilities when using visual features.

What carries the argument

Two-stream network structure used as spatiotemporal video feature extractor with squeeze-and-excitation and context matching module.

If this is right

  • The two-stream structure provides improved results in textual only video QA setting.
  • Squeeze-and-excitation achieves channel-wise attended spatiotemporal feature.
  • Context matching module with level adjusting layer enables joint modeling of visual and textual features.
  • Scoring mechanism and smoothed ranking loss select the correct answer from candidates.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the limitations with visual features are resolved, the approach may enable more robust multimodal video QA systems.
  • The method could extend to other tasks requiring integration of video content with natural language questions.
  • Further experiments on additional video QA datasets would test the general applicability of the two-stream adaptation.

Load-bearing premise

The two-stream network structure that works well for human action recognition can be directly adopted as a spatiotemporal feature extractor for video QA tasks.

What would settle it

An experiment replacing the two-stream extractor with a standard video encoder on the TVQA dataset and observing no improvement or worse performance in the textual setting would falsify the claim of effective adoption.

Figures

Figures reproduced from arXiv: 1907.05006 by Chiwan Song, Sung-Eui Yoon, Woobin Im.

Figure 1
Figure 1. Figure 1: The figure of our multi-channel neural network structure with two-stream spatiotemporal video feature extractor [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Our two-stream I3D with the Squeeze-and-Excitation structure. [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Our Squeeze-and-Excitation structure focused on related objects or actions. This kind of gap between two modalities raise a difficulty in context matching stage where visual and textual information is merged. To make our feature extractor concentrate more on the crucial objects in the video frames, we utilize the Squeeze￾and-Excitation (SE) structure [16] and integrate it within the two-stream I3D, where a… view at source ↗
read the original abstract

Understanding the content of videos is one of the core techniques for developing various helpful applications in the real world, such as recognizing various human actions for surveillance systems or customer behavior analysis in an autonomous shop. However, understanding the content or story of the video still remains a challenging problem due to its sheer amount of data and temporal structure. In this paper, we propose a multi-channel neural network structure that adopts a two-stream network structure, which has been shown high performance in human action recognition field, and use it as a spatiotemporal video feature extractor for solving video question and answering task. We also adopt a squeeze-and-excitation structure to two-stream network structure for achieving a channel-wise attended spatiotemporal feature. For jointly modeling the spatiotemporal features from video and the textual features from the question, we design a context matching module with a level adjusting layer to remove the gap of information between visual and textual features by applying attention mechanism on joint modeling. Finally, we adopt a scoring mechanism and smoothed ranking loss objective function for selecting the correct answer from answer candidates. We evaluate our model with TVQA dataset, and our approach shows the improved result in textual only setting, but the result with visual feature shows the limitation and possibility of our approach.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes a multi-channel neural network that adopts a two-stream architecture (previously successful in action recognition) as a spatiotemporal video feature extractor for video question answering. It augments this with squeeze-and-excitation blocks for channel-wise attention, a context matching module incorporating a level adjusting layer and attention to bridge visual-textual gaps, and a scoring mechanism with smoothed ranking loss. Evaluation is performed on the TVQA dataset, with the central claim being improved results in the textual-only setting alongside acknowledged limitations when visual features are included.

Significance. If the empirical gains are substantiated with quantitative evidence, the work could illustrate a viable transfer of two-stream spatiotemporal extractors to the video QA domain, offering a concrete architecture for joint modeling of temporal video structure and questions. The explicit reporting of both gains and limitations in the textual vs. visual regimes provides a balanced starting point for further multimodal research.

major comments (2)
  1. [Abstract] Abstract: the claim that the approach 'shows the improved result in textual only setting' supplies no quantitative numbers, error bars, baseline comparisons, or experimental details, leaving the central empirical assertion without supporting evidence.
  2. [Abstract] Abstract: the description of adopting the two-stream network 'without substantial additional adaptation' (as the weakest assumption) is not accompanied by any ablation or comparison showing what adaptations were in fact required, undermining assessment of whether the transfer is effective.
minor comments (1)
  1. [Abstract] The abstract uses the phrase 'multi-channel neural network structure' in a way that could be clarified relative to the two-stream component to avoid potential reader confusion.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on the abstract. We address each major comment below and will revise the manuscript accordingly to strengthen the presentation of our claims.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the claim that the approach 'shows the improved result in textual only setting' supplies no quantitative numbers, error bars, baseline comparisons, or experimental details, leaving the central empirical assertion without supporting evidence.

    Authors: We agree that the abstract would be strengthened by including key quantitative results. The full manuscript reports experimental results on the TVQA dataset with baseline comparisons in the textual-only setting. In the revised version, we will update the abstract to include specific accuracy metrics and comparisons to support the claim of improvement. revision: yes

  2. Referee: [Abstract] Abstract: the description of adopting the two-stream network 'without substantial additional adaptation' (as the weakest assumption) is not accompanied by any ablation or comparison showing what adaptations were in fact required, undermining assessment of whether the transfer is effective.

    Authors: The provided abstract does not use the exact phrasing 'without substantial additional adaptation,' but describes direct adoption of the two-stream structure with added components (squeeze-and-excitation and context matching). To address the concern, we will add a brief discussion of the specific modifications required for the video QA task and consider including an ablation if space permits in the revision. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical architecture proposal with direct testing

full rationale

The paper describes an empirical neural network proposal adopting a two-stream structure (previously successful in action recognition) as a video feature extractor for QA, plus squeeze-and-excitation, context matching, and a scoring loss. It evaluates on TVQA and explicitly reports both textual-only gains and limitations when using visual features. No equations, parameter-fitting steps, derivations, or self-citation chains are described that would reduce any claim to its own inputs by construction. The central claims are the architecture choice and the observed experimental outcomes, which are presented as tested rather than assumed or fitted. This matches the most common honest finding of a self-contained empirical contribution.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 0 invented entities

The model rests on standard deep learning assumptions about feature transfer from action recognition to QA and the effectiveness of attention for multimodal alignment; many typical neural network hyperparameters are implicitly free parameters but not enumerated in the abstract.

free parameters (1)
  • network hyperparameters and training settings
    Standard in neural network models; learning rates, layer dimensions, and optimization choices are chosen or fitted but not detailed in the abstract.
axioms (2)
  • domain assumption Two-stream networks extract effective spatiotemporal features from video
    Invoked when adopting the structure from action recognition for video QA.
  • domain assumption Attention mechanisms can remove the information gap between visual and textual features
    Central to the context matching module design.

pith-pipeline@v0.9.0 · 5744 in / 1355 out tokens · 29191 ms · 2026-05-24T23:26:59.037339+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

38 extracted references · 38 canonical work pages · 4 internal anchors

  1. [1]

    A read-write memory network for movie story understanding,

    S. Na, S. Lee, J. Kim, and G. Kim, “A read-write memory network for movie story understanding,” in Proceedings of the IEEE International Conference on Computer Vision , pp. 677–685, 2017

  2. [2]

    Deepstory: Video story qa by deep embedded memory networks,

    K.-M. Kim, M.-O. Heo, S.-H. Choi, and B.-T. Zhang, “Deepstory: Video story qa by deep embedded memory networks,” in IJCAI, 2017

  3. [3]

    Tvqa: Localized, composi- tional video question answering,

    J. Lei, L. Yu, M. Bansal, and T. L. Berg, “Tvqa: Localized, composi- tional video question answering,” in EMNLP, 2018

  4. [4]

    Motion-appearance co- memory networks for video question answering,

    J. Gao, R. Ge, K. Chen, and R. Nevatia, “Motion-appearance co- memory networks for video question answering,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pp. 6576–6585, 2018

  5. [5]

    Movieqa: Understanding stories in movies through question-answering,

    M. Tapaswi, Y . Zhu, R. Stiefelhagen, A. Torralba, R. Urtasun, and S. Fidler, “Movieqa: Understanding stories in movies through question-answering,” in Proceedings of the IEEE conference on com- puter vision and pattern recognition , pp. 4631–4640, 2016

  6. [6]

    Uncovering the temporal context for video question answering,

    L. Zhu, Z. Xu, Y . Yang, and A. G. Hauptmann, “Uncovering the temporal context for video question answering,” International Journal of Computer Vision , vol. 124, no. 3, pp. 409–421, 2017

  7. [7]

    Tgif-qa: Towardt spatio- temporal reasoning in visual question answering,

    Y . Jang, Y . Song, Y . Yu, Y . Kim, and G. Kim, “Tgif-qa: Towardt spatio- temporal reasoning in visual question answering,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pp. 2758–2766, 2017

  8. [8]

    Video question answering via gradually refined attention over ap- pearance and motion,

    D. Xu, Z. Zhao, J. Xiao, F. Wu, H. Zhang, X. He, and Y . Zhuang, “Video question answering via gradually refined attention over ap- pearance and motion,” in Proceedings of the 25th ACM international conference on Multimedia , pp. 1645–1653, ACM, 2017

  9. [9]

    Marioqa: Answering questions by watching gameplay videos,

    J. Mun, P. H. Seo, I. Jung, and B. Han, “Marioqa: Answering questions by watching gameplay videos,” in ICCV, 2017

  10. [10]

    Two-stream convolutional networks for action recognition in videos,

    K. Simonyan and A. Zisserman, “Two-stream convolutional networks for action recognition in videos,” in Advances in neural information processing systems, pp. 568–576, 2014

  11. [11]

    Quo vadis, action recognition? a new model and the kinetics dataset,

    J. Carreira and A. Zisserman, “Quo vadis, action recognition? a new model and the kinetics dataset,” inproceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pp. 6299–6308, 2017

  12. [12]

    Temporal segment networks: Towards good practices for deep action recognition,

    L. Wang, Y . Xiong, Z. Wang, Y . Qiao, D. Lin, X. Tang, and L. Van Gool, “Temporal segment networks: Towards good practices for deep action recognition,” in European conference on computer vision , pp. 20–36, Springer, 2016

  13. [13]

    Long short-term memory,

    S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural computation, vol. 9, no. 8, pp. 1735–1780, 1997

  14. [14]

    Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling

    J. Chung, C. Gulcehre, K. Cho, and Y . Bengio, “Empirical evaluation of gated recurrent neural networks on sequence modeling,” arXiv preprint arXiv:1412.3555, 2014

  15. [15]

    Learning spatiotemporal features with 3d convolutional networks,

    D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri, “Learning spatiotemporal features with 3d convolutional networks,” in Pro- ceedings of the IEEE international conference on computer vision , pp. 4489–4497, 2015

  16. [16]

    Squeeze-and-excitation networks,

    J. Hu, L. Shen, and G. Sun, “Squeeze-and-excitation networks,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 7132–7141, 2018

  17. [17]

    Improving pairwise ranking for multi- label image classification,

    Y . Li, Y . Song, and J. Luo, “Improving pairwise ranking for multi- label image classification,” in Proceedings of the IEEE conference on computer vision and pattern recognition , pp. 3617–3625, 2017

  18. [18]

    ImageNet: A Large-Scale Hierarchical Image Database,

    J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “ImageNet: A Large-Scale Hierarchical Image Database,” in CVPR09, 2009

  19. [19]

    Identifying first-person camera wearers in third-person videos,

    C. Fan, J. Lee, M. Xu, K. Kumar Singh, Y . Jae Lee, D. J. Crandall, and M. S. Ryoo, “Identifying first-person camera wearers in third-person videos,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5125–5133, 2017

  20. [20]

    Regional attention based deep feature for image retrieval,

    J. Kim and S.-E. Yoon, “Regional attention based deep feature for image retrieval,” in Proc. British Machine Vision Conference (BMVC 2018), 2018

  21. [21]

    Cross-dimensional weighting for aggregated deep convolutional features,

    Y . Kalantidis, C. Mellina, and S. Osindero, “Cross-dimensional weighting for aggregated deep convolutional features,” in European conference on computer vision , pp. 685–701, Springer, 2016

  22. [22]

    Large-scale image retrieval with attentive deep local features,

    H. Noh, A. Araujo, J. Sim, T. Weyand, and B. Han, “Large-scale image retrieval with attentive deep local features,” inProceedings of the IEEE International Conference on Computer Vision , pp. 3456–3465, 2017

  23. [23]

    Bidirectional attention flow for machine comprehension,

    M. Seo, A. Kembhavi, A. Farhadi, and H. Hajishirzi, “Bidirectional attention flow for machine comprehension,” ICLR, 2017

  24. [24]

    Deep residual learning for image recognition,

    K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition , pp. 770–778, 2016

  25. [25]

    Going deeper with convolutions,

    C. Szegedy, W. Liu, Y . Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V . Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,” in Proceedings of the IEEE conference on computer vision and pattern recognition , pp. 1–9, 2015

  26. [26]

    Glove: Global vectors for word representation,

    J. Pennington, R. Socher, and C. Manning, “Glove: Global vectors for word representation,” in Proceedings of the 2014 conference on em- pirical methods in natural language processing (EMNLP) , pp. 1532– 1543, 2014

  27. [27]

    The Kinetics Human Action Video Dataset

    W. Kay, J. Carreira, K. Simonyan, B. Zhang, C. Hillier, S. Vijaya- narasimhan, F. Viola, T. Green, T. Back, P. Natsev,et al., “The kinetics human action video dataset,” arXiv preprint arXiv:1705.06950 , 2017

  28. [28]

    Rectified linear units improve restricted boltzmann machines,

    V . Nair and G. E. Hinton, “Rectified linear units improve restricted boltzmann machines,” in Proceedings of the 27th international con- ference on machine learning (ICML-10) , pp. 807–814, 2010

  29. [29]

    Bidirectional recurrent neural net- works,

    M. Schuster and K. K. Paliwal, “Bidirectional recurrent neural net- works,” IEEE Transactions on Signal Processing , vol. 45, no. 11, pp. 2673–2681, 1997

  30. [30]

    Empirical Evaluation of Rectified Activations in Convolutional Network

    B. Xu, N. Wang, T. Chen, and M. Li, “Empirical evaluation of rectified activations in convolutional network,” arXiv preprint arXiv:1505.00853, 2015

  31. [31]

    An im- proved algorithm for tv-l 1 optical flow,

    A. Wedel, T. Pock, C. Zach, H. Bischof, and D. Cremers, “An im- proved algorithm for tv-l 1 optical flow,” in Statistical and geometrical approaches to visual motion analysis , pp. 23–45, Springer, 2009

  32. [32]

    Ensemble methods in machine learning,

    T. G. Dietterich, “Ensemble methods in machine learning,” in Inter- national workshop on multiple classifier systems , pp. 1–15, Springer, 2000

  33. [33]

    Adam: A Method for Stochastic Optimization

    D. P. Kingma and J. Ba, “Adam: A method for stochastic optimiza- tion,” arXiv preprint arXiv:1412.6980 , 2014

  34. [34]

    Overfitting in neural nets: Backpropagation, conjugate gradient, and early stopping,

    R. Caruana, S. Lawrence, and C. L. Giles, “Overfitting in neural nets: Backpropagation, conjugate gradient, and early stopping,” in Advances in neural information processing systems , pp. 402–408, 2001

  35. [35]

    Large-scale machine learning with stochastic gradient descent,

    L. Bottou, “Large-scale machine learning with stochastic gradient descent,” in Proceedings of COMPSTAT’2010, pp. 177–186, Springer, 2010

  36. [36]

    Tensorflow: A system for large-scale machine learning,

    M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard, et al. , “Tensorflow: A system for large-scale machine learning,” in 12th{USENIX} Symposium on Operating Systems Design and Implementation ({OSDI} 16), pp. 265– 283, 2016

  37. [37]

    Bottom-up and top-down attention for image captioning and visual question answering,

    P. Anderson, X. He, C. Buehler, D. Teney, M. Johnson, S. Gould, and L. Zhang, “Bottom-up and top-down attention for image captioning and visual question answering,” in Proceedings of the IEEE Confer- ence on Computer Vision and Pattern Recognition , pp. 6077–6086, 2018

  38. [38]

    Visual genome: Connecting language and vision using crowdsourced dense image annotations,

    R. Krishna, Y . Zhu, O. Groth, J. Johnson, K. Hata, J. Kravitz, S. Chen, Y . Kalantidis, L.-J. Li, D. A. Shamma, et al. , “Visual genome: Connecting language and vision using crowdsourced dense image annotations,” International Journal of Computer Vision , vol. 123, no. 1, pp. 32–73, 2017