Two-stream Spatiotemporal Feature for Video QA Task
Pith reviewed 2026-05-24 23:26 UTC · model grok-4.3
The pith
A two-stream network from action recognition serves as a spatiotemporal feature extractor that improves text-only video QA on TVQA.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We propose a multi-channel neural network that adopts a two-stream network structure as a spatiotemporal video feature extractor for the video QA task. We adopt a squeeze-and-excitation structure for channel-wise attended spatiotemporal features. A context matching module with a level adjusting layer removes the information gap between visual and textual features using attention. A scoring mechanism and smoothed ranking loss select the correct answer. Evaluation on TVQA shows improved results in the textual only setting, but limitations and possibilities when using visual features.
What carries the argument
Two-stream network structure used as spatiotemporal video feature extractor with squeeze-and-excitation and context matching module.
If this is right
- The two-stream structure provides improved results in textual only video QA setting.
- Squeeze-and-excitation achieves channel-wise attended spatiotemporal feature.
- Context matching module with level adjusting layer enables joint modeling of visual and textual features.
- Scoring mechanism and smoothed ranking loss select the correct answer from candidates.
Where Pith is reading between the lines
- If the limitations with visual features are resolved, the approach may enable more robust multimodal video QA systems.
- The method could extend to other tasks requiring integration of video content with natural language questions.
- Further experiments on additional video QA datasets would test the general applicability of the two-stream adaptation.
Load-bearing premise
The two-stream network structure that works well for human action recognition can be directly adopted as a spatiotemporal feature extractor for video QA tasks.
What would settle it
An experiment replacing the two-stream extractor with a standard video encoder on the TVQA dataset and observing no improvement or worse performance in the textual setting would falsify the claim of effective adoption.
Figures
read the original abstract
Understanding the content of videos is one of the core techniques for developing various helpful applications in the real world, such as recognizing various human actions for surveillance systems or customer behavior analysis in an autonomous shop. However, understanding the content or story of the video still remains a challenging problem due to its sheer amount of data and temporal structure. In this paper, we propose a multi-channel neural network structure that adopts a two-stream network structure, which has been shown high performance in human action recognition field, and use it as a spatiotemporal video feature extractor for solving video question and answering task. We also adopt a squeeze-and-excitation structure to two-stream network structure for achieving a channel-wise attended spatiotemporal feature. For jointly modeling the spatiotemporal features from video and the textual features from the question, we design a context matching module with a level adjusting layer to remove the gap of information between visual and textual features by applying attention mechanism on joint modeling. Finally, we adopt a scoring mechanism and smoothed ranking loss objective function for selecting the correct answer from answer candidates. We evaluate our model with TVQA dataset, and our approach shows the improved result in textual only setting, but the result with visual feature shows the limitation and possibility of our approach.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes a multi-channel neural network that adopts a two-stream architecture (previously successful in action recognition) as a spatiotemporal video feature extractor for video question answering. It augments this with squeeze-and-excitation blocks for channel-wise attention, a context matching module incorporating a level adjusting layer and attention to bridge visual-textual gaps, and a scoring mechanism with smoothed ranking loss. Evaluation is performed on the TVQA dataset, with the central claim being improved results in the textual-only setting alongside acknowledged limitations when visual features are included.
Significance. If the empirical gains are substantiated with quantitative evidence, the work could illustrate a viable transfer of two-stream spatiotemporal extractors to the video QA domain, offering a concrete architecture for joint modeling of temporal video structure and questions. The explicit reporting of both gains and limitations in the textual vs. visual regimes provides a balanced starting point for further multimodal research.
major comments (2)
- [Abstract] Abstract: the claim that the approach 'shows the improved result in textual only setting' supplies no quantitative numbers, error bars, baseline comparisons, or experimental details, leaving the central empirical assertion without supporting evidence.
- [Abstract] Abstract: the description of adopting the two-stream network 'without substantial additional adaptation' (as the weakest assumption) is not accompanied by any ablation or comparison showing what adaptations were in fact required, undermining assessment of whether the transfer is effective.
minor comments (1)
- [Abstract] The abstract uses the phrase 'multi-channel neural network structure' in a way that could be clarified relative to the two-stream component to avoid potential reader confusion.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on the abstract. We address each major comment below and will revise the manuscript accordingly to strengthen the presentation of our claims.
read point-by-point responses
-
Referee: [Abstract] Abstract: the claim that the approach 'shows the improved result in textual only setting' supplies no quantitative numbers, error bars, baseline comparisons, or experimental details, leaving the central empirical assertion without supporting evidence.
Authors: We agree that the abstract would be strengthened by including key quantitative results. The full manuscript reports experimental results on the TVQA dataset with baseline comparisons in the textual-only setting. In the revised version, we will update the abstract to include specific accuracy metrics and comparisons to support the claim of improvement. revision: yes
-
Referee: [Abstract] Abstract: the description of adopting the two-stream network 'without substantial additional adaptation' (as the weakest assumption) is not accompanied by any ablation or comparison showing what adaptations were in fact required, undermining assessment of whether the transfer is effective.
Authors: The provided abstract does not use the exact phrasing 'without substantial additional adaptation,' but describes direct adoption of the two-stream structure with added components (squeeze-and-excitation and context matching). To address the concern, we will add a brief discussion of the specific modifications required for the video QA task and consider including an ablation if space permits in the revision. revision: yes
Circularity Check
No significant circularity; empirical architecture proposal with direct testing
full rationale
The paper describes an empirical neural network proposal adopting a two-stream structure (previously successful in action recognition) as a video feature extractor for QA, plus squeeze-and-excitation, context matching, and a scoring loss. It evaluates on TVQA and explicitly reports both textual-only gains and limitations when using visual features. No equations, parameter-fitting steps, derivations, or self-citation chains are described that would reduce any claim to its own inputs by construction. The central claims are the architecture choice and the observed experimental outcomes, which are presented as tested rather than assumed or fitted. This matches the most common honest finding of a self-contained empirical contribution.
Axiom & Free-Parameter Ledger
free parameters (1)
- network hyperparameters and training settings
axioms (2)
- domain assumption Two-stream networks extract effective spatiotemporal features from video
- domain assumption Attention mechanisms can remove the information gap between visual and textual features
Reference graph
Works this paper leans on
-
[1]
A read-write memory network for movie story understanding,
S. Na, S. Lee, J. Kim, and G. Kim, “A read-write memory network for movie story understanding,” in Proceedings of the IEEE International Conference on Computer Vision , pp. 677–685, 2017
work page 2017
-
[2]
Deepstory: Video story qa by deep embedded memory networks,
K.-M. Kim, M.-O. Heo, S.-H. Choi, and B.-T. Zhang, “Deepstory: Video story qa by deep embedded memory networks,” in IJCAI, 2017
work page 2017
-
[3]
Tvqa: Localized, composi- tional video question answering,
J. Lei, L. Yu, M. Bansal, and T. L. Berg, “Tvqa: Localized, composi- tional video question answering,” in EMNLP, 2018
work page 2018
-
[4]
Motion-appearance co- memory networks for video question answering,
J. Gao, R. Ge, K. Chen, and R. Nevatia, “Motion-appearance co- memory networks for video question answering,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pp. 6576–6585, 2018
work page 2018
-
[5]
Movieqa: Understanding stories in movies through question-answering,
M. Tapaswi, Y . Zhu, R. Stiefelhagen, A. Torralba, R. Urtasun, and S. Fidler, “Movieqa: Understanding stories in movies through question-answering,” in Proceedings of the IEEE conference on com- puter vision and pattern recognition , pp. 4631–4640, 2016
work page 2016
-
[6]
Uncovering the temporal context for video question answering,
L. Zhu, Z. Xu, Y . Yang, and A. G. Hauptmann, “Uncovering the temporal context for video question answering,” International Journal of Computer Vision , vol. 124, no. 3, pp. 409–421, 2017
work page 2017
-
[7]
Tgif-qa: Towardt spatio- temporal reasoning in visual question answering,
Y . Jang, Y . Song, Y . Yu, Y . Kim, and G. Kim, “Tgif-qa: Towardt spatio- temporal reasoning in visual question answering,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pp. 2758–2766, 2017
work page 2017
-
[8]
Video question answering via gradually refined attention over ap- pearance and motion,
D. Xu, Z. Zhao, J. Xiao, F. Wu, H. Zhang, X. He, and Y . Zhuang, “Video question answering via gradually refined attention over ap- pearance and motion,” in Proceedings of the 25th ACM international conference on Multimedia , pp. 1645–1653, ACM, 2017
work page 2017
-
[9]
Marioqa: Answering questions by watching gameplay videos,
J. Mun, P. H. Seo, I. Jung, and B. Han, “Marioqa: Answering questions by watching gameplay videos,” in ICCV, 2017
work page 2017
-
[10]
Two-stream convolutional networks for action recognition in videos,
K. Simonyan and A. Zisserman, “Two-stream convolutional networks for action recognition in videos,” in Advances in neural information processing systems, pp. 568–576, 2014
work page 2014
-
[11]
Quo vadis, action recognition? a new model and the kinetics dataset,
J. Carreira and A. Zisserman, “Quo vadis, action recognition? a new model and the kinetics dataset,” inproceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pp. 6299–6308, 2017
work page 2017
-
[12]
Temporal segment networks: Towards good practices for deep action recognition,
L. Wang, Y . Xiong, Z. Wang, Y . Qiao, D. Lin, X. Tang, and L. Van Gool, “Temporal segment networks: Towards good practices for deep action recognition,” in European conference on computer vision , pp. 20–36, Springer, 2016
work page 2016
-
[13]
S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural computation, vol. 9, no. 8, pp. 1735–1780, 1997
work page 1997
-
[14]
Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling
J. Chung, C. Gulcehre, K. Cho, and Y . Bengio, “Empirical evaluation of gated recurrent neural networks on sequence modeling,” arXiv preprint arXiv:1412.3555, 2014
work page internal anchor Pith review Pith/arXiv arXiv 2014
-
[15]
Learning spatiotemporal features with 3d convolutional networks,
D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri, “Learning spatiotemporal features with 3d convolutional networks,” in Pro- ceedings of the IEEE international conference on computer vision , pp. 4489–4497, 2015
work page 2015
-
[16]
Squeeze-and-excitation networks,
J. Hu, L. Shen, and G. Sun, “Squeeze-and-excitation networks,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 7132–7141, 2018
work page 2018
-
[17]
Improving pairwise ranking for multi- label image classification,
Y . Li, Y . Song, and J. Luo, “Improving pairwise ranking for multi- label image classification,” in Proceedings of the IEEE conference on computer vision and pattern recognition , pp. 3617–3625, 2017
work page 2017
-
[18]
ImageNet: A Large-Scale Hierarchical Image Database,
J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “ImageNet: A Large-Scale Hierarchical Image Database,” in CVPR09, 2009
work page 2009
-
[19]
Identifying first-person camera wearers in third-person videos,
C. Fan, J. Lee, M. Xu, K. Kumar Singh, Y . Jae Lee, D. J. Crandall, and M. S. Ryoo, “Identifying first-person camera wearers in third-person videos,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5125–5133, 2017
work page 2017
-
[20]
Regional attention based deep feature for image retrieval,
J. Kim and S.-E. Yoon, “Regional attention based deep feature for image retrieval,” in Proc. British Machine Vision Conference (BMVC 2018), 2018
work page 2018
-
[21]
Cross-dimensional weighting for aggregated deep convolutional features,
Y . Kalantidis, C. Mellina, and S. Osindero, “Cross-dimensional weighting for aggregated deep convolutional features,” in European conference on computer vision , pp. 685–701, Springer, 2016
work page 2016
-
[22]
Large-scale image retrieval with attentive deep local features,
H. Noh, A. Araujo, J. Sim, T. Weyand, and B. Han, “Large-scale image retrieval with attentive deep local features,” inProceedings of the IEEE International Conference on Computer Vision , pp. 3456–3465, 2017
work page 2017
-
[23]
Bidirectional attention flow for machine comprehension,
M. Seo, A. Kembhavi, A. Farhadi, and H. Hajishirzi, “Bidirectional attention flow for machine comprehension,” ICLR, 2017
work page 2017
-
[24]
Deep residual learning for image recognition,
K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition , pp. 770–778, 2016
work page 2016
-
[25]
Going deeper with convolutions,
C. Szegedy, W. Liu, Y . Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V . Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,” in Proceedings of the IEEE conference on computer vision and pattern recognition , pp. 1–9, 2015
work page 2015
-
[26]
Glove: Global vectors for word representation,
J. Pennington, R. Socher, and C. Manning, “Glove: Global vectors for word representation,” in Proceedings of the 2014 conference on em- pirical methods in natural language processing (EMNLP) , pp. 1532– 1543, 2014
work page 2014
-
[27]
The Kinetics Human Action Video Dataset
W. Kay, J. Carreira, K. Simonyan, B. Zhang, C. Hillier, S. Vijaya- narasimhan, F. Viola, T. Green, T. Back, P. Natsev,et al., “The kinetics human action video dataset,” arXiv preprint arXiv:1705.06950 , 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[28]
Rectified linear units improve restricted boltzmann machines,
V . Nair and G. E. Hinton, “Rectified linear units improve restricted boltzmann machines,” in Proceedings of the 27th international con- ference on machine learning (ICML-10) , pp. 807–814, 2010
work page 2010
-
[29]
Bidirectional recurrent neural net- works,
M. Schuster and K. K. Paliwal, “Bidirectional recurrent neural net- works,” IEEE Transactions on Signal Processing , vol. 45, no. 11, pp. 2673–2681, 1997
work page 1997
-
[30]
Empirical Evaluation of Rectified Activations in Convolutional Network
B. Xu, N. Wang, T. Chen, and M. Li, “Empirical evaluation of rectified activations in convolutional network,” arXiv preprint arXiv:1505.00853, 2015
work page internal anchor Pith review Pith/arXiv arXiv 2015
-
[31]
An im- proved algorithm for tv-l 1 optical flow,
A. Wedel, T. Pock, C. Zach, H. Bischof, and D. Cremers, “An im- proved algorithm for tv-l 1 optical flow,” in Statistical and geometrical approaches to visual motion analysis , pp. 23–45, Springer, 2009
work page 2009
-
[32]
Ensemble methods in machine learning,
T. G. Dietterich, “Ensemble methods in machine learning,” in Inter- national workshop on multiple classifier systems , pp. 1–15, Springer, 2000
work page 2000
-
[33]
Adam: A Method for Stochastic Optimization
D. P. Kingma and J. Ba, “Adam: A method for stochastic optimiza- tion,” arXiv preprint arXiv:1412.6980 , 2014
work page internal anchor Pith review Pith/arXiv arXiv 2014
-
[34]
Overfitting in neural nets: Backpropagation, conjugate gradient, and early stopping,
R. Caruana, S. Lawrence, and C. L. Giles, “Overfitting in neural nets: Backpropagation, conjugate gradient, and early stopping,” in Advances in neural information processing systems , pp. 402–408, 2001
work page 2001
-
[35]
Large-scale machine learning with stochastic gradient descent,
L. Bottou, “Large-scale machine learning with stochastic gradient descent,” in Proceedings of COMPSTAT’2010, pp. 177–186, Springer, 2010
work page 2010
-
[36]
Tensorflow: A system for large-scale machine learning,
M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard, et al. , “Tensorflow: A system for large-scale machine learning,” in 12th{USENIX} Symposium on Operating Systems Design and Implementation ({OSDI} 16), pp. 265– 283, 2016
work page 2016
-
[37]
Bottom-up and top-down attention for image captioning and visual question answering,
P. Anderson, X. He, C. Buehler, D. Teney, M. Johnson, S. Gould, and L. Zhang, “Bottom-up and top-down attention for image captioning and visual question answering,” in Proceedings of the IEEE Confer- ence on Computer Vision and Pattern Recognition , pp. 6077–6086, 2018
work page 2018
-
[38]
Visual genome: Connecting language and vision using crowdsourced dense image annotations,
R. Krishna, Y . Zhu, O. Groth, J. Johnson, K. Hata, J. Kravitz, S. Chen, Y . Kalantidis, L.-J. Li, D. A. Shamma, et al. , “Visual genome: Connecting language and vision using crowdsourced dense image annotations,” International Journal of Computer Vision , vol. 123, no. 1, pp. 32–73, 2017
work page 2017
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.