Open-Ended Long-Form Video Question Answering via Hierarchical Convolutional Self-Attention Networks

Jingkuan Song; Xiaofei He; Zhijie Lin; Zhou Zhao; Zhu Zhang

arxiv: 1906.12158 · v1 · pith:S526QSSWnew · submitted 2019-06-28 · 💻 cs.CV · cs.LG

Open-Ended Long-Form Video Question Answering via Hierarchical Convolutional Self-Attention Networks

Zhu Zhang , Zhou Zhao , Zhijie Lin , Jingkuan Song , Xiaofei He This is my paper

Pith reviewed 2026-05-25 13:43 UTC · model grok-4.3

classification 💻 cs.CV cs.LG

keywords video question answeringlong-form videoself-attentionhierarchical convolutional networkencoder-decoder modelmulti-scale decoderopen-ended QA

0 comments

The pith

A hierarchical convolutional self-attention encoder-decoder models long-form videos for open-ended question answering by capturing question-aware dependencies.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces the HCSA network to address limitations of recurrent encoder-decoder models when applied to long videos in open-ended question answering. Recurrent approaches face heavy computational costs and weak long-range dependency modeling as video length grows. The proposed method builds a hierarchical structure in the encoder to process video sequences efficiently while incorporating question information into self-attention layers. The decoder then draws on multiple encoder layers at different scales to generate answers without losing details from deeper representations. If effective, this yields both higher accuracy and lower runtime on extended video content compared with prior recurrent baselines.

Core claim

The central claim is that the Hierarchical Convolutional Self-Attention encoder-decoder network efficiently models long-form video contents by constructing a hierarchical structure and capturing question-aware long-range dependencies from video context, while the multi-scale attentive decoder incorporates representations from multiple encoder layers to avoid information loss at the top layer, thereby solving the computational and modeling shortcomings of recurrent networks for open-ended long-form video question answering.

What carries the argument

The Hierarchical Convolutional Self-Attention (HCSA) encoder-decoder, which stacks convolutional self-attention blocks in a hierarchy to build question-conditioned representations across video scales and feeds them through a multi-scale decoder.

If this is right

Longer video sequences become feasible to process without quadratic growth in recurrent steps or vanishing gradients.
Question conditioning inside the attention layers produces video representations that are already aligned with the query before decoding begins.
Multi-layer fusion in the decoder reduces the risk that only the final encoder state is used for answer generation.
Overall inference speed improves enough to support real-time or near-real-time question answering on extended clips.
The same hierarchical pattern can be applied to other video understanding tasks that require both local detail and global context.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The architecture may transfer to long audio or text sequences where similar dependency and efficiency trade-offs appear.
If the convolutional hierarchy proves stable, it could reduce the need for very deep recurrent stacks in multimodal systems.
A direct test on videos several times longer than current benchmarks would reveal whether the hierarchy continues to scale without additional modifications.
Replacing the convolutional blocks with other local operators could test how much the specific choice of convolution matters versus the overall hierarchy.

Load-bearing premise

The hierarchical convolutional self-attention layers can capture long-range video dependencies at least as well as recurrent layers while avoiding their sequential computation cost, and the multi-scale decoder recovers any details lost in the top encoder layer.

What would settle it

A head-to-head evaluation on a long-form video QA benchmark where the HCSA model shows either lower answer accuracy or higher wall-clock time than a comparable recurrent baseline when video length exceeds a few minutes.

Figures

Figures reproduced from arXiv: 1906.12158 by Jingkuan Song, Xiaofei He, Zhijie Lin, Zhou Zhao, Zhu Zhang.

**Figure 2.** Figure 2: The Framework of Hierarchical Convolutional Self-Attention Encoder-Decoder Networks. [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: The Multi-Scale Attention Results in the Decoder. [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

read the original abstract

Open-ended video question answering aims to automatically generate the natural-language answer from referenced video contents according to the given question. Currently, most existing approaches focus on short-form video question answering with multi-modal recurrent encoder-decoder networks. Although these works have achieved promising performance, they may still be ineffectively applied to long-form video question answering due to the lack of long-range dependency modeling and the suffering from the heavy computational cost. To tackle these problems, we propose a fast Hierarchical Convolutional Self-Attention encoder-decoder network(HCSA). Concretely, we first develop a hierarchical convolutional self-attention encoder to efficiently model long-form video contents, which builds the hierarchical structure for video sequences and captures question-aware long-range dependencies from video context. We then devise a multi-scale attentive decoder to incorporate multi-layer video representations for answer generation, which avoids the information missing of the top encoder layer. The extensive experiments show the effectiveness and efficiency of our method.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

HCSA proposes a hierarchical conv self-attention encoder-decoder for long-form video QA to fix RNN scaling issues, but the abstract gives no numbers or details to check if it works.

read the letter

The main takeaway is a new architecture called HCSA that stacks convolutional self-attention in a hierarchy for the encoder and adds a multi-scale attentive decoder. The goal is to capture question-aware long-range dependencies in longer videos without the compute and dependency problems that come with recurrent models. The paper spells out the motivation clearly and describes how the hierarchy builds multi-scale video representations while the decoder pulls from multiple layers to avoid dropping top-level information. That part is straightforward and addresses a genuine pain point in the subfield. The design itself looks like a reasonable incremental step beyond standard encoder-decoder RNNs for video QA. The soft spot is that all the weight rests on the claim of effectiveness and efficiency from extensive experiments, yet the abstract supplies zero numbers, baselines, dataset details, or ablation results. Without those, there is no way to tell whether the gains are real, consistent, or large enough to matter. The central assumption that the structure actually delivers better long-range modeling at lower cost remains untested from what is visible. This paper is for people already working on video question answering and multimodal sequence models. It is narrow enough that most outside that niche will not need it. The problem is real and the proposal is concrete enough that a serious editor should send it to referees rather than desk reject, so the experiments can be examined directly.

Referee Report

0 major / 1 minor

Summary. The paper proposes a Hierarchical Convolutional Self-Attention (HCSA) encoder-decoder network for open-ended long-form video question answering. It claims that recurrent models struggle with long-range dependencies and computational cost on long videos; the HCSA encoder builds a hierarchical structure to efficiently model video sequences and capture question-aware long-range dependencies, while the multi-scale attentive decoder incorporates multi-layer representations to avoid information loss from the top encoder layer. The authors assert that extensive experiments demonstrate the method's effectiveness and efficiency.

Significance. If the experimental claims hold, the work could offer a scalable, non-recurrent alternative for long-form video QA tasks, addressing a recognized limitation in existing multi-modal encoder-decoder approaches. The hierarchical convolutional self-attention and multi-scale decoder represent targeted architectural choices that, if shown to deliver measurable gains in dependency modeling and runtime, would be of interest to the video understanding community.

minor comments (1)

The abstract asserts 'extensive experiments' but provides no quantitative results, baselines, or dataset details; adding at least one key performance metric or comparison would strengthen the summary.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their time and for providing a concise summary of our manuscript on the Hierarchical Convolutional Self-Attention (HCSA) encoder-decoder for open-ended long-form video question answering. We note that the report lists no specific major comments despite the 'uncertain' recommendation. We are prepared to address any concrete concerns about the experimental claims, architecture details, or comparisons if they are provided in a revised report.

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper introduces a new HCSA encoder-decoder architecture as an explicit proposal to address RNN limitations in long-form video QA. The encoder is described as building a hierarchical structure to capture question-aware dependencies, and the decoder as using multi-scale representations to avoid information loss; these are design choices motivated by stated problems rather than derived from equations or prior results. No self-citations, fitted parameters renamed as predictions, or uniqueness theorems appear in the provided text. Effectiveness is asserted via experiments, making the chain self-contained as an architectural contribution without internal reduction to its own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the domain assumption that recurrent models fail on long videos due to dependency and cost issues, plus the new invented entity of the HCSA architecture itself; no free parameters are identifiable from the abstract.

axioms (1)

domain assumption Recurrent models suffer from long-range dependency issues and high computational cost in long videos.
Explicitly stated in the abstract as the core motivation for the new method.

invented entities (1)

HCSA network no independent evidence
purpose: Efficiently model long-form video contents with question-aware long-range dependencies via hierarchical convolution and self-attention.
New architecture proposed in the paper with no independent evidence provided in the abstract.

pith-pipeline@v0.9.0 · 5704 in / 1121 out tokens · 27344 ms · 2026-05-25T13:43:47.589629+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

25 extracted references · 25 canonical work pages · 4 internal anchors

[1]

Bottom-up and top-down attention for im- age captioning and visual question answering

[Anderson et al., 2018] Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang. Bottom-up and top-down attention for im- age captioning and visual question answering. In CVPR, pages 6077–6086,

work page 2018
[2]

Vqa: Visual question answering

[Antol et al., 2015] Stanislaw Antol, Aishwarya Agrawal, Ji- asen Lu, Margaret Mitchell, Dhruv Batra, C Lawrence Zit- nick, and Devi Parikh. Vqa: Visual question answering. In ICCV, pages 2425–2433,

work page 2015
[3]

Neural Machine Translation by Jointly Learning to Align and Translate

[Bahdanau et al., 2014] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473,

work page internal anchor Pith review Pith/arXiv arXiv 2014
[4]

[Fellbaum, 1998] Christiane Fellbaum. WordNet. Wiley On- line Library,

work page 1998
[5]

Spatio- temporal context networks for video question answering

[Gao and Han, 2017] Kun Gao and Yahong Han. Spatio- temporal context networks for video question answering. In Paciﬁc Rim Conference on Multimedia, pages 108–118. Springer,

work page 2017
[6]

Motion-appearance co-memory networks for video question answering

[Gao et al., 2018] Jiyang Gao, Runzhou Ge, Kan Chen, and Ram Nevatia. Motion-appearance co-memory networks for video question answering. CVPR,

work page 2018
[7]

Convolutional Sequence to Sequence Learning

[Gehring et al., 2017] Jonas Gehring, Michael Auli, David Grangier, Denis Yarats, and Yann N Dauphin. Convo- lutional sequence to sequence learning. arXiv preprint arXiv:1705.03122,

work page internal anchor Pith review Pith/arXiv arXiv 2017
[8]

Tgif-qa: Toward spatio- temporal reasoning in visual question answering

[Jang et al., 2017] Yunseok Jang, Yale Song, Youngjae Yu, Youngjin Kim, and Gunhee Kim. Tgif-qa: Toward spatio- temporal reasoning in visual question answering. In CVPR, pages 2680–8,

work page 2017
[9]

Dense- captioning events in videos

[Krishna et al., 2017] Ranjay Krishna, Kenji Hata, Fred- eric Ren, Li Fei-Fei, and Juan Carlos Niebles. Dense- captioning events in videos. In ICCV, pages 706–715,

work page 2017
[10]

Visual question answering with question representation update (qru)

[Li and Jia, 2016] Ruiyu Li and Jiaya Jia. Visual question answering with question representation update (qru). In NIPS, pages 4655–4663,

work page 2016
[11]

Hierarchical question-image co-attention for visual question answering

[Lu et al., 2016] Jiasen Lu, Jianwei Yang, Dhruv Batra, and Devi Parikh. Hierarchical question-image co-attention for visual question answering. In NIPS, pages 289–297,

work page 2016
[12]

A dataset and exploration of models for understanding video data through ﬁll-in-the-blank question-answering

[Maharaj et al., 2017] Tegan Maharaj, Nicolas Ballas, Anna Rohrbach, Aaron C Courville, and Christopher Joseph Pal. A dataset and exploration of models for understanding video data through ﬁll-in-the-blank question-answering. In CVPR, pages 7359–7368,

work page 2017
[13]

A multi-world approach to question answer- ing about real-world scenes based on uncertain input

[Malinowski and Fritz, 2014] Mateusz Malinowski and Mario Fritz. A multi-world approach to question answer- ing about real-world scenes based on uncertain input. In NIPS, pages 1682–1690,

work page 2014
[14]

Video Fill in the Blank with Merging LSTMs

[Mazaheri et al., 2016] Amir Mazaheri, Dong Zhang, and Mubarak Shah. Video ﬁll in the blank with merging lstms. arXiv preprint arXiv:1610.04062,

work page internal anchor Pith review Pith/arXiv arXiv 2016
[15]

Efficient Estimation of Word Representations in Vector Space

[Mikolov et al., 2013] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efﬁcient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781,

work page internal anchor Pith review Pith/arXiv arXiv 2013
[16]

Movieqa: Understanding stories in movies through question-answering

[Tapaswi et al., 2016] Makarand Tapaswi, Yukun Zhu, Rainer Stiefelhagen, Antonio Torralba, Raquel Urtasun, and Sanja Fidler. Movieqa: Understanding stories in movies through question-answering. In CVPR, pages 4631–4640,

work page 2016
[17]

Learning spa- tiotemporal features with 3d convolutional networks

[Tran et al., 2015] Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. Learning spa- tiotemporal features with 3d convolutional networks. In ICCV, pages 4489–4497,

work page 2015
[18]

Attention is all you need

[Vaswani et al., 2017] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In NIPS, pages 5998–6008,

work page 2017
[19]

Verbs semantics and lexical selection

[Wu and Palmer, 1994] Zhibiao Wu and Martha Palmer. Verbs semantics and lexical selection. In Proceedings of the 32nd annual meeting on Association for Computa- tional Linguistics, pages 133–138. ACL,

work page 1994
[20]

Visual question answering: A survey of methods and datasets

[Wu et al., 2017] Qi Wu, Damien Teney, Peng Wang, Chun- hua Shen, Anthony Dick, and Anton van den Hengel. Visual question answering: A survey of methods and datasets. Computer Vision and Image Understanding , 163:21–40,

work page 2017
[21]

Unifying the video and question attentions for open- ended video question answering

[Xue et al., 2017] Hongyang Xue, Zhou Zhao, and Deng Cai. Unifying the video and question attentions for open- ended video question answering. IEEE Transactions on Image Processing, 26(12):5656–5666,

work page 2017
[22]

Leveraging video descriptions to learn video question answering

[Zeng et al., 2017] Kuo-Hao Zeng, Tseng-Hung Chen, Ching-Yao Chuang, Yuan-Hong Liao, Juan Carlos Niebles, and Min Sun. Leveraging video descriptions to learn video question answering. In AAAI, pages 4334–4340,

work page 2017
[23]

Open-ended long-form video question answering via adaptive hierarchical reinforced networks

[Zhao et al., 2018] Zhou Zhao, Zhu Zhang, Shuwen Xiao, Zhou Yu, Jun Yu, Deng Cai, Fei Wu, and Yueting Zhuang. Open-ended long-form video question answering via adaptive hierarchical reinforced networks. In IJCAI, pages 3683–3689,

work page 2018
[24]

Multi-turn video question answering via hierarchical attention context reinforced networks

[Zhao et al., 2019] Zhou Zhao, Zhu Zhang, Xinghua Jiang, and Deng Cai. Multi-turn video question answering via hierarchical attention context reinforced networks. IEEE Transactions on Image Processing,

work page 2019
[25]

Uncovering the temporal con- text for video question answering

[Zhu et al., 2017] Linchao Zhu, Zhongwen Xu, Yi Yang, and Alexander G Hauptmann. Uncovering the temporal con- text for video question answering. IJCV, 124(3):409–421, 2017

work page 2017

[1] [1]

Bottom-up and top-down attention for im- age captioning and visual question answering

[Anderson et al., 2018] Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang. Bottom-up and top-down attention for im- age captioning and visual question answering. In CVPR, pages 6077–6086,

work page 2018

[2] [2]

Vqa: Visual question answering

[Antol et al., 2015] Stanislaw Antol, Aishwarya Agrawal, Ji- asen Lu, Margaret Mitchell, Dhruv Batra, C Lawrence Zit- nick, and Devi Parikh. Vqa: Visual question answering. In ICCV, pages 2425–2433,

work page 2015

[3] [3]

Neural Machine Translation by Jointly Learning to Align and Translate

[Bahdanau et al., 2014] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473,

work page internal anchor Pith review Pith/arXiv arXiv 2014

[4] [4]

[Fellbaum, 1998] Christiane Fellbaum. WordNet. Wiley On- line Library,

work page 1998

[5] [5]

Spatio- temporal context networks for video question answering

[Gao and Han, 2017] Kun Gao and Yahong Han. Spatio- temporal context networks for video question answering. In Paciﬁc Rim Conference on Multimedia, pages 108–118. Springer,

work page 2017

[6] [6]

Motion-appearance co-memory networks for video question answering

[Gao et al., 2018] Jiyang Gao, Runzhou Ge, Kan Chen, and Ram Nevatia. Motion-appearance co-memory networks for video question answering. CVPR,

work page 2018

[7] [7]

Convolutional Sequence to Sequence Learning

[Gehring et al., 2017] Jonas Gehring, Michael Auli, David Grangier, Denis Yarats, and Yann N Dauphin. Convo- lutional sequence to sequence learning. arXiv preprint arXiv:1705.03122,

work page internal anchor Pith review Pith/arXiv arXiv 2017

[8] [8]

Tgif-qa: Toward spatio- temporal reasoning in visual question answering

[Jang et al., 2017] Yunseok Jang, Yale Song, Youngjae Yu, Youngjin Kim, and Gunhee Kim. Tgif-qa: Toward spatio- temporal reasoning in visual question answering. In CVPR, pages 2680–8,

work page 2017

[9] [9]

Dense- captioning events in videos

[Krishna et al., 2017] Ranjay Krishna, Kenji Hata, Fred- eric Ren, Li Fei-Fei, and Juan Carlos Niebles. Dense- captioning events in videos. In ICCV, pages 706–715,

work page 2017

[10] [10]

Visual question answering with question representation update (qru)

[Li and Jia, 2016] Ruiyu Li and Jiaya Jia. Visual question answering with question representation update (qru). In NIPS, pages 4655–4663,

work page 2016

[11] [11]

Hierarchical question-image co-attention for visual question answering

[Lu et al., 2016] Jiasen Lu, Jianwei Yang, Dhruv Batra, and Devi Parikh. Hierarchical question-image co-attention for visual question answering. In NIPS, pages 289–297,

work page 2016

[12] [12]

A dataset and exploration of models for understanding video data through ﬁll-in-the-blank question-answering

[Maharaj et al., 2017] Tegan Maharaj, Nicolas Ballas, Anna Rohrbach, Aaron C Courville, and Christopher Joseph Pal. A dataset and exploration of models for understanding video data through ﬁll-in-the-blank question-answering. In CVPR, pages 7359–7368,

work page 2017

[13] [13]

A multi-world approach to question answer- ing about real-world scenes based on uncertain input

[Malinowski and Fritz, 2014] Mateusz Malinowski and Mario Fritz. A multi-world approach to question answer- ing about real-world scenes based on uncertain input. In NIPS, pages 1682–1690,

work page 2014

[14] [14]

Video Fill in the Blank with Merging LSTMs

[Mazaheri et al., 2016] Amir Mazaheri, Dong Zhang, and Mubarak Shah. Video ﬁll in the blank with merging lstms. arXiv preprint arXiv:1610.04062,

work page internal anchor Pith review Pith/arXiv arXiv 2016

[15] [15]

Efficient Estimation of Word Representations in Vector Space

[Mikolov et al., 2013] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efﬁcient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781,

work page internal anchor Pith review Pith/arXiv arXiv 2013

[16] [16]

Movieqa: Understanding stories in movies through question-answering

[Tapaswi et al., 2016] Makarand Tapaswi, Yukun Zhu, Rainer Stiefelhagen, Antonio Torralba, Raquel Urtasun, and Sanja Fidler. Movieqa: Understanding stories in movies through question-answering. In CVPR, pages 4631–4640,

work page 2016

[17] [17]

Learning spa- tiotemporal features with 3d convolutional networks

[Tran et al., 2015] Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. Learning spa- tiotemporal features with 3d convolutional networks. In ICCV, pages 4489–4497,

work page 2015

[18] [18]

Attention is all you need

[Vaswani et al., 2017] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In NIPS, pages 5998–6008,

work page 2017

[19] [19]

Verbs semantics and lexical selection

[Wu and Palmer, 1994] Zhibiao Wu and Martha Palmer. Verbs semantics and lexical selection. In Proceedings of the 32nd annual meeting on Association for Computa- tional Linguistics, pages 133–138. ACL,

work page 1994

[20] [20]

Visual question answering: A survey of methods and datasets

[Wu et al., 2017] Qi Wu, Damien Teney, Peng Wang, Chun- hua Shen, Anthony Dick, and Anton van den Hengel. Visual question answering: A survey of methods and datasets. Computer Vision and Image Understanding , 163:21–40,

work page 2017

[21] [21]

Unifying the video and question attentions for open- ended video question answering

[Xue et al., 2017] Hongyang Xue, Zhou Zhao, and Deng Cai. Unifying the video and question attentions for open- ended video question answering. IEEE Transactions on Image Processing, 26(12):5656–5666,

work page 2017

[22] [22]

Leveraging video descriptions to learn video question answering

[Zeng et al., 2017] Kuo-Hao Zeng, Tseng-Hung Chen, Ching-Yao Chuang, Yuan-Hong Liao, Juan Carlos Niebles, and Min Sun. Leveraging video descriptions to learn video question answering. In AAAI, pages 4334–4340,

work page 2017

[23] [23]

Open-ended long-form video question answering via adaptive hierarchical reinforced networks

[Zhao et al., 2018] Zhou Zhao, Zhu Zhang, Shuwen Xiao, Zhou Yu, Jun Yu, Deng Cai, Fei Wu, and Yueting Zhuang. Open-ended long-form video question answering via adaptive hierarchical reinforced networks. In IJCAI, pages 3683–3689,

work page 2018

[24] [24]

Multi-turn video question answering via hierarchical attention context reinforced networks

[Zhao et al., 2019] Zhou Zhao, Zhu Zhang, Xinghua Jiang, and Deng Cai. Multi-turn video question answering via hierarchical attention context reinforced networks. IEEE Transactions on Image Processing,

work page 2019

[25] [25]

Uncovering the temporal con- text for video question answering

[Zhu et al., 2017] Linchao Zhu, Zhongwen Xu, Yi Yang, and Alexander G Hauptmann. Uncovering the temporal con- text for video question answering. IJCV, 124(3):409–421, 2017

work page 2017