Deep Modular Co-Attention Networks for Visual Question Answering

Dacheng Tao; Jun Yu; Qi Tian; Yuhao Cui; Zhou Yu

REVIEW 2 major objections 2 minor 37 references

Cascading modular co-attention layers achieves 70.63 percent accuracy on visual question answering.

Reviewed by Pith at T0; open to challenge. T0 means a machine referee read the full paper against a public rubric. the ladder, T0–T4 →

Challenge this review Re-run · record.json Download PDF Read on arXiv ↗

T0 review · grok-4.3

2026-05-25 16:17 UTC pith:XHLBZ4HZ

load-bearing objection MCAN shows that cascading modular co-attention layers lifts single-model accuracy to 70.63% on VQA-v2 test-dev where shallow models stalled. the 2 major comments →

arxiv 1906.10770 v1 pith:XHLBZ4HZ submitted 2019-06-25 cs.CV

Deep Modular Co-Attention Networks for Visual Question Answering

Zhou Yu , Jun Yu , Yuhao Cui , Dacheng Tao , Qi Tian This is my paper

classification cs.CV

keywords visual question answeringco-attentionmodular networksdeep attention modelsVQA-v2multimodal learningattention mechanisms

verification ladder T0 review T1 audit T2 compute T3 formal T4 reserved

The pith

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a deep Modular Co-Attention Network built from stacked MCA layers for VQA. Each layer uses a modular combination of two basic attention units to handle self-attention within questions and images plus guided attention across modalities. This design is shown to capture fine-grained word-object associations more effectively than prior shallow co-attention approaches. The result is a substantial performance lift on the VQA-v2 benchmark, where the best single model reaches 70.63 percent overall accuracy on the test-dev set. A reader would care because the work demonstrates that depth in modular attention can advance multimodal reasoning without requiring entirely new fusion architectures.

Core claim

The central claim is that cascading Modular Co-Attention (MCA) layers in depth, where each MCA layer jointly models question self-attention, image self-attention, and question-guided image attention through modular composition of two basic attention units, produces significantly better fine-grained cross-modal associations than shallow co-attention models and reaches 70.63 percent overall accuracy on the VQA-v2 test-dev set.

What carries the argument

The Modular Co-Attention (MCA) layer, which performs self-attention on questions and images plus guided attention using modular composition of two basic attention units.

Load-bearing premise

The assumption that the modular composition of two basic attention units inside each MCA layer is sufficient to capture the fine-grained word-object associations required for VQA.

What would settle it

An experiment in which a non-modular or non-cascaded attention architecture achieves higher than 70.63 percent accuracy on the identical VQA-v2 test-dev split would falsify the central claim.

Watch this falsifier — get emailed when new claim-graph text bears on it.

If this is right

Deeper stacking of MCA layers improves VQA accuracy over shallow counterparts.
The modular design enables effective depth without richer cross-modal fusion.
The approach sets a new state-of-the-art single-model result on VQA-v2.
Ablation studies isolate the contributions of depth and modularity to the gains.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The success of simple modular stacking may reduce the need for complex cross-modal fusion designs in other vision-language tasks.
Similar modular depth could be tested on related benchmarks such as visual grounding or referring expression comprehension.
Extending the cascade further might yield additional gains if training stability is maintained.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit.

Desk Editor's Note

MCAN shows that cascading modular co-attention layers lifts single-model accuracy to 70.63% on VQA-v2 test-dev where shallow models stalled.

read the letter

The core result is that depth helps when the co-attention is built from repeated modular layers rather than one or two shallow blocks. Each MCA layer wires two basic attention units to do question self-attention, image self-attention, and guided cross-attention; stacking them produces the reported gain over prior co-attention baselines on VQA-v2. The ablations in the paper appear to isolate the contribution of the extra layers and the modular split, which is the part that actually moves the number. Code release is a plus for anyone who wants to check the implementation or run their own controls. The 70.63% figure is the clearest empirical takeaway and sits above the previous reported state of the art in the abstract. The main limitation visible from the abstract is the lack of error bars or significance tests, so the size of the improvement needs the full tables to judge stability across runs. The design choice that two attention units per layer are enough for fine-grained word-object links is supported by their ablations but would benefit from seeing exactly which alternative fusions were tried and rejected. This is a straightforward empirical paper aimed at the VQA community; anyone working on attention-based multimodal models will want the numbers and the architecture details. It is worth sending to referees because the result is new, the method is reproducible, and the controls are present even if the statistical reporting could be tighter.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes the Modular Co-Attention Network (MCAN) for Visual Question Answering, consisting of cascaded Modular Co-Attention (MCA) layers. Each MCA layer models question and image self-attention plus guided cross-attention via a modular composition of two basic attention units. The central empirical claim is that the best single MCAN model reaches 70.63% overall accuracy on the VQA-v2 test-dev set, outperforming prior state-of-the-art, with supporting ablation studies on the architecture and code release.

Significance. If the reported gains hold, the work establishes that deep cascaded co-attention can deliver clear improvements over shallow baselines in VQA through modular attention composition, advancing cross-modal fusion. The extensive ablations and public code release are strengths that enable direct verification of the design choices and support reproducibility.

major comments (2)

[Abstract] Abstract: the motivation that 'deep co-attention models show little improvement over their shallow counterparts' is stated without quantitative citations or numbers from the referenced prior works; this underpins the need for the proposed depth and modularity.
[Ablation studies] Experimental results (ablation studies): while the modular composition of two attention units per MCA layer is presented as sufficient for fine-grained associations, the ablations should include a direct control comparing against richer cross-modal fusion mechanisms (e.g., more than two units or alternative connectivity) to confirm this is not an under-capacity design.

minor comments (2)

[Abstract] The abstract reports the 70.63% figure without error bars, number of runs, or statistical significance tests against the prior SOTA; adding these would strengthen the reliability assessment of the central performance claim.
[Experiments] Dataset split details and training hyperparameters (e.g., exact VQA-v2 train/val/test-dev partitions and random seeds) are referenced but could be expanded in the experimental section for full reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for the positive review and recommendation of minor revision. We appreciate the constructive feedback on the abstract motivation and ablation studies. We address each major comment below.

read point-by-point responses

Referee: [Abstract] Abstract: the motivation that 'deep co-attention models show little improvement over their shallow counterparts' is stated without quantitative citations or numbers from the referenced prior works; this underpins the need for the proposed depth and modularity.

Authors: We agree that the abstract would be strengthened by quantitative support. The claim draws from the broader literature on co-attention models, but specific numbers and citations were omitted for brevity. In the revision we will update the abstract to include concrete accuracy figures (e.g., from Bottom-Up Top-Down and related shallow vs. deeper baselines) together with the relevant references. revision: yes
Referee: [Ablation studies] Experimental results (ablation studies): while the modular composition of two attention units per MCA layer is presented as sufficient for fine-grained associations, the ablations should include a direct control comparing against richer cross-modal fusion mechanisms (e.g., more than two units or alternative connectivity) to confirm this is not an under-capacity design.

Authors: Our existing ablations already examine the effect of stacking MCA layers and varying attention heads, showing consistent gains from the modular two-unit design. Nevertheless, the referee's suggestion for an explicit control against richer per-layer mechanisms is reasonable to further rule out under-capacity. We will add this comparison (more than two units and alternative connectivities) to the ablation section in the revised manuscript. revision: yes

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The paper's central claim is an empirical accuracy result (70.63% on VQA-v2 test-dev) obtained by training the proposed MCAN architecture of cascaded MCA layers. Each MCA layer is defined as a modular composition of two basic attention units for self-attention and guided-attention. No equations, parameters, or predictions in the manuscript reduce the reported accuracy to a fitted input by construction. No self-citation chain, uniqueness theorem, or ansatz is invoked to force the result; ablations and code release supply independent empirical controls. The derivation chain consists of model design followed by standard supervised training and evaluation, remaining self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The model rests on standard attention mechanisms from prior literature and training on the VQA-v2 dataset; no new physical entities or ad-hoc constants are introduced beyond typical deep-learning hyperparameters.

free parameters (2)

number of MCA layers
Depth is a design choice that must be selected to achieve the reported performance.
attention module dimensions and heads
Standard hyperparameters whose values affect the final accuracy.

axioms (1)

domain assumption Attention units can jointly model intra-modal and inter-modal dependencies in vision-language data
Invoked when the MCA layer is defined as a composition of two basic attention units.

pith-pipeline@v0.9.0 · 5745 in / 1205 out tokens · 26780 ms · 2026-05-25T16:17:28.589905+00:00 · methodology

0 comments

read the original abstract

Visual Question Answering (VQA) requires a fine-grained and simultaneous understanding of both the visual content of images and the textual content of questions. Therefore, designing an effective `co-attention' model to associate key words in questions with key objects in images is central to VQA performance. So far, most successful attempts at co-attention learning have been achieved by using shallow models, and deep co-attention models show little improvement over their shallow counterparts. In this paper, we propose a deep Modular Co-Attention Network (MCAN) that consists of Modular Co-Attention (MCA) layers cascaded in depth. Each MCA layer models the self-attention of questions and images, as well as the guided-attention of images jointly using a modular composition of two basic attention units. We quantitatively and qualitatively evaluate MCAN on the benchmark VQA-v2 dataset and conduct extensive ablation studies to explore the reasons behind MCAN's effectiveness. Experimental results demonstrate that MCAN significantly outperforms the previous state-of-the-art. Our best single model delivers 70.63$\%$ overall accuracy on the test-dev set. Code is available at https://github.com/MILVLG/mcan-vqa.

Figures

Figures reproduced from arXiv: 1906.10770 by Dacheng Tao, Jun Yu, Qi Tian, Yuhao Cui, Zhou Yu.

**Figure 1.** Figure 1: Accuracies vs. co-attention depth on VQA-v2 val split. We list most of the state-of-the-art approaches with (deep) co-attention models. Except for DCN [24] which uses the convolutional visual features and thus leads to inferior performance, all the compared methods (i.e., MCAN, BAN [14] and MFH [33]) use the same bottom-up attention visual features to represent images [1]. that requires fine-grained semant… view at source ↗

**Figure 3.** Figure 3: Flowcharts of three MCA variants for VQA. (Y) [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗

**Figure 4.** Figure 4: Overall flowchart of the deep Modular Co-Attention Networks (MCAN). In the Deep Co-attention Learning stage, [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗

**Figure 5.** Figure 5: Two deep co-attention models based on a cascade [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗

**Figure 6.** Figure 6: The overall and per-type accuracies of the MCAN [PITH_FULL_IMAGE:figures/full_fig_p006_6.png] view at source ↗

**Figure 7.** Figure 7: Visualizations of the learned attention maps ( [PITH_FULL_IMAGE:figures/full_fig_p007_7.png] view at source ↗

**Figure 8.** Figure 8: Typical examples of the learned image and question attentions by Eq.(5). For each example, the image, question [PITH_FULL_IMAGE:figures/full_fig_p008_8.png] view at source ↗

**Figure 9.** Figure 9: Two examples of the learned attention maps from typical attention units and layers. For each attention unit (within [PITH_FULL_IMAGE:figures/full_fig_p011_9.png] view at source ↗

discussion (0)

Reference graph

Works this paper leans on

37 extracted references · 37 canonical work pages · 10 internal anchors

[1]

Bottom-up and top-down attention for image captioning and visual question answering

Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang. Bottom-up and top-down attention for image captioning and visual question answering. IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , pages 6077–6086, 2018

work page 2018
[2]

Vqa: Visual question answering

Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C Lawrence Zitnick, and Devi Parikh. Vqa: Visual question answering. In International Conference on Computer Vision (ICCV), pages 2425–2433, 2015

work page 2015
[3]

Layer Normalization

Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer normalization. arXiv preprint arXiv:1607.06450 , 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[4]

Neural Machine Translation by Jointly Learning to Align and Translate

Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014
[5]

Training Deeper Neural Machine Translation Models with Transparent Attention

Ankur Bapna, Mia Xu Chen, Orhan Firat, Yuan Cao, and Yonghui Wu. Training deeper neural machine translation models with transparent attention. arXiv preprint arXiv:1808.07561, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[6]

Mutan: Multimodal tucker fusion for visual question answering

Hedi Ben-Younes, R ´emi Cadene, Matthieu Cord, and Nicolas Thome. Mutan: Multimodal tucker fusion for visual question answering. In International Conference on Computer Vision (ICCV), pages 2612–2620, 2017

work page 2017
[7]

ABC-CNN: An Attention Based Convolutional Neural Network for Visual Question Answering

Kan Chen, Jiang Wang, Liang-Chieh Chen, Haoyuan Gao, Wei Xu, and Ram Nevatia. Abc-cnn: An attention based convolutional neural network for visual question answering. arXiv preprint arXiv:1511.05960, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015
[8]

Attention-based models for speech recognition

Jan K Chorowski, Dzmitry Bahdanau, Dmitriy Serdyuk, Kyunghyun Cho, and Yoshua Bengio. Attention-based models for speech recognition. In Advances in neural information processing systems (NIPS) , pages 577–585, 2015

work page 2015
[9]

Long-term recurrent convolutional networks for visual recognition and description

Jeffrey Donahue, Lisa Anne Hendricks, Sergio Guadarrama, Marcus Rohrbach, Subhashini Venugopalan, Kate Saenko, and Trevor Darrell. Long-term recurrent convolutional networks for visual recognition and description. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2625–2634, 2015

work page 2015
[10]

Multimodal Compact Bilinear Pooling for Visual Question Answering and Visual Grounding

Akira Fukui, Dong Huk Park, Daylen Yang, Anna Rohrbach, Trevor Darrell, and Marcus Rohrbach. Multimodal compact bilinear pooling for visual question answering and visual grounding. arXiv preprint arXiv:1606.01847, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[11]

Making the v in vqa matter: Elevating the role of image understanding in visual question answering

Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. Making the v in vqa matter: Elevating the role of image understanding in visual question answering. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 6904–6913, 2017

work page 2017
[12]

Deep residual learning for image recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778, 2016

work page 2016
[13]

Long short-term memory

Sepp Hochreiter and J ¨urgen Schmidhuber. Long short-term memory. Neural computation, 9(8):1735–1780, 1997

work page 1997
[14]

Bilinear attention networks

Jin-Hwa Kim, Jaehyun Jun, and Byoung-Tak Zhang. Bilinear attention networks. Advances in neural information processing systems (NIPS), 2018

work page 2018
[15]

Multimodal residual learning for visual qa

Jin-Hwa Kim, Sang-Woo Lee, Donghyun Kwak, Min-Oh Heo, Jeonghee Kim, Jung-Woo Ha, and Byoung-Tak Zhang. Multimodal residual learning for visual qa. In Advances in neural information processing systems (NIPS) , pages 361– 369, 2016

work page 2016
[16]

Hadamard Product for Low-rank Bilinear Pooling

Jin-Hwa Kim, Kyoung Woon On, Woosang Lim, Jeonghee Kim, Jung-Woo Ha, and Byoung-Tak Zhang. Hadamard Product for Low-rank Bilinear Pooling. In International Conference on Learning Representation (ICLR), 2017

work page 2017
[17]

Adam: A Method for Stochastic Optimization

Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 , 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014
[18]

Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations

Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalan- tidis, Li-Jia Li, David A Shamma, et al. Visual genome: Connecting language and vision using crowdsourced dense image annotations. arXiv preprint arXiv:1602.07332, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[19]

Microsoft coco: Common objects in context

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll´ar, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In European Conference on Computer Vision (ECCV) , pages 740–755, 2014

work page 2014
[20]

Hierarchical question-image co-attention for visual question answering

Jiasen Lu, Jianwei Yang, Dhruv Batra, and Devi Parikh. Hierarchical question-image co-attention for visual question answering. In Advances in neural information processing systems (NIPS), pages 289–297, 2016

work page 2016
[21]

A multi-world approach to question answering about real-world scenes based on uncertain input

Mateusz Malinowski and Mario Fritz. A multi-world approach to question answering about real-world scenes based on uncertain input. In Advances in neural information processing systems (NIPS), pages 1682–1690, 2014

work page 2014
[22]

Recurrent models of visual attention

V olodymyr Mnih, Nicolas Heess, Alex Graves, et al. Recurrent models of visual attention. In Advances in neural information processing systems (NIPS) , pages 2204–2212, 2014

work page 2014
[23]

Dual Attention Networks for Multimodal Reasoning and Matching

Hyeonseob Nam, Jung-Woo Ha, and Jeonghee Kim. Dual attention networks for multimodal reasoning and matching. arXiv preprint arXiv:1611.00471, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[24]

Duy-Kien Nguyen and Takayuki Okatani. Improved fusion of visual and language representations by dense symmetric co-attention for visual question answering.IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 6087–6096, 2018

work page 2018
[25]

Glove: Global vectors for word representation

Jeffrey Pennington, Richard Socher, and Christopher D Manning. Glove: Global vectors for word representation. In EMNLP, pages 1532–1543, 2014

work page 2014
[26]

Faster r-cnn: Towards real-time object detection with region proposal networks

Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. In Advances in neural information processing systems (NIPS), pages 91–99, 2015

work page 2015
[27]

Where to look: Focus regions for visual question answering

Kevin J Shih, Saurabh Singh, and Derek Hoiem. Where to look: Focus regions for visual question answering. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 4613–4621, 2016

work page 2016
[28]

Tips and Tricks for Visual Question Answering: Learnings from the 2017 Challenge

Damien Teney, Peter Anderson, Xiaodong He, and Anton van den Hengel. Tips and tricks for visual question answering: Learnings from the 2017 challenge. arXiv preprint arXiv:1708.02711, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[29]

Attention is all you need

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in Neural Information Processing Systems , pages 6000–6010, 2017

work page 2017
[30]

Show, attend and tell: Neural image caption generation with visual attention

Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron C Courville, Ruslan Salakhutdinov, Richard S Zemel, and Yoshua Bengio. Show, attend and tell: Neural image caption generation with visual attention. In International Conference on Machine Learning (ICML), volume 14, pages 77–81, 2015

work page 2015
[31]

Stacked attention networks for image question answering

Zichao Yang, Xiaodong He, Jianfeng Gao, Li Deng, and Alex Smola. Stacked attention networks for image question answering. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 21–29, 2016

work page 2016
[32]

Multi-modal factorized bilinear pooling with co-attention learning for visual question answering

Zhou Yu, Jun Yu, Jianping Fan, and Dacheng Tao. Multi-modal factorized bilinear pooling with co-attention learning for visual question answering. IEEE International Conference on Computer Vision (ICCV), pages 1839–1848, 2017

work page 2017
[33]

Beyond bilinear: Generalized multimodal factorized high-order pooling for visual question answering

Zhou Yu, Jun Yu, Chenchao Xiang, Jianping Fan, and Dacheng Tao. Beyond bilinear: Generalized multimodal factorized high-order pooling for visual question answering. IEEE Transactions on Neural Networks and Learning Systems, 29(12):5947–5959, 2018

work page 2018
[34]

Rethinking diversiﬁed and discriminative proposal generation for visual grounding.International Joint Conference on Artiﬁcial Intelligence (IJCAI) , pages 1114– 1120, 2018

Zhou Yu, Jun Yu, Chenchao Xiang, Zhou Zhao, Qi Tian, and Dacheng Tao. Rethinking diversiﬁed and discriminative proposal generation for visual grounding.International Joint Conference on Artiﬁcial Intelligence (IJCAI) , pages 1114– 1120, 2018

work page 2018
[35]

Learning to count objects in natural images for visual question answering

Yan Zhang, Jonathon Hare, and Adam Pr ¨ugel-Bennett. Learning to count objects in natural images for visual question answering. International Conference on Learning Representation (ICLR), 2018

work page 2018
[36]

Open-ended long- form video question answering via adaptive hierarchical reinforced networks

Zhou Zhao, Zhu Zhang, Shuwen Xiao, Zhou Yu, Jun Yu, Deng Cai, Fei Wu, and Yueting Zhuang. Open-ended long- form video question answering via adaptive hierarchical reinforced networks. In International Joint Conference on Artiﬁcial Intelligence (IJCAI), pages 3683–3689, 2018

work page 2018
[37]

Simple Baseline for Visual Question Answering

Bolei Zhou, Yuandong Tian, Sainbayar Sukhbaatar, Arthur Szlam, and Rob Fergus. Simple baseline for visual question answering. arXiv preprint arXiv:1512.02167, 2015. Appendix A. Model Ensembling To compare MCAN to the best results on VQA-v2 leaderboard2, we train 4 MCAN ed-6 models with slightly different hyper-parameters for ensemble. The comparative resu...

work page internal anchor Pith review Pith/arXiv arXiv 2015

[1] [1]

Bottom-up and top-down attention for image captioning and visual question answering

Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang. Bottom-up and top-down attention for image captioning and visual question answering. IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , pages 6077–6086, 2018

work page 2018

[2] [2]

Vqa: Visual question answering

Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C Lawrence Zitnick, and Devi Parikh. Vqa: Visual question answering. In International Conference on Computer Vision (ICCV), pages 2425–2433, 2015

work page 2015

[3] [3]

Layer Normalization

Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer normalization. arXiv preprint arXiv:1607.06450 , 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016

[4] [4]

Neural Machine Translation by Jointly Learning to Align and Translate

Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014

[5] [5]

Training Deeper Neural Machine Translation Models with Transparent Attention

Ankur Bapna, Mia Xu Chen, Orhan Firat, Yuan Cao, and Yonghui Wu. Training deeper neural machine translation models with transparent attention. arXiv preprint arXiv:1808.07561, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[6] [6]

Mutan: Multimodal tucker fusion for visual question answering

Hedi Ben-Younes, R ´emi Cadene, Matthieu Cord, and Nicolas Thome. Mutan: Multimodal tucker fusion for visual question answering. In International Conference on Computer Vision (ICCV), pages 2612–2620, 2017

work page 2017

[7] [7]

ABC-CNN: An Attention Based Convolutional Neural Network for Visual Question Answering

Kan Chen, Jiang Wang, Liang-Chieh Chen, Haoyuan Gao, Wei Xu, and Ram Nevatia. Abc-cnn: An attention based convolutional neural network for visual question answering. arXiv preprint arXiv:1511.05960, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015

[8] [8]

Attention-based models for speech recognition

Jan K Chorowski, Dzmitry Bahdanau, Dmitriy Serdyuk, Kyunghyun Cho, and Yoshua Bengio. Attention-based models for speech recognition. In Advances in neural information processing systems (NIPS) , pages 577–585, 2015

work page 2015

[9] [9]

Long-term recurrent convolutional networks for visual recognition and description

Jeffrey Donahue, Lisa Anne Hendricks, Sergio Guadarrama, Marcus Rohrbach, Subhashini Venugopalan, Kate Saenko, and Trevor Darrell. Long-term recurrent convolutional networks for visual recognition and description. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2625–2634, 2015

work page 2015

[10] [10]

Multimodal Compact Bilinear Pooling for Visual Question Answering and Visual Grounding

Akira Fukui, Dong Huk Park, Daylen Yang, Anna Rohrbach, Trevor Darrell, and Marcus Rohrbach. Multimodal compact bilinear pooling for visual question answering and visual grounding. arXiv preprint arXiv:1606.01847, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016

[11] [11]

Making the v in vqa matter: Elevating the role of image understanding in visual question answering

Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. Making the v in vqa matter: Elevating the role of image understanding in visual question answering. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 6904–6913, 2017

work page 2017

[12] [12]

Deep residual learning for image recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778, 2016

work page 2016

[13] [13]

Long short-term memory

Sepp Hochreiter and J ¨urgen Schmidhuber. Long short-term memory. Neural computation, 9(8):1735–1780, 1997

work page 1997

[14] [14]

Bilinear attention networks

Jin-Hwa Kim, Jaehyun Jun, and Byoung-Tak Zhang. Bilinear attention networks. Advances in neural information processing systems (NIPS), 2018

work page 2018

[15] [15]

Multimodal residual learning for visual qa

Jin-Hwa Kim, Sang-Woo Lee, Donghyun Kwak, Min-Oh Heo, Jeonghee Kim, Jung-Woo Ha, and Byoung-Tak Zhang. Multimodal residual learning for visual qa. In Advances in neural information processing systems (NIPS) , pages 361– 369, 2016

work page 2016

[16] [16]

Hadamard Product for Low-rank Bilinear Pooling

Jin-Hwa Kim, Kyoung Woon On, Woosang Lim, Jeonghee Kim, Jung-Woo Ha, and Byoung-Tak Zhang. Hadamard Product for Low-rank Bilinear Pooling. In International Conference on Learning Representation (ICLR), 2017

work page 2017

[17] [17]

Adam: A Method for Stochastic Optimization

Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 , 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014

[18] [18]

Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations

Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalan- tidis, Li-Jia Li, David A Shamma, et al. Visual genome: Connecting language and vision using crowdsourced dense image annotations. arXiv preprint arXiv:1602.07332, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016

[19] [19]

Microsoft coco: Common objects in context

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll´ar, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In European Conference on Computer Vision (ECCV) , pages 740–755, 2014

work page 2014

[20] [20]

Hierarchical question-image co-attention for visual question answering

Jiasen Lu, Jianwei Yang, Dhruv Batra, and Devi Parikh. Hierarchical question-image co-attention for visual question answering. In Advances in neural information processing systems (NIPS), pages 289–297, 2016

work page 2016

[21] [21]

A multi-world approach to question answering about real-world scenes based on uncertain input

Mateusz Malinowski and Mario Fritz. A multi-world approach to question answering about real-world scenes based on uncertain input. In Advances in neural information processing systems (NIPS), pages 1682–1690, 2014

work page 2014

[22] [22]

Recurrent models of visual attention

V olodymyr Mnih, Nicolas Heess, Alex Graves, et al. Recurrent models of visual attention. In Advances in neural information processing systems (NIPS) , pages 2204–2212, 2014

work page 2014

[23] [23]

Dual Attention Networks for Multimodal Reasoning and Matching

Hyeonseob Nam, Jung-Woo Ha, and Jeonghee Kim. Dual attention networks for multimodal reasoning and matching. arXiv preprint arXiv:1611.00471, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016

[24] [24]

Duy-Kien Nguyen and Takayuki Okatani. Improved fusion of visual and language representations by dense symmetric co-attention for visual question answering.IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 6087–6096, 2018

work page 2018

[25] [25]

Glove: Global vectors for word representation

Jeffrey Pennington, Richard Socher, and Christopher D Manning. Glove: Global vectors for word representation. In EMNLP, pages 1532–1543, 2014

work page 2014

[26] [26]

Faster r-cnn: Towards real-time object detection with region proposal networks

Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. In Advances in neural information processing systems (NIPS), pages 91–99, 2015

work page 2015

[27] [27]

Where to look: Focus regions for visual question answering

Kevin J Shih, Saurabh Singh, and Derek Hoiem. Where to look: Focus regions for visual question answering. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 4613–4621, 2016

work page 2016

[28] [28]

Tips and Tricks for Visual Question Answering: Learnings from the 2017 Challenge

Damien Teney, Peter Anderson, Xiaodong He, and Anton van den Hengel. Tips and tricks for visual question answering: Learnings from the 2017 challenge. arXiv preprint arXiv:1708.02711, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[29] [29]

Attention is all you need

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in Neural Information Processing Systems , pages 6000–6010, 2017

work page 2017

[30] [30]

Show, attend and tell: Neural image caption generation with visual attention

Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron C Courville, Ruslan Salakhutdinov, Richard S Zemel, and Yoshua Bengio. Show, attend and tell: Neural image caption generation with visual attention. In International Conference on Machine Learning (ICML), volume 14, pages 77–81, 2015

work page 2015

[31] [31]

Stacked attention networks for image question answering

Zichao Yang, Xiaodong He, Jianfeng Gao, Li Deng, and Alex Smola. Stacked attention networks for image question answering. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 21–29, 2016

work page 2016

[32] [32]

Multi-modal factorized bilinear pooling with co-attention learning for visual question answering

Zhou Yu, Jun Yu, Jianping Fan, and Dacheng Tao. Multi-modal factorized bilinear pooling with co-attention learning for visual question answering. IEEE International Conference on Computer Vision (ICCV), pages 1839–1848, 2017

work page 2017

[33] [33]

Beyond bilinear: Generalized multimodal factorized high-order pooling for visual question answering

Zhou Yu, Jun Yu, Chenchao Xiang, Jianping Fan, and Dacheng Tao. Beyond bilinear: Generalized multimodal factorized high-order pooling for visual question answering. IEEE Transactions on Neural Networks and Learning Systems, 29(12):5947–5959, 2018

work page 2018

[34] [34]

Rethinking diversiﬁed and discriminative proposal generation for visual grounding.International Joint Conference on Artiﬁcial Intelligence (IJCAI) , pages 1114– 1120, 2018

Zhou Yu, Jun Yu, Chenchao Xiang, Zhou Zhao, Qi Tian, and Dacheng Tao. Rethinking diversiﬁed and discriminative proposal generation for visual grounding.International Joint Conference on Artiﬁcial Intelligence (IJCAI) , pages 1114– 1120, 2018

work page 2018

[35] [35]

Learning to count objects in natural images for visual question answering

Yan Zhang, Jonathon Hare, and Adam Pr ¨ugel-Bennett. Learning to count objects in natural images for visual question answering. International Conference on Learning Representation (ICLR), 2018

work page 2018

[36] [36]

Open-ended long- form video question answering via adaptive hierarchical reinforced networks

Zhou Zhao, Zhu Zhang, Shuwen Xiao, Zhou Yu, Jun Yu, Deng Cai, Fei Wu, and Yueting Zhuang. Open-ended long- form video question answering via adaptive hierarchical reinforced networks. In International Joint Conference on Artiﬁcial Intelligence (IJCAI), pages 3683–3689, 2018

work page 2018

[37] [37]

Simple Baseline for Visual Question Answering

Bolei Zhou, Yuandong Tian, Sainbayar Sukhbaatar, Arthur Szlam, and Rob Fergus. Simple baseline for visual question answering. arXiv preprint arXiv:1512.02167, 2015. Appendix A. Model Ensembling To compare MCAN to the best results on VQA-v2 leaderboard2, we train 4 MCAN ed-6 models with slightly different hyper-parameters for ensemble. The comparative resu...

work page internal anchor Pith review Pith/arXiv arXiv 2015