Transformer Module Networks for Systematic Generalization in Visual Question Answering

Kentaro Takemoto; Moyuru Yamada; Tomotake Sasaki; Vanessa D'Amario; Xavier Boix

arxiv: 2201.11316 · v2 · pith:FTOACBK4new · submitted 2022-01-27 · 💻 cs.CV · cs.LG

Transformer Module Networks for Systematic Generalization in Visual Question Answering

Moyuru Yamada , Vanessa D'Amario , Kentaro Takemoto , Xavier Boix , Tomotake Sasaki This is my paper

classification 💻 cs.CV cs.LG

keywords moduletransformersgeneralizationperformancesystematicachievecompositionsmodules

0 comments

read the original abstract

Transformers achieve great performance on Visual Question Answering (VQA). However, their systematic generalization capabilities, i.e., handling novel combinations of known concepts, is unclear. We reveal that Neural Module Networks (NMNs), i.e., question-specific compositions of modules that tackle a sub-task, achieve better or similar systematic generalization performance than the conventional Transformers, even though NMNs' modules are CNN-based. In order to address this shortcoming of Transformers with respect to NMNs, in this paper we investigate whether and how modularity can bring benefits to Transformers. Namely, we introduce Transformer Module Network (TMN), a novel NMN based on compositions of Transformer modules. TMNs achieve state-of-the-art systematic generalization performance in three VQA datasets, improving more than 30% over standard Transformers for novel compositions of sub-tasks. We show that not only the module composition but also the module specialization for each sub-task are the key of such performance gain.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Learning to Theorize the World from Observation
cs.LG 2026-05 unverdicted novelty 7.0

NEO is a probabilistic neural model that induces compositional programs as a learned Language of Thought from non-textual observations and executes them via a shared transition model to enable explanation-driven gener...
Learning to Theorize the World from Observation
cs.LG 2026-05 unverdicted novelty 6.0

NEO induces compositional latent programs as world theories from observations and executes them to enable explanation-driven generalization.