Learning to Assemble Neural Module Tree Networks for Visual Grounding

Daqing Liu; Feng Wu; Hanwang Zhang; Zheng-Jun Zha

arxiv: 1812.03299 · v3 · pith:GCNURVTWnew · submitted 2018-12-08 · 💻 cs.CV

Learning to Assemble Neural Module Tree Networks for Visual Grounding

Daqing Liu , Hanwang Zhang , Feng Wu , Zheng-Jun Zha This is my paper

classification 💻 cs.CV

keywords groundingvisualcompositemodulelanguageneuralnmtreetree

0 comments

read the original abstract

Visual grounding, a task to ground (i.e., localize) natural language in images, essentially requires composite visual reasoning. However, existing methods over-simplify the composite nature of language into a monolithic sentence embedding or a coarse composition of subject-predicate-object triplet. In this paper, we propose to ground natural language in an intuitive, explainable, and composite fashion as it should be. In particular, we develop a novel modular network called Neural Module Tree network (NMTree) that regularizes the visual grounding along the dependency parsing tree of the sentence, where each node is a neural module that calculates visual attention according to its linguistic feature, and the grounding score is accumulated in a bottom-up direction where as needed. NMTree disentangles the visual grounding from the composite reasoning, allowing the former to only focus on primitive and easy-to-generalize patterns. To reduce the impact of parsing errors, we train the modules and their assembly end-to-end by using the Gumbel-Softmax approximation and its straight-through gradient estimator, accounting for the discrete nature of module assembly. Overall, the proposed NMTree consistently outperforms the state-of-the-arts on several benchmarks. Qualitative results show explainable grounding score calculation in great detail.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

VistaRef: Boosting Visual Spatial Orientation Awareness for Pointing-to-Object Detection
cs.CV 2026-06 unverdicted novelty 5.0

VistaRef improves pointing-to-object detection accuracy by 14 points via local hand entity modeling, geometric ray modeling, and an orientation-consistent alignment loss.