pith. sign in

arxiv: 1812.03299 · v3 · pith:GCNURVTWnew · submitted 2018-12-08 · 💻 cs.CV

Learning to Assemble Neural Module Tree Networks for Visual Grounding

classification 💻 cs.CV
keywords groundingvisualcompositemodulelanguageneuralnmtreetree
0
0 comments X
read the original abstract

Visual grounding, a task to ground (i.e., localize) natural language in images, essentially requires composite visual reasoning. However, existing methods over-simplify the composite nature of language into a monolithic sentence embedding or a coarse composition of subject-predicate-object triplet. In this paper, we propose to ground natural language in an intuitive, explainable, and composite fashion as it should be. In particular, we develop a novel modular network called Neural Module Tree network (NMTree) that regularizes the visual grounding along the dependency parsing tree of the sentence, where each node is a neural module that calculates visual attention according to its linguistic feature, and the grounding score is accumulated in a bottom-up direction where as needed. NMTree disentangles the visual grounding from the composite reasoning, allowing the former to only focus on primitive and easy-to-generalize patterns. To reduce the impact of parsing errors, we train the modules and their assembly end-to-end by using the Gumbel-Softmax approximation and its straight-through gradient estimator, accounting for the discrete nature of module assembly. Overall, the proposed NMTree consistently outperforms the state-of-the-arts on several benchmarks. Qualitative results show explainable grounding score calculation in great detail.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. VistaRef: Boosting Visual Spatial Orientation Awareness for Pointing-to-Object Detection

    cs.CV 2026-06 unverdicted novelty 5.0

    VistaRef improves pointing-to-object detection accuracy by 14 points via local hand entity modeling, geometric ray modeling, and an orientation-consistent alignment loss.