Variational Context: Exploiting Visual and Textual Context for Grounding Referring Expressions

Hanwang Zhang; Shih-Fu Chang; Yulei Niu; Zhiwu Lu

arxiv: 1907.03609 · v1 · pith:D53YU44Rnew · submitted 2019-07-08 · 💻 cs.CV

Variational Context: Exploiting Visual and Textual Context for Grounding Referring Expressions

Yulei Niu , Hanwang Zhang , Zhiwu Lu , Shih-Fu Chang This is my paper

Pith reviewed 2026-05-25 01:13 UTC · model grok-4.3

classification 💻 cs.CV

keywords referring expression groundingvariational Bayesian inferencecontext modelingvisual groundingmultimodal comprehensionunsupervised groundingimage region localization

0 comments

The pith

A variational Bayesian framework captures reciprocal influence between a referent and its context to ground referring expressions without modeling full exponential complexity.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to solve referring expression grounding by localizing objects in images based on descriptions that include attributes and relationships. It argues that existing pairwise approaches oversimplify context, so a variational method is introduced that lets the estimated referent and context mutually refine each other's posterior distributions. This reciprocity shrinks the context search space while also enforcing that the original expression can be reconstructed from the inferred context. The same model extends directly to cases with no referent annotations. Experiments across benchmarks indicate consistent gains in both supervised and unsupervised regimes.

Core claim

The central claim is that a variational posterior approximation exploiting the reciprocal relation between referent and context reduces the search space for complex multimodal context, and that adding semantic reproduction of the referring expression from the estimated context further improves grounding; the resulting model outperforms prior art on standard datasets in both supervised and unsupervised settings.

What carries the argument

Variational context framework that alternates reciprocal posterior estimation between referent and context while enforcing expression reconstruction from context.

If this is right

Grounding accuracy improves on benchmarks that require distinguishing same-category objects via attributes and relations.
The same architecture applies without change to the unsupervised case lacking referent location labels.
Context search remains tractable even when multiple image regions participate in the expression.
Semantic consistency between expression and inferred context acts as an additional training signal.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same reciprocity idea could be tested on video sequences where context evolves over time.
If the variational reduction holds, similar mutual-influence models might apply to other multimodal tasks such as visual question answering.
Failure on scenes with highly ambiguous context would indicate the approximation misses critical higher-order dependencies.

Load-bearing premise

The reciprocal influence between referent and context can be adequately captured by a variational approximation that avoids the full joint modeling's exponential cost.

What would settle it

A controlled experiment on images with many same-category objects where removing the reciprocal update step produces no drop in grounding accuracy.

Figures

Figures reproduced from arXiv: 1907.03609 by Hanwang Zhang, Shih-Fu Chang, Yulei Niu, Zhiwu Lu.

**Figure 2.** Figure 2: The architecture of the proposed Variational Context framework. It consists of a region feature extraction module (Section 4.1), and a [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Two qualitative examples of the cue-specific language [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Qualitative results on RefCOCOg (det) showing comparisons between correct (green tick) and wrong referent grounds (red cross) by VC [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Performances of VC and CMN with different number of object bounding boxes on RefCOCO Test A &B, RefCOCO+ Test A & B, and [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 6.** Figure 6: Qualitative results of our full model (VC w/ Gen+PG) on RefCOCOg (det). The first column shows the grounding results. The second column [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

**Figure 7.** Figure 7: Common failure cases of our full model in supervised grounding on RefCOCOg. Each example shows grounding results, context estimation [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗

**Figure 8.** Figure 8: Common failure cases in unsupervised grounding with detected [PITH_FULL_IMAGE:figures/full_fig_p010_8.png] view at source ↗

**Figure 10.** Figure 10: Example generation results using our full model (VC w / Gen+PG) on three datasets. The ground-truth/generated expression is linked with [PITH_FULL_IMAGE:figures/full_fig_p012_10.png] view at source ↗

read the original abstract

We focus on grounding (i.e., localizing or linking) referring expressions in images, e.g., ``largest elephant standing behind baby elephant''. This is a general yet challenging vision-language task since it does not only require the localization of objects, but also the multimodal comprehension of context -- visual attributes (e.g., ``largest'', ``baby'') and relationships (e.g., ``behind'') that help to distinguish the referent from other objects, especially those of the same category. Due to the exponential complexity involved in modeling the context associated with multiple image regions, existing work oversimplifies this task to pairwise region modeling by multiple instance learning. In this paper, we propose a variational Bayesian method, called Variational Context, to solve the problem of complex context modeling in referring expression grounding. Specifically, our framework exploits the reciprocal relation between the referent and context, i.e., either of them influences estimation of the posterior distribution of the other, and thereby the search space of context can be greatly reduced. In addition to reciprocity, our framework considers the semantic information of context, i.e., the referring expression can be reproduced based on the estimated context. We also extend the model to unsupervised setting where no annotation for the referent is available. Extensive experiments on various benchmarks show consistent improvement over state-of-the-art methods in both supervised and unsupervised settings.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The variational reciprocity idea targets a real modeling gap in referring grounding but the abstract leaves the posterior approximation too underspecified to judge if complexity reduction actually holds.

read the letter

The main point is a variational Bayesian model for referring expression grounding that exploits reciprocity between referent and context to shrink the search space, adds a semantic reproduction term so the expression can be regenerated from the estimated context, and extends the whole thing to an unsupervised setting without referent labels. This combination is presented as new relative to the pairwise region modeling that earlier work used to dodge exponential complexity. The paper does a clean job stating the problem and why full context modeling is hard, then positions the reciprocal influence and reproduction as mechanisms that let the posterior of one inform the other. It also claims consistent gains over prior methods on standard benchmarks in both supervised and unsupervised regimes, which is the kind of evidence that matters for this task. The soft spot is the lack of any equations, factorization details, or ELBO derivation in the abstract. Without seeing how the variational family is defined or how reciprocity is actually encoded in the approximate posterior, it is difficult to tell whether the claimed reduction in search space is real or whether the model collapses back to something close to pairwise relations. The stress-test concern about a possibly too-restrictive variational family making the complexity savings illusory is reasonable based on what is shown. If the full paper supplies the math and ablations that validate the joint modeling, that would address it; otherwise the central claim stays conceptual. This work is for people already working on vision-language grounding who care about context modeling beyond pairs. A reader who follows Bayesian approaches in multimodal tasks would find the framing useful even if they end up disagreeing with the execution. It deserves a serious referee because the problem it attacks is substantive and the proposed angle is distinct from the cited priors. Recommendation: send it out for review so the technical details can be checked properly.

Referee Report

3 major / 2 minor

Summary. The paper proposes a variational Bayesian method called Variational Context for grounding referring expressions in images. It exploits the reciprocal relation between the referent and context to approximate the posterior and reduce the exponential complexity of full context modeling, while also incorporating semantic reproduction of the referring expression from the estimated context. The approach is extended to an unsupervised setting without referent annotations, with experiments claiming consistent improvements over state-of-the-art on various benchmarks.

Significance. If the variational approximation is shown to validly encode reciprocity and achieve the claimed complexity reduction without restrictive assumptions that undermine the joint modeling, the work would be significant for addressing a core challenge in vision-language tasks: scalable multimodal context comprehension beyond pairwise region modeling. The unsupervised extension is a notable strength that broadens applicability.

major comments (3)

[Abstract, §3] Abstract and §3 (framework description): The central claim that reciprocity between referent and context 'greatly reduces' the search space via variational posterior approximation lacks any explicit derivation, factorization assumptions on the variational family q(·), latent variable definitions, or ELBO objective; without these, it is impossible to verify whether the approximation encodes the claimed reciprocal influence or merely reparameterizes independent pairwise terms as noted in the skeptic analysis.
[§3.2] §3.2 (variational posterior): The semantic reproduction term (referring expression reproduced from estimated context) is presented as enforcing joint modeling, but no equation shows how this term interacts with the reciprocity mechanism or prevents the approximation from collapsing to the exponential enumeration it aims to avoid.
[§5] §5 (experiments): Claims of 'consistent improvement' are stated without reported error bars, ablation studies on the variational components, dataset statistics, or comparisons isolating the reciprocity reduction; this leaves the load-bearing claim of effective complexity reduction without quantitative support.

minor comments (2)

[Abstract] Abstract: No equations, implementation details, or hyperparameter settings are supplied, which hinders immediate assessment of the variational realization.
[§3] Notation: The distinction between the true posterior p(context|referent, expression) and the variational approximation is not clearly delineated in the high-level description.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the insightful comments. We will revise the manuscript to provide the missing derivations and strengthen the experimental analysis as detailed in the point-by-point responses below.

read point-by-point responses

Referee: [Abstract, §3] Abstract and §3 (framework description): The central claim that reciprocity between referent and context 'greatly reduces' the search space via variational posterior approximation lacks any explicit derivation, factorization assumptions on the variational family q(·), latent variable definitions, or ELBO objective; without these, it is impossible to verify whether the approximation encodes the claimed reciprocal influence or merely reparameterizes independent pairwise terms as noted in the skeptic analysis.

Authors: We agree that the current presentation lacks sufficient mathematical detail. In the revised version, we will include an explicit derivation of the variational approximation, specify the factorization assumptions for the variational family q(·) to model reciprocity (such as separate conditionals for referent and context), define the latent variables clearly, and present the full ELBO objective. This will clarify how the approximation captures the reciprocal influence. revision: yes
Referee: [§3.2] §3.2 (variational posterior): The semantic reproduction term (referring expression reproduced from estimated context) is presented as enforcing joint modeling, but no equation shows how this term interacts with the reciprocity mechanism or prevents the approximation from collapsing to the exponential enumeration it aims to avoid.

Authors: The semantic reproduction term acts as a regularizer in the variational objective to ensure the estimated context captures the necessary semantics. We will add equations in the revised §3.2 demonstrating its integration with the reciprocity terms in the ELBO and how the overall framework avoids exponential complexity through the variational approximation. revision: yes
Referee: [§5] §5 (experiments): Claims of 'consistent improvement' are stated without reported error bars, ablation studies on the variational components, dataset statistics, or comparisons isolating the reciprocity reduction; this leaves the load-bearing claim of effective complexity reduction without quantitative support.

Authors: We recognize that additional experimental details are needed to support the claims. The revised manuscript will report error bars, include ablation studies on the reciprocity and semantic reproduction components, provide dataset statistics, and add experiments isolating the complexity reduction achieved by the variational approach. revision: yes

Circularity Check

0 steps flagged

No circularity: variational reciprocity claim is independent of self-defined inputs or self-citations

full rationale

The abstract presents a variational Bayesian framework whose central mechanism (reciprocal influence between referent and context reducing search space, plus semantic reproduction) is stated as a modeling choice without any quoted equations, fitted parameters renamed as predictions, or load-bearing self-citations. No derivation step reduces by construction to its own inputs, and the unsupervised extension is described as a direct extension rather than a reparameterization of prior results. The derivation chain therefore remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The method rests on standard variational inference assumptions and the domain assumption that context can be modeled via posterior reciprocity; no free parameters or invented entities are explicitly quantified in the abstract.

axioms (1)

domain assumption Variational inference provides a tractable approximation to the true posterior over context given the referent
Invoked as the core mechanism to reduce search space without exponential complexity.

invented entities (1)

Variational Context model no independent evidence
purpose: To capture reciprocal influence and semantic reproduction between referent and context
New framework introduced to solve the stated problem of complex context modeling.

pith-pipeline@v0.9.0 · 5778 in / 1238 out tokens · 33708 ms · 2026-05-25T01:13:05.485526+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

64 extracted references · 64 canonical work pages · 3 internal anchors

[1]

Antol, A

S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C. Lawrence Zitnick, and D. Parikh. Vqa: Visual question answering. In ICCV, 2015

work page 2015
[2]

J. Ba, V . Mnih, and K. Kavukcuoglu. Multiple object recognition with visual attention. In ICLR, 2015

work page 2015
[3]

Bahdanau, K

D. Bahdanau, K. Cho, and Y. Bengio. Neural machine translation by jointly learning to align and translate. 2015

work page 2015
[4]

X. Chen, L. Ma, J. Chen, Z. Jie, W. Liu, and J. Luo. Real-time referring expression comprehension by single-stage grounding network. arXiv preprint arXiv:1812.03426, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[5]

K. Cho, B. Van Merri ¨enboer, D. Bahdanau, and Y. Bengio. On the properties of neural machine translation: Encoder-decoder approaches. arXiv preprint arXiv:1409.1259, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014
[6]

B. Dai, Y. Zhang, and D. Lin. Detecting visual relationships with deep relational networks. In CVPR, 2017. 12 RefCOCO Test A RefCOCO Test B RefCOCO+ Test A RefCOCO+ Test B RefCOCOg Val* kid left bear in red right bear pizza in front pizza in the back red shirt man in white shirt closest bed smaller bed a man in a blue coat standing in the snow a pair of a...

work page 2017
[7]

A. Das, S. Kottur, J. M. Moura, S. Lee, and D. Batra. Learning cooperative visual dialog agents with deep reinforcement learning. In ICCV, 2017

work page 2017
[8]

C. Deng, Q. Wu, Q. Wu, F. Hu, F. Lyu, and M. Tan. Visual grounding via accumulated attention. In CVPR, 2018

work page 2018
[9]

C. Deng, Q. Wu, G. Xu, Z. Yu, Y. Xu, K. Jia, and M. Tan. You only look & listen once: Towards fast and accurate visual grounding. arXiv preprint arXiv:1902.04213, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1902
[10]

T. G. Dietterich, R. H. Lathrop, and T. Lozano-P ´erez. Solving the multiple instance problem with axis-parallel rectangles. Artiﬁcial intelligence, 1997

work page 1997
[11]

C. W. Fox and S. J. Roberts. A tutorial on variational bayesian inference. Artiﬁcial intelligence review, 2012

work page 2012
[12]

Glorot and Y

X. Glorot and Y. Bengio. Understanding the difﬁculty of training deep feedforward neural networks. In ICAIS, 2010

work page 2010
[13]

Golland, P

D. Golland, P . Liang, and D. Klein. A game-theoretic approach to generating spatial descriptions. In EMNLP, 2010

work page 2010
[14]

R. Hu, J. Andreas, M. Rohrbach, T. Darrell, and K. Saenko. Learning to reason: End-to-end module networks for visual question answering. ICCV, 2017

work page 2017
[15]

R. Hu, M. Rohrbach, J. Andreas, T. Darrell, and K. Saenko. Modeling relationships in referential expressions with compositional modular networks. In CVPR, 2017

work page 2017
[16]

R. Hu, H. Xu, M. Rohrbach, J. Feng, K. Saenko, and T. Darrell. Natural language object retrieval. In CVPR, 2016

work page 2016
[17]

Ioffe and C

S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In ICML, 2015

work page 2015
[18]

Kazemzadeh, V

S. Kazemzadeh, V . Ordonez, M. Matten, and T. L. Berg. Referitgame: Referring to objects in photographs of natural scenes. In EMNLP, 2014

work page 2014
[19]

D. P . Kingma and M. Welling. Auto-encoding variational bayes. In ICLR, 2014

work page 2014
[20]

Krahmer and K

E. Krahmer and K. Van Deemter. Computational generation of referring expressions: A survey. Computational Linguistics , 38(1):173–218, 2012

work page 2012
[21]

Y. Li, N. Duan, B. Zhou, X. Chu, W. Ouyang, X. Wang, and M. Zhou. Visual question generation as dual task of visual question answering. In CVPR, 2018

work page 2018
[22]

Y. Li, W. Ouyang, and X. Wang. Vip-cnn: Visual phrase guided convolutional neural network. In CVPR, 2017

work page 2017
[23]

T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P . Perona, D. Ramanan, P . Doll´ar, and C. L. Zitnick. Microsoft coco: Common objects in context. In ECCV, 2014

work page 2014
[24]

J. Liu, L. Wang, and M.-H. Yang. Referring expression generation and comprehension via attributes. In ICCV, 2017

work page 2017
[25]

W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, and A. C. Berg. Ssd: Single shot multibox detector. In ECCV, 2016

work page 2016
[26]

C. Lu, R. Krishna, M. Bernstein, and L. Fei-Fei. Visual relationship detection with language priors. In ECCV, 2016

work page 2016
[27]

J. Lu, J. Yang, D. Batra, and D. Parikh. Hierarchical question-image co-attention for visual question answering. In NIPS, 2016

work page 2016
[28]

Luo and G

R. Luo and G. Shakhnarovich. Comprehension-guided referring expressions. In CVPR, 2017

work page 2017
[29]

Makhzani, J

A. Makhzani, J. Shlens, N. Jaitly, and I. J. Goodfellow. Adversarial autoencoders. In ICLR Workshop, 2016

work page 2016
[30]

J. Mao, J. Huang, A. Toshev, O. Camburu, A. L. Yuille, and K. Murphy. Generation and comprehension of unambiguous object descriptions. In CVPR, 2016

work page 2016
[31]

Mitchell, K

M. Mitchell, K. van Deemter, and E. Reiter. Natural reference to objects in a visual domain. In INLG, 2010

work page 2010
[32]

Mitchell, K

M. Mitchell, K. Van Deemter, and E. Reiter. Generating expressions that refer to visible objects. In NAACL, 2013

work page 2013
[33]

V . K. Nagaraja, V . I. Morariu, and L. S. Davis. Modeling context between objects for referring expression understanding. In ECCV, 2016

work page 2016
[34]

Pennington, R

J. Pennington, R. Socher, and C. Manning. Glove: Global vectors for word representation. In EMNLP, 2014

work page 2014
[35]

B. A. Plummer, A. Mallya, C. M. Cervantes, J. Hockenmaier, and S. Lazebnik. Phrase localization and visual relationship detection with comprehensive linguistic cues. In ICCV, 2017

work page 2017
[36]

B. A. Plummer, L. Wang, C. M. Cervantes, J. C. Caicedo, J. Hockenmaier, and S. Lazebnik. Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. In ICCV, 2015

work page 2015
[37]

Redmon and A

J. Redmon and A. Farhadi. Yolo9000: better, faster, stronger. In CVPR, 2017

work page 2017
[38]

S. Ren, K. He, R. Girshick, and J. Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. In NIPS, 2015

work page 2015
[39]

Rohrbach, M

A. Rohrbach, M. Rohrbach, R. Hu, T. Darrell, and B. Schiele. Grounding of textual phrases in images by reconstruction. In ECCV, 2016

work page 2016
[40]

Schuster and K

M. Schuster and K. K. Paliwal. Bidirectional recurrent neural networks. TSP, 1997

work page 1997
[41]

Schuster, R

S. Schuster, R. Krishna, A. Chang, L. Fei-Fei, and C. D. Manning. Generating semantically precise scene graphs from textual descriptions for improved image retrieval. In Workshop on Vision and Language, 2015

work page 2015
[42]

K. Sohn, H. Lee, and X. Yan. Learning structured output representation using deep conditional generative models. InNIPS, 2015. 13

work page 2015
[43]

Q. Sun, B. Schiele, and M. Fritz. A domain based approach to social relation recognition. In CVPR, 2017

work page 2017
[44]

J. A. Thomas. Meaning in interaction: An introduction to pragmatics . Routledge, 2014

work page 2014
[45]

Thomason, J

J. Thomason, J. Sinapov, and R. Mooney. Guiding interaction behaviors for multi-modal grounded language learning. In Proceedings of the First Workshop on Language Grounding for Robotics, pages 20–24, 2017

work page 2017
[46]

van Deemter, I

K. van Deemter, I. van der Sluis, and A. Gatt. Building a semantically transparent corpus for the generation of referring expressions. In INLG, 2006

work page 2006
[47]

Vinyals, A

O. Vinyals, A. Toshev, S. Bengio, and D. Erhan. Show and tell: A neural image caption generator. In CVPR, 2015

work page 2015
[48]

Weaver and N

L. Weaver and N. Tao. The optimal reward baseline for gradient-based reinforcement learning. In UAI, pages 538–545. Morgan Kaufmann Publishers Inc., 2001

work page 2001
[49]

Y. Wei, J. Feng, X. Liang, C. Ming-Ming, Y. Zhao, and S. Yan. Object region mining with adversarial erasing: A simple classiﬁcation to semantic segmentation approach. In CVPR, 2017

work page 2017
[50]

R. J. Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning , 8(3-4):229–256, 1992

work page 1992
[51]

Winograd

T. Winograd. Understanding natural language. Cognitive psychology, 3(1):1–191, 1972

work page 1972
[52]

F. Xiao, L. Sigal, and Y.-J. Lee. Weakly-supervised visual grounding of phrases with linguistic structures. In CVPR, 2017

work page 2017
[53]

K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhudinov, R. Zemel, and Y. Bengio. Show, attend and tell: Neural image caption generation with visual attention. In ICML, 2015

work page 2015
[54]

T. Xue, J. Wu, K. Bouman, and B. Freeman. Visual dynamics: Probabilistic future frame synthesis via cross convolutional networks. In NIPS, 2016

work page 2016
[55]

X. Yan, J. Yang, K. Sohn, and H. Lee. Attribute2image: Conditional image generation from visual attributes. In ECCV, 2016

work page 2016
[56]

L. Yu, Z. Lin, X. Shen, J. Yang, X. Lu, M. Bansal, and T. L. Berg. Mattnet: Modular attention network for referring expression comprehension. In CVPR, 2018

work page 2018
[57]

L. Yu, P . Poirson, S. Yang, A. C. Berg, and T. L. Berg. Modeling context in referring expressions. In ECCV, 2016

work page 2016
[58]

L. Yu, H. Tan, M. Bansal, and T. L. Berg. A joint speaker-listener-reinforcer model for referring expressions. In ICCV, 2017

work page 2017
[59]

Zhang, Z

H. Zhang, Z. Kyaw, S.-F. Chang, and T.-S. Chua. Visual translation embedding network for visual relation detection. In CVPR, 2017

work page 2017
[60]

Zhang, Z

H. Zhang, Z. Kyaw, J. Yu, and S.-F. Chang. Ppr-fcn: Weakly supervised visual relation detection via parallel pairwise r-fcn. In ICCV, 2017

work page 2017
[61]

Zhang, Y

H. Zhang, Y. Niu, and S.-F. Chang. Grounding referring expressions in images by variational context. In CVPR, 2018

work page 2018
[62]

Z. Zhao, Q. Yang, D. Cai, X. He, and Y. Zhuang. Video question answering via hierarchical spatio-temporal attention networks. In International Joint Conference on Artiﬁcial Intelligence (IJCAI) , volume 2, 2017

work page 2017
[63]

Zhuang, Q

B. Zhuang, Q. Wu, C. Shen, I. D. Reid, and A. van den Hengel. Parallel attention: A uniﬁed framework for visual object discovery through dialogs and queries. In CVPR, 2018

work page 2018
[64]

C. L. Zitnick and P . Doll ´ar. Edge boxes: Locating object proposals from edges. In ECCV, 2014. Yulei Niu received the B.E. degree in computer science from the Renmin University of China, Beijing, China, in 2015, where he is currently pursuing the Ph.D. degree in computer science. From 2017 to 2018, he visited the Digital Video and Multimedia Laboratory,...

work page 2014

[1] [1]

Antol, A

S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C. Lawrence Zitnick, and D. Parikh. Vqa: Visual question answering. In ICCV, 2015

work page 2015

[2] [2]

J. Ba, V . Mnih, and K. Kavukcuoglu. Multiple object recognition with visual attention. In ICLR, 2015

work page 2015

[3] [3]

Bahdanau, K

D. Bahdanau, K. Cho, and Y. Bengio. Neural machine translation by jointly learning to align and translate. 2015

work page 2015

[4] [4]

X. Chen, L. Ma, J. Chen, Z. Jie, W. Liu, and J. Luo. Real-time referring expression comprehension by single-stage grounding network. arXiv preprint arXiv:1812.03426, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[5] [5]

K. Cho, B. Van Merri ¨enboer, D. Bahdanau, and Y. Bengio. On the properties of neural machine translation: Encoder-decoder approaches. arXiv preprint arXiv:1409.1259, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014

[6] [6]

B. Dai, Y. Zhang, and D. Lin. Detecting visual relationships with deep relational networks. In CVPR, 2017. 12 RefCOCO Test A RefCOCO Test B RefCOCO+ Test A RefCOCO+ Test B RefCOCOg Val* kid left bear in red right bear pizza in front pizza in the back red shirt man in white shirt closest bed smaller bed a man in a blue coat standing in the snow a pair of a...

work page 2017

[7] [7]

A. Das, S. Kottur, J. M. Moura, S. Lee, and D. Batra. Learning cooperative visual dialog agents with deep reinforcement learning. In ICCV, 2017

work page 2017

[8] [8]

C. Deng, Q. Wu, Q. Wu, F. Hu, F. Lyu, and M. Tan. Visual grounding via accumulated attention. In CVPR, 2018

work page 2018

[9] [9]

C. Deng, Q. Wu, G. Xu, Z. Yu, Y. Xu, K. Jia, and M. Tan. You only look & listen once: Towards fast and accurate visual grounding. arXiv preprint arXiv:1902.04213, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1902

[10] [10]

T. G. Dietterich, R. H. Lathrop, and T. Lozano-P ´erez. Solving the multiple instance problem with axis-parallel rectangles. Artiﬁcial intelligence, 1997

work page 1997

[11] [11]

C. W. Fox and S. J. Roberts. A tutorial on variational bayesian inference. Artiﬁcial intelligence review, 2012

work page 2012

[12] [12]

Glorot and Y

X. Glorot and Y. Bengio. Understanding the difﬁculty of training deep feedforward neural networks. In ICAIS, 2010

work page 2010

[13] [13]

Golland, P

D. Golland, P . Liang, and D. Klein. A game-theoretic approach to generating spatial descriptions. In EMNLP, 2010

work page 2010

[14] [14]

R. Hu, J. Andreas, M. Rohrbach, T. Darrell, and K. Saenko. Learning to reason: End-to-end module networks for visual question answering. ICCV, 2017

work page 2017

[15] [15]

R. Hu, M. Rohrbach, J. Andreas, T. Darrell, and K. Saenko. Modeling relationships in referential expressions with compositional modular networks. In CVPR, 2017

work page 2017

[16] [16]

R. Hu, H. Xu, M. Rohrbach, J. Feng, K. Saenko, and T. Darrell. Natural language object retrieval. In CVPR, 2016

work page 2016

[17] [17]

Ioffe and C

S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In ICML, 2015

work page 2015

[18] [18]

Kazemzadeh, V

S. Kazemzadeh, V . Ordonez, M. Matten, and T. L. Berg. Referitgame: Referring to objects in photographs of natural scenes. In EMNLP, 2014

work page 2014

[19] [19]

D. P . Kingma and M. Welling. Auto-encoding variational bayes. In ICLR, 2014

work page 2014

[20] [20]

Krahmer and K

E. Krahmer and K. Van Deemter. Computational generation of referring expressions: A survey. Computational Linguistics , 38(1):173–218, 2012

work page 2012

[21] [21]

Y. Li, N. Duan, B. Zhou, X. Chu, W. Ouyang, X. Wang, and M. Zhou. Visual question generation as dual task of visual question answering. In CVPR, 2018

work page 2018

[22] [22]

Y. Li, W. Ouyang, and X. Wang. Vip-cnn: Visual phrase guided convolutional neural network. In CVPR, 2017

work page 2017

[23] [23]

T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P . Perona, D. Ramanan, P . Doll´ar, and C. L. Zitnick. Microsoft coco: Common objects in context. In ECCV, 2014

work page 2014

[24] [24]

J. Liu, L. Wang, and M.-H. Yang. Referring expression generation and comprehension via attributes. In ICCV, 2017

work page 2017

[25] [25]

W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, and A. C. Berg. Ssd: Single shot multibox detector. In ECCV, 2016

work page 2016

[26] [26]

C. Lu, R. Krishna, M. Bernstein, and L. Fei-Fei. Visual relationship detection with language priors. In ECCV, 2016

work page 2016

[27] [27]

J. Lu, J. Yang, D. Batra, and D. Parikh. Hierarchical question-image co-attention for visual question answering. In NIPS, 2016

work page 2016

[28] [28]

Luo and G

R. Luo and G. Shakhnarovich. Comprehension-guided referring expressions. In CVPR, 2017

work page 2017

[29] [29]

Makhzani, J

A. Makhzani, J. Shlens, N. Jaitly, and I. J. Goodfellow. Adversarial autoencoders. In ICLR Workshop, 2016

work page 2016

[30] [30]

J. Mao, J. Huang, A. Toshev, O. Camburu, A. L. Yuille, and K. Murphy. Generation and comprehension of unambiguous object descriptions. In CVPR, 2016

work page 2016

[31] [31]

Mitchell, K

M. Mitchell, K. van Deemter, and E. Reiter. Natural reference to objects in a visual domain. In INLG, 2010

work page 2010

[32] [32]

Mitchell, K

M. Mitchell, K. Van Deemter, and E. Reiter. Generating expressions that refer to visible objects. In NAACL, 2013

work page 2013

[33] [33]

V . K. Nagaraja, V . I. Morariu, and L. S. Davis. Modeling context between objects for referring expression understanding. In ECCV, 2016

work page 2016

[34] [34]

Pennington, R

J. Pennington, R. Socher, and C. Manning. Glove: Global vectors for word representation. In EMNLP, 2014

work page 2014

[35] [35]

B. A. Plummer, A. Mallya, C. M. Cervantes, J. Hockenmaier, and S. Lazebnik. Phrase localization and visual relationship detection with comprehensive linguistic cues. In ICCV, 2017

work page 2017

[36] [36]

B. A. Plummer, L. Wang, C. M. Cervantes, J. C. Caicedo, J. Hockenmaier, and S. Lazebnik. Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. In ICCV, 2015

work page 2015

[37] [37]

Redmon and A

J. Redmon and A. Farhadi. Yolo9000: better, faster, stronger. In CVPR, 2017

work page 2017

[38] [38]

S. Ren, K. He, R. Girshick, and J. Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. In NIPS, 2015

work page 2015

[39] [39]

Rohrbach, M

A. Rohrbach, M. Rohrbach, R. Hu, T. Darrell, and B. Schiele. Grounding of textual phrases in images by reconstruction. In ECCV, 2016

work page 2016

[40] [40]

Schuster and K

M. Schuster and K. K. Paliwal. Bidirectional recurrent neural networks. TSP, 1997

work page 1997

[41] [41]

Schuster, R

S. Schuster, R. Krishna, A. Chang, L. Fei-Fei, and C. D. Manning. Generating semantically precise scene graphs from textual descriptions for improved image retrieval. In Workshop on Vision and Language, 2015

work page 2015

[42] [42]

K. Sohn, H. Lee, and X. Yan. Learning structured output representation using deep conditional generative models. InNIPS, 2015. 13

work page 2015

[43] [43]

Q. Sun, B. Schiele, and M. Fritz. A domain based approach to social relation recognition. In CVPR, 2017

work page 2017

[44] [44]

J. A. Thomas. Meaning in interaction: An introduction to pragmatics . Routledge, 2014

work page 2014

[45] [45]

Thomason, J

J. Thomason, J. Sinapov, and R. Mooney. Guiding interaction behaviors for multi-modal grounded language learning. In Proceedings of the First Workshop on Language Grounding for Robotics, pages 20–24, 2017

work page 2017

[46] [46]

van Deemter, I

K. van Deemter, I. van der Sluis, and A. Gatt. Building a semantically transparent corpus for the generation of referring expressions. In INLG, 2006

work page 2006

[47] [47]

Vinyals, A

O. Vinyals, A. Toshev, S. Bengio, and D. Erhan. Show and tell: A neural image caption generator. In CVPR, 2015

work page 2015

[48] [48]

Weaver and N

L. Weaver and N. Tao. The optimal reward baseline for gradient-based reinforcement learning. In UAI, pages 538–545. Morgan Kaufmann Publishers Inc., 2001

work page 2001

[49] [49]

Y. Wei, J. Feng, X. Liang, C. Ming-Ming, Y. Zhao, and S. Yan. Object region mining with adversarial erasing: A simple classiﬁcation to semantic segmentation approach. In CVPR, 2017

work page 2017

[50] [50]

R. J. Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning , 8(3-4):229–256, 1992

work page 1992

[51] [51]

Winograd

T. Winograd. Understanding natural language. Cognitive psychology, 3(1):1–191, 1972

work page 1972

[52] [52]

F. Xiao, L. Sigal, and Y.-J. Lee. Weakly-supervised visual grounding of phrases with linguistic structures. In CVPR, 2017

work page 2017

[53] [53]

K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhudinov, R. Zemel, and Y. Bengio. Show, attend and tell: Neural image caption generation with visual attention. In ICML, 2015

work page 2015

[54] [54]

T. Xue, J. Wu, K. Bouman, and B. Freeman. Visual dynamics: Probabilistic future frame synthesis via cross convolutional networks. In NIPS, 2016

work page 2016

[55] [55]

X. Yan, J. Yang, K. Sohn, and H. Lee. Attribute2image: Conditional image generation from visual attributes. In ECCV, 2016

work page 2016

[56] [56]

L. Yu, Z. Lin, X. Shen, J. Yang, X. Lu, M. Bansal, and T. L. Berg. Mattnet: Modular attention network for referring expression comprehension. In CVPR, 2018

work page 2018

[57] [57]

L. Yu, P . Poirson, S. Yang, A. C. Berg, and T. L. Berg. Modeling context in referring expressions. In ECCV, 2016

work page 2016

[58] [58]

L. Yu, H. Tan, M. Bansal, and T. L. Berg. A joint speaker-listener-reinforcer model for referring expressions. In ICCV, 2017

work page 2017

[59] [59]

Zhang, Z

H. Zhang, Z. Kyaw, S.-F. Chang, and T.-S. Chua. Visual translation embedding network for visual relation detection. In CVPR, 2017

work page 2017

[60] [60]

Zhang, Z

H. Zhang, Z. Kyaw, J. Yu, and S.-F. Chang. Ppr-fcn: Weakly supervised visual relation detection via parallel pairwise r-fcn. In ICCV, 2017

work page 2017

[61] [61]

Zhang, Y

H. Zhang, Y. Niu, and S.-F. Chang. Grounding referring expressions in images by variational context. In CVPR, 2018

work page 2018

[62] [62]

Z. Zhao, Q. Yang, D. Cai, X. He, and Y. Zhuang. Video question answering via hierarchical spatio-temporal attention networks. In International Joint Conference on Artiﬁcial Intelligence (IJCAI) , volume 2, 2017

work page 2017

[63] [63]

Zhuang, Q

B. Zhuang, Q. Wu, C. Shen, I. D. Reid, and A. van den Hengel. Parallel attention: A uniﬁed framework for visual object discovery through dialogs and queries. In CVPR, 2018

work page 2018

[64] [64]

C. L. Zitnick and P . Doll ´ar. Edge boxes: Locating object proposals from edges. In ECCV, 2014. Yulei Niu received the B.E. degree in computer science from the Renmin University of China, Beijing, China, in 2015, where he is currently pursuing the Ph.D. degree in computer science. From 2017 to 2018, he visited the Digital Video and Multimedia Laboratory,...

work page 2014