Obj-GloVe: Scene-Based Contextual Object Embedding

Canwen Xu; Chenliang Li; Zhenzhong Chen

arxiv: 1907.01478 · v1 · pith:7J7SGR6Enew · submitted 2019-07-02 · 💻 cs.CV · cs.LG

Obj-GloVe: Scene-Based Contextual Object Embedding

Canwen Xu , Zhenzhong Chen , Chenliang Li This is my paper

Pith reviewed 2026-05-25 10:52 UTC · model grok-4.3

classification 💻 cs.CV cs.LG

keywords contextual object embeddingGloVe adaptationscene co-occurrenceobject detectiontext-to-image synthesisvisual embeddingsOpen Images

0 comments

The pith

Treating object co-occurrences in images like word co-occurrences produces useful contextual embeddings for visual objects.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes to build embeddings for common visual objects by applying the GloVe method, originally for words, to their co-occurrence statistics across scenes in a large image collection. This creates vectors that encode how objects typically appear together in the same scene. If this works, the embeddings should help models reason about context in tasks such as spotting objects in photos or generating images from text descriptions. The authors train these vectors on the Open Images dataset and test them on detection and synthesis problems.

Core claim

Obj-GloVe is a scene-based contextual embedding for visual objects obtained by applying the GloVe algorithm to co-occurrence counts of object classes in images. The method produces vector representations that reflect semantic and contextual relationships among objects, as shown through nearest-neighbor analysis and projections along semantic axes. These embeddings are then applied to improve object detection and text-to-image synthesis.

What carries the argument

Obj-GloVe embedding, which adapts the GloVe co-occurrence matrix factorization to object pairs appearing in the same image scene.

If this is right

Embeddings improve accuracy in object detection when incorporated into models.
They enhance quality in text-to-image synthesis applications.
Dimensionality reduction and nearest neighbors reveal meaningful semantic groupings among objects.
Projecting vectors along semantic axes shows interpretable directions in the embedding space.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Such embeddings could be precomputed once and reused across multiple vision tasks without retraining.
The success would imply that statistical co-occurrence is a sufficient signal for learning object context, similar to language.
Future work might combine these with language embeddings for multimodal tasks.

Load-bearing premise

Object co-occurrence statistics in image datasets contain enough structured information to support embeddings that capture useful context for downstream visual tasks.

What would settle it

Running an object detection model with and without the Obj-GloVe features on the same dataset and finding no difference in mean average precision would indicate the embeddings add no value.

Figures

Figures reproduced from arXiv: 1907.01478 by Canwen Xu, Chenliang Li, Zhenzhong Chen.

**Figure 2.** Figure 2: PCA visualization of Obj-GloVe. The labels are randomly sampled. [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: t-SNE visualization of the most common objects of Obj-GloVe. [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Projections of Obj-GloVe on Animal-Person axis. The projections closer to Animal and Person are marked with red and green, respectively. The labels are randomly sampled. 7, for each annotated image in the validation set with more than one bounding box, we iteratively “mask” a box label from the ground truth and attempt to predict it with trained embedding. Note that the visual features are completely disca… view at source ↗

**Figure 5.** Figure 5: Projections of Obj-GloVe on Man-Woman axis. We visualize the pole areas of two sides. The labels are randomly sampled. 12 [PITH_FULL_IMAGE:figures/full_fig_p012_5.png] view at source ↗

**Figure 6.** Figure 6: Photos are slices of a full scene. Thus, Obj-GloVe can implicitly exploit spatial information across multiple [PITH_FULL_IMAGE:figures/full_fig_p013_6.png] view at source ↗

**Figure 7.** Figure 7: A live example of masking. The labels in the image are iteratively masked to form a test sample. Thus, [PITH_FULL_IMAGE:figures/full_fig_p013_7.png] view at source ↗

**Figure 8.** Figure 8: Scene generation with Obj-GloVe. The black text is the user input and the red one is “auto-completed” based [PITH_FULL_IMAGE:figures/full_fig_p013_8.png] view at source ↗

**Figure 9.** Figure 9: Progressive scene generation with Obj-GloVe. The black text is the user input and the red one is “auto [PITH_FULL_IMAGE:figures/full_fig_p014_9.png] view at source ↗

read the original abstract

Recently, with the prevalence of large-scale image dataset, the co-occurrence information among classes becomes rich, calling for a new way to exploit it to facilitate inference. In this paper, we propose Obj-GloVe, a generic scene-based contextual embedding for common visual objects, where we adopt the word embedding method GloVe to exploit the co-occurrence between entities. We train the embedding on pre-processed Open Images V4 dataset and provide extensive visualization and analysis by dimensionality reduction and projecting the vectors along a specific semantic axis, and showcasing the nearest neighbors of the most common objects. Furthermore, we reveal the potential applications of Obj-GloVe on object detection and text-to-image synthesis, then verify its effectiveness on these two applications respectively.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Straightforward GloVe port to object co-occurrences in Open Images that shows some visualizations but supplies no numbers to support its claims of helping detection or synthesis.

read the letter

The paper takes the GloVe algorithm and runs it on a co-occurrence matrix built from object classes that appear together in the same image in Open Images V4. They factor the matrix in the usual way to produce embeddings for the object classes. That is the core contribution: moving the method from text to scene-level visual statistics. They then reduce the vectors and show nearest-neighbor lists for common objects plus projections onto a few semantic axes. Those visualizations are the only concrete output described. They look reasonable on the surface and could give a reader quick intuition about which objects tend to share scenes. The rest of the abstract asserts that the embeddings help object detection and text-to-image synthesis and that the authors verified this, but it contains no metrics, baselines, ablation tables, or even a sentence on experimental setup. Without those details it is impossible to tell whether the vectors add anything beyond the raw co-occurrence frequencies already present in the dataset. The central modeling choice is exactly the co-occurrence counts, so any downstream gain has to be demonstrated rather than assumed. The paper is aimed at people who want ready-made object vectors to drop into detection or generation pipelines. A reader already working on contextual representations in vision might skim the neighbor lists for ideas, but the lack of quantitative support makes the work too thin for a serious referee process right now. If the full manuscript contains proper experiments with controls, that could change; on the evidence given here it does not.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes Obj-GloVe, a scene-based contextual embedding for visual objects obtained by adapting the GloVe algorithm to factorize a global co-occurrence matrix of object classes extracted from scenes in the Open Images V4 dataset. It presents visualizations including dimensionality reduction, semantic axis projections, and nearest neighbor analysis, and asserts verification of effectiveness for object detection and text-to-image synthesis applications.

Significance. If the embeddings prove to capture relational semantics beyond dataset biases, this could provide a transferable representation of visual context for downstream vision tasks. The approach leverages large-scale annotation data in a manner analogous to textual embeddings, but the absence of quantitative validation in the abstract and the direct derivation from co-occurrence statistics make the practical utility uncertain without further evidence.

major comments (2)

[Abstract] The abstract claims to 'verify its effectiveness on these two applications respectively,' but no quantitative results, ablation studies, baseline comparisons, or error analysis are described to support the effectiveness claims for object detection and text-to-image synthesis.
The embeddings are constructed directly from co-occurrence counts in the dataset, so the captured 'context' is by definition the input statistics; downstream task gains would need independent validation on held-out data with comparisons to class priors, but this validation is not provided.

minor comments (1)

The description of the pre-processing of the Open Images V4 dataset lacks detail on how scenes are defined and how the co-occurrence matrix is constructed (e.g., window size, weighting).

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful review and the opportunity to clarify the manuscript. We address the major comments point by point below.

read point-by-point responses

Referee: [Abstract] The abstract claims to 'verify its effectiveness on these two applications respectively,' but no quantitative results, ablation studies, baseline comparisons, or error analysis are described to support the effectiveness claims for object detection and text-to-image synthesis.

Authors: We agree that the abstract's phrasing implies quantitative verification that is not present in the manuscript. The paper provides visualizations (dimensionality reduction, semantic axis projections, nearest neighbors) and qualitative illustrations of potential use in object detection and text-to-image synthesis, but contains no quantitative task results, ablations, or baseline comparisons. We will revise the abstract to state that we demonstrate the embeddings and illustrate their potential applications, removing the claim of verified effectiveness. revision: yes
Referee: The embeddings are constructed directly from co-occurrence counts in the dataset, so the captured 'context' is by definition the input statistics; downstream task gains would need independent validation on held-out data with comparisons to class priors, but this validation is not provided.

Authors: The embeddings are obtained by factorizing the co-occurrence matrix with the GloVe objective, which is designed to produce dense vectors that encode relational structure beyond raw frequencies (as in textual GloVe). Nevertheless, we acknowledge that the manuscript does not supply held-out quantitative evaluations or comparisons against simple class priors or other baselines. We will add such validation experiments in the revised manuscript to address this concern. revision: yes

Circularity Check

0 steps flagged

Direct application of GloVe to object co-occurrence matrix exhibits no circularity

full rationale

The paper explicitly adopts the established GloVe algorithm to factorize a co-occurrence matrix built from Open Images V4 scenes. This produces embeddings whose geometry reflects the input statistics by design of the method, but the paper makes no additional claims that a 'prediction' or 'first-principles result' reduces to its inputs beyond the standard GloVe construction. No self-citations, uniqueness theorems, or fitted parameters renamed as predictions appear in the abstract or described chain. Effectiveness on downstream tasks is asserted separately and would require external validation, but the derivation itself is self-contained and non-circular.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

Based solely on abstract, the method rests on treating visual co-occurrence as directly transferable from textual co-occurrence without additional evidence or justification provided.

free parameters (1)

GloVe training hyperparameters (vector dimension, window size)
Standard but unspecified parameters that control the resulting embedding

axioms (1)

domain assumption Object co-occurrence statistics in images are sufficiently analogous to word co-occurrence in text to yield useful contextual embeddings
Invoked when adopting GloVe for visual objects

pith-pipeline@v0.9.0 · 5647 in / 1057 out tokens · 45209 ms · 2026-05-25T10:52:05.310958+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

39 extracted references · 39 canonical work pages · 1 internal anchor

[1]

Jeffrey Pennington, Richard Socher, and Christopher D. Manning. Glove: Global vectors for word representation. In EMNLP, pages 1532–1543. ACL, 2014

work page 2014
[2]

Tomasz Malisiewicz and Alexei A. Efros. Beyond categories: The visual memex model for reasoning about object relationships. In NIPS, pages 1222–1230. Curran Associates, Inc., 2009

work page 2009
[3]

Murphy, and William T

Antonio Torralba, Kevin P. Murphy, and William T. Freeman. Using the forest to see the trees: exploiting context for visual object detection and localization. Commun. ACM, 53(3):107–114, 2010

work page 2010
[4]

Belongie

Andrew Rabinovich, Andrea Vedaldi, Carolina Galleguillos, Eric Wiewiora, and Serge J. Belongie. Objects in context. In ICCV, pages 1–8. IEEE Computer Society, 2007

work page 2007
[5]

Deep residual learning for image recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, pages 770–778. IEEE Computer Society, 2016

work page 2016
[6]

Very deep convolutional networks for large-scale image recognition

Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. In ICLR, 2015

work page 2015
[7]

Alina Kuznetsova, Hassan Rom, Neil Alldrin, Jasper R. R. Uijlings, Ivan Krasin, Jordi Pont-Tuset, Shahab Kamali, Stefan Popov, Matteo Malloci, Tom Duerig, and Vittorio Ferrari. The open images dataset V4: uniﬁed image classiﬁcation, object detection, and visual relationship detection at scale. CoRR, abs/1811.00982, 2018

work page arXiv 2018
[8]

Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll ´ar, and C

Tsung-Yi Lin, Michael Maire, Serge J. Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll ´ar, and C. Lawrence Zitnick. Microsoft COCO: common objects in context. InECCV (5), volume 8693 of Lecture Notes in Computer Science, pages 740–755. Springer, 2014

work page 2014
[9]

Mark Everingham, S. M. Ali Eslami, Luc J. Van Gool, Christopher K. I. Williams, John M. Winn, and Andrew Zisserman. The pascal visual object classes challenge: A retrospective. International Journal of Computer Vision, 111(1):98–136, 2015

work page 2015
[10]

How to generate a good word embedding

Siwei Lai, Kang Liu, Shizhu He, and Jun Zhao. How to generate a good word embedding. IEEE Intelligent Systems, 31(6):5–14, 2016. 9

work page 2016
[11]

Efros, and Martial Hebert

Santosh Kumar Divvala, Derek Hoiem, James Hays, Alexei A. Efros, and Martial Hebert. An empirical study of context in object detection. In CVPR, pages 1271–1278. IEEE Computer Society, 2009

work page 2009
[12]

Contextualizing object detection and classiﬁcation

Zheng Song, Qiang Chen, ZhongYang Huang, Yang Hua, and Shuicheng Yan. Contextualizing object detection and classiﬁcation. In CVPR, pages 1585–1592. IEEE Computer Society, 2011

work page 2011
[13]

Spatial memory for context reasoning in object detection

Xinlei Chen and Abhinav Gupta. Spatial memory for context reasoning in object detection. In ICCV, pages 4106–4116. IEEE Computer Society, 2017

work page 2017
[14]

Relation networks for object detection

Han Hu, Jiayuan Gu, Zheng Zhang, Jifeng Dai, and Yichen Wei. Relation networks for object detection. In CVPR, pages 3588–3597. IEEE Computer Society, 2018

work page 2018
[15]

Structure inference net: Object detection using scene-level context and instance-level relationships

Yong Liu, Ruiping Wang, Shiguang Shan, and Xilin Chen. Structure inference net: Object detection using scene-level context and instance-level relationships. In CVPR, pages 6985–6994. IEEE Computer Society, 2018

work page 2018
[16]

Phan, Aixin Sun, Yi Tay, Jialong Han, and Chenliang Li

Minh C. Phan, Aixin Sun, Yi Tay, Jialong Han, and Chenliang Li. Neupl: Attention-based semantic matching and pair-linking for entity disambiguation. In CIKM, pages 1667–1676. ACM, 2017

work page 2017
[17]

Pair-Linking for Collective Entity Disambiguation: Two Could Be Better Than All

Minh C. Phan, Aixin Sun, Yi Tay, Jialong Han, and Chenliang Li. Pair-linking for collective entity disambigua- tion: Two could be better than all. CoRR, abs/1802.01074, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[18]

Attngan: Fine-grained text to image generation with attentional generative adversarial networks

Tao Xu, Pengchuan Zhang, Qiuyuan Huang, Han Zhang, Zhe Gan, Xiaolei Huang, and Xiaodong He. Attngan: Fine-grained text to image generation with attentional generative adversarial networks. In CVPR, pages 1316–

work page
[19]

IEEE Computer Society, 2018

work page 2018
[20]

Generating images from captions with attention

Elman Mansimov, Emilio Parisotto, Lei Jimmy Ba, and Ruslan Salakhutdinov. Generating images from captions with attention. In ICLR, 2016

work page 2016
[21]

Plug & play generative net- works: Conditional iterative generation of images in latent space

Anh Nguyen, Jeff Clune, Yoshua Bengio, Alexey Dosovitskiy, and Jason Yosinski. Plug & play generative net- works: Conditional iterative generation of images in latent space. In CVPR, pages 3510–3520. IEEE Computer Society, 2017

work page 2017
[22]

Reed, Zeynep Akata, Xinchen Yan, Lajanugen Logeswaran, Bernt Schiele, and Honglak Lee

Scott E. Reed, Zeynep Akata, Xinchen Yan, Lajanugen Logeswaran, Bernt Schiele, and Honglak Lee. Generative adversarial text to image synthesis. In ICML, volume 48 of JMLR Workshop and Conference Proceedings, pages 1060–1069. JMLR.org, 2016

work page 2016
[24]

A neural probabilistic language model

Yoshua Bengio, R ´ejean Ducharme, and Pascal Vincent. A neural probabilistic language model. In NIPS, pages 932–938. MIT Press, 2000

work page 2000
[25]

Efﬁcient estimation of word representations in vector space

Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efﬁcient estimation of word representations in vector space. In ICLR (Workshop), 2013

work page 2013
[26]

Corrado, and Jeffrey Dean

Tomas Mikolov, Ilya Sutskever, Kai Chen, Gregory S. Corrado, and Jeffrey Dean. Distributed representations of words and phrases and their compositionality. In NIPS, pages 3111–3119, 2013

work page 2013
[27]

Deerwester, Susan T

Scott C. Deerwester, Susan T. Dumais, Thomas K. Landauer, George W. Furnas, and Richard A. Harshman. Indexing by latent semantic analysis. JASIS, 41(6):391–407, 1990

work page 1990
[28]

Producing high-dimensional semantic space from lexical co-occurence

Kevin Lund and Curt Burgess. Producing high-dimensional semantic space from lexical co-occurence. Behavior Research Methods Instruments and Computers , 28:203–208, 06 1996

work page 1996
[29]

Facenet: A uniﬁed embedding for face recognition and clustering

Florian Schroff, Dmitry Kalenichenko, and James Philbin. Facenet: A uniﬁed embedding for face recognition and clustering. In CVPR, pages 815–823. IEEE Computer Society, 2015

work page 2015
[30]

Visual translation embedding network for visual relation detection

Hanwang Zhang, Zawlin Kyaw, Shih-Fu Chang, and Tat-Seng Chua. Visual translation embedding network for visual relation detection. In CVPR, pages 3107–3115. IEEE Computer Society, 2017

work page 2017
[31]

Representation learning for scene graph completion via jointly structural and visual embedding

Hai Wan, Yonghao Luo, Bo Peng, and Wei-Shi Zheng. Representation learning for scene graph completion via jointly structural and visual embedding. In IJCAI, pages 949–956. ijcai.org, 2018

work page 2018
[32]

Felzenszwalb, Ross B

Pedro F. Felzenszwalb, Ross B. Girshick, David A. McAllester, and Deva Ramanan. Object detection with discriminatively trained part-based models. IEEE Trans. Pattern Anal. Mach. Intell., 32(9):1627–1645, 2010

work page 2010
[33]

Myung Jin Choi, Antonio Torralba, and Alan S. Willsky. A tree-based context model for object recognition. IEEE Trans. Pattern Anal. Mach. Intell., 34(2):240–252, 2012

work page 2012
[34]

Roozbeh Mottaghi, Xianjie Chen, Xiaobai Liu, Nam-Gyu Cho, Seong-Whan Lee, Sanja Fidler, Raquel Urtasun, and Alan L. Yuille. The role of context for object detection and semantic segmentation in the wild. In CVPR, pages 891–898. IEEE Computer Society, 2014. 10

work page 2014
[35]

Girshick, and Jian Sun

Shaoqing Ren, Kaiming He, Ross B. Girshick, and Jian Sun. Faster R-CNN: towards real-time object detection with region proposal networks. In NIPS, pages 91–99, 2015

work page 2015
[36]

Semantic image synthesis via adversarial learning

Hao Dong, Simiao Yu, Chao Wu, and Yike Guo. Semantic image synthesis via adversarial learning. In ICCV, pages 5707–5715. IEEE Computer Society, 2017

work page 2017
[37]

Stackgan: Text to photo-realistic image synthesis with stacked generative adversarial networks

Han Zhang, Tao Xu, and Hongsheng Li. Stackgan: Text to photo-realistic image synthesis with stacked generative adversarial networks. In ICCV, pages 5908–5916. IEEE Computer Society, 2017

work page 2017
[38]

DA-GAN: instance-level image translation by deep attention generative adversarial networks

Shuang Ma, Jianlong Fu, Chang Wen Chen, and Tao Mei. DA-GAN: instance-level image translation by deep attention generative adversarial networks. In CVPR, pages 5657–5666. IEEE Computer Society, 2018

work page 2018
[39]

C. Wah, S. Branson, P. Welinder, P. Perona, and S. Belongie. The Caltech-UCSD Birds-200-2011 Dataset. Technical Report CNS-TR-2011-001, California Institute of Technology, 2011

work page 2011
[40]

auto-completed

Bryan Perozzi, Rami Al-Rfou, and Steven Skiena. Deepwalk: online learning of social representations. In KDD, pages 701–710. ACM, 2014. 11 Figure 5: Projections of Obj-GloVe on Man-Woman axis. We visualize the pole areas of two sides. The labels are randomly sampled. 12 Figure 6: Photos are slices of a full scene. Thus, Obj-GloVe can implicitly exploit spa...

work page 2014

[1] [1]

Jeffrey Pennington, Richard Socher, and Christopher D. Manning. Glove: Global vectors for word representation. In EMNLP, pages 1532–1543. ACL, 2014

work page 2014

[2] [2]

Tomasz Malisiewicz and Alexei A. Efros. Beyond categories: The visual memex model for reasoning about object relationships. In NIPS, pages 1222–1230. Curran Associates, Inc., 2009

work page 2009

[3] [3]

Murphy, and William T

Antonio Torralba, Kevin P. Murphy, and William T. Freeman. Using the forest to see the trees: exploiting context for visual object detection and localization. Commun. ACM, 53(3):107–114, 2010

work page 2010

[4] [4]

Belongie

Andrew Rabinovich, Andrea Vedaldi, Carolina Galleguillos, Eric Wiewiora, and Serge J. Belongie. Objects in context. In ICCV, pages 1–8. IEEE Computer Society, 2007

work page 2007

[5] [5]

Deep residual learning for image recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, pages 770–778. IEEE Computer Society, 2016

work page 2016

[6] [6]

Very deep convolutional networks for large-scale image recognition

Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. In ICLR, 2015

work page 2015

[7] [7]

Alina Kuznetsova, Hassan Rom, Neil Alldrin, Jasper R. R. Uijlings, Ivan Krasin, Jordi Pont-Tuset, Shahab Kamali, Stefan Popov, Matteo Malloci, Tom Duerig, and Vittorio Ferrari. The open images dataset V4: uniﬁed image classiﬁcation, object detection, and visual relationship detection at scale. CoRR, abs/1811.00982, 2018

work page arXiv 2018

[8] [8]

Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll ´ar, and C

Tsung-Yi Lin, Michael Maire, Serge J. Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll ´ar, and C. Lawrence Zitnick. Microsoft COCO: common objects in context. InECCV (5), volume 8693 of Lecture Notes in Computer Science, pages 740–755. Springer, 2014

work page 2014

[9] [9]

Mark Everingham, S. M. Ali Eslami, Luc J. Van Gool, Christopher K. I. Williams, John M. Winn, and Andrew Zisserman. The pascal visual object classes challenge: A retrospective. International Journal of Computer Vision, 111(1):98–136, 2015

work page 2015

[10] [10]

How to generate a good word embedding

Siwei Lai, Kang Liu, Shizhu He, and Jun Zhao. How to generate a good word embedding. IEEE Intelligent Systems, 31(6):5–14, 2016. 9

work page 2016

[11] [11]

Efros, and Martial Hebert

Santosh Kumar Divvala, Derek Hoiem, James Hays, Alexei A. Efros, and Martial Hebert. An empirical study of context in object detection. In CVPR, pages 1271–1278. IEEE Computer Society, 2009

work page 2009

[12] [12]

Contextualizing object detection and classiﬁcation

Zheng Song, Qiang Chen, ZhongYang Huang, Yang Hua, and Shuicheng Yan. Contextualizing object detection and classiﬁcation. In CVPR, pages 1585–1592. IEEE Computer Society, 2011

work page 2011

[13] [13]

Spatial memory for context reasoning in object detection

Xinlei Chen and Abhinav Gupta. Spatial memory for context reasoning in object detection. In ICCV, pages 4106–4116. IEEE Computer Society, 2017

work page 2017

[14] [14]

Relation networks for object detection

Han Hu, Jiayuan Gu, Zheng Zhang, Jifeng Dai, and Yichen Wei. Relation networks for object detection. In CVPR, pages 3588–3597. IEEE Computer Society, 2018

work page 2018

[15] [15]

Structure inference net: Object detection using scene-level context and instance-level relationships

Yong Liu, Ruiping Wang, Shiguang Shan, and Xilin Chen. Structure inference net: Object detection using scene-level context and instance-level relationships. In CVPR, pages 6985–6994. IEEE Computer Society, 2018

work page 2018

[16] [16]

Phan, Aixin Sun, Yi Tay, Jialong Han, and Chenliang Li

Minh C. Phan, Aixin Sun, Yi Tay, Jialong Han, and Chenliang Li. Neupl: Attention-based semantic matching and pair-linking for entity disambiguation. In CIKM, pages 1667–1676. ACM, 2017

work page 2017

[17] [17]

Pair-Linking for Collective Entity Disambiguation: Two Could Be Better Than All

Minh C. Phan, Aixin Sun, Yi Tay, Jialong Han, and Chenliang Li. Pair-linking for collective entity disambigua- tion: Two could be better than all. CoRR, abs/1802.01074, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[18] [18]

Attngan: Fine-grained text to image generation with attentional generative adversarial networks

Tao Xu, Pengchuan Zhang, Qiuyuan Huang, Han Zhang, Zhe Gan, Xiaolei Huang, and Xiaodong He. Attngan: Fine-grained text to image generation with attentional generative adversarial networks. In CVPR, pages 1316–

work page

[19] [19]

IEEE Computer Society, 2018

work page 2018

[20] [20]

Generating images from captions with attention

Elman Mansimov, Emilio Parisotto, Lei Jimmy Ba, and Ruslan Salakhutdinov. Generating images from captions with attention. In ICLR, 2016

work page 2016

[21] [21]

Plug & play generative net- works: Conditional iterative generation of images in latent space

Anh Nguyen, Jeff Clune, Yoshua Bengio, Alexey Dosovitskiy, and Jason Yosinski. Plug & play generative net- works: Conditional iterative generation of images in latent space. In CVPR, pages 3510–3520. IEEE Computer Society, 2017

work page 2017

[22] [22]

Reed, Zeynep Akata, Xinchen Yan, Lajanugen Logeswaran, Bernt Schiele, and Honglak Lee

Scott E. Reed, Zeynep Akata, Xinchen Yan, Lajanugen Logeswaran, Bernt Schiele, and Honglak Lee. Generative adversarial text to image synthesis. In ICML, volume 48 of JMLR Workshop and Conference Proceedings, pages 1060–1069. JMLR.org, 2016

work page 2016

[23] [24]

A neural probabilistic language model

Yoshua Bengio, R ´ejean Ducharme, and Pascal Vincent. A neural probabilistic language model. In NIPS, pages 932–938. MIT Press, 2000

work page 2000

[24] [25]

Efﬁcient estimation of word representations in vector space

Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efﬁcient estimation of word representations in vector space. In ICLR (Workshop), 2013

work page 2013

[25] [26]

Corrado, and Jeffrey Dean

Tomas Mikolov, Ilya Sutskever, Kai Chen, Gregory S. Corrado, and Jeffrey Dean. Distributed representations of words and phrases and their compositionality. In NIPS, pages 3111–3119, 2013

work page 2013

[26] [27]

Deerwester, Susan T

Scott C. Deerwester, Susan T. Dumais, Thomas K. Landauer, George W. Furnas, and Richard A. Harshman. Indexing by latent semantic analysis. JASIS, 41(6):391–407, 1990

work page 1990

[27] [28]

Producing high-dimensional semantic space from lexical co-occurence

Kevin Lund and Curt Burgess. Producing high-dimensional semantic space from lexical co-occurence. Behavior Research Methods Instruments and Computers , 28:203–208, 06 1996

work page 1996

[28] [29]

Facenet: A uniﬁed embedding for face recognition and clustering

Florian Schroff, Dmitry Kalenichenko, and James Philbin. Facenet: A uniﬁed embedding for face recognition and clustering. In CVPR, pages 815–823. IEEE Computer Society, 2015

work page 2015

[29] [30]

Visual translation embedding network for visual relation detection

Hanwang Zhang, Zawlin Kyaw, Shih-Fu Chang, and Tat-Seng Chua. Visual translation embedding network for visual relation detection. In CVPR, pages 3107–3115. IEEE Computer Society, 2017

work page 2017

[30] [31]

Representation learning for scene graph completion via jointly structural and visual embedding

Hai Wan, Yonghao Luo, Bo Peng, and Wei-Shi Zheng. Representation learning for scene graph completion via jointly structural and visual embedding. In IJCAI, pages 949–956. ijcai.org, 2018

work page 2018

[31] [32]

Felzenszwalb, Ross B

Pedro F. Felzenszwalb, Ross B. Girshick, David A. McAllester, and Deva Ramanan. Object detection with discriminatively trained part-based models. IEEE Trans. Pattern Anal. Mach. Intell., 32(9):1627–1645, 2010

work page 2010

[32] [33]

Myung Jin Choi, Antonio Torralba, and Alan S. Willsky. A tree-based context model for object recognition. IEEE Trans. Pattern Anal. Mach. Intell., 34(2):240–252, 2012

work page 2012

[33] [34]

Roozbeh Mottaghi, Xianjie Chen, Xiaobai Liu, Nam-Gyu Cho, Seong-Whan Lee, Sanja Fidler, Raquel Urtasun, and Alan L. Yuille. The role of context for object detection and semantic segmentation in the wild. In CVPR, pages 891–898. IEEE Computer Society, 2014. 10

work page 2014

[34] [35]

Girshick, and Jian Sun

Shaoqing Ren, Kaiming He, Ross B. Girshick, and Jian Sun. Faster R-CNN: towards real-time object detection with region proposal networks. In NIPS, pages 91–99, 2015

work page 2015

[35] [36]

Semantic image synthesis via adversarial learning

Hao Dong, Simiao Yu, Chao Wu, and Yike Guo. Semantic image synthesis via adversarial learning. In ICCV, pages 5707–5715. IEEE Computer Society, 2017

work page 2017

[36] [37]

Stackgan: Text to photo-realistic image synthesis with stacked generative adversarial networks

Han Zhang, Tao Xu, and Hongsheng Li. Stackgan: Text to photo-realistic image synthesis with stacked generative adversarial networks. In ICCV, pages 5908–5916. IEEE Computer Society, 2017

work page 2017

[37] [38]

DA-GAN: instance-level image translation by deep attention generative adversarial networks

Shuang Ma, Jianlong Fu, Chang Wen Chen, and Tao Mei. DA-GAN: instance-level image translation by deep attention generative adversarial networks. In CVPR, pages 5657–5666. IEEE Computer Society, 2018

work page 2018

[38] [39]

C. Wah, S. Branson, P. Welinder, P. Perona, and S. Belongie. The Caltech-UCSD Birds-200-2011 Dataset. Technical Report CNS-TR-2011-001, California Institute of Technology, 2011

work page 2011

[39] [40]

auto-completed

Bryan Perozzi, Rami Al-Rfou, and Steven Skiena. Deepwalk: online learning of social representations. In KDD, pages 701–710. ACM, 2014. 11 Figure 5: Projections of Obj-GloVe on Man-Woman axis. We visualize the pole areas of two sides. The labels are randomly sampled. 12 Figure 6: Photos are slices of a full scene. Thus, Obj-GloVe can implicitly exploit spa...

work page 2014