pith. sign in

arxiv: 1907.01478 · v1 · pith:7J7SGR6Enew · submitted 2019-07-02 · 💻 cs.CV · cs.LG

Obj-GloVe: Scene-Based Contextual Object Embedding

Pith reviewed 2026-05-25 10:52 UTC · model grok-4.3

classification 💻 cs.CV cs.LG
keywords contextual object embeddingGloVe adaptationscene co-occurrenceobject detectiontext-to-image synthesisvisual embeddingsOpen Images
0
0 comments X

The pith

Treating object co-occurrences in images like word co-occurrences produces useful contextual embeddings for visual objects.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes to build embeddings for common visual objects by applying the GloVe method, originally for words, to their co-occurrence statistics across scenes in a large image collection. This creates vectors that encode how objects typically appear together in the same scene. If this works, the embeddings should help models reason about context in tasks such as spotting objects in photos or generating images from text descriptions. The authors train these vectors on the Open Images dataset and test them on detection and synthesis problems.

Core claim

Obj-GloVe is a scene-based contextual embedding for visual objects obtained by applying the GloVe algorithm to co-occurrence counts of object classes in images. The method produces vector representations that reflect semantic and contextual relationships among objects, as shown through nearest-neighbor analysis and projections along semantic axes. These embeddings are then applied to improve object detection and text-to-image synthesis.

What carries the argument

Obj-GloVe embedding, which adapts the GloVe co-occurrence matrix factorization to object pairs appearing in the same image scene.

If this is right

  • Embeddings improve accuracy in object detection when incorporated into models.
  • They enhance quality in text-to-image synthesis applications.
  • Dimensionality reduction and nearest neighbors reveal meaningful semantic groupings among objects.
  • Projecting vectors along semantic axes shows interpretable directions in the embedding space.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Such embeddings could be precomputed once and reused across multiple vision tasks without retraining.
  • The success would imply that statistical co-occurrence is a sufficient signal for learning object context, similar to language.
  • Future work might combine these with language embeddings for multimodal tasks.

Load-bearing premise

Object co-occurrence statistics in image datasets contain enough structured information to support embeddings that capture useful context for downstream visual tasks.

What would settle it

Running an object detection model with and without the Obj-GloVe features on the same dataset and finding no difference in mean average precision would indicate the embeddings add no value.

Figures

Figures reproduced from arXiv: 1907.01478 by Canwen Xu, Chenliang Li, Zhenzhong Chen.

Figure 1
Figure 1. Figure 1: The procedure of data pre-processing. (1)(2) We calculate the center of bounding boxes. (3)(4) Then we [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: PCA visualization of Obj-GloVe. The labels are randomly sampled. [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: t-SNE visualization of the most common objects of Obj-GloVe. [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Projections of Obj-GloVe on Animal-Person axis. The projections closer to Animal and Person are marked with red and green, respectively. The labels are randomly sampled. 7, for each annotated image in the validation set with more than one bounding box, we iteratively “mask” a box label from the ground truth and attempt to predict it with trained embedding. Note that the visual features are completely disca… view at source ↗
Figure 5
Figure 5. Figure 5: Projections of Obj-GloVe on Man-Woman axis. We visualize the pole areas of two sides. The labels are randomly sampled. 12 [PITH_FULL_IMAGE:figures/full_fig_p012_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Photos are slices of a full scene. Thus, Obj-GloVe can implicitly exploit spatial information across multiple [PITH_FULL_IMAGE:figures/full_fig_p013_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: A live example of masking. The labels in the image are iteratively masked to form a test sample. Thus, [PITH_FULL_IMAGE:figures/full_fig_p013_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Scene generation with Obj-GloVe. The black text is the user input and the red one is “auto-completed” based [PITH_FULL_IMAGE:figures/full_fig_p013_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Progressive scene generation with Obj-GloVe. The black text is the user input and the red one is “auto [PITH_FULL_IMAGE:figures/full_fig_p014_9.png] view at source ↗
read the original abstract

Recently, with the prevalence of large-scale image dataset, the co-occurrence information among classes becomes rich, calling for a new way to exploit it to facilitate inference. In this paper, we propose Obj-GloVe, a generic scene-based contextual embedding for common visual objects, where we adopt the word embedding method GloVe to exploit the co-occurrence between entities. We train the embedding on pre-processed Open Images V4 dataset and provide extensive visualization and analysis by dimensionality reduction and projecting the vectors along a specific semantic axis, and showcasing the nearest neighbors of the most common objects. Furthermore, we reveal the potential applications of Obj-GloVe on object detection and text-to-image synthesis, then verify its effectiveness on these two applications respectively.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes Obj-GloVe, a scene-based contextual embedding for visual objects obtained by adapting the GloVe algorithm to factorize a global co-occurrence matrix of object classes extracted from scenes in the Open Images V4 dataset. It presents visualizations including dimensionality reduction, semantic axis projections, and nearest neighbor analysis, and asserts verification of effectiveness for object detection and text-to-image synthesis applications.

Significance. If the embeddings prove to capture relational semantics beyond dataset biases, this could provide a transferable representation of visual context for downstream vision tasks. The approach leverages large-scale annotation data in a manner analogous to textual embeddings, but the absence of quantitative validation in the abstract and the direct derivation from co-occurrence statistics make the practical utility uncertain without further evidence.

major comments (2)
  1. [Abstract] The abstract claims to 'verify its effectiveness on these two applications respectively,' but no quantitative results, ablation studies, baseline comparisons, or error analysis are described to support the effectiveness claims for object detection and text-to-image synthesis.
  2. The embeddings are constructed directly from co-occurrence counts in the dataset, so the captured 'context' is by definition the input statistics; downstream task gains would need independent validation on held-out data with comparisons to class priors, but this validation is not provided.
minor comments (1)
  1. The description of the pre-processing of the Open Images V4 dataset lacks detail on how scenes are defined and how the co-occurrence matrix is constructed (e.g., window size, weighting).

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful review and the opportunity to clarify the manuscript. We address the major comments point by point below.

read point-by-point responses
  1. Referee: [Abstract] The abstract claims to 'verify its effectiveness on these two applications respectively,' but no quantitative results, ablation studies, baseline comparisons, or error analysis are described to support the effectiveness claims for object detection and text-to-image synthesis.

    Authors: We agree that the abstract's phrasing implies quantitative verification that is not present in the manuscript. The paper provides visualizations (dimensionality reduction, semantic axis projections, nearest neighbors) and qualitative illustrations of potential use in object detection and text-to-image synthesis, but contains no quantitative task results, ablations, or baseline comparisons. We will revise the abstract to state that we demonstrate the embeddings and illustrate their potential applications, removing the claim of verified effectiveness. revision: yes

  2. Referee: The embeddings are constructed directly from co-occurrence counts in the dataset, so the captured 'context' is by definition the input statistics; downstream task gains would need independent validation on held-out data with comparisons to class priors, but this validation is not provided.

    Authors: The embeddings are obtained by factorizing the co-occurrence matrix with the GloVe objective, which is designed to produce dense vectors that encode relational structure beyond raw frequencies (as in textual GloVe). Nevertheless, we acknowledge that the manuscript does not supply held-out quantitative evaluations or comparisons against simple class priors or other baselines. We will add such validation experiments in the revised manuscript to address this concern. revision: yes

Circularity Check

0 steps flagged

Direct application of GloVe to object co-occurrence matrix exhibits no circularity

full rationale

The paper explicitly adopts the established GloVe algorithm to factorize a co-occurrence matrix built from Open Images V4 scenes. This produces embeddings whose geometry reflects the input statistics by design of the method, but the paper makes no additional claims that a 'prediction' or 'first-principles result' reduces to its inputs beyond the standard GloVe construction. No self-citations, uniqueness theorems, or fitted parameters renamed as predictions appear in the abstract or described chain. Effectiveness on downstream tasks is asserted separately and would require external validation, but the derivation itself is self-contained and non-circular.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

Based solely on abstract, the method rests on treating visual co-occurrence as directly transferable from textual co-occurrence without additional evidence or justification provided.

free parameters (1)
  • GloVe training hyperparameters (vector dimension, window size)
    Standard but unspecified parameters that control the resulting embedding
axioms (1)
  • domain assumption Object co-occurrence statistics in images are sufficiently analogous to word co-occurrence in text to yield useful contextual embeddings
    Invoked when adopting GloVe for visual objects

pith-pipeline@v0.9.0 · 5647 in / 1057 out tokens · 45209 ms · 2026-05-25T10:52:05.310958+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

39 extracted references · 39 canonical work pages · 1 internal anchor

  1. [1]

    Jeffrey Pennington, Richard Socher, and Christopher D. Manning. Glove: Global vectors for word representation. In EMNLP, pages 1532–1543. ACL, 2014

  2. [2]

    Tomasz Malisiewicz and Alexei A. Efros. Beyond categories: The visual memex model for reasoning about object relationships. In NIPS, pages 1222–1230. Curran Associates, Inc., 2009

  3. [3]

    Murphy, and William T

    Antonio Torralba, Kevin P. Murphy, and William T. Freeman. Using the forest to see the trees: exploiting context for visual object detection and localization. Commun. ACM, 53(3):107–114, 2010

  4. [4]

    Belongie

    Andrew Rabinovich, Andrea Vedaldi, Carolina Galleguillos, Eric Wiewiora, and Serge J. Belongie. Objects in context. In ICCV, pages 1–8. IEEE Computer Society, 2007

  5. [5]

    Deep residual learning for image recognition

    Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, pages 770–778. IEEE Computer Society, 2016

  6. [6]

    Very deep convolutional networks for large-scale image recognition

    Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. In ICLR, 2015

  7. [7]

    Alina Kuznetsova, Hassan Rom, Neil Alldrin, Jasper R. R. Uijlings, Ivan Krasin, Jordi Pont-Tuset, Shahab Kamali, Stefan Popov, Matteo Malloci, Tom Duerig, and Vittorio Ferrari. The open images dataset V4: unified image classification, object detection, and visual relationship detection at scale. CoRR, abs/1811.00982, 2018

  8. [8]

    Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll ´ar, and C

    Tsung-Yi Lin, Michael Maire, Serge J. Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll ´ar, and C. Lawrence Zitnick. Microsoft COCO: common objects in context. InECCV (5), volume 8693 of Lecture Notes in Computer Science, pages 740–755. Springer, 2014

  9. [9]

    Mark Everingham, S. M. Ali Eslami, Luc J. Van Gool, Christopher K. I. Williams, John M. Winn, and Andrew Zisserman. The pascal visual object classes challenge: A retrospective. International Journal of Computer Vision, 111(1):98–136, 2015

  10. [10]

    How to generate a good word embedding

    Siwei Lai, Kang Liu, Shizhu He, and Jun Zhao. How to generate a good word embedding. IEEE Intelligent Systems, 31(6):5–14, 2016. 9

  11. [11]

    Efros, and Martial Hebert

    Santosh Kumar Divvala, Derek Hoiem, James Hays, Alexei A. Efros, and Martial Hebert. An empirical study of context in object detection. In CVPR, pages 1271–1278. IEEE Computer Society, 2009

  12. [12]

    Contextualizing object detection and classification

    Zheng Song, Qiang Chen, ZhongYang Huang, Yang Hua, and Shuicheng Yan. Contextualizing object detection and classification. In CVPR, pages 1585–1592. IEEE Computer Society, 2011

  13. [13]

    Spatial memory for context reasoning in object detection

    Xinlei Chen and Abhinav Gupta. Spatial memory for context reasoning in object detection. In ICCV, pages 4106–4116. IEEE Computer Society, 2017

  14. [14]

    Relation networks for object detection

    Han Hu, Jiayuan Gu, Zheng Zhang, Jifeng Dai, and Yichen Wei. Relation networks for object detection. In CVPR, pages 3588–3597. IEEE Computer Society, 2018

  15. [15]

    Structure inference net: Object detection using scene-level context and instance-level relationships

    Yong Liu, Ruiping Wang, Shiguang Shan, and Xilin Chen. Structure inference net: Object detection using scene-level context and instance-level relationships. In CVPR, pages 6985–6994. IEEE Computer Society, 2018

  16. [16]

    Phan, Aixin Sun, Yi Tay, Jialong Han, and Chenliang Li

    Minh C. Phan, Aixin Sun, Yi Tay, Jialong Han, and Chenliang Li. Neupl: Attention-based semantic matching and pair-linking for entity disambiguation. In CIKM, pages 1667–1676. ACM, 2017

  17. [17]

    Pair-Linking for Collective Entity Disambiguation: Two Could Be Better Than All

    Minh C. Phan, Aixin Sun, Yi Tay, Jialong Han, and Chenliang Li. Pair-linking for collective entity disambigua- tion: Two could be better than all. CoRR, abs/1802.01074, 2018

  18. [18]

    Attngan: Fine-grained text to image generation with attentional generative adversarial networks

    Tao Xu, Pengchuan Zhang, Qiuyuan Huang, Han Zhang, Zhe Gan, Xiaolei Huang, and Xiaodong He. Attngan: Fine-grained text to image generation with attentional generative adversarial networks. In CVPR, pages 1316–

  19. [19]

    IEEE Computer Society, 2018

  20. [20]

    Generating images from captions with attention

    Elman Mansimov, Emilio Parisotto, Lei Jimmy Ba, and Ruslan Salakhutdinov. Generating images from captions with attention. In ICLR, 2016

  21. [21]

    Plug & play generative net- works: Conditional iterative generation of images in latent space

    Anh Nguyen, Jeff Clune, Yoshua Bengio, Alexey Dosovitskiy, and Jason Yosinski. Plug & play generative net- works: Conditional iterative generation of images in latent space. In CVPR, pages 3510–3520. IEEE Computer Society, 2017

  22. [22]

    Reed, Zeynep Akata, Xinchen Yan, Lajanugen Logeswaran, Bernt Schiele, and Honglak Lee

    Scott E. Reed, Zeynep Akata, Xinchen Yan, Lajanugen Logeswaran, Bernt Schiele, and Honglak Lee. Generative adversarial text to image synthesis. In ICML, volume 48 of JMLR Workshop and Conference Proceedings, pages 1060–1069. JMLR.org, 2016

  23. [24]

    A neural probabilistic language model

    Yoshua Bengio, R ´ejean Ducharme, and Pascal Vincent. A neural probabilistic language model. In NIPS, pages 932–938. MIT Press, 2000

  24. [25]

    Efficient estimation of word representations in vector space

    Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient estimation of word representations in vector space. In ICLR (Workshop), 2013

  25. [26]

    Corrado, and Jeffrey Dean

    Tomas Mikolov, Ilya Sutskever, Kai Chen, Gregory S. Corrado, and Jeffrey Dean. Distributed representations of words and phrases and their compositionality. In NIPS, pages 3111–3119, 2013

  26. [27]

    Deerwester, Susan T

    Scott C. Deerwester, Susan T. Dumais, Thomas K. Landauer, George W. Furnas, and Richard A. Harshman. Indexing by latent semantic analysis. JASIS, 41(6):391–407, 1990

  27. [28]

    Producing high-dimensional semantic space from lexical co-occurence

    Kevin Lund and Curt Burgess. Producing high-dimensional semantic space from lexical co-occurence. Behavior Research Methods Instruments and Computers , 28:203–208, 06 1996

  28. [29]

    Facenet: A unified embedding for face recognition and clustering

    Florian Schroff, Dmitry Kalenichenko, and James Philbin. Facenet: A unified embedding for face recognition and clustering. In CVPR, pages 815–823. IEEE Computer Society, 2015

  29. [30]

    Visual translation embedding network for visual relation detection

    Hanwang Zhang, Zawlin Kyaw, Shih-Fu Chang, and Tat-Seng Chua. Visual translation embedding network for visual relation detection. In CVPR, pages 3107–3115. IEEE Computer Society, 2017

  30. [31]

    Representation learning for scene graph completion via jointly structural and visual embedding

    Hai Wan, Yonghao Luo, Bo Peng, and Wei-Shi Zheng. Representation learning for scene graph completion via jointly structural and visual embedding. In IJCAI, pages 949–956. ijcai.org, 2018

  31. [32]

    Felzenszwalb, Ross B

    Pedro F. Felzenszwalb, Ross B. Girshick, David A. McAllester, and Deva Ramanan. Object detection with discriminatively trained part-based models. IEEE Trans. Pattern Anal. Mach. Intell., 32(9):1627–1645, 2010

  32. [33]

    Myung Jin Choi, Antonio Torralba, and Alan S. Willsky. A tree-based context model for object recognition. IEEE Trans. Pattern Anal. Mach. Intell., 34(2):240–252, 2012

  33. [34]

    Roozbeh Mottaghi, Xianjie Chen, Xiaobai Liu, Nam-Gyu Cho, Seong-Whan Lee, Sanja Fidler, Raquel Urtasun, and Alan L. Yuille. The role of context for object detection and semantic segmentation in the wild. In CVPR, pages 891–898. IEEE Computer Society, 2014. 10

  34. [35]

    Girshick, and Jian Sun

    Shaoqing Ren, Kaiming He, Ross B. Girshick, and Jian Sun. Faster R-CNN: towards real-time object detection with region proposal networks. In NIPS, pages 91–99, 2015

  35. [36]

    Semantic image synthesis via adversarial learning

    Hao Dong, Simiao Yu, Chao Wu, and Yike Guo. Semantic image synthesis via adversarial learning. In ICCV, pages 5707–5715. IEEE Computer Society, 2017

  36. [37]

    Stackgan: Text to photo-realistic image synthesis with stacked generative adversarial networks

    Han Zhang, Tao Xu, and Hongsheng Li. Stackgan: Text to photo-realistic image synthesis with stacked generative adversarial networks. In ICCV, pages 5908–5916. IEEE Computer Society, 2017

  37. [38]

    DA-GAN: instance-level image translation by deep attention generative adversarial networks

    Shuang Ma, Jianlong Fu, Chang Wen Chen, and Tao Mei. DA-GAN: instance-level image translation by deep attention generative adversarial networks. In CVPR, pages 5657–5666. IEEE Computer Society, 2018

  38. [39]

    C. Wah, S. Branson, P. Welinder, P. Perona, and S. Belongie. The Caltech-UCSD Birds-200-2011 Dataset. Technical Report CNS-TR-2011-001, California Institute of Technology, 2011

  39. [40]

    auto-completed

    Bryan Perozzi, Rami Al-Rfou, and Steven Skiena. Deepwalk: online learning of social representations. In KDD, pages 701–710. ACM, 2014. 11 Figure 5: Projections of Obj-GloVe on Man-Woman axis. We visualize the pole areas of two sides. The labels are randomly sampled. 12 Figure 6: Photos are slices of a full scene. Thus, Obj-GloVe can implicitly exploit spa...