Informative Image Captioning with External Sources of Information

Piyush Sharma; Radu Soricut; Sanqiang Zhao; Tomer Levinboim

arxiv: 1906.08876 · v1 · pith:I7D2QFWFnew · submitted 2019-06-20 · 💻 cs.CL · cs.CV

Informative Image Captioning with External Sources of Information

Sanqiang Zhao , Piyush Sharma , Tomer Levinboim , Radu Soricut This is my paper

Pith reviewed 2026-05-25 19:22 UTC · model grok-4.3

classification 💻 cs.CL cs.CV

keywords image captioninginformative captionsentity labelstransformer modelmultimodal learningexternal knowledgevision language models

0 comments

The pith

A multi-encoder Transformer integrates external fine-grained entity labels with image features to produce fluent yet informative image captions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Current image captioning models tend to use only common object names and miss fine-grained details about entities and their interactions. This paper proposes a mechanism to integrate image information with external fine-grained entity labels assumed to come from upstream models. It introduces a multimodal multi-encoder Transformer that ingests both visual features and these labels. The model learns to control when the specific labels appear in the output. This matters because it enables captions that convey precise information while remaining natural and readable.

Core claim

The paper establishes that a multimodal, multi-encoder Transformer model can take in both image features and multiple sources of entity labels and learn to selectively include those labels in the generated text, producing captions that describe the image fluently while mentioning informative, fine-grained entities.

What carries the argument

Multimodal multi-encoder Transformer model ingesting image features and entity labels from upstream models.

If this is right

Captions can include fine-grained entity mentions beyond common object names.
The model can control the use of external labels to maintain fluency.
Multiple external sources of entity information can be combined within one architecture.
Generated descriptions become both fluent and more specific about entities and interactions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same integration approach could be tested on video captioning or visual question answering where external labels are available.
Gains would scale with improvements in the accuracy of the upstream entity labelers.
The method might be evaluated on images containing rare or domain-specific objects to quantify the informativeness lift.

Load-bearing premise

The method relies on the assumption that upstream models provide sufficiently accurate fine-grained entity labels without errors.

What would settle it

Running the model with deliberately noisy or incorrect entity labels and observing no gain in informativeness or a drop in fluency would falsify the central claim.

Figures

Figures reproduced from arXiv: 1906.08876 by Piyush Sharma, Radu Soricut, Sanqiang Zhao, Tomer Levinboim.

**Figure 2.** Figure 2: A multi-encoder Transformer Network pro [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Learnable representations for the Object la [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Learnable representations for the Web Entity [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗

**Figure 5.** Figure 5: The Image Transformer Encoder (left side) is [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗

**Figure 6.** Figure 6: Sample outputs for various model configurations for two images and their additional label inputs. In both [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

**Figure 7.** Figure 7: Interface for the human evaluation. CIDEr score) amongst baselines not using input labels ( [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗

**Figure 8.** Figure 8: Qualitative results comparing baseline captions ( [PITH_FULL_IMAGE:figures/full_fig_p009_8.png] view at source ↗

read the original abstract

An image caption should fluently present the essential information in a given image, including informative, fine-grained entity mentions and the manner in which these entities interact. However, current captioning models are usually trained to generate captions that only contain common object names, thus falling short on an important "informativeness" dimension. We present a mechanism for integrating image information together with fine-grained labels (assumed to be generated by some upstream models) into a caption that describes the image in a fluent and informative manner. We introduce a multimodal, multi-encoder model based on Transformer that ingests both image features and multiple sources of entity labels. We demonstrate that we can learn to control the appearance of these entity labels in the output, resulting in captions that are both fluent and informative.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper adds a multi-encoder Transformer to control insertion of external entity labels in image captions, but the gains rest on untested upstream label quality.

read the letter

The main contribution is a Transformer with separate encoders for the image and for multiple external fine-grained entity label sources. The model learns to decide which labels to weave into the caption while keeping the output fluent. This directly targets the common problem that standard captioners only name generic objects and miss specific entities or relations that an upstream detector might provide. The control aspect is the part that feels new relative to prior captioning work that simply concatenates extra inputs. The architecture choice makes sense for keeping the different signals distinct during encoding. The paper does a clear job of stating the informativeness gap and showing a mechanism that can condition on label presence. That framing is useful even if the execution details are incremental. The central weakness is the assumption that the external labels arrive clean and complete. The work treats those labels as reliable inputs rather than testing what happens when the upstream models make mistakes or miss entities. Without robustness checks or ablations on label noise, it is hard to know whether the control mechanism survives real deployment conditions. The abstract also gives no quantitative results or baseline comparisons, so the size of any improvement stays unknown from the provided text. This paper is for people already working on controllable or knowledge-augmented captioning. A reader who needs a concrete architecture for handling several label streams could extract the design pattern. It is not foundational enough to change how most groups approach the task, but the idea is coherent enough that a serious referee should see the full experiments and ablations before deciding on acceptance.

Referee Report

0 major / 1 minor

Summary. The paper introduces a multimodal multi-encoder Transformer model that ingests image features together with fine-grained entity labels from external upstream sources. It claims to enable controllable inclusion of these labels in the generated captions while maintaining fluency, thereby producing more informative image descriptions than standard models limited to common object names.

Significance. If the experimental results hold, the work addresses a clear limitation in current image captioning systems by incorporating external fine-grained information in a controllable manner. The multi-encoder design and controllability mechanism represent a practical advance for generating captions that better reflect entity interactions and details.

minor comments (1)

The abstract states that labels are 'assumed to be generated by some upstream models' but provides no details on how label noise or missing labels would affect the integration and controllability; this external dependency should be quantified in experiments.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their review and for recognizing the potential significance of incorporating external fine-grained entity labels into image captioning via a controllable multi-encoder Transformer. No specific major comments were provided in the report, so we have no individual points to address.

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper describes an empirical multimodal Transformer architecture that ingests image features plus externally supplied fine-grained entity labels (explicitly assumed to come from upstream models) and learns controllable inclusion of those labels. No equations, derivations, or parameter-fitting steps are presented that would reduce any claimed prediction or result to the inputs by construction. The central claim is supported by experimental demonstration rather than a self-referential chain, and the provided abstract and context contain no self-citation load-bearing premises or ansatz smuggling. This is the expected outcome for a non-derivational applied ML paper.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only; no explicit free parameters, axioms, or invented entities are stated. The central claim rests on the unstated assumption that upstream label generators exist and produce usable input.

pith-pipeline@v0.9.0 · 5663 in / 979 out tokens · 13731 ms · 2026-05-25T19:22:06.405543+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

28 extracted references · 28 canonical work pages · 8 internal anchors

[1]

URL: " 'urlintro :=

ENTRY address author booktitle chapter edition editor howpublished institution journal key month note number organization pages publisher school series title type volume year eprint doi pubmed url lastchecked label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block STRINGS urlintro eprinturl eprintpr...

work page
[2]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...

work page
[3]

Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang. 2018. Bottom-up and top-down attention for image captioning and VQA . In CVPR

work page 2018
[4]

Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473

work page internal anchor Pith review Pith/arXiv arXiv 2014
[5]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805

work page internal anchor Pith review Pith/arXiv arXiv 2018
[6]

Jeff Donahue, Lisa Anne Hendricks, Sergio Guadarrama, Marcus Rohrbach, Subhashini Venugopalan, Kate Saenko, and Trevor Darrell. 2014. Long-term recurrent convolutional networks for visual recognition and description. In Proc. of IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

work page 2014
[7]

John Duchi, Elad Hazan, and Yoram Singer. 2011. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12(Jul):2121--2159

work page 2011
[8]

Hao Fang, Saurabh Gupta, Forrest Iandola, Rupesh Srivastava, Li Deng, Piotr Doll \'a r, Jianfeng Gao, Xiaodong He, Margaret Mitchell, John Platt, et al. 2015. From captions to visual concepts and back. In Proc. of IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

work page 2015
[9]

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770--778

work page 2016
[10]

Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. 2015. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531

work page internal anchor Pith review Pith/arXiv arXiv 2015
[11]

Micah Hodosh, Peter Young, and Julia Hockenmaier. 2013. Framing image description as a ranking task: Data, models and evaluation metrics. JAIR

work page 2013
[12]

Andrej Karpathy and Li Fei-Fei. 2015. Deep visual-semantic alignments for generating image descriptions. In Proc. of IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

work page 2015
[13]

Ryan Kiros, Ruslan Salakhutdinov, and Richard S Zemel. 2015. Unifying visual-semantic embeddings with multimodal neural language models. Transactions of the Association for Computational Linguistics

work page 2015
[14]

Microsoft COCO: Common Objects in Context

Tsung - Yi Lin, Michael Maire, Serge J. Belongie, Lubomir D. Bourdev, Ross B. Girshick, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll \' a r, and C. Lawrence Zitnick. 2014. Microsoft COCO: common objects in context. CoRR, abs/1405.0312

work page internal anchor Pith review Pith/arXiv arXiv 2014
[15]

Siqi Liu, Zhenhai Zhu, Ning Ye, Sergio Guadarrama, and Kevin Murphy. 2017. Optimization of image description metrics using policy gradient methods. In International Conference on Computer Vision (ICCV)

work page 2017
[16]

Di Lu, Spencer Whitehead, Lifu Huang, Heng Ji, and Shih-Fu Chang. 2018. Entity-aware image caption generation. arXiv preprint arXiv:1804.07889

work page internal anchor Pith review Pith/arXiv arXiv 2018
[17]

Tong Niu and Mohit Bansal. 2018. Polite dialogue generation without parallel data. arXiv preprint arXiv:1805.03162

work page internal anchor Pith review Pith/arXiv arXiv 2018
[18]

Marc'Aurelio Ranzato, Sumit Chopra, Michael Auli, and Wojciech Zaremba. 2015. Sequence level training with recurrent neural networks. CoRR, abs/1511.06732

work page internal anchor Pith review Pith/arXiv arXiv 2015
[19]

Rico Sennrich, Barry Haddow, and Alexandra Birch. 2015. Neural machine translation of rare words with subword units. arXiv preprint arXiv:1508.07909

work page internal anchor Pith review Pith/arXiv arXiv 2015
[20]

Piyush Sharma, Nan Ding, Sebastian Goodman, and Radu Soricut. 2018. Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), volume 1, pages 2556--2565

work page 2018
[21]

Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014. Sequence to sequence learning with neural networks. In Advances in neural information processing systems, pages 3104--3112

work page 2014
[22]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, ukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems, pages 5998--6008

work page 2017
[23]

Ramakrishna Vedantam, C Lawrence Zitnick, and Devi Parikh. 2015. Cider: Consensus-based image description evaluation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4566--4575

work page 2015
[24]

Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. 2015. Show and tell: A neural image caption generator. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3156--3164

work page 2015
[25]

Kelvin Xu, Jimmy Ba, Ryan Kiros, Aaron Courville, Ruslan Salakhutdinov, Richard Zemel, and Yoshua Bengio. 2015. Show, attend and tell: Neural image caption generation with visual attention. In Proc. of the 32nd International Conference on Machine Learning (ICML)

work page 2015
[26]

Z. Yang, Y. Yuan, Y. Wu, R. Salakhutdinov, and W. W. Cohen. 2016. Review networks for caption generation. In NIPS

work page 2016
[27]

Ting Yao, Yingwei Pan, Yehao Li, Zhaofan Qiu, and Tao Mei. 2017. Boosting image captioning with attributes. In IEEE International Conference on Computer Vision, ICCV, pages 22--29

work page 2017
[28]

Quanzeng You, Hailin Jin, Zhaowen Wang, Chen Fang, and Jiebo Luo. 2016. Image captioning with semantic attention. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4651--4659

work page 2016

[1] [1]

URL: " 'urlintro :=

ENTRY address author booktitle chapter edition editor howpublished institution journal key month note number organization pages publisher school series title type volume year eprint doi pubmed url lastchecked label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block STRINGS urlintro eprinturl eprintpr...

work page

[2] [2]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...

work page

[3] [3]

Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang. 2018. Bottom-up and top-down attention for image captioning and VQA . In CVPR

work page 2018

[4] [4]

Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473

work page internal anchor Pith review Pith/arXiv arXiv 2014

[5] [5]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805

work page internal anchor Pith review Pith/arXiv arXiv 2018

[6] [6]

Jeff Donahue, Lisa Anne Hendricks, Sergio Guadarrama, Marcus Rohrbach, Subhashini Venugopalan, Kate Saenko, and Trevor Darrell. 2014. Long-term recurrent convolutional networks for visual recognition and description. In Proc. of IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

work page 2014

[7] [7]

John Duchi, Elad Hazan, and Yoram Singer. 2011. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12(Jul):2121--2159

work page 2011

[8] [8]

Hao Fang, Saurabh Gupta, Forrest Iandola, Rupesh Srivastava, Li Deng, Piotr Doll \'a r, Jianfeng Gao, Xiaodong He, Margaret Mitchell, John Platt, et al. 2015. From captions to visual concepts and back. In Proc. of IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

work page 2015

[9] [9]

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770--778

work page 2016

[10] [10]

Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. 2015. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531

work page internal anchor Pith review Pith/arXiv arXiv 2015

[11] [11]

Micah Hodosh, Peter Young, and Julia Hockenmaier. 2013. Framing image description as a ranking task: Data, models and evaluation metrics. JAIR

work page 2013

[12] [12]

Andrej Karpathy and Li Fei-Fei. 2015. Deep visual-semantic alignments for generating image descriptions. In Proc. of IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

work page 2015

[13] [13]

Ryan Kiros, Ruslan Salakhutdinov, and Richard S Zemel. 2015. Unifying visual-semantic embeddings with multimodal neural language models. Transactions of the Association for Computational Linguistics

work page 2015

[14] [14]

Microsoft COCO: Common Objects in Context

Tsung - Yi Lin, Michael Maire, Serge J. Belongie, Lubomir D. Bourdev, Ross B. Girshick, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll \' a r, and C. Lawrence Zitnick. 2014. Microsoft COCO: common objects in context. CoRR, abs/1405.0312

work page internal anchor Pith review Pith/arXiv arXiv 2014

[15] [15]

Siqi Liu, Zhenhai Zhu, Ning Ye, Sergio Guadarrama, and Kevin Murphy. 2017. Optimization of image description metrics using policy gradient methods. In International Conference on Computer Vision (ICCV)

work page 2017

[16] [16]

Di Lu, Spencer Whitehead, Lifu Huang, Heng Ji, and Shih-Fu Chang. 2018. Entity-aware image caption generation. arXiv preprint arXiv:1804.07889

work page internal anchor Pith review Pith/arXiv arXiv 2018

[17] [17]

Tong Niu and Mohit Bansal. 2018. Polite dialogue generation without parallel data. arXiv preprint arXiv:1805.03162

work page internal anchor Pith review Pith/arXiv arXiv 2018

[18] [18]

Marc'Aurelio Ranzato, Sumit Chopra, Michael Auli, and Wojciech Zaremba. 2015. Sequence level training with recurrent neural networks. CoRR, abs/1511.06732

work page internal anchor Pith review Pith/arXiv arXiv 2015

[19] [19]

Rico Sennrich, Barry Haddow, and Alexandra Birch. 2015. Neural machine translation of rare words with subword units. arXiv preprint arXiv:1508.07909

work page internal anchor Pith review Pith/arXiv arXiv 2015

[20] [20]

Piyush Sharma, Nan Ding, Sebastian Goodman, and Radu Soricut. 2018. Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), volume 1, pages 2556--2565

work page 2018

[21] [21]

Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014. Sequence to sequence learning with neural networks. In Advances in neural information processing systems, pages 3104--3112

work page 2014

[22] [22]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, ukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems, pages 5998--6008

work page 2017

[23] [23]

Ramakrishna Vedantam, C Lawrence Zitnick, and Devi Parikh. 2015. Cider: Consensus-based image description evaluation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4566--4575

work page 2015

[24] [24]

Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. 2015. Show and tell: A neural image caption generator. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3156--3164

work page 2015

[25] [25]

Kelvin Xu, Jimmy Ba, Ryan Kiros, Aaron Courville, Ruslan Salakhutdinov, Richard Zemel, and Yoshua Bengio. 2015. Show, attend and tell: Neural image caption generation with visual attention. In Proc. of the 32nd International Conference on Machine Learning (ICML)

work page 2015

[26] [26]

Z. Yang, Y. Yuan, Y. Wu, R. Salakhutdinov, and W. W. Cohen. 2016. Review networks for caption generation. In NIPS

work page 2016

[27] [27]

Ting Yao, Yingwei Pan, Yehao Li, Zhaofan Qiu, and Tao Mei. 2017. Boosting image captioning with attributes. In IEEE International Conference on Computer Vision, ICCV, pages 22--29

work page 2017

[28] [28]

Quanzeng You, Hailin Jin, Zhaowen Wang, Chen Fang, and Jiebo Luo. 2016. Image captioning with semantic attention. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4651--4659

work page 2016