pith. sign in

arxiv: 1906.12188 · v1 · pith:664A4AQHnew · submitted 2019-06-26 · 💻 cs.CV · cs.AI· cs.CL

A Deep Decoder Structure Based on WordEmbedding Regression for An Encoder-Decoder Based Model for Image Captioning

Pith reviewed 2026-05-25 15:41 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.CL
keywords image captioningencoder-decoder modelword embedding regressionsemantic attentionMS-COCO datasetlong caption generationdeep decoder
0
0 comments X

The pith

Training caption decoders to regress word embeddings rather than maximize next-word likelihood produces longer, higher-scoring descriptions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper replaces the usual log-likelihood objective in encoder-decoder image captioning models with a regression loss that forces the decoder to predict the embedding vector of the next word. This change is presented as a way to capture longer-range dependencies without adding memory cells. A semantic attention layer is added that conditions image attention on the meaning of the word just produced. On the MS-COCO test set the resulting model records CIDEr 125.0 and BLEU-4 50.5, exceeding the best prior figures of 117.1 and 48.0, with the largest gains on longer captions.

Core claim

A decoder trained to regress the word embedding of the next token given prior tokens, together with an attention mechanism that routes image features according to the semantic content of the last generated word, extracts long-term information and yields longer, more detailed captions without external memory augmentation.

What carries the argument

Word-embedding regression loss that replaces next-word log-likelihood training, paired with semantic attention conditioned on the meaning of the previously generated word.

If this is right

  • Longer fine-grained captions become feasible without adding LSTM memory cells or external storage.
  • Generated words receive importance weighting during decoding because the regression objective operates on continuous embeddings.
  • Attention points can be steered by semantic content of prior words rather than solely by visual features.
  • The same decoder structure can be applied to other sequence tasks where long-range consistency matters.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Embedding regression may reduce reliance on large vocabularies or heavy smoothing techniques used in likelihood training.
  • The method could be tested on other datasets with longer ground-truth sentences to confirm the length advantage.
  • Semantic attention might interact with the regression loss to improve handling of rare or context-dependent words.

Load-bearing premise

The reported gains in caption length and metric scores are produced by the embedding regression and semantic attention rather than by unreported differences in model size, training schedule, or data handling.

What would settle it

Re-train the identical architecture using standard cross-entropy loss instead of embedding regression and check whether CIDEr and BLEU-4 fall back to the prior state-of-the-art levels of 117.1 and 48.0.

Figures

Figures reproduced from arXiv: 1906.12188 by Ahmad Asadi, Reza Safabakhsh.

Figure 1
Figure 1. Figure 1: Structure of alignment method proposed by Karpathy et al.[7] [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: An illustration of the skip-gram word embedding model proposed in [PITH_FULL_IMAGE:figures/full_fig_p009_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: The proposed architecture for image caption generation task. [PITH_FULL_IMAGE:figures/full_fig_p011_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Samples of correct generated captions the other parts of the decoder during the training phase. In addition, we employed a stacked multi-RNN cell as the decoder and trained it to predict the embedding of the next word instead of its one-hot vector in order to first take word meanings into consideration and second reduce model parameters while generating captions for given images. Results show that the prop… view at source ↗
Figure 5
Figure 5. Figure 5: Samples of incorrect generated captions References 1. H. Fang, S. Gupta, F. Iandola, R. K. Srivastava, L. Deng, P. Doll´ar, J. Gao, X. He, M. Mitchell, J. C. Platt, et al., “From captions to visual concepts and back,” in Proceed￾ings of the IEEE conference on computer vision and pattern recognition, pp. 1473–1482, 2015. 2. S. Wu, J. Wieland, O. Farivar, and J. Schiller, “Automatic alt-text: Computer-genera… view at source ↗
read the original abstract

Generating textual descriptions for images has been an attractive problem for the computer vision and natural language processing researchers in recent years. Dozens of models based on deep learning have been proposed to solve this problem. The existing approaches are based on neural encoder-decoder structures equipped with the attention mechanism. These methods strive to train decoders to minimize the log likelihood of the next word in a sentence given the previous ones, which results in the sparsity of the output space. In this work, we propose a new approach to train decoders to regress the word embedding of the next word with respect to the previous ones instead of minimizing the log likelihood. The proposed method is able to learn and extract long-term information and can generate longer fine-grained captions without introducing any external memory cell. Furthermore, decoders trained by the proposed technique can take the importance of the generated words into consideration while generating captions. In addition, a novel semantic attention mechanism is proposed that guides attention points through the image, taking the meaning of the previously generated word into account. We evaluate the proposed approach with the MS-COCO dataset. The proposed model outperformed the state of the art models especially in generating longer captions. It achieved a CIDEr score equal to 125.0 and a BLEU-4 score equal to 50.5, while the best scores of the state of the art models are 117.1 and 48.0, respectively.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The manuscript proposes training an encoder-decoder image captioning model by regressing the decoder outputs to word embeddings of the next token (instead of next-token log-likelihood) and introduces a semantic attention mechanism that conditions on the meaning of the previously generated word. On MS-COCO it reports CIDEr = 125.0 and BLEU-4 = 50.5, exceeding prior bests of 117.1 and 48.0, with particular gains on longer captions.

Significance. If the numerical gains can be shown to arise specifically from the regression loss and semantic attention rather than from uncontrolled differences in capacity or training, the work would supply a concrete alternative training objective that avoids output-space sparsity and supports longer captions without external memory cells.

major comments (2)
  1. [Abstract] Abstract: the central claim that the model 'outperformed the state of the art models especially in generating longer captions' with CIDEr 125.0 / BLEU-4 50.5 rests on an empirical comparison whose protocol, baseline re-implementations, statistical tests, and ablation studies are entirely absent; without these the attribution to word-embedding regression and semantic attention cannot be verified.
  2. [Abstract] Abstract: no information is supplied on whether model capacity, optimizer schedule, vocabulary size, or preprocessing were held constant relative to the cited baselines (117.1 / 48.0); any mismatch would render the reported deltas non-diagnostic of the two novel components.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for highlighting the need for greater clarity regarding our experimental protocol. We address each comment below and will revise the manuscript accordingly to strengthen the attribution of results to the proposed components.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central claim that the model 'outperformed the state of the art models especially in generating longer captions' with CIDEr 125.0 / BLEU-4 50.5 rests on an empirical comparison whose protocol, baseline re-implementations, statistical tests, and ablation studies are entirely absent; without these the attribution to word-embedding regression and semantic attention cannot be verified.

    Authors: We agree the abstract is too concise to convey these details. The full manuscript (Section 4) specifies that all baselines were re-implemented from the cited works using identical ResNet-101 encoder, LSTM decoder dimensions, 10k-word vocabulary, Karpathy MS-COCO split, and Adam optimizer schedule; only the loss (embedding regression vs. cross-entropy) and attention module differ. Ablation tables compare the regression objective against log-likelihood and isolate the semantic attention contribution, with particular analysis of caption length. We will add a brief protocol summary to the abstract and include multi-run variance for statistical support in the revision. revision: yes

  2. Referee: [Abstract] Abstract: no information is supplied on whether model capacity, optimizer schedule, vocabulary size, or preprocessing were held constant relative to the cited baselines (117.1 / 48.0); any mismatch would render the reported deltas non-diagnostic of the two novel components.

    Authors: The manuscript states that capacity (encoder/decoder sizes), vocabulary, preprocessing, and optimizer are identical to the re-implemented baselines; the only controlled differences are the regression loss and semantic attention. We will insert explicit confirmation of these controls into both the abstract and the experimental setup section. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical results on public benchmark with no derivation chain

full rationale

The paper proposes an encoder-decoder image captioning model trained via word-embedding regression plus a semantic attention mechanism and reports CIDEr/BLEU scores on MS-COCO. No equations, first-principles derivations, or predictions are presented that reduce by construction to quantities defined from the same fitted parameters or self-citations. The central claim is an empirical performance comparison; the absence of ablations affects attribution strength but does not create circularity in any derivation. The result is therefore self-contained as an experimental report.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that regressing word embeddings extracts long-term dependencies better than likelihood maximization; no explicit free parameters or invented entities are named in the abstract.

axioms (1)
  • domain assumption Regressing to word embeddings allows extraction of long-term information without external memory cells
    Invoked to explain why the new loss produces longer fine-grained captions.

pith-pipeline@v0.9.0 · 5793 in / 1198 out tokens · 23737 ms · 2026-05-25T15:41:28.178458+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

46 extracted references · 46 canonical work pages · 9 internal anchors

  1. [1]

    From captions to visual concepts and back,

    H. Fang, S. Gupta, F. Iandola, R. K. Srivastava, L. Deng, P. Doll´ ar, J. Gao, X. He, M. Mitchell, J. C. Platt, et al., “From captions to visual concepts and back,” inProceed- ings of the IEEE conference on computer vision and pattern recognition, pp. 1473–1482, 2015

  2. [2]

    Automatic alt-text: Computer-generated image descriptions for blind users on a social network service.,

    S. Wu, J. Wieland, O. Farivar, and J. Schiller, “Automatic alt-text: Computer-generated image descriptions for blind users on a social network service.,” in CSCW, pp. 1180– Title Suppressed Due to Excessive Length 17 1192, 2017

  3. [3]

    Visual dialog,

    A. Das, S. Kottur, K. Gupta, A. Singh, D. Yadav, J. M. Moura, D. Parikh, and D. Ba- tra, “Visual dialog,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, vol. 2, 2017

  4. [4]

    Sequence to sequence learning with neural networks,

    I. Sutskever, O. Vinyals, and Q. V. Le, “Sequence to sequence learning with neural networks,” in Advances in neural information processing systems, pp. 3104–3112, 2014

  5. [5]

    Neural Machine Translation by Jointly Learning to Align and Translate

    D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation by jointly learning to align and translate,” arXiv preprint arXiv:1409.0473 , 2014

  6. [6]

    Recurrent models of visual attention,

    V. Mnih, N. Heess, A. Graves, et al., “Recurrent models of visual attention,” inAdvances in neural information processing systems , pp. 2204–2212, 2014

  7. [7]

    Deep fragment embeddings for bidirec- tional image sentence mapping,

    A. Karpathy, A. Joulin, and L. F. Fei-Fei, “Deep fragment embeddings for bidirec- tional image sentence mapping,” in Advances in neural information processing systems, pp. 1889–1897, 2014

  8. [8]

    Rich feature hierarchies for accurate object detection and semantic segmentation,

    R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature hierarchies for accurate object detection and semantic segmentation,” in Proceedings of the IEEE conference on computer vision and pattern recognition , pp. 580–587, 2014

  9. [9]

    Imagenet classification with deep con- volutional neural networks,

    A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep con- volutional neural networks,” in Advances in neural information processing systems , pp. 1097–1105, 2012

  10. [10]

    Deep visual-semantic alignments for generating image descriptions,

    A. Karpathy and L. Fei-Fei, “Deep visual-semantic alignments for generating image descriptions,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3128–3137, 2015

  11. [11]

    Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation

    K. Cho, B. Van Merri¨ enboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio, “Learning phrase representations using rnn encoder-decoder for statistical machine translation,” arXiv preprint arXiv:1406.1078 , 2014

  12. [12]

    Translating Videos to Natural Language Using Deep Recurrent Neural Networks

    S. Venugopalan, H. Xu, J. Donahue, M. Rohrbach, R. Mooney, and K. Saenko, “Trans- lating videos to natural language using deep recurrent neural networks,” arXiv preprint arXiv:1412.4729, 2014

  13. [13]

    Emotional human-machine conversation generation based on long short-term memory,

    X. Sun, X. Peng, and S. Ding, “Emotional human-machine conversation generation based on long short-term memory,” Cognitive Computation, vol. 10, no. 3, pp. 389–397, 2018

  14. [14]

    Show, attend and tell: Neural image caption generation with visual attention,

    K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhudinov, R. Zemel, and Y. Bengio, “Show, attend and tell: Neural image caption generation with visual attention,” in International Conference on Machine Learning , pp. 2048–2057, 2015

  15. [15]

    Image captioning with deep bidirectional lstms,

    C. Wang, H. Yang, C. Bartz, and C. Meinel, “Image captioning with deep bidirectional lstms,” in Proceedings of the 2016 ACM on Multimedia Conference, pp. 988–997, ACM, 2016

  16. [16]

    Rethinking the inception architecture for computer vision,

    C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, “Rethinking the inception architecture for computer vision,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2818–2826, 2016

  17. [17]

    Show and tell: A neural image caption generator,

    O. Vinyals, A. Toshev, S. Bengio, and D. Erhan, “Show and tell: A neural image caption generator,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3156–3164, 2015

  18. [18]

    Neural image caption generation with weighted training and reference,

    G. Ding, M. Chen, S. Zhao, H. Chen, J. Han, and Q. Liu, “Neural image caption generation with weighted training and reference,” Cognitive Computation , pp. 1–15, 2018

  19. [19]

    Multiple Object Recognition with Visual Attention

    J. Ba, V. Mnih, and K. Kavukcuoglu, “Multiple object recognition with visual atten- tion,” arXiv preprint arXiv:1412.7755 , 2014

  20. [20]

    Attend to You: Personalized Image Captioning with Context Sequence Memory Networks

    C. C. Park, B. Kim, and G. Kim, “Attend to you: Personalized image captioning with context sequence memory networks,” arXiv preprint arXiv:1704.06485 , 2017

  21. [21]

    Visual saliency for image captioning in new multimedia services,

    M. Cornia, L. Baraldi, G. Serra, and R. Cucchiara, “Visual saliency for image captioning in new multimedia services,” in Multimedia & Expo Workshops (ICMEW), 2017 IEEE International Conference on, pp. 309–314, IEEE, 2017

  22. [22]

    Stacked attention networks for image question answering,

    Z. Yang, X. He, J. Gao, L. Deng, and A. Smola, “Stacked attention networks for image question answering,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 21–29, 2016

  23. [23]

    Long-term recurrent convolutional networks for visual recog- nition and description,

    J. Donahue, L. Anne Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan, K. Saenko, and T. Darrell, “Long-term recurrent convolutional networks for visual recog- nition and description,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2625–2634, 2015. 18 Ahmad Asadi, Reza Safabakhsh

  24. [24]

    Temporal-difference learning with sampling baseline for image captioning,

    H. Chen, G. Ding, S. Zhao, and J. Han, “Temporal-difference learning with sampling baseline for image captioning,” 2017

  25. [25]

    Show, observe and tell: Attribute-driven attention model for image captioning.,

    H. Chen, G. Ding, Z. Lin, S. Zhao, and J. Han, “Show, observe and tell: Attribute-driven attention model for image captioning.,” in IJCAI, pp. 606–612, 2018

  26. [26]

    Image captioning with mem- orized knowledge,

    H. Chen, G. Ding, Z. Lin, Y. Guo, C. Shan, and J. Han, “Image captioning with mem- orized knowledge,” Cognitive Computation, pp. 1–14, 2019

  27. [27]

    Stack-captioning: Coarse-to-fine learning for image captioning,

    J. Gu, J. Cai, G. Wang, and T. Chen, “Stack-captioning: Coarse-to-fine learning for image captioning,” in Thirty-Second AAAI Conference on Artificial Intelligence , 2018

  28. [28]

    Very Deep Convolutional Networks for Large-Scale Image Recognition

    K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv preprint arXiv:1409.1556 , 2014

  29. [29]

    Efficient Estimation of Word Representations in Vector Space

    T. Mikolov, K. Chen, G. Corrado, and J. Dean, “Efficient estimation of word represen- tations in vector space,” arXiv preprint arXiv:1301.3781 , 2013

  30. [30]

    Dropout: A simple way to prevent neural networks from overfitting,

    N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout: A simple way to prevent neural networks from overfitting,” The Journal of Machine Learning Research, vol. 15, no. 1, pp. 1929–1958, 2014

  31. [31]

    Microsoft coco: Common objects in context,

    T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Doll´ ar, and C. L. Zitnick, “Microsoft coco: Common objects in context,” in European conference on computer vision , pp. 740–755, Springer, 2014

  32. [32]

    Bleu: a method for automatic evalu- ation of machine translation,

    K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu, “Bleu: a method for automatic evalu- ation of machine translation,” in Proceedings of the 40th annual meeting on association for computational linguistics , pp. 311–318, Association for Computational Linguistics, 2002

  33. [33]

    Cider: Consensus-based image description evaluation,

    R. Vedantam, C. Lawrence Zitnick, and D. Parikh, “Cider: Consensus-based image description evaluation,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4566–4575, 2015

  34. [34]

    Rouge: A package for automatic evaluation of summaries,

    C.-Y. Lin, “Rouge: A package for automatic evaluation of summaries,” Text Summa- rization Branches Out , 2004

  35. [35]

    Meteor: An automatic metric for mt evaluation with im- proved correlation with human judgments,

    S. Banerjee and A. Lavie, “Meteor: An automatic metric for mt evaluation with im- proved correlation with human judgments,” in Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summa- rization, pp. 65–72, 2005

  36. [36]

    Imagenet: A large- scale hierarchical image database,

    J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large- scale hierarchical image database,” in Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on , pp. 248–255, IEEE, 2009

  37. [37]

    Show and tell: Lessons learned from the 2015 mscoco image captioning challenge,

    O. Vinyals, A. Toshev, S. Bengio, and D. Erhan, “Show and tell: Lessons learned from the 2015 mscoco image captioning challenge,” IEEE transactions on pattern analysis and machine intelligence , vol. 39, no. 4, pp. 652–663, 2017

  38. [38]

    Knowing when to look: Adaptive attention via a visual sentinel for image captioning,

    J. Lu, C. Xiong, D. Parikh, and R. Socher, “Knowing when to look: Adaptive attention via a visual sentinel for image captioning,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , vol. 6, 2017

  39. [39]

    Deep Reinforcement Learning-based Image Captioning with Embedding Reward

    Z. Ren, X. Wang, N. Zhang, X. Lv, and L.-J. Li, “Deep reinforcement learning-based image captioning with embedding reward,” arXiv preprint arXiv:1704.03899 , 2017

  40. [40]

    An empirical study of language cnn for im- age captioning,

    J. Gu, G. Wang, J. Cai, and T. Chen, “An empirical study of language cnn for im- age captioning,” in Proceedings of the International Conference on Computer Vision (ICCV), 2017

  41. [41]

    Self-critical sequence training for image captioning,

    S. J. Rennie, E. Marcheret, Y. Mroueh, J. Ross, and V. Goel, “Self-critical sequence training for image captioning,” in CVPR, vol. 1, p. 3, 2017

  42. [42]

    Skeleton key: Image cap- tioning by skeleton-attribute decomposition,

    Y. Wang, Z. Lin, X. Shen, S. Cohen, and G. W. Cottrell, “Skeleton key: Image cap- tioning by skeleton-attribute decomposition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pp. 7272–7281, 2017

  43. [43]

    Semantic compositional networks for visual captioning,

    Z. Gan, C. Gan, X. He, Y. Pu, K. Tran, J. Gao, L. Carin, and L. Deng, “Semantic compositional networks for visual captioning,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , vol. 2, 2017

  44. [44]

    Show, Tell and Discriminate: Image Captioning by Self-retrieval with Partially Labeled Data

    X. Liu, H. Li, J. Shao, D. Chen, and X. Wang, “Show, tell and discriminate: Image cap- tioning by self-retrieval with partially labeled data,” arXiv preprint arXiv:1803.08314 , 2018

  45. [45]

    Image captioning with deep bidirectional lstms and multi-task learning,

    C. Wang, H. Yang, and C. Meinel, “Image captioning with deep bidirectional lstms and multi-task learning,” ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM) , vol. 14, no. 2s, p. 40, 2018. Title Suppressed Due to Excessive Length 19

  46. [46]

    Paying more attention to saliency: Image captioning with saliency and context attention,

    M. Cornia, L. Baraldi, G. Serra, and R. Cucchiara, “Paying more attention to saliency: Image captioning with saliency and context attention,” ACM Transactions on Multi- media Computing, Communications, and Applications (TOMM) , vol. 14, no. 2, p. 48, 2018