Image Captioning via Compact Bidirectional Architecture

Daqing Liu; Huixia Ben; Meng Wang; Richang Hong; Yuanen Zhou; Zhenzhen Hu; Zijie Song

arxiv: 2201.01984 · v3 · submitted 2022-01-06 · 💻 cs.CV · cs.CL

Image Captioning via Compact Bidirectional Architecture

Zijie Song , Yuanen Zhou , Zhenzhen Hu , Daqing Liu , Huixia Ben , Richang Hong , Meng Wang This is my paper

Pith reviewed 2026-05-24 12:16 UTC · model grok-4.3

classification 💻 cs.CV cs.CL

keywords image captioningbidirectional transformercompact architecturesentence-level ensembleself-critical trainingMSCOCOleft-to-right right-to-left flows

0 comments

The pith

A compact model fuses left-to-right and right-to-left flows to generate image captions using bidirectional context while decoding in parallel.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents a Compact Bidirectional Transformer that couples left-to-right and right-to-left generation streams inside one decoder. This coupling acts as regularization so the model can draw on future context without running two separate networks in sequence. Ablation results show the compact coupling and sentence-level choice between the two streams matter more than any added explicit interaction layer. Extending self-critical training to both streams and combining it with word-level ensemble produces new state-of-the-art scores on MSCOCO among models that do not use vision-language pretraining. The same compact design also works when the backbone is switched to an LSTM.

Core claim

Tightly coupling L2R and R2L flows into a single compact model serves as effective regularization for implicitly exploiting bidirectional context; the final caption is then selected from either flow via sentence-level ensemble, and this architecture supports a two-flow version of self-critical training that reaches new state-of-the-art results on MSCOCO without vision-language pretraining.

What carries the argument

Compact Bidirectional Transformer that tightly couples L2R and R2L flows into one model to regularize for bidirectional context while allowing parallel execution and sentence-level ensemble selection.

If this is right

The decoder runs in parallel instead of requiring sequential stages.
Sentence-level ensemble between the two flows improves final captions.
Word-level ensemble can be added on top to enlarge the ensemble gain.
Two-flow self-critical training yields higher scores than the conventional one-flow version.
The same coupling pattern transfers to an LSTM decoder backbone.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The approach could be tested on other autoregressive tasks such as machine translation to check whether the same compact coupling reduces the usual cost of bidirectional decoding.
If the coupling mainly acts as regularization, performance gains should be largest in low-data captioning settings where overfitting is a concern.
One could measure whether the shared parameters force the two flows to learn complementary rather than redundant features by inspecting their attention patterns on the same image.

Load-bearing premise

Tightly coupling the two directional flows inside one shared model is what actually supplies useful bidirectional regularization rather than merely saving parameters.

What would settle it

Train two independent L2R and R2L models with the same total parameter count and compare their ensemble performance on the MSCOCO test set; if the separate models match or exceed the compact version, the regularization benefit of coupling would not hold.

Figures

Figures reproduced from arXiv: 2201.01984 by Daqing Liu, Huixia Ben, Meng Wang, Richang Hong, Yuanen Zhou, Zhenzhen Hu, Zijie Song.

**Figure 2.** Figure 2: Illustration of Compact Bidirectional Transformer for Image Captioning (CBTIC). CBTIC model composes of an [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Examples of captions generated by our CBTIC model, conventional unidirectional Transformer model and human [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

read the original abstract

Most current image captioning models typically generate captions from left-to-right. This unidirectional property makes them can only leverage past context but not future context. Though refinement-based models can exploit both past and future context by generating a new caption in the second stage based on pre-retrieved or pre-generated captions in the first stage, the decoder of these models generally consists of two networks~(i.e. a retriever or captioner in the first stage and a captioner in the second stage), which can only be executed sequentially. In this paper, we introduce a Compact Bidirectional Transformer model for image captioning that can leverage bidirectional context implicitly and explicitly while the decoder can be executed parallelly. Specifically, it is implemented by tightly coupling left-to-right(L2R) and right-to-left(R2L) flows into a single compact model to serve as a regularization for implicitly exploiting bidirectional context and optionally allowing explicit interaction of the bidirectional flows, while the final caption is chosen from either L2R or R2L flow in a sentence-level ensemble manner. We conduct extensive ablation studies on MSCOCO benchmark and find that the compact bidirectional architecture and the sentence-level ensemble play more important roles than the explicit interaction mechanism. By combining with word-level ensemble seamlessly, the effect of sentence-level ensemble is further enlarged. We further extend the conventional one-flow self-critical training to the two-flows version under this architecture and achieve new state-of-the-art results in comparison with non-vision-language-pretraining models. Finally, we verify the generality of this compact bidirectional architecture by extending it to LSTM backbone. Source code is available at https://github.com/YuanEZhou/cbtic.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper offers a compact bidirectional decoder for image captioning with sentence ensemble and two-flow training, but the regularization benefit of the coupling is not isolated in the experiments.

read the letter

The one thing to take away is that this paper gives a compact way to do bidirectional image captioning in a single parallel decoder instead of the usual two-stage sequential setup. They couple the L2R and R2L flows tightly, pick the best sentence from either, and train with a two-flow version of self-critical sequence training. On MSCOCO they beat previous non-VLP models. The architecture is the main novelty. By putting both directions in one compact model they avoid running two networks one after the other. The sentence-level ensemble is straightforward but effective according to their ablations. They also show the same idea works when the backbone is an LSTM rather than a transformer. The fact that they release the code at the GitHub link is helpful for anyone who wants to build on it. The soft spot is around the claim that the tight coupling provides regularization that lets the model exploit bidirectional context implicitly. The ablations compare different versions of their own model but do not include a control with two separate unidirectional models that have the same total parameter count. Without that, it's hard to know whether the improvement comes from the bidirectional aspect or just from having more capacity or more training examples. The explicit interaction mechanism turns out to be less important than the compact architecture itself, which is an interesting finding but also suggests the bidirectional regularization story might be secondary. Overall this is a solid incremental paper for the image captioning community. Readers who care about making decoders more efficient or about using both directions without extra stages will find the details useful. The experiments are on the standard benchmark with ablations, so the claims can be checked. I think it deserves to go to peer review. The core idea is clear and the results are presented with enough supporting experiments to make it worth a referee's time.

Referee Report

2 major / 1 minor

Summary. The paper proposes a Compact Bidirectional Transformer (CBT) for image captioning that tightly couples L2R and R2L flows into one compact model to implicitly exploit bidirectional context (with optional explicit interaction), uses sentence-level ensemble for final caption selection, extends self-critical training to two flows, and reports new SOTA results on MSCOCO among non-VLP models. Extensive ablations indicate the compact architecture and ensemble matter more than explicit interaction; the approach also generalizes to LSTM backbones, with source code released.

Significance. If the gains are shown to arise specifically from the bidirectional regularization effect of tight coupling (rather than capacity or ensemble alone), the work would provide an efficient parallelizable alternative to refinement-based or separate bidirectional models. The release of source code, the reported ablations, and the LSTM extension are positive elements that support reproducibility and generality.

major comments (2)

[Ablations (MSCOCO experiments)] Ablations section (around the MSCOCO experiments): the reported comparisons isolate the role of explicit interaction but do not include a control consisting of two independent unidirectional models whose total parameter count matches the compact bidirectional model. Without this baseline it is not possible to determine whether observed improvements derive from the claimed regularization effect of tight L2R-R2L coupling or from parameter sharing and doubled training signal.
[Architecture and Ablations] Architecture description and results: the central claim that 'tightly coupling L2R and R2L flows into a single compact model [serves] as a regularization for implicitly exploiting bidirectional context' is load-bearing for the interpretation of the SOTA numbers, yet the ablation tables do not quantify the implicit bidirectional exploitation separately from the ensemble and capacity effects.

minor comments (1)

[Abstract] The abstract contains minor grammatical issues (e.g., 'makes them can only leverage') that should be corrected for clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments and suggestions. We address the major comments point by point below.

read point-by-point responses

Referee: [Ablations (MSCOCO experiments)] Ablations section (around the MSCOCO experiments): the reported comparisons isolate the role of explicit interaction but do not include a control consisting of two independent unidirectional models whose total parameter count matches the compact bidirectional model. Without this baseline it is not possible to determine whether observed improvements derive from the claimed regularization effect of tight L2R-R2L coupling or from parameter sharing and doubled training signal.

Authors: We agree that a control consisting of two independent unidirectional models with total parameter count matched to the compact bidirectional model would provide stronger evidence to isolate the regularization effect of tight L2R-R2L coupling from capacity and doubled training signal. Our existing ablations compare the compact model against standard single-flow baselines and vary explicit interaction, but do not include this exact matched-capacity control. We will add this baseline ablation in the revised manuscript. revision: yes
Referee: [Architecture and Ablations] Architecture description and results: the central claim that 'tightly coupling L2R and R2L flows into a single compact model [serves] as a regularization for implicitly exploiting bidirectional context' is load-bearing for the interpretation of the SOTA numbers, yet the ablation tables do not quantify the implicit bidirectional exploitation separately from the ensemble and capacity effects.

Authors: The ablation studies demonstrate that the compact architecture yields gains beyond explicit interaction alone and that the sentence-level ensemble is a major contributor. We acknowledge that the tables do not provide a direct, separate quantification of the implicit bidirectional exploitation effect independent of capacity and ensemble. We will revise the discussion to more clearly acknowledge this limitation in the current evidence and to temper the interpretation of the central claim accordingly. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical architecture validated by ablations on external benchmark

full rationale

The paper proposes a compact bidirectional transformer architecture for image captioning and reports results from ablations on the MSCOCO benchmark, including comparisons of compact vs. non-compact variants, sentence-level ensemble, and two-flow self-critical training. All load-bearing claims (SOTA among non-VLP models, importance of compact coupling and ensemble) rest on direct experimental measurements rather than any derivation that reduces by construction to fitted parameters or self-citations. No mathematical predictions, uniqueness theorems, or ansatzes are invoked that loop back to the paper's own inputs; the work is self-contained against the external MSCOCO test set.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper rests on the standard transformer and LSTM sequence modeling assumptions plus the MSCOCO benchmark and self-critical sequence training framework; no new free parameters, axioms, or invented entities are extractable from the abstract alone.

axioms (1)

domain assumption Standard transformer decoder assumptions for autoregressive sequence generation
Invoked as the backbone for the compact bidirectional model.

pith-pipeline@v0.9.0 · 5846 in / 1146 out tokens · 26736 ms · 2026-05-24T12:16:39.762070+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

57 extracted references · 57 canonical work pages · 5 internal anchors

[1]

, " * write output.state after.block = add.period write newline

ENTRY address archivePrefix author booktitle chapter edition editor eid eprint howpublished institution isbn journal key month note number organization pages publisher school series title type volume year label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block FUNCTION init.state.consts #0 'before.a...

work page
[2]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...

work page
[3]

Anderson, P.; Fernando, B.; Johnson, M.; and Gould, S. 2016. Spice: Semantic propositional image caption evaluation. In European conference on computer vision, 382--398. Springer

work page 2016
[4]

Anderson, P.; He, X.; Buehler, C.; Teney, D.; Johnson, M.; Gould, S.; and Zhang, L. 2018. Bottom-up and top-down attention for image captioning and visual question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition, 6077--6086

work page 2018
[5]

Layer Normalization

Ba, J. L.; Kiros, J. R.; and Hinton, G. E. 2016. Layer normalization. arXiv preprint arXiv:1607.06450

work page internal anchor Pith review Pith/arXiv arXiv 2016
[6]

Bahdanau, D.; Cho, K.; and Bengio, Y. 2014. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473

work page internal anchor Pith review Pith/arXiv arXiv 2014
[7]

Banerjee, S.; and Lavie, A. 2005. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, 65--72

work page 2005
[8]

Caruana, R. 1997. Multitask learning. Machine learning, 28(1): 41--75

work page 1997
[9]

Chen, X.; Fang, H.; Lin, T.-Y.; Vedantam, R.; Gupta, S.; Doll \'a r, P.; and Zitnick, C. L. 2015. Microsoft coco captions: Data collection and evaluation server. arXiv preprint arXiv:1504.00325

work page internal anchor Pith review Pith/arXiv arXiv 2015
[10]

Chen, Y.-C.; Gan, Z.; Cheng, Y.; Liu, J.; and Liu, J. 2020. Distilling Knowledge Learned in BERT for Text Generation. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 7893--7905

work page 2020
[11]

Cornia, M.; Stefanini, M.; Baraldi, L.; and Cucchiara, R. 2020. Meshed-memory transformer for image captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 10578--10587

work page 2020
[12]

G.; and Forsyth, D

Deshpande, A.; Aneja, J.; Wang, L.; Schwing, A. G.; and Forsyth, D. 2019. Fast, diverse and accurate image captioning guided by part-of-speech. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 10695--10704

work page 2019
[13]

Devlin, J.; Chang, M.-W.; Lee, K.; and Toutanova, K. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805

work page internal anchor Pith review Pith/arXiv arXiv 2018
[14]

Elliott, D.; Frank, S.; and Hasler, E. 2015. Multilingual image description with neural sequence models. arXiv preprint arXiv:1510.04709

work page internal anchor Pith review Pith/arXiv arXiv 2015
[15]

Gu, J.; Wang, G.; Cai, J.; and Chen, T. 2017. An empirical study of language cnn for image captioning. In Proceedings of the IEEE International Conference on Computer Vision, 1222--1231

work page 2017
[16]

He, K.; Zhang, X.; Ren, S.; and Sun, J. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, 770--778

work page 2016
[17]

Hou, J.; Wu, X.; Zhao, W.; Luo, J.; and Jia, Y. 2019. Joint syntax representation learning and visual cue translation for video captioning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 8918--8927

work page 2019
[18]

Huang, L.; Wang, W.; Chen, J.; and Wei, X.-Y. 2019. Attention on attention for image captioning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 4634--4643

work page 2019
[19]

Ji, J.; Luo, Y.; Sun, X.; Chen, F.; Luo, G.; Wu, Y.; Gao, Y.; and Ji, R. 2021. Improving image captioning by leveraging intra-and inter-layer global representation in transformer network. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, 1655--1663

work page 2021
[20]

Jiang, H.; Misra, I.; Rohrbach, M.; Learned-Miller, E.; and Chen, X. 2020. In defense of grid features for visual question answering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 10267--10276

work page 2020
[21]

Jiang, W.; Ma, L.; Jiang, Y.-G.; Liu, W.; and Zhang, T. 2018. Recurrent fusion network for image captioning. In Proceedings of the European Conference on Computer Vision (ECCV), 499--515

work page 2018
[22]

Karpathy, A.; and Fei-Fei, L. 2015. Deep visual-semantic alignments for generating image descriptions. In Proceedings of the IEEE conference on computer vision and pattern recognition, 3128--3137

work page 2015
[23]

Li, G.; Zhu, L.; Liu, P.; and Yang, Y. 2019. Entangled transformer for image captioning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 8928--8937

work page 2019
[24]

Li, X.; Yin, X.; Li, C.; Zhang, P.; Hu, X.; Zhang, L.; Wang, L.; Hu, H.; Dong, L.; Wei, F.; et al. 2020. Oscar: Object-semantics aligned pre-training for vision-language tasks. In European Conference on Computer Vision, 121--137. Springer

work page 2020
[25]

Lin, C.-Y. 2004. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out, 74--81

work page 2004
[26]

Luo, R. 2020. A Better Variant of Self-Critical Sequence Training. arXiv preprint arXiv:2003.09971

work page arXiv 2020
[27]

Pan, Y.; Yao, T.; Li, Y.; and Mei, T. 2020. X-linear attention networks for image captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 10971--10980

work page 2020
[28]

Papineni, K.; Roukos, S.; Ward, T.; and Zhu, W.-J. 2002. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics, 311--318

work page 2002
[29]

Qin, Y.; Du, J.; Zhang, Y.; and Lu, H. 2019. Look back and predict forward in image captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 8367--8375

work page 2019
[30]

Ren, S.; He, K.; Girshick, R.; and Sun, J. 2015. Faster r-cnn: Towards real-time object detection with region proposal networks. Advances in neural information processing systems, 28: 91--99

work page 2015
[31]

J.; Marcheret, E.; Mroueh, Y.; Ross, J.; and Goel, V

Rennie, S. J.; Marcheret, E.; Mroueh, Y.; Ross, J.; and Goel, V. 2017. Self-critical sequence training for image captioning. In Proceedings of the IEEE conference on computer vision and pattern recognition, 7008--7024

work page 2017
[32]

Sammani, F.; and Elsayed, M. 2019. Look and modify: Modification networks for image captioning. arXiv preprint arXiv:1909.03169

work page arXiv 2019
[33]

Sammani, F.; and Melas-Kyriazi, L. 2020. Show, edit and tell: A framework for editing image captions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 4808--4816

work page 2020
[34]

Song, Z.; Zhou, X.; Mao, Z.; and Tan, J. 2021. Image Captioning with Context-Aware Auxiliary Guidance. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, 2584--2592

work page 2021
[35]

Sutskever, I.; Vinyals, O.; and Le, Q. V. 2014. Sequence to sequence learning with neural networks. In Advances in neural information processing systems, 3104--3112

work page 2014
[36]

N.; Kaiser, .; and Polosukhin, I

Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A. N.; Kaiser, .; and Polosukhin, I. 2017. Attention is all you need. In Advances in neural information processing systems, 5998--6008

work page 2017
[37]

Vedantam, R.; Lawrence Zitnick, C.; and Parikh, D. 2015. Cider: Consensus-based image description evaluation. In Proceedings of the IEEE conference on computer vision and pattern recognition, 4566--4575

work page 2015
[38]

Vinyals, O.; Toshev, A.; Bengio, S.; and Erhan, D. 2015. Show and tell: A neural image caption generator. In Proceedings of the IEEE conference on computer vision and pattern recognition, 3156--3164

work page 2015
[39]

Wang, B.; Ma, L.; Zhang, W.; Jiang, W.; Wang, J.; and Liu, W. 2019 a . Controllable video captioning with pos sequence guidance based on gated fusion network. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 2641--2650

work page 2019
[40]

Wang, C.; Yang, H.; Bartz, C.; and Meinel, C. 2016. Image captioning with deep bidirectional LSTMs. In Proceedings of the 24th ACM international conference on Multimedia, 988--997

work page 2016
[41]

Wang, L.; Bai, Z.; Zhang, Y.; and Lu, H. 2020. Show, Recall, and Tell: Image Captioning with Recall Mechanism. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, 12176--12183

work page 2020
[42]

Wang, X.; Wu, J.; Chen, J.; Li, L.; Wang, Y.-F.; and Wang, W. Y. 2019 b . Vatex: A large-scale, high-quality multilingual dataset for video-and-language research. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 4581--4591

work page 2019
[43]

Xu, K.; Ba, J.; Kiros, R.; Cho, K.; Courville, A.; Salakhudinov, R.; Zemel, R.; and Bengio, Y. 2015. Show, attend and tell: Neural image caption generation with visual attention. In International conference on machine learning, 2048--2057. PMLR

work page 2015
[44]

Yang, X.; Tang, K.; Zhang, H.; and Cai, J. 2019. Auto-encoding scene graphs for image captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 10685--10694

work page 2019
[45]

Yao, T.; Pan, Y.; Li, Y.; and Mei, T. 2018. Exploring visual relationship for image captioning. In Proceedings of the European conference on computer vision (ECCV), 684--699

work page 2018
[46]

Yao, T.; Pan, Y.; Li, Y.; and Mei, T. 2019. Hierarchy parsing for image captioning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 2621--2629

work page 2019
[47]

Yao, T.; Pan, Y.; Li, Y.; Qiu, Z.; and Mei, T. 2017. Boosting image captioning with attributes. In Proceedings of the IEEE international conference on computer vision, 4894--4902

work page 2017
[48]

Zhang, P.; Li, X.; Hu, X.; Yang, J.; Zhang, L.; Wang, L.; Choi, Y.; and Gao, J. 2021 a . Vinvl: Revisiting visual representations in vision-language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 5579--5588

work page 2021
[49]

Zhang, X.; Su, J.; Qin, Y.; Liu, Y.; Ji, R.; and Wang, H. 2018. Asynchronous bidirectional decoding for neural machine translation. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 32

work page 2018
[50]

Zhang, X.; Sun, X.; Luo, Y.; Ji, J.; Zhou, Y.; Wu, Y.; Huang, F.; and Ji, R. 2021 b . RSTNet: Captioning With Adaptive Attention on Visual and Non-Visual Words. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 15465--15474

work page 2021
[51]

Zhang, Z.; Qi, Z.; Yuan, C.; Shan, Y.; Li, B.; Deng, Y.; and Hu, W. 2021 c . Open-book Video Captioning with Retrieve-Copy-Generate Network. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 9837--9846

work page 2021
[52]

Zhang, Z.; Wu, S.; Liu, S.; Li, M.; Zhou, M.; and Xu, T. 2019. Regularizing neural machine translation by target-bidirectional agreement. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, 443--450

work page 2019
[53]

Zhao, W.; Wang, B.; Ye, J.; Yang, M.; Zhao, Z.; Luo, R.; and Qiao, Y. 2018. A Multi-task Learning Approach for Image Captioning. In IJCAI, 1205--1211

work page 2018
[54]

Zhou, L.; Palangi, H.; Zhang, L.; Hu, H.; Corso, J.; and Gao, J. 2020 a . Unified vision-language pre-training for image captioning and vqa. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, 13041--13049

work page 2020
[55]

Zhou, L.; Zhang, J.; and Zong, C. 2019. Synchronous bidirectional neural machine translation. Transactions of the Association for Computational Linguistics, 7: 91--105

work page 2019
[56]

Zhou, Y.; Wang, M.; Liu, D.; Hu, Z.; and Zhang, H. 2020 b . More grounded image captioning by distilling image-text matching model. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 4777--4786

work page 2020
[57]

Zhou, Y.; Zhang, Y.; Hu, Z.; and Wang, M. 2021. Semi-Autoregressive Transformer for Image Captioning. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops, 3139--3143

work page 2021

[1] [1]

, " * write output.state after.block = add.period write newline

ENTRY address archivePrefix author booktitle chapter edition editor eid eprint howpublished institution isbn journal key month note number organization pages publisher school series title type volume year label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block FUNCTION init.state.consts #0 'before.a...

work page

[2] [2]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...

work page

[3] [3]

Anderson, P.; Fernando, B.; Johnson, M.; and Gould, S. 2016. Spice: Semantic propositional image caption evaluation. In European conference on computer vision, 382--398. Springer

work page 2016

[4] [4]

Anderson, P.; He, X.; Buehler, C.; Teney, D.; Johnson, M.; Gould, S.; and Zhang, L. 2018. Bottom-up and top-down attention for image captioning and visual question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition, 6077--6086

work page 2018

[5] [5]

Layer Normalization

Ba, J. L.; Kiros, J. R.; and Hinton, G. E. 2016. Layer normalization. arXiv preprint arXiv:1607.06450

work page internal anchor Pith review Pith/arXiv arXiv 2016

[6] [6]

Bahdanau, D.; Cho, K.; and Bengio, Y. 2014. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473

work page internal anchor Pith review Pith/arXiv arXiv 2014

[7] [7]

Banerjee, S.; and Lavie, A. 2005. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, 65--72

work page 2005

[8] [8]

Caruana, R. 1997. Multitask learning. Machine learning, 28(1): 41--75

work page 1997

[9] [9]

Chen, X.; Fang, H.; Lin, T.-Y.; Vedantam, R.; Gupta, S.; Doll \'a r, P.; and Zitnick, C. L. 2015. Microsoft coco captions: Data collection and evaluation server. arXiv preprint arXiv:1504.00325

work page internal anchor Pith review Pith/arXiv arXiv 2015

[10] [10]

Chen, Y.-C.; Gan, Z.; Cheng, Y.; Liu, J.; and Liu, J. 2020. Distilling Knowledge Learned in BERT for Text Generation. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 7893--7905

work page 2020

[11] [11]

Cornia, M.; Stefanini, M.; Baraldi, L.; and Cucchiara, R. 2020. Meshed-memory transformer for image captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 10578--10587

work page 2020

[12] [12]

G.; and Forsyth, D

Deshpande, A.; Aneja, J.; Wang, L.; Schwing, A. G.; and Forsyth, D. 2019. Fast, diverse and accurate image captioning guided by part-of-speech. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 10695--10704

work page 2019

[13] [13]

Devlin, J.; Chang, M.-W.; Lee, K.; and Toutanova, K. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805

work page internal anchor Pith review Pith/arXiv arXiv 2018

[14] [14]

Elliott, D.; Frank, S.; and Hasler, E. 2015. Multilingual image description with neural sequence models. arXiv preprint arXiv:1510.04709

work page internal anchor Pith review Pith/arXiv arXiv 2015

[15] [15]

Gu, J.; Wang, G.; Cai, J.; and Chen, T. 2017. An empirical study of language cnn for image captioning. In Proceedings of the IEEE International Conference on Computer Vision, 1222--1231

work page 2017

[16] [16]

He, K.; Zhang, X.; Ren, S.; and Sun, J. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, 770--778

work page 2016

[17] [17]

Hou, J.; Wu, X.; Zhao, W.; Luo, J.; and Jia, Y. 2019. Joint syntax representation learning and visual cue translation for video captioning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 8918--8927

work page 2019

[18] [18]

Huang, L.; Wang, W.; Chen, J.; and Wei, X.-Y. 2019. Attention on attention for image captioning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 4634--4643

work page 2019

[19] [19]

Ji, J.; Luo, Y.; Sun, X.; Chen, F.; Luo, G.; Wu, Y.; Gao, Y.; and Ji, R. 2021. Improving image captioning by leveraging intra-and inter-layer global representation in transformer network. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, 1655--1663

work page 2021

[20] [20]

Jiang, H.; Misra, I.; Rohrbach, M.; Learned-Miller, E.; and Chen, X. 2020. In defense of grid features for visual question answering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 10267--10276

work page 2020

[21] [21]

Jiang, W.; Ma, L.; Jiang, Y.-G.; Liu, W.; and Zhang, T. 2018. Recurrent fusion network for image captioning. In Proceedings of the European Conference on Computer Vision (ECCV), 499--515

work page 2018

[22] [22]

Karpathy, A.; and Fei-Fei, L. 2015. Deep visual-semantic alignments for generating image descriptions. In Proceedings of the IEEE conference on computer vision and pattern recognition, 3128--3137

work page 2015

[23] [23]

Li, G.; Zhu, L.; Liu, P.; and Yang, Y. 2019. Entangled transformer for image captioning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 8928--8937

work page 2019

[24] [24]

Li, X.; Yin, X.; Li, C.; Zhang, P.; Hu, X.; Zhang, L.; Wang, L.; Hu, H.; Dong, L.; Wei, F.; et al. 2020. Oscar: Object-semantics aligned pre-training for vision-language tasks. In European Conference on Computer Vision, 121--137. Springer

work page 2020

[25] [25]

Lin, C.-Y. 2004. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out, 74--81

work page 2004

[26] [26]

Luo, R. 2020. A Better Variant of Self-Critical Sequence Training. arXiv preprint arXiv:2003.09971

work page arXiv 2020

[27] [27]

Pan, Y.; Yao, T.; Li, Y.; and Mei, T. 2020. X-linear attention networks for image captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 10971--10980

work page 2020

[28] [28]

Papineni, K.; Roukos, S.; Ward, T.; and Zhu, W.-J. 2002. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics, 311--318

work page 2002

[29] [29]

Qin, Y.; Du, J.; Zhang, Y.; and Lu, H. 2019. Look back and predict forward in image captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 8367--8375

work page 2019

[30] [30]

Ren, S.; He, K.; Girshick, R.; and Sun, J. 2015. Faster r-cnn: Towards real-time object detection with region proposal networks. Advances in neural information processing systems, 28: 91--99

work page 2015

[31] [31]

J.; Marcheret, E.; Mroueh, Y.; Ross, J.; and Goel, V

Rennie, S. J.; Marcheret, E.; Mroueh, Y.; Ross, J.; and Goel, V. 2017. Self-critical sequence training for image captioning. In Proceedings of the IEEE conference on computer vision and pattern recognition, 7008--7024

work page 2017

[32] [32]

Sammani, F.; and Elsayed, M. 2019. Look and modify: Modification networks for image captioning. arXiv preprint arXiv:1909.03169

work page arXiv 2019

[33] [33]

Sammani, F.; and Melas-Kyriazi, L. 2020. Show, edit and tell: A framework for editing image captions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 4808--4816

work page 2020

[34] [34]

Song, Z.; Zhou, X.; Mao, Z.; and Tan, J. 2021. Image Captioning with Context-Aware Auxiliary Guidance. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, 2584--2592

work page 2021

[35] [35]

Sutskever, I.; Vinyals, O.; and Le, Q. V. 2014. Sequence to sequence learning with neural networks. In Advances in neural information processing systems, 3104--3112

work page 2014

[36] [36]

N.; Kaiser, .; and Polosukhin, I

Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A. N.; Kaiser, .; and Polosukhin, I. 2017. Attention is all you need. In Advances in neural information processing systems, 5998--6008

work page 2017

[37] [37]

Vedantam, R.; Lawrence Zitnick, C.; and Parikh, D. 2015. Cider: Consensus-based image description evaluation. In Proceedings of the IEEE conference on computer vision and pattern recognition, 4566--4575

work page 2015

[38] [38]

Vinyals, O.; Toshev, A.; Bengio, S.; and Erhan, D. 2015. Show and tell: A neural image caption generator. In Proceedings of the IEEE conference on computer vision and pattern recognition, 3156--3164

work page 2015

[39] [39]

Wang, B.; Ma, L.; Zhang, W.; Jiang, W.; Wang, J.; and Liu, W. 2019 a . Controllable video captioning with pos sequence guidance based on gated fusion network. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 2641--2650

work page 2019

[40] [40]

Wang, C.; Yang, H.; Bartz, C.; and Meinel, C. 2016. Image captioning with deep bidirectional LSTMs. In Proceedings of the 24th ACM international conference on Multimedia, 988--997

work page 2016

[41] [41]

Wang, L.; Bai, Z.; Zhang, Y.; and Lu, H. 2020. Show, Recall, and Tell: Image Captioning with Recall Mechanism. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, 12176--12183

work page 2020

[42] [42]

Wang, X.; Wu, J.; Chen, J.; Li, L.; Wang, Y.-F.; and Wang, W. Y. 2019 b . Vatex: A large-scale, high-quality multilingual dataset for video-and-language research. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 4581--4591

work page 2019

[43] [43]

Xu, K.; Ba, J.; Kiros, R.; Cho, K.; Courville, A.; Salakhudinov, R.; Zemel, R.; and Bengio, Y. 2015. Show, attend and tell: Neural image caption generation with visual attention. In International conference on machine learning, 2048--2057. PMLR

work page 2015

[44] [44]

Yang, X.; Tang, K.; Zhang, H.; and Cai, J. 2019. Auto-encoding scene graphs for image captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 10685--10694

work page 2019

[45] [45]

Yao, T.; Pan, Y.; Li, Y.; and Mei, T. 2018. Exploring visual relationship for image captioning. In Proceedings of the European conference on computer vision (ECCV), 684--699

work page 2018

[46] [46]

Yao, T.; Pan, Y.; Li, Y.; and Mei, T. 2019. Hierarchy parsing for image captioning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 2621--2629

work page 2019

[47] [47]

Yao, T.; Pan, Y.; Li, Y.; Qiu, Z.; and Mei, T. 2017. Boosting image captioning with attributes. In Proceedings of the IEEE international conference on computer vision, 4894--4902

work page 2017

[48] [48]

Zhang, P.; Li, X.; Hu, X.; Yang, J.; Zhang, L.; Wang, L.; Choi, Y.; and Gao, J. 2021 a . Vinvl: Revisiting visual representations in vision-language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 5579--5588

work page 2021

[49] [49]

Zhang, X.; Su, J.; Qin, Y.; Liu, Y.; Ji, R.; and Wang, H. 2018. Asynchronous bidirectional decoding for neural machine translation. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 32

work page 2018

[50] [50]

Zhang, X.; Sun, X.; Luo, Y.; Ji, J.; Zhou, Y.; Wu, Y.; Huang, F.; and Ji, R. 2021 b . RSTNet: Captioning With Adaptive Attention on Visual and Non-Visual Words. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 15465--15474

work page 2021

[51] [51]

Zhang, Z.; Qi, Z.; Yuan, C.; Shan, Y.; Li, B.; Deng, Y.; and Hu, W. 2021 c . Open-book Video Captioning with Retrieve-Copy-Generate Network. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 9837--9846

work page 2021

[52] [52]

Zhang, Z.; Wu, S.; Liu, S.; Li, M.; Zhou, M.; and Xu, T. 2019. Regularizing neural machine translation by target-bidirectional agreement. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, 443--450

work page 2019

[53] [53]

Zhao, W.; Wang, B.; Ye, J.; Yang, M.; Zhao, Z.; Luo, R.; and Qiao, Y. 2018. A Multi-task Learning Approach for Image Captioning. In IJCAI, 1205--1211

work page 2018

[54] [54]

Zhou, L.; Palangi, H.; Zhang, L.; Hu, H.; Corso, J.; and Gao, J. 2020 a . Unified vision-language pre-training for image captioning and vqa. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, 13041--13049

work page 2020

[55] [55]

Zhou, L.; Zhang, J.; and Zong, C. 2019. Synchronous bidirectional neural machine translation. Transactions of the Association for Computational Linguistics, 7: 91--105

work page 2019

[56] [56]

Zhou, Y.; Wang, M.; Liu, D.; Hu, Z.; and Zhang, H. 2020 b . More grounded image captioning by distilling image-text matching model. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 4777--4786

work page 2020

[57] [57]

Zhou, Y.; Zhang, Y.; Hu, Z.; and Wang, M. 2021. Semi-Autoregressive Transformer for Image Captioning. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops, 3139--3143

work page 2021