Image Captioning via Compact Bidirectional Architecture
Pith reviewed 2026-05-24 12:16 UTC · model grok-4.3
The pith
A compact model fuses left-to-right and right-to-left flows to generate image captions using bidirectional context while decoding in parallel.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Tightly coupling L2R and R2L flows into a single compact model serves as effective regularization for implicitly exploiting bidirectional context; the final caption is then selected from either flow via sentence-level ensemble, and this architecture supports a two-flow version of self-critical training that reaches new state-of-the-art results on MSCOCO without vision-language pretraining.
What carries the argument
Compact Bidirectional Transformer that tightly couples L2R and R2L flows into one model to regularize for bidirectional context while allowing parallel execution and sentence-level ensemble selection.
If this is right
- The decoder runs in parallel instead of requiring sequential stages.
- Sentence-level ensemble between the two flows improves final captions.
- Word-level ensemble can be added on top to enlarge the ensemble gain.
- Two-flow self-critical training yields higher scores than the conventional one-flow version.
- The same coupling pattern transfers to an LSTM decoder backbone.
Where Pith is reading between the lines
- The approach could be tested on other autoregressive tasks such as machine translation to check whether the same compact coupling reduces the usual cost of bidirectional decoding.
- If the coupling mainly acts as regularization, performance gains should be largest in low-data captioning settings where overfitting is a concern.
- One could measure whether the shared parameters force the two flows to learn complementary rather than redundant features by inspecting their attention patterns on the same image.
Load-bearing premise
Tightly coupling the two directional flows inside one shared model is what actually supplies useful bidirectional regularization rather than merely saving parameters.
What would settle it
Train two independent L2R and R2L models with the same total parameter count and compare their ensemble performance on the MSCOCO test set; if the separate models match or exceed the compact version, the regularization benefit of coupling would not hold.
Figures
read the original abstract
Most current image captioning models typically generate captions from left-to-right. This unidirectional property makes them can only leverage past context but not future context. Though refinement-based models can exploit both past and future context by generating a new caption in the second stage based on pre-retrieved or pre-generated captions in the first stage, the decoder of these models generally consists of two networks~(i.e. a retriever or captioner in the first stage and a captioner in the second stage), which can only be executed sequentially. In this paper, we introduce a Compact Bidirectional Transformer model for image captioning that can leverage bidirectional context implicitly and explicitly while the decoder can be executed parallelly. Specifically, it is implemented by tightly coupling left-to-right(L2R) and right-to-left(R2L) flows into a single compact model to serve as a regularization for implicitly exploiting bidirectional context and optionally allowing explicit interaction of the bidirectional flows, while the final caption is chosen from either L2R or R2L flow in a sentence-level ensemble manner. We conduct extensive ablation studies on MSCOCO benchmark and find that the compact bidirectional architecture and the sentence-level ensemble play more important roles than the explicit interaction mechanism. By combining with word-level ensemble seamlessly, the effect of sentence-level ensemble is further enlarged. We further extend the conventional one-flow self-critical training to the two-flows version under this architecture and achieve new state-of-the-art results in comparison with non-vision-language-pretraining models. Finally, we verify the generality of this compact bidirectional architecture by extending it to LSTM backbone. Source code is available at https://github.com/YuanEZhou/cbtic.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes a Compact Bidirectional Transformer (CBT) for image captioning that tightly couples L2R and R2L flows into one compact model to implicitly exploit bidirectional context (with optional explicit interaction), uses sentence-level ensemble for final caption selection, extends self-critical training to two flows, and reports new SOTA results on MSCOCO among non-VLP models. Extensive ablations indicate the compact architecture and ensemble matter more than explicit interaction; the approach also generalizes to LSTM backbones, with source code released.
Significance. If the gains are shown to arise specifically from the bidirectional regularization effect of tight coupling (rather than capacity or ensemble alone), the work would provide an efficient parallelizable alternative to refinement-based or separate bidirectional models. The release of source code, the reported ablations, and the LSTM extension are positive elements that support reproducibility and generality.
major comments (2)
- [Ablations (MSCOCO experiments)] Ablations section (around the MSCOCO experiments): the reported comparisons isolate the role of explicit interaction but do not include a control consisting of two independent unidirectional models whose total parameter count matches the compact bidirectional model. Without this baseline it is not possible to determine whether observed improvements derive from the claimed regularization effect of tight L2R-R2L coupling or from parameter sharing and doubled training signal.
- [Architecture and Ablations] Architecture description and results: the central claim that 'tightly coupling L2R and R2L flows into a single compact model [serves] as a regularization for implicitly exploiting bidirectional context' is load-bearing for the interpretation of the SOTA numbers, yet the ablation tables do not quantify the implicit bidirectional exploitation separately from the ensemble and capacity effects.
minor comments (1)
- [Abstract] The abstract contains minor grammatical issues (e.g., 'makes them can only leverage') that should be corrected for clarity.
Simulated Author's Rebuttal
We thank the referee for the constructive comments and suggestions. We address the major comments point by point below.
read point-by-point responses
-
Referee: [Ablations (MSCOCO experiments)] Ablations section (around the MSCOCO experiments): the reported comparisons isolate the role of explicit interaction but do not include a control consisting of two independent unidirectional models whose total parameter count matches the compact bidirectional model. Without this baseline it is not possible to determine whether observed improvements derive from the claimed regularization effect of tight L2R-R2L coupling or from parameter sharing and doubled training signal.
Authors: We agree that a control consisting of two independent unidirectional models with total parameter count matched to the compact bidirectional model would provide stronger evidence to isolate the regularization effect of tight L2R-R2L coupling from capacity and doubled training signal. Our existing ablations compare the compact model against standard single-flow baselines and vary explicit interaction, but do not include this exact matched-capacity control. We will add this baseline ablation in the revised manuscript. revision: yes
-
Referee: [Architecture and Ablations] Architecture description and results: the central claim that 'tightly coupling L2R and R2L flows into a single compact model [serves] as a regularization for implicitly exploiting bidirectional context' is load-bearing for the interpretation of the SOTA numbers, yet the ablation tables do not quantify the implicit bidirectional exploitation separately from the ensemble and capacity effects.
Authors: The ablation studies demonstrate that the compact architecture yields gains beyond explicit interaction alone and that the sentence-level ensemble is a major contributor. We acknowledge that the tables do not provide a direct, separate quantification of the implicit bidirectional exploitation effect independent of capacity and ensemble. We will revise the discussion to more clearly acknowledge this limitation in the current evidence and to temper the interpretation of the central claim accordingly. revision: partial
Circularity Check
No circularity: empirical architecture validated by ablations on external benchmark
full rationale
The paper proposes a compact bidirectional transformer architecture for image captioning and reports results from ablations on the MSCOCO benchmark, including comparisons of compact vs. non-compact variants, sentence-level ensemble, and two-flow self-critical training. All load-bearing claims (SOTA among non-VLP models, importance of compact coupling and ensemble) rest on direct experimental measurements rather than any derivation that reduces by construction to fitted parameters or self-citations. No mathematical predictions, uniqueness theorems, or ansatzes are invoked that loop back to the paper's own inputs; the work is self-contained against the external MSCOCO test set.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Standard transformer decoder assumptions for autoregressive sequence generation
Reference graph
Works this paper leans on
-
[1]
, " * write output.state after.block = add.period write newline
ENTRY address archivePrefix author booktitle chapter edition editor eid eprint howpublished institution isbn journal key month note number organization pages publisher school series title type volume year label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block FUNCTION init.state.consts #0 'before.a...
-
[2]
" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...
-
[3]
Anderson, P.; Fernando, B.; Johnson, M.; and Gould, S. 2016. Spice: Semantic propositional image caption evaluation. In European conference on computer vision, 382--398. Springer
work page 2016
-
[4]
Anderson, P.; He, X.; Buehler, C.; Teney, D.; Johnson, M.; Gould, S.; and Zhang, L. 2018. Bottom-up and top-down attention for image captioning and visual question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition, 6077--6086
work page 2018
-
[5]
Ba, J. L.; Kiros, J. R.; and Hinton, G. E. 2016. Layer normalization. arXiv preprint arXiv:1607.06450
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[6]
Bahdanau, D.; Cho, K.; and Bengio, Y. 2014. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473
work page internal anchor Pith review Pith/arXiv arXiv 2014
-
[7]
Banerjee, S.; and Lavie, A. 2005. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, 65--72
work page 2005
-
[8]
Caruana, R. 1997. Multitask learning. Machine learning, 28(1): 41--75
work page 1997
-
[9]
Chen, X.; Fang, H.; Lin, T.-Y.; Vedantam, R.; Gupta, S.; Doll \'a r, P.; and Zitnick, C. L. 2015. Microsoft coco captions: Data collection and evaluation server. arXiv preprint arXiv:1504.00325
work page internal anchor Pith review Pith/arXiv arXiv 2015
-
[10]
Chen, Y.-C.; Gan, Z.; Cheng, Y.; Liu, J.; and Liu, J. 2020. Distilling Knowledge Learned in BERT for Text Generation. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 7893--7905
work page 2020
-
[11]
Cornia, M.; Stefanini, M.; Baraldi, L.; and Cucchiara, R. 2020. Meshed-memory transformer for image captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 10578--10587
work page 2020
-
[12]
Deshpande, A.; Aneja, J.; Wang, L.; Schwing, A. G.; and Forsyth, D. 2019. Fast, diverse and accurate image captioning guided by part-of-speech. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 10695--10704
work page 2019
-
[13]
Devlin, J.; Chang, M.-W.; Lee, K.; and Toutanova, K. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[14]
Elliott, D.; Frank, S.; and Hasler, E. 2015. Multilingual image description with neural sequence models. arXiv preprint arXiv:1510.04709
work page internal anchor Pith review Pith/arXiv arXiv 2015
-
[15]
Gu, J.; Wang, G.; Cai, J.; and Chen, T. 2017. An empirical study of language cnn for image captioning. In Proceedings of the IEEE International Conference on Computer Vision, 1222--1231
work page 2017
-
[16]
He, K.; Zhang, X.; Ren, S.; and Sun, J. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, 770--778
work page 2016
-
[17]
Hou, J.; Wu, X.; Zhao, W.; Luo, J.; and Jia, Y. 2019. Joint syntax representation learning and visual cue translation for video captioning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 8918--8927
work page 2019
-
[18]
Huang, L.; Wang, W.; Chen, J.; and Wei, X.-Y. 2019. Attention on attention for image captioning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 4634--4643
work page 2019
-
[19]
Ji, J.; Luo, Y.; Sun, X.; Chen, F.; Luo, G.; Wu, Y.; Gao, Y.; and Ji, R. 2021. Improving image captioning by leveraging intra-and inter-layer global representation in transformer network. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, 1655--1663
work page 2021
-
[20]
Jiang, H.; Misra, I.; Rohrbach, M.; Learned-Miller, E.; and Chen, X. 2020. In defense of grid features for visual question answering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 10267--10276
work page 2020
-
[21]
Jiang, W.; Ma, L.; Jiang, Y.-G.; Liu, W.; and Zhang, T. 2018. Recurrent fusion network for image captioning. In Proceedings of the European Conference on Computer Vision (ECCV), 499--515
work page 2018
-
[22]
Karpathy, A.; and Fei-Fei, L. 2015. Deep visual-semantic alignments for generating image descriptions. In Proceedings of the IEEE conference on computer vision and pattern recognition, 3128--3137
work page 2015
-
[23]
Li, G.; Zhu, L.; Liu, P.; and Yang, Y. 2019. Entangled transformer for image captioning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 8928--8937
work page 2019
-
[24]
Li, X.; Yin, X.; Li, C.; Zhang, P.; Hu, X.; Zhang, L.; Wang, L.; Hu, H.; Dong, L.; Wei, F.; et al. 2020. Oscar: Object-semantics aligned pre-training for vision-language tasks. In European Conference on Computer Vision, 121--137. Springer
work page 2020
-
[25]
Lin, C.-Y. 2004. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out, 74--81
work page 2004
- [26]
-
[27]
Pan, Y.; Yao, T.; Li, Y.; and Mei, T. 2020. X-linear attention networks for image captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 10971--10980
work page 2020
-
[28]
Papineni, K.; Roukos, S.; Ward, T.; and Zhu, W.-J. 2002. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics, 311--318
work page 2002
-
[29]
Qin, Y.; Du, J.; Zhang, Y.; and Lu, H. 2019. Look back and predict forward in image captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 8367--8375
work page 2019
-
[30]
Ren, S.; He, K.; Girshick, R.; and Sun, J. 2015. Faster r-cnn: Towards real-time object detection with region proposal networks. Advances in neural information processing systems, 28: 91--99
work page 2015
-
[31]
J.; Marcheret, E.; Mroueh, Y.; Ross, J.; and Goel, V
Rennie, S. J.; Marcheret, E.; Mroueh, Y.; Ross, J.; and Goel, V. 2017. Self-critical sequence training for image captioning. In Proceedings of the IEEE conference on computer vision and pattern recognition, 7008--7024
work page 2017
- [32]
-
[33]
Sammani, F.; and Melas-Kyriazi, L. 2020. Show, edit and tell: A framework for editing image captions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 4808--4816
work page 2020
-
[34]
Song, Z.; Zhou, X.; Mao, Z.; and Tan, J. 2021. Image Captioning with Context-Aware Auxiliary Guidance. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, 2584--2592
work page 2021
-
[35]
Sutskever, I.; Vinyals, O.; and Le, Q. V. 2014. Sequence to sequence learning with neural networks. In Advances in neural information processing systems, 3104--3112
work page 2014
-
[36]
N.; Kaiser, .; and Polosukhin, I
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A. N.; Kaiser, .; and Polosukhin, I. 2017. Attention is all you need. In Advances in neural information processing systems, 5998--6008
work page 2017
-
[37]
Vedantam, R.; Lawrence Zitnick, C.; and Parikh, D. 2015. Cider: Consensus-based image description evaluation. In Proceedings of the IEEE conference on computer vision and pattern recognition, 4566--4575
work page 2015
-
[38]
Vinyals, O.; Toshev, A.; Bengio, S.; and Erhan, D. 2015. Show and tell: A neural image caption generator. In Proceedings of the IEEE conference on computer vision and pattern recognition, 3156--3164
work page 2015
-
[39]
Wang, B.; Ma, L.; Zhang, W.; Jiang, W.; Wang, J.; and Liu, W. 2019 a . Controllable video captioning with pos sequence guidance based on gated fusion network. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 2641--2650
work page 2019
-
[40]
Wang, C.; Yang, H.; Bartz, C.; and Meinel, C. 2016. Image captioning with deep bidirectional LSTMs. In Proceedings of the 24th ACM international conference on Multimedia, 988--997
work page 2016
-
[41]
Wang, L.; Bai, Z.; Zhang, Y.; and Lu, H. 2020. Show, Recall, and Tell: Image Captioning with Recall Mechanism. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, 12176--12183
work page 2020
-
[42]
Wang, X.; Wu, J.; Chen, J.; Li, L.; Wang, Y.-F.; and Wang, W. Y. 2019 b . Vatex: A large-scale, high-quality multilingual dataset for video-and-language research. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 4581--4591
work page 2019
-
[43]
Xu, K.; Ba, J.; Kiros, R.; Cho, K.; Courville, A.; Salakhudinov, R.; Zemel, R.; and Bengio, Y. 2015. Show, attend and tell: Neural image caption generation with visual attention. In International conference on machine learning, 2048--2057. PMLR
work page 2015
-
[44]
Yang, X.; Tang, K.; Zhang, H.; and Cai, J. 2019. Auto-encoding scene graphs for image captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 10685--10694
work page 2019
-
[45]
Yao, T.; Pan, Y.; Li, Y.; and Mei, T. 2018. Exploring visual relationship for image captioning. In Proceedings of the European conference on computer vision (ECCV), 684--699
work page 2018
-
[46]
Yao, T.; Pan, Y.; Li, Y.; and Mei, T. 2019. Hierarchy parsing for image captioning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 2621--2629
work page 2019
-
[47]
Yao, T.; Pan, Y.; Li, Y.; Qiu, Z.; and Mei, T. 2017. Boosting image captioning with attributes. In Proceedings of the IEEE international conference on computer vision, 4894--4902
work page 2017
-
[48]
Zhang, P.; Li, X.; Hu, X.; Yang, J.; Zhang, L.; Wang, L.; Choi, Y.; and Gao, J. 2021 a . Vinvl: Revisiting visual representations in vision-language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 5579--5588
work page 2021
-
[49]
Zhang, X.; Su, J.; Qin, Y.; Liu, Y.; Ji, R.; and Wang, H. 2018. Asynchronous bidirectional decoding for neural machine translation. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 32
work page 2018
-
[50]
Zhang, X.; Sun, X.; Luo, Y.; Ji, J.; Zhou, Y.; Wu, Y.; Huang, F.; and Ji, R. 2021 b . RSTNet: Captioning With Adaptive Attention on Visual and Non-Visual Words. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 15465--15474
work page 2021
-
[51]
Zhang, Z.; Qi, Z.; Yuan, C.; Shan, Y.; Li, B.; Deng, Y.; and Hu, W. 2021 c . Open-book Video Captioning with Retrieve-Copy-Generate Network. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 9837--9846
work page 2021
-
[52]
Zhang, Z.; Wu, S.; Liu, S.; Li, M.; Zhou, M.; and Xu, T. 2019. Regularizing neural machine translation by target-bidirectional agreement. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, 443--450
work page 2019
-
[53]
Zhao, W.; Wang, B.; Ye, J.; Yang, M.; Zhao, Z.; Luo, R.; and Qiao, Y. 2018. A Multi-task Learning Approach for Image Captioning. In IJCAI, 1205--1211
work page 2018
-
[54]
Zhou, L.; Palangi, H.; Zhang, L.; Hu, H.; Corso, J.; and Gao, J. 2020 a . Unified vision-language pre-training for image captioning and vqa. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, 13041--13049
work page 2020
-
[55]
Zhou, L.; Zhang, J.; and Zong, C. 2019. Synchronous bidirectional neural machine translation. Transactions of the Association for Computational Linguistics, 7: 91--105
work page 2019
-
[56]
Zhou, Y.; Wang, M.; Liu, D.; Hu, Z.; and Zhang, H. 2020 b . More grounded image captioning by distilling image-text matching model. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 4777--4786
work page 2020
-
[57]
Zhou, Y.; Zhang, Y.; Hu, Z.; and Wang, M. 2021. Semi-Autoregressive Transformer for Image Captioning. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops, 3139--3143
work page 2021
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.