pith. sign in

arxiv: 2201.01984 · v3 · submitted 2022-01-06 · 💻 cs.CV · cs.CL

Image Captioning via Compact Bidirectional Architecture

Pith reviewed 2026-05-24 12:16 UTC · model grok-4.3

classification 💻 cs.CV cs.CL
keywords image captioningbidirectional transformercompact architecturesentence-level ensembleself-critical trainingMSCOCOleft-to-right right-to-left flows
0
0 comments X

The pith

A compact model fuses left-to-right and right-to-left flows to generate image captions using bidirectional context while decoding in parallel.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents a Compact Bidirectional Transformer that couples left-to-right and right-to-left generation streams inside one decoder. This coupling acts as regularization so the model can draw on future context without running two separate networks in sequence. Ablation results show the compact coupling and sentence-level choice between the two streams matter more than any added explicit interaction layer. Extending self-critical training to both streams and combining it with word-level ensemble produces new state-of-the-art scores on MSCOCO among models that do not use vision-language pretraining. The same compact design also works when the backbone is switched to an LSTM.

Core claim

Tightly coupling L2R and R2L flows into a single compact model serves as effective regularization for implicitly exploiting bidirectional context; the final caption is then selected from either flow via sentence-level ensemble, and this architecture supports a two-flow version of self-critical training that reaches new state-of-the-art results on MSCOCO without vision-language pretraining.

What carries the argument

Compact Bidirectional Transformer that tightly couples L2R and R2L flows into one model to regularize for bidirectional context while allowing parallel execution and sentence-level ensemble selection.

If this is right

  • The decoder runs in parallel instead of requiring sequential stages.
  • Sentence-level ensemble between the two flows improves final captions.
  • Word-level ensemble can be added on top to enlarge the ensemble gain.
  • Two-flow self-critical training yields higher scores than the conventional one-flow version.
  • The same coupling pattern transfers to an LSTM decoder backbone.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The approach could be tested on other autoregressive tasks such as machine translation to check whether the same compact coupling reduces the usual cost of bidirectional decoding.
  • If the coupling mainly acts as regularization, performance gains should be largest in low-data captioning settings where overfitting is a concern.
  • One could measure whether the shared parameters force the two flows to learn complementary rather than redundant features by inspecting their attention patterns on the same image.

Load-bearing premise

Tightly coupling the two directional flows inside one shared model is what actually supplies useful bidirectional regularization rather than merely saving parameters.

What would settle it

Train two independent L2R and R2L models with the same total parameter count and compare their ensemble performance on the MSCOCO test set; if the separate models match or exceed the compact version, the regularization benefit of coupling would not hold.

Figures

Figures reproduced from arXiv: 2201.01984 by Daqing Liu, Huixia Ben, Meng Wang, Richang Hong, Yuanen Zhou, Zhenzhen Hu, Zijie Song.

Figure 1
Figure 1. Figure 1: A conceptual overview of (a) Uni-directional gen [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Illustration of Compact Bidirectional Transformer for Image Captioning (CBTIC). CBTIC model composes of an [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Examples of captions generated by our CBTIC model, conventional unidirectional Transformer model and human [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
read the original abstract

Most current image captioning models typically generate captions from left-to-right. This unidirectional property makes them can only leverage past context but not future context. Though refinement-based models can exploit both past and future context by generating a new caption in the second stage based on pre-retrieved or pre-generated captions in the first stage, the decoder of these models generally consists of two networks~(i.e. a retriever or captioner in the first stage and a captioner in the second stage), which can only be executed sequentially. In this paper, we introduce a Compact Bidirectional Transformer model for image captioning that can leverage bidirectional context implicitly and explicitly while the decoder can be executed parallelly. Specifically, it is implemented by tightly coupling left-to-right(L2R) and right-to-left(R2L) flows into a single compact model to serve as a regularization for implicitly exploiting bidirectional context and optionally allowing explicit interaction of the bidirectional flows, while the final caption is chosen from either L2R or R2L flow in a sentence-level ensemble manner. We conduct extensive ablation studies on MSCOCO benchmark and find that the compact bidirectional architecture and the sentence-level ensemble play more important roles than the explicit interaction mechanism. By combining with word-level ensemble seamlessly, the effect of sentence-level ensemble is further enlarged. We further extend the conventional one-flow self-critical training to the two-flows version under this architecture and achieve new state-of-the-art results in comparison with non-vision-language-pretraining models. Finally, we verify the generality of this compact bidirectional architecture by extending it to LSTM backbone. Source code is available at https://github.com/YuanEZhou/cbtic.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper proposes a Compact Bidirectional Transformer (CBT) for image captioning that tightly couples L2R and R2L flows into one compact model to implicitly exploit bidirectional context (with optional explicit interaction), uses sentence-level ensemble for final caption selection, extends self-critical training to two flows, and reports new SOTA results on MSCOCO among non-VLP models. Extensive ablations indicate the compact architecture and ensemble matter more than explicit interaction; the approach also generalizes to LSTM backbones, with source code released.

Significance. If the gains are shown to arise specifically from the bidirectional regularization effect of tight coupling (rather than capacity or ensemble alone), the work would provide an efficient parallelizable alternative to refinement-based or separate bidirectional models. The release of source code, the reported ablations, and the LSTM extension are positive elements that support reproducibility and generality.

major comments (2)
  1. [Ablations (MSCOCO experiments)] Ablations section (around the MSCOCO experiments): the reported comparisons isolate the role of explicit interaction but do not include a control consisting of two independent unidirectional models whose total parameter count matches the compact bidirectional model. Without this baseline it is not possible to determine whether observed improvements derive from the claimed regularization effect of tight L2R-R2L coupling or from parameter sharing and doubled training signal.
  2. [Architecture and Ablations] Architecture description and results: the central claim that 'tightly coupling L2R and R2L flows into a single compact model [serves] as a regularization for implicitly exploiting bidirectional context' is load-bearing for the interpretation of the SOTA numbers, yet the ablation tables do not quantify the implicit bidirectional exploitation separately from the ensemble and capacity effects.
minor comments (1)
  1. [Abstract] The abstract contains minor grammatical issues (e.g., 'makes them can only leverage') that should be corrected for clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments and suggestions. We address the major comments point by point below.

read point-by-point responses
  1. Referee: [Ablations (MSCOCO experiments)] Ablations section (around the MSCOCO experiments): the reported comparisons isolate the role of explicit interaction but do not include a control consisting of two independent unidirectional models whose total parameter count matches the compact bidirectional model. Without this baseline it is not possible to determine whether observed improvements derive from the claimed regularization effect of tight L2R-R2L coupling or from parameter sharing and doubled training signal.

    Authors: We agree that a control consisting of two independent unidirectional models with total parameter count matched to the compact bidirectional model would provide stronger evidence to isolate the regularization effect of tight L2R-R2L coupling from capacity and doubled training signal. Our existing ablations compare the compact model against standard single-flow baselines and vary explicit interaction, but do not include this exact matched-capacity control. We will add this baseline ablation in the revised manuscript. revision: yes

  2. Referee: [Architecture and Ablations] Architecture description and results: the central claim that 'tightly coupling L2R and R2L flows into a single compact model [serves] as a regularization for implicitly exploiting bidirectional context' is load-bearing for the interpretation of the SOTA numbers, yet the ablation tables do not quantify the implicit bidirectional exploitation separately from the ensemble and capacity effects.

    Authors: The ablation studies demonstrate that the compact architecture yields gains beyond explicit interaction alone and that the sentence-level ensemble is a major contributor. We acknowledge that the tables do not provide a direct, separate quantification of the implicit bidirectional exploitation effect independent of capacity and ensemble. We will revise the discussion to more clearly acknowledge this limitation in the current evidence and to temper the interpretation of the central claim accordingly. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical architecture validated by ablations on external benchmark

full rationale

The paper proposes a compact bidirectional transformer architecture for image captioning and reports results from ablations on the MSCOCO benchmark, including comparisons of compact vs. non-compact variants, sentence-level ensemble, and two-flow self-critical training. All load-bearing claims (SOTA among non-VLP models, importance of compact coupling and ensemble) rest on direct experimental measurements rather than any derivation that reduces by construction to fitted parameters or self-citations. No mathematical predictions, uniqueness theorems, or ansatzes are invoked that loop back to the paper's own inputs; the work is self-contained against the external MSCOCO test set.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper rests on the standard transformer and LSTM sequence modeling assumptions plus the MSCOCO benchmark and self-critical sequence training framework; no new free parameters, axioms, or invented entities are extractable from the abstract alone.

axioms (1)
  • domain assumption Standard transformer decoder assumptions for autoregressive sequence generation
    Invoked as the backbone for the compact bidirectional model.

pith-pipeline@v0.9.0 · 5846 in / 1146 out tokens · 26736 ms · 2026-05-24T12:16:39.762070+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

57 extracted references · 57 canonical work pages · 5 internal anchors

  1. [1]

    , " * write output.state after.block = add.period write newline

    ENTRY address archivePrefix author booktitle chapter edition editor eid eprint howpublished institution isbn journal key month note number organization pages publisher school series title type volume year label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block FUNCTION init.state.consts #0 'before.a...

  2. [2]

    write newline

    " write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...

  3. [3]

    Anderson, P.; Fernando, B.; Johnson, M.; and Gould, S. 2016. Spice: Semantic propositional image caption evaluation. In European conference on computer vision, 382--398. Springer

  4. [4]

    Anderson, P.; He, X.; Buehler, C.; Teney, D.; Johnson, M.; Gould, S.; and Zhang, L. 2018. Bottom-up and top-down attention for image captioning and visual question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition, 6077--6086

  5. [5]

    Layer Normalization

    Ba, J. L.; Kiros, J. R.; and Hinton, G. E. 2016. Layer normalization. arXiv preprint arXiv:1607.06450

  6. [6]

    Bahdanau, D.; Cho, K.; and Bengio, Y. 2014. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473

  7. [7]

    Banerjee, S.; and Lavie, A. 2005. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, 65--72

  8. [8]

    Caruana, R. 1997. Multitask learning. Machine learning, 28(1): 41--75

  9. [9]

    Chen, X.; Fang, H.; Lin, T.-Y.; Vedantam, R.; Gupta, S.; Doll \'a r, P.; and Zitnick, C. L. 2015. Microsoft coco captions: Data collection and evaluation server. arXiv preprint arXiv:1504.00325

  10. [10]

    Chen, Y.-C.; Gan, Z.; Cheng, Y.; Liu, J.; and Liu, J. 2020. Distilling Knowledge Learned in BERT for Text Generation. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 7893--7905

  11. [11]

    Cornia, M.; Stefanini, M.; Baraldi, L.; and Cucchiara, R. 2020. Meshed-memory transformer for image captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 10578--10587

  12. [12]

    G.; and Forsyth, D

    Deshpande, A.; Aneja, J.; Wang, L.; Schwing, A. G.; and Forsyth, D. 2019. Fast, diverse and accurate image captioning guided by part-of-speech. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 10695--10704

  13. [13]

    Devlin, J.; Chang, M.-W.; Lee, K.; and Toutanova, K. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805

  14. [14]

    Elliott, D.; Frank, S.; and Hasler, E. 2015. Multilingual image description with neural sequence models. arXiv preprint arXiv:1510.04709

  15. [15]

    Gu, J.; Wang, G.; Cai, J.; and Chen, T. 2017. An empirical study of language cnn for image captioning. In Proceedings of the IEEE International Conference on Computer Vision, 1222--1231

  16. [16]

    He, K.; Zhang, X.; Ren, S.; and Sun, J. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, 770--778

  17. [17]

    Hou, J.; Wu, X.; Zhao, W.; Luo, J.; and Jia, Y. 2019. Joint syntax representation learning and visual cue translation for video captioning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 8918--8927

  18. [18]

    Huang, L.; Wang, W.; Chen, J.; and Wei, X.-Y. 2019. Attention on attention for image captioning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 4634--4643

  19. [19]

    Ji, J.; Luo, Y.; Sun, X.; Chen, F.; Luo, G.; Wu, Y.; Gao, Y.; and Ji, R. 2021. Improving image captioning by leveraging intra-and inter-layer global representation in transformer network. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, 1655--1663

  20. [20]

    Jiang, H.; Misra, I.; Rohrbach, M.; Learned-Miller, E.; and Chen, X. 2020. In defense of grid features for visual question answering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 10267--10276

  21. [21]

    Jiang, W.; Ma, L.; Jiang, Y.-G.; Liu, W.; and Zhang, T. 2018. Recurrent fusion network for image captioning. In Proceedings of the European Conference on Computer Vision (ECCV), 499--515

  22. [22]

    Karpathy, A.; and Fei-Fei, L. 2015. Deep visual-semantic alignments for generating image descriptions. In Proceedings of the IEEE conference on computer vision and pattern recognition, 3128--3137

  23. [23]

    Li, G.; Zhu, L.; Liu, P.; and Yang, Y. 2019. Entangled transformer for image captioning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 8928--8937

  24. [24]

    Li, X.; Yin, X.; Li, C.; Zhang, P.; Hu, X.; Zhang, L.; Wang, L.; Hu, H.; Dong, L.; Wei, F.; et al. 2020. Oscar: Object-semantics aligned pre-training for vision-language tasks. In European Conference on Computer Vision, 121--137. Springer

  25. [25]

    Lin, C.-Y. 2004. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out, 74--81

  26. [26]

    Luo, R. 2020. A Better Variant of Self-Critical Sequence Training. arXiv preprint arXiv:2003.09971

  27. [27]

    Pan, Y.; Yao, T.; Li, Y.; and Mei, T. 2020. X-linear attention networks for image captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 10971--10980

  28. [28]

    Papineni, K.; Roukos, S.; Ward, T.; and Zhu, W.-J. 2002. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics, 311--318

  29. [29]

    Qin, Y.; Du, J.; Zhang, Y.; and Lu, H. 2019. Look back and predict forward in image captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 8367--8375

  30. [30]

    Ren, S.; He, K.; Girshick, R.; and Sun, J. 2015. Faster r-cnn: Towards real-time object detection with region proposal networks. Advances in neural information processing systems, 28: 91--99

  31. [31]

    J.; Marcheret, E.; Mroueh, Y.; Ross, J.; and Goel, V

    Rennie, S. J.; Marcheret, E.; Mroueh, Y.; Ross, J.; and Goel, V. 2017. Self-critical sequence training for image captioning. In Proceedings of the IEEE conference on computer vision and pattern recognition, 7008--7024

  32. [32]

    Sammani, F.; and Elsayed, M. 2019. Look and modify: Modification networks for image captioning. arXiv preprint arXiv:1909.03169

  33. [33]

    Sammani, F.; and Melas-Kyriazi, L. 2020. Show, edit and tell: A framework for editing image captions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 4808--4816

  34. [34]

    Song, Z.; Zhou, X.; Mao, Z.; and Tan, J. 2021. Image Captioning with Context-Aware Auxiliary Guidance. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, 2584--2592

  35. [35]

    Sutskever, I.; Vinyals, O.; and Le, Q. V. 2014. Sequence to sequence learning with neural networks. In Advances in neural information processing systems, 3104--3112

  36. [36]

    N.; Kaiser, .; and Polosukhin, I

    Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A. N.; Kaiser, .; and Polosukhin, I. 2017. Attention is all you need. In Advances in neural information processing systems, 5998--6008

  37. [37]

    Vedantam, R.; Lawrence Zitnick, C.; and Parikh, D. 2015. Cider: Consensus-based image description evaluation. In Proceedings of the IEEE conference on computer vision and pattern recognition, 4566--4575

  38. [38]

    Vinyals, O.; Toshev, A.; Bengio, S.; and Erhan, D. 2015. Show and tell: A neural image caption generator. In Proceedings of the IEEE conference on computer vision and pattern recognition, 3156--3164

  39. [39]

    Wang, B.; Ma, L.; Zhang, W.; Jiang, W.; Wang, J.; and Liu, W. 2019 a . Controllable video captioning with pos sequence guidance based on gated fusion network. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 2641--2650

  40. [40]

    Wang, C.; Yang, H.; Bartz, C.; and Meinel, C. 2016. Image captioning with deep bidirectional LSTMs. In Proceedings of the 24th ACM international conference on Multimedia, 988--997

  41. [41]

    Wang, L.; Bai, Z.; Zhang, Y.; and Lu, H. 2020. Show, Recall, and Tell: Image Captioning with Recall Mechanism. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, 12176--12183

  42. [42]

    Wang, X.; Wu, J.; Chen, J.; Li, L.; Wang, Y.-F.; and Wang, W. Y. 2019 b . Vatex: A large-scale, high-quality multilingual dataset for video-and-language research. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 4581--4591

  43. [43]

    Xu, K.; Ba, J.; Kiros, R.; Cho, K.; Courville, A.; Salakhudinov, R.; Zemel, R.; and Bengio, Y. 2015. Show, attend and tell: Neural image caption generation with visual attention. In International conference on machine learning, 2048--2057. PMLR

  44. [44]

    Yang, X.; Tang, K.; Zhang, H.; and Cai, J. 2019. Auto-encoding scene graphs for image captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 10685--10694

  45. [45]

    Yao, T.; Pan, Y.; Li, Y.; and Mei, T. 2018. Exploring visual relationship for image captioning. In Proceedings of the European conference on computer vision (ECCV), 684--699

  46. [46]

    Yao, T.; Pan, Y.; Li, Y.; and Mei, T. 2019. Hierarchy parsing for image captioning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 2621--2629

  47. [47]

    Yao, T.; Pan, Y.; Li, Y.; Qiu, Z.; and Mei, T. 2017. Boosting image captioning with attributes. In Proceedings of the IEEE international conference on computer vision, 4894--4902

  48. [48]

    Zhang, P.; Li, X.; Hu, X.; Yang, J.; Zhang, L.; Wang, L.; Choi, Y.; and Gao, J. 2021 a . Vinvl: Revisiting visual representations in vision-language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 5579--5588

  49. [49]

    Zhang, X.; Su, J.; Qin, Y.; Liu, Y.; Ji, R.; and Wang, H. 2018. Asynchronous bidirectional decoding for neural machine translation. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 32

  50. [50]

    Zhang, X.; Sun, X.; Luo, Y.; Ji, J.; Zhou, Y.; Wu, Y.; Huang, F.; and Ji, R. 2021 b . RSTNet: Captioning With Adaptive Attention on Visual and Non-Visual Words. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 15465--15474

  51. [51]

    Zhang, Z.; Qi, Z.; Yuan, C.; Shan, Y.; Li, B.; Deng, Y.; and Hu, W. 2021 c . Open-book Video Captioning with Retrieve-Copy-Generate Network. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 9837--9846

  52. [52]

    Zhang, Z.; Wu, S.; Liu, S.; Li, M.; Zhou, M.; and Xu, T. 2019. Regularizing neural machine translation by target-bidirectional agreement. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, 443--450

  53. [53]

    Zhao, W.; Wang, B.; Ye, J.; Yang, M.; Zhao, Z.; Luo, R.; and Qiao, Y. 2018. A Multi-task Learning Approach for Image Captioning. In IJCAI, 1205--1211

  54. [54]

    Zhou, L.; Palangi, H.; Zhang, L.; Hu, H.; Corso, J.; and Gao, J. 2020 a . Unified vision-language pre-training for image captioning and vqa. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, 13041--13049

  55. [55]

    Zhou, L.; Zhang, J.; and Zong, C. 2019. Synchronous bidirectional neural machine translation. Transactions of the Association for Computational Linguistics, 7: 91--105

  56. [56]

    Zhou, Y.; Wang, M.; Liu, D.; Hu, Z.; and Zhang, H. 2020 b . More grounded image captioning by distilling image-text matching model. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 4777--4786

  57. [57]

    Zhou, Y.; Zhang, Y.; Hu, Z.; and Wang, M. 2021. Semi-Autoregressive Transformer for Image Captioning. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops, 3139--3143