Automated Description Generation of Cytologic Findings for Lung Cytological Images Using a Pretrained Vision Model and Dual Text Decoders: Preliminary Study

Atsushi Teramoto; Ayano Michiba; Hiroshi Fujita; Kazuyoshi Imaizumi; Tetsuya Tsukamoto; Yuka Kiriyama

arxiv: 2403.18151 · v2 · submitted 2024-03-26 · 📡 eess.IV · cs.CV· physics.med-ph

Automated Description Generation of Cytologic Findings for Lung Cytological Images Using a Pretrained Vision Model and Dual Text Decoders: Preliminary Study

Atsushi Teramoto , Ayano Michiba , Yuka Kiriyama , Tetsuya Tsukamoto , Kazuyoshi Imaizumi , Hiroshi Fujita This is my paper

Pith reviewed 2026-05-24 02:40 UTC · model grok-4.3

classification 📡 eess.IV cs.CVphysics.med-ph

keywords lung cytologycytologic findingsbenign malignant classificationimage captioningCNNTransformer decoderpulmonary cytology

0 comments

The pith

A CNN classifies lung cytology images as benign or malignant then routes features to one of two specialized Transformer decoders to generate matching cell descriptions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper builds an automated system for pulmonary cytology that first determines whether a cell image is benign or malignant and then produces a written report of the findings. Separate text decoders handle the two cell types so that descriptions stay appropriate to each category. On a set of 801 images the classification step reaches 100 percent sensitivity and 96.4 percent specificity while the generated text scores 0.828 on BLEU-4 and matches expert grammar and style better than single-decoder or LLM baselines.

Core claim

The proposed method consists of a vision model and dual text decoders. In the former, a convolutional neural network (CNN) is used to classify a given image as benign or malignant, and the features related to the image are extracted from the intermediate layer. Independent text decoders for benign and malignant cells are prepared for text generation, and the text decoder switches according to the CNN classification results. The text decoder is configured using a Transformer that uses the features obtained from the CNN for generating findings.

What carries the argument

Dual Transformer text decoders switched by the output of a CNN classifier that also supplies image features from its intermediate layer.

Load-bearing premise

The 801 patch images from 206 patients form a representative dataset that allows the CNN classification to reliably switch between the two text decoders without introducing errors from misclassification or overfitting on this limited sample.

What would settle it

Evaluation on an independent set of lung cytology images from new patients that measures whether text accuracy falls when the CNN misclassifies even a small fraction of cases.

Figures

Figures reproduced from arXiv: 2403.18151 by Atsushi Teramoto, Ayano Michiba, Hiroshi Fujita, Kazuyoshi Imaizumi, Tetsuya Tsukamoto, Yuka Kiriyama.

**Figure 1.** Figure 1: Outline of the proposed report-generation scheme. Vision Model CNN Text Decoder #1 Text Decoder #2 Benign Malignancy Report (Microscopic findings) Grad-CAM Saliency Map [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗

read the original abstract

Objective: Cytology plays a crucial role in lung cancer diagnosis. Pulmonary cytology involves cell morphological characterization in the specimen and reporting the corresponding findings, which are extremely burdensome tasks. In this study, we propose a technique to generate cytologic findings from for cytologic images to assist in the reporting of pulmonary cytology. Methods: For this study, 801 patch images were retrieved using cytology specimens collected from 206 patients; the findings were assigned to each image as a dataset for generating cytologic findings. The proposed method consists of a vision model and dual text decoders. In the former, a convolutional neural network (CNN) is used to classify a given image as benign or malignant, and the features related to the image are extracted from the intermediate layer. Independent text decoders for benign and malignant cells are prepared for text generation, and the text decoder switches according to the CNN classification results. The text decoder is configured using a Transformer that uses the features obtained from the CNN for generating findings. Results: The sensitivity and specificity were 100% and 96.4%, respectively, for automated benign and malignant case classification, and the saliency map indicated characteristic benign and malignant areas. The grammar and style of the generated texts were confirmed correct, achieving a BLEU-4 score of 0.828, reflecting high degree of agreement with the gold standard, outperforming existing LLM-based image-captioning methods and single-text-decoder ablation model. Conclusion: Experimental results indicate that the proposed method is useful for pulmonary cytology classification and generation of cytologic findings.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The dual-decoder routing is a reasonable tweak but the 100% sensitivity on 206 patients is the part that needs a patient-level split check before the numbers can be trusted.

read the letter

The dual decoder switched by CNN classification is the part worth looking at, but the near-perfect benign/malignant routing on 801 patches from 206 patients is the part that needs scrutiny. They pull patches from lung cytology slides, run a CNN to decide benign or malignant and pull features, then feed those to one of two separate Transformer decoders that write the report text. The switch happens based on the CNN output. On their test they report 100% sensitivity and 96.4% specificity for the classification step and a BLEU-4 of 0.828 for the generated text, which beats both some LLM captioners and their own single-decoder version. Saliency maps also line up with expected cell features. The ablation to a single decoder is useful; it shows the benefit of having specialized text generators once the route is correct. The output text is said to match grammar and style of real reports. The soft spot sits in the data handling. Four patches per patient on average means strong correlation within patients. Without an explicit statement that they split at the patient level, the CNN could be learning to recognize the same patient's cells across patches rather than learning general benign versus malignant morphology. A misroute then sends the image to the wrong decoder and the whole text generation result is compromised. Their single-decoder ablation does not test this routing failure. The abstract gives no cross-validation details or error bars either. This is a methods paper aimed at people doing image-to-text work in cytology or similar narrow medical domains. A reader who wants a worked example of a routed captioning model will find one here. It is solid enough to send to peer review; a referee can request the split protocol and perhaps an external validation set. I would not desk-reject it on the current evidence.

Referee Report

3 major / 2 minor

Summary. The paper proposes a dual-decoder architecture for generating cytologic findings from lung cytological patch images: a CNN classifies each image as benign or malignant (and extracts features), then routes the image to one of two independent Transformer text decoders. On a dataset of 801 patches from 206 patients the method reports 100% sensitivity / 96.4% specificity for the routing step and a BLEU-4 score of 0.828 for the generated text, outperforming both existing LLM captioning baselines and a single-decoder ablation.

Significance. If the reported performance can be reproduced under patient-level splitting and external validation, the dual-decoder design offers a concrete architectural idea for conditioning text generation on an upstream diagnostic label. The preliminary nature of the study and the modest dataset size, however, constrain the immediate clinical or methodological impact.

major comments (3)

[Abstract] Abstract / Methods: No description is given of the train/validation/test split strategy or whether splits were performed at the patient level. With only ~4 patches per patient, intra-patient correlation could inflate the reported 100% sensitivity and 96.4% specificity for CNN routing; this directly affects the validity of the dual-decoder claim.
[Results] Results: The manuscript provides no cross-validation procedure, confidence intervals, or statistical comparison against the single-decoder ablation. Without these, it is impossible to assess whether the BLEU-4 improvement of 0.828 is robust or merely an artifact of the limited sample.
[Methods] Methods: The paper does not state how the CNN classification threshold was chosen or whether it was tuned on the same data used to evaluate the text decoders. Any leakage here would propagate directly into the reported text-generation scores.

minor comments (2)

[Abstract] Abstract contains a typographical error: “generate cytologic findings from for cytologic images.”
[Results] The saliency-map analysis is mentioned but no quantitative overlap metric with pathologist annotations is supplied.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments, which identify important gaps in methodological transparency. We address each major comment below and will revise the manuscript to incorporate the requested clarifications and additional analyses.

read point-by-point responses

Referee: [Abstract] Abstract / Methods: No description is given of the train/validation/test split strategy or whether splits were performed at the patient level. With only ~4 patches per patient, intra-patient correlation could inflate the reported 100% sensitivity and 96.4% specificity for CNN routing; this directly affects the validity of the dual-decoder claim.

Authors: We agree that the absence of a described splitting strategy is a limitation in the current manuscript. We will revise the Methods section to explicitly state that the 801 patches from 206 patients were partitioned at the patient level, with patients randomly allocated to training, validation, and test sets (approximately 70/15/15) such that no patches from the same patient appear across different sets. This directly mitigates the risk of inflated metrics due to intra-patient correlation. revision: yes
Referee: [Results] Results: The manuscript provides no cross-validation procedure, confidence intervals, or statistical comparison against the single-decoder ablation. Without these, it is impossible to assess whether the BLEU-4 improvement of 0.828 is robust or merely an artifact of the limited sample.

Authors: We acknowledge that the current manuscript lacks cross-validation, confidence intervals, and formal statistical comparison to the single-decoder ablation. We will update the Results section to report patient-level cross-validation, 95% confidence intervals for the BLEU-4 and classification metrics, and a statistical test (e.g., paired comparison) demonstrating whether the dual-decoder improvement is significant. revision: yes
Referee: [Methods] Methods: The paper does not state how the CNN classification threshold was chosen or whether it was tuned on the same data used to evaluate the text decoders. Any leakage here would propagate directly into the reported text-generation scores.

Authors: We agree that the threshold selection process must be clarified to rule out leakage. We will revise the Methods section to state that the CNN threshold was selected on the validation set (to achieve high sensitivity), with the test set held out entirely for final evaluation of both classification and text generation. This ensures the reported text-generation scores are not affected by threshold tuning on evaluation data. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical ML pipeline with no derivations

full rationale

The paper describes a standard supervised learning setup: a CNN is trained to classify patches as benign/malignant and to extract features, then two separate Transformer decoders are trained on the respective subsets to generate text. All reported numbers (100% sensitivity, 96.4% specificity, BLEU-4 = 0.828) are direct empirical evaluation metrics on the 801-patch dataset; no equations, uniqueness theorems, or self-citations are used to derive or force any result. The architecture choices are explicit design decisions, not outputs of a prior self-referential step. The work is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The central claim depends on the quality and representativeness of the 801-image dataset and the effectiveness of the dual-decoder architecture over alternatives; no invented entities are introduced.

free parameters (2)

benign/malignant classification threshold
Chosen to achieve the reported 100% sensitivity and 96.4% specificity on the dataset
Transformer model size and training parameters
Standard hyperparameters in deep learning models not detailed in the abstract

axioms (1)

domain assumption The cytologic findings labels assigned to each image are accurate and consistent ground truth
Used as ground truth for both classification training and text generation evaluation

pith-pipeline@v0.9.0 · 5850 in / 1480 out tokens · 36339 ms · 2026-05-24T02:40:39.959419+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

26 extracted references · 26 canonical work pages · 3 internal anchors

[1]

Cancer facts and figures 2023

American Cancer Society, “Cancer facts and figures 2023”. Available at: https://www.cancer.org/content/dam/cancer-org/research/cancer-facts-and-statistics/annual-cancer-facts-and-figures/2023/2023-cancer-facts-and-figures.pdf

work page 2023
[2]

DeepPap: Deep convolutional networks for cervical cell classification,

L. Zhang et al., “DeepPap: Deep convolutional networks for cervical cell classification,” IEEE J. Biomed. Health Inform., vol. 21, no. 6, pp. 1633–1643, 2017, doi: 10.1109/JBHI.2017.2705583

work page doi:10.1109/jbhi.2017.2705583 2017
[3]

Nasal cytology with deep learning techniques,

G. Dimauro et al., “Nasal cytology with deep learning techniques,” Int. J. Med. Inform., vol. 122, pp. 13–19, 2019, doi: 10.1016/j.ijmedinf.2018.11.010

work page doi:10.1016/j.ijmedinf.2018.11.010 2019
[4]

BFCNet: a CNN for diagnosis of ductal carcinoma in breast from cytology images,

A. Bal, et al., “BFCNet: a CNN for diagnosis of ductal carcinoma in breast from cytology images,” Pattern Anal. Appl., vol. 24, no. 3, pp. 967–980, 2021, doi: 10.1007/s10044-021-00962-4

work page doi:10.1007/s10044-021-00962-4 2021
[5]

Automated classification of lung cancer types from cytological images using deep convolutional neural networks,

A. Teramoto et al., “Automated classification of lung cancer types from cytological images using deep convolutional neural networks,” BioMed Res. Int., vol. 2017, pp. 4067832, 2017, doi: 10.1155/2017/4067832

work page doi:10.1155/2017/4067832 2017
[6]

Comparison of fine-tuned deep convolutional neural networks for the automated classification of lung cancer cytology images with integration of additional classifiers,

T. Tsukamoto et al., “Comparison of fine-tuned deep convolutional neural networks for the automated classification of lung cancer cytology images with integration of additional classifiers,” Asian Pac. J. Cancer Prev., vol. 23, no. 4, pp. 1315–1324, 2022, doi: 10.31557/APJCP.2022.23.4.1315

work page doi:10.31557/apjcp.2022.23.4.1315 2022
[7]

Automated classification of benign and malignant cells from lung cytological images using deep convolutional neural network,

A. Teramoto et al., “Automated classification of benign and malignant cells from lung cytological images using deep convolutional neural network,” Inform. Med. Unlocked, vol. 16, p. 100205, 2019, doi: 10.1016/j.imu.2019.100205

work page doi:10.1016/j.imu.2019.100205 2019
[8]

A. Teramoto et al., “Deep learning approach to classification of lung cytological images: Two-step training using actual and synthesized images by progressive growing of generative adversarial networks,” PLOS ONE, vol. 15, no. 3, p. e0229951, 2020, doi: 10.1371/journal.pone.0229951

work page doi:10.1371/journal.pone.0229951 2020
[9]

TieNet: Text-image embedding network for common thorax disease classification and reporting in chest X-rays

X. Wang et al., “TieNet: Text-image embedding network for common thorax disease classification and reporting in chest X-rays” in, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. IEEE, Jun. 2018, 2018, pp. 9049–9058, doi: 10.1109/CVPR.2018.00943

work page doi:10.1109/cvpr.2018.00943 2018
[10]

RATCHET: Medical Transformer for chest X-ray diagnosis and reporting

B. Hou et al., “RATCHET: Medical Transformer for chest X-ray diagnosis and reporting” in Med. Image Comput. Comput. Assist. Interv. MICCAI, M. de Bruijne, P. C. Cattin, S. Cotin, N. Padoy, S. Speidel, Y. Zheng and C. Essert, Eds. Cham: Springer International Publishing, vol. 2021, pp. 293–303,

work page 2021
[11]

GNNFormer: A Graph-based Framework for Cytopathology Report Generation,

Y. F. Zhou et al., “GNNFormer: A Graph-based Framework for Cytopathology Report Generation,” arXiv, Available at: arXiv:2303.09956,

work page arXiv
[12]

The World Health Organization reporting system for lung cytopathology,

F. C. Schmitt, et al., “The World Health Organization reporting system for lung cytopathology,” Acta Cytol., vol. 67, no. 1, pp. 80–91, 2023, doi: 10.1159/000527580

work page doi:10.1159/000527580 2023
[13]

Very deep convolutional networks for large-scale image recognition

K. Simonyan and A. Zisserman, "Very deep convolutional networks for large-scale image recognition" in International Conference on Learning Representations, vol. 2015,

work page 2015
[14]

Going deeper with convolutions,

C. Szegedy, et al., “Going deeper with convolutions” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), vol. 2015, 2015, pp. 1–9, doi: 10.1109/CVPR.2015.7298594

work page doi:10.1109/cvpr.2015.7298594 2015
[15]

Deep residual learning for image recognition,

K. He et al., “Deep residual learning for image recognition” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), vol. 2016, 2016, pp. 770–778, doi: 10.1109/CVPR.2016.90

work page doi:10.1109/cvpr.2016.90 2016
[16]

Densely connected convolutional networks,

G. Huang et al., “Densely connected convolutional networks” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), vol. 2017, 2017, pp. 2261–2269, doi: 10.1109/CVPR.2017.243

work page doi:10.1109/cvpr.2017.243 2017
[17]

Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi Parikh, and Dhruv Batra

R. R. Selvaraju et al., "Grad-CAM: Visual explanations from deep networks via gradient-based localization" in IEEE International Conference on Computer Vision (ICCV), vol. 2017, 2017, pp. 618–626, doi: 10.1109/ICCV.2017.74

work page doi:10.1109/iccv.2017.74 2017
[18]

IEEE (pp

R. Girshick et al., "Rich feature hierarchies for accurate object detection and semantic segmentation" in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), vol. 2015, 2014, pp. 580–587, doi: 10.1109/CVPR.2014.81

work page doi:10.1109/cvpr.2014.81 2015
[19]

doi:10.3115/1073083.1073135 , editor =

K. Papineni et al., "Bleu: A method for automatic evaluation of machine translation" in Annual Meeting of the Association for Computational Linguistics, pp. 1106–1114, 2001, doi: 10.3115/1073083.1073135

work page doi:10.3115/1073083.1073135 2001
[20]

Meteor universal: Language specific translation evaluation for any target language

M. Denkowski and A. Lavie, "Meteor universal: Language specific translation evaluation for any target language" in EACL Workshop on Statistical Machine Translation, 2014, pp. 376–380, doi: 10.3115/v1/W14-3348

work page doi:10.3115/v1/w14-3348 2014
[21]

Rouge: A package for automatic evaluation of summaries

C. Y. Lin, "Rouge: A package for automatic evaluation of summaries" in Text Summarization Branches Out, 2004, pp. 74–81

work page 2004
[22]

Zhang, Y

R. Vedantam et al., "Cider: Consensus-based image description evaluation" in IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 4566–4575, doi: 10.1109/CVPR.2015.7299087

work page doi:10.1109/cvpr.2015.7299087 2015
[23]

Spice: Semantic propositional image caption evaluation

P. Anderson et al., "Spice: Semantic propositional image caption evaluation" in European Conference on Computer Vision, 2016, pp. 382–398

work page 2016
[24]

GIT: A Generative Image-to-text Transformer for Vision and Language

J. Wang et al., "GIT: A Generative Image-to-text Transformer for Vision and Language," arXiv, Available at: arXiv:2205.14100v5,

work page internal anchor Pith review Pith/arXiv arXiv
[25]

BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models

J. Li et al., BLIP-2: Bootstrapping language-image pre-training with frozen image encoders and large language models," arXiv., Available at: arXiv:2301.12597v3,

work page internal anchor Pith review Pith/arXiv arXiv
[26]

OPT: Open Pre-trained Transformer Language Models

S. Zhang, et al., "OPT: Open Pre-trained Transformer Language Models," arXiv, Available at: arXiv:2205.01068v4,

work page internal anchor Pith review Pith/arXiv arXiv

[1] [1]

Cancer facts and figures 2023

American Cancer Society, “Cancer facts and figures 2023”. Available at: https://www.cancer.org/content/dam/cancer-org/research/cancer-facts-and-statistics/annual-cancer-facts-and-figures/2023/2023-cancer-facts-and-figures.pdf

work page 2023

[2] [2]

DeepPap: Deep convolutional networks for cervical cell classification,

L. Zhang et al., “DeepPap: Deep convolutional networks for cervical cell classification,” IEEE J. Biomed. Health Inform., vol. 21, no. 6, pp. 1633–1643, 2017, doi: 10.1109/JBHI.2017.2705583

work page doi:10.1109/jbhi.2017.2705583 2017

[3] [3]

Nasal cytology with deep learning techniques,

G. Dimauro et al., “Nasal cytology with deep learning techniques,” Int. J. Med. Inform., vol. 122, pp. 13–19, 2019, doi: 10.1016/j.ijmedinf.2018.11.010

work page doi:10.1016/j.ijmedinf.2018.11.010 2019

[4] [4]

BFCNet: a CNN for diagnosis of ductal carcinoma in breast from cytology images,

A. Bal, et al., “BFCNet: a CNN for diagnosis of ductal carcinoma in breast from cytology images,” Pattern Anal. Appl., vol. 24, no. 3, pp. 967–980, 2021, doi: 10.1007/s10044-021-00962-4

work page doi:10.1007/s10044-021-00962-4 2021

[5] [5]

Automated classification of lung cancer types from cytological images using deep convolutional neural networks,

A. Teramoto et al., “Automated classification of lung cancer types from cytological images using deep convolutional neural networks,” BioMed Res. Int., vol. 2017, pp. 4067832, 2017, doi: 10.1155/2017/4067832

work page doi:10.1155/2017/4067832 2017

[6] [6]

Comparison of fine-tuned deep convolutional neural networks for the automated classification of lung cancer cytology images with integration of additional classifiers,

T. Tsukamoto et al., “Comparison of fine-tuned deep convolutional neural networks for the automated classification of lung cancer cytology images with integration of additional classifiers,” Asian Pac. J. Cancer Prev., vol. 23, no. 4, pp. 1315–1324, 2022, doi: 10.31557/APJCP.2022.23.4.1315

work page doi:10.31557/apjcp.2022.23.4.1315 2022

[7] [7]

Automated classification of benign and malignant cells from lung cytological images using deep convolutional neural network,

A. Teramoto et al., “Automated classification of benign and malignant cells from lung cytological images using deep convolutional neural network,” Inform. Med. Unlocked, vol. 16, p. 100205, 2019, doi: 10.1016/j.imu.2019.100205

work page doi:10.1016/j.imu.2019.100205 2019

[8] [8]

A. Teramoto et al., “Deep learning approach to classification of lung cytological images: Two-step training using actual and synthesized images by progressive growing of generative adversarial networks,” PLOS ONE, vol. 15, no. 3, p. e0229951, 2020, doi: 10.1371/journal.pone.0229951

work page doi:10.1371/journal.pone.0229951 2020

[9] [9]

TieNet: Text-image embedding network for common thorax disease classification and reporting in chest X-rays

X. Wang et al., “TieNet: Text-image embedding network for common thorax disease classification and reporting in chest X-rays” in, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. IEEE, Jun. 2018, 2018, pp. 9049–9058, doi: 10.1109/CVPR.2018.00943

work page doi:10.1109/cvpr.2018.00943 2018

[10] [10]

RATCHET: Medical Transformer for chest X-ray diagnosis and reporting

B. Hou et al., “RATCHET: Medical Transformer for chest X-ray diagnosis and reporting” in Med. Image Comput. Comput. Assist. Interv. MICCAI, M. de Bruijne, P. C. Cattin, S. Cotin, N. Padoy, S. Speidel, Y. Zheng and C. Essert, Eds. Cham: Springer International Publishing, vol. 2021, pp. 293–303,

work page 2021

[11] [11]

GNNFormer: A Graph-based Framework for Cytopathology Report Generation,

Y. F. Zhou et al., “GNNFormer: A Graph-based Framework for Cytopathology Report Generation,” arXiv, Available at: arXiv:2303.09956,

work page arXiv

[12] [12]

The World Health Organization reporting system for lung cytopathology,

F. C. Schmitt, et al., “The World Health Organization reporting system for lung cytopathology,” Acta Cytol., vol. 67, no. 1, pp. 80–91, 2023, doi: 10.1159/000527580

work page doi:10.1159/000527580 2023

[13] [13]

Very deep convolutional networks for large-scale image recognition

K. Simonyan and A. Zisserman, "Very deep convolutional networks for large-scale image recognition" in International Conference on Learning Representations, vol. 2015,

work page 2015

[14] [14]

Going deeper with convolutions,

C. Szegedy, et al., “Going deeper with convolutions” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), vol. 2015, 2015, pp. 1–9, doi: 10.1109/CVPR.2015.7298594

work page doi:10.1109/cvpr.2015.7298594 2015

[15] [15]

Deep residual learning for image recognition,

K. He et al., “Deep residual learning for image recognition” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), vol. 2016, 2016, pp. 770–778, doi: 10.1109/CVPR.2016.90

work page doi:10.1109/cvpr.2016.90 2016

[16] [16]

Densely connected convolutional networks,

G. Huang et al., “Densely connected convolutional networks” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), vol. 2017, 2017, pp. 2261–2269, doi: 10.1109/CVPR.2017.243

work page doi:10.1109/cvpr.2017.243 2017

[17] [17]

Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi Parikh, and Dhruv Batra

R. R. Selvaraju et al., "Grad-CAM: Visual explanations from deep networks via gradient-based localization" in IEEE International Conference on Computer Vision (ICCV), vol. 2017, 2017, pp. 618–626, doi: 10.1109/ICCV.2017.74

work page doi:10.1109/iccv.2017.74 2017

[18] [18]

IEEE (pp

R. Girshick et al., "Rich feature hierarchies for accurate object detection and semantic segmentation" in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), vol. 2015, 2014, pp. 580–587, doi: 10.1109/CVPR.2014.81

work page doi:10.1109/cvpr.2014.81 2015

[19] [19]

doi:10.3115/1073083.1073135 , editor =

K. Papineni et al., "Bleu: A method for automatic evaluation of machine translation" in Annual Meeting of the Association for Computational Linguistics, pp. 1106–1114, 2001, doi: 10.3115/1073083.1073135

work page doi:10.3115/1073083.1073135 2001

[20] [20]

Meteor universal: Language specific translation evaluation for any target language

M. Denkowski and A. Lavie, "Meteor universal: Language specific translation evaluation for any target language" in EACL Workshop on Statistical Machine Translation, 2014, pp. 376–380, doi: 10.3115/v1/W14-3348

work page doi:10.3115/v1/w14-3348 2014

[21] [21]

Rouge: A package for automatic evaluation of summaries

C. Y. Lin, "Rouge: A package for automatic evaluation of summaries" in Text Summarization Branches Out, 2004, pp. 74–81

work page 2004

[22] [22]

Zhang, Y

R. Vedantam et al., "Cider: Consensus-based image description evaluation" in IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 4566–4575, doi: 10.1109/CVPR.2015.7299087

work page doi:10.1109/cvpr.2015.7299087 2015

[23] [23]

Spice: Semantic propositional image caption evaluation

P. Anderson et al., "Spice: Semantic propositional image caption evaluation" in European Conference on Computer Vision, 2016, pp. 382–398

work page 2016

[24] [24]

GIT: A Generative Image-to-text Transformer for Vision and Language

J. Wang et al., "GIT: A Generative Image-to-text Transformer for Vision and Language," arXiv, Available at: arXiv:2205.14100v5,

work page internal anchor Pith review Pith/arXiv arXiv

[25] [25]

BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models

J. Li et al., BLIP-2: Bootstrapping language-image pre-training with frozen image encoders and large language models," arXiv., Available at: arXiv:2301.12597v3,

work page internal anchor Pith review Pith/arXiv arXiv

[26] [26]

OPT: Open Pre-trained Transformer Language Models

S. Zhang, et al., "OPT: Open Pre-trained Transformer Language Models," arXiv, Available at: arXiv:2205.01068v4,

work page internal anchor Pith review Pith/arXiv arXiv