BEiTScore: Reference-free Image Captioning Evaluation with an Efficient Cross-Encoder Model

Bruno Martins; Chrysoula Zerva; Gon\c{c}alo Gomes

arxiv: 2605.21728 · v1 · pith:WKZXOXQUnew · submitted 2026-05-20 · 💻 cs.CV · cs.CL· cs.LG

BEiTScore: Reference-free Image Captioning Evaluation with an Efficient Cross-Encoder Model

Gon\c{c}alo Gomes , Bruno Martins , Chrysoula Zerva This is my paper

Pith reviewed 2026-05-22 08:58 UTC · model grok-4.3

classification 💻 cs.CV cs.CLcs.LG

keywords reference-free evaluationimage captioningcross-encoder modelvision-language modelsadversarial data augmentationefficient benchmarkingcaption quality assessmentVQA initialization

0 comments

The pith

A lightweight cross-encoder model trained on adversarial data achieves state-of-the-art reference-free evaluation of image captions at low computational cost.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces BEiTScore, a learned metric that scores how well an image caption matches its image without needing reference captions. It initializes a compact cross-encoder from a visual question-answering checkpoint and trains it on a mix of data that includes adversarial examples generated by large language models to catch subtle visual-linguistic mistakes. This design aims to overcome the token limits and coarse judgments of CLIP encoders while avoiding the high expense of using LLMs directly as judges. A new benchmark tests performance on detailed, long-form captions across varied scenarios. If the approach works, it supplies a practical tool for large-scale model comparison, guided decoding, and reinforcement learning signals in vision-language systems.

Core claim

The central claim is that a cross-encoder model, started from a visual question-answering checkpoint and trained on a carefully mixed dataset containing adversarial LLM augmentations, produces reference-free caption scores that reach state-of-the-art accuracy while remaining efficient enough for repeated use in benchmarking and training loops.

What carries the argument

The lightweight cross-encoder model, initialized from a visual question-answering checkpoint, that directly processes image-text pairs to output a quality score.

If this is right

Large-scale benchmarking of captioning models becomes feasible without prohibitive compute.
Quality-aware decoding during generation can use the metric as a direct signal.
Reinforcement learning or reward modeling for vision-language models gains a practical, efficient reward function.
Evaluation of long-form and context-rich captions improves over bag-of-words style encoders.
The introduced benchmark provides a standardized testbed for detailed caption assessment.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar lightweight cross-encoder setups could be tested for evaluating other vision-language outputs such as visual question answers or generated images.
The adversarial augmentation strategy might transfer to training evaluators in adjacent domains like video captioning or multimodal dialogue.
If the efficiency holds, the metric could support on-device or repeated inference in production captioning pipelines.
Extending the initialization idea to other pretrained multimodal checkpoints could yield further gains without increasing model size.

Load-bearing premise

The mixture of training data that includes adversarial LLM-based augmentations is what gives the model its sensitivity to fine-grained visual-linguistic mismatches.

What would settle it

If the model assigns high scores to captions that contain clear fine-grained errors such as wrong object attributes, incorrect spatial relations, or missing context on the new benchmark, while lower-scoring correct captions exist, the performance advantage would be refuted.

Figures

Figures reproduced from arXiv: 2605.21728 by Bruno Martins, Chrysoula Zerva, Gon\c{c}alo Gomes.

**Figure 1.** Figure 1: BEiTScore versus state-of-the-art metrics on an instance from Nebula [19]. there have been significant advances in terms of learned evaluation metrics, the current encoder-based metrics still struggle with these requirements. Recent encoder-based approaches either rely on reference captions to assist evaluation within the textual domain [4, 19, 32] or use CLIP [9]-based encoders. Despite achieving strong c… view at source ↗

**Figure 2.** Figure 2: BEiTScore architecture, and adversarial data augmentation strategy. 3.1 Adversarial Data Augmentation Strategy Building on the aforementioned idea, we augmented existing datasets by explicitly focusing on pairing images with detailed textual descriptions [24, 27–29], and repurposing them to build a binary pairwise training dataset. Specifically, we prompt a LLM4 to generate factually incorrect but fluent … view at source ↗

**Figure 3.** Figure 3: Two qualitative examples from Winoground benchmark. ity and counting), correctly classifying named spatial relations between objects (relations), and distinguishing actions while identifying their participants (actions). These tasks were previously very challenging for encoder-based methods, as highlighted in the original benchmark and confirmed by our results. Results on the Winoground Benchmark: Winogro… view at source ↗

**Figure 4.** Figure 4: Qualitative example from LongCapVLCP, with an Image and original caption from DOCCI. The results in [PITH_FULL_IMAGE:figures/full_fig_p013_4.png] view at source ↗

**Figure 5.** Figure 5: Caption length distributions (i.e., the number of words per caption) for the three datasets. Each histogram shows the number of instances containing captions with a number of words that falls within each bin interval values. The maximum caption lengths are as follows: 176 words for the training dataset, 270 words for the validation dataset, and 525 words for the LongCapVLCP benchmark [PITH_FULL_IMAGE:figu… view at source ↗

read the original abstract

Image captioning evaluation remains a significant challenge, as vision-language models evolve toward more challenging capabilities such as generating long-form and context-rich descriptions. State-of-the-art evaluation metrics involve extensive computational costs associated with the use of Large Language Models (LLMs) as judges, or instead suffer from the limitations of standard CLIP-based encoders, such as strict token limits, lack of fine-grained sensitivity, or lack of compositional generalization by treating captions as ``bags-of-words.'' We propose a new learned metric that tackles the aforementioned challenges, based on a lightweight cross-encoder that is initialized from a visual question-answering model checkpoint, balancing a strong weight initialization with computational efficiency. Our training scheme uses a carefully assembled data mixture for supervised learning, featuring adversarial LLM-based data augmentations to enhance model sensitivity to fine-grained visual-linguistic errors. We also introduce a new benchmark designed to assess detailed captioning evaluation across diverse scenarios. Experimental results demonstrate that the proposed metric achieves state-of-the-art performance while maintaining the efficiency required for large-scale benchmarking, quality-aware decoding, or reward guidance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

BEiTScore gives a workable, efficient cross-encoder metric for detailed captions by starting from VQA weights and adding adversarial LLM augmentations, and the full paper supplies reproducible details without internal contradictions.

read the letter

The main point is that this paper delivers a lighter learned metric for reference-free caption evaluation that tries to fix the usual problems with CLIP-style encoders while avoiding the cost of LLM-as-judge setups. It initializes a cross-encoder from a VQA checkpoint and trains on a data mix that includes adversarial examples generated by LLMs to improve sensitivity to fine-grained visual-linguistic mismatches. They also release a new benchmark aimed at long-form and context-rich descriptions. The full manuscript gives concrete implementation details on the training mixture, initialization, and efficiency measurements, which makes the setup look reproducible on its own terms. Reported results position it ahead of prior learned metrics on the new benchmark while keeping inference fast enough for large-scale use or reward modeling. That combination of initialization choice and augmentation strategy is the clearest incremental advance here. The efficiency numbers and baseline comparisons line up without obvious inconsistencies, and the paper does not appear to be fitting performance to quantities defined only inside the work itself. On the softer side, the new benchmark's construction details and diversity checks could still benefit from closer examination during review, since any new test set carries some risk of being tuned to the method. The gains are attributed to the adversarial component, but the experiments would be stronger with a fuller set of ablations showing exactly how much that piece moves the needle versus the VQA initialization alone. Nothing here looks like a central flaw that would collapse the claims. This is aimed at researchers who build or evaluate vision-language models and need an automatic metric that is more sensitive than standard encoders but cheaper than full LLM judging. A reader working on captioning systems, multimodal training loops, or evaluation protocols would find the metric and benchmark directly usable. I would send it out for peer review; the grounding and execution are solid enough to merit community feedback even if some sections need tightening.

Referee Report

2 major / 3 minor

Summary. The manuscript introduces BEiTScore, a reference-free image captioning evaluation metric based on a lightweight cross-encoder initialized from a visual question-answering checkpoint. Training uses a data mixture that incorporates adversarial LLM-based augmentations to increase sensitivity to fine-grained visual-linguistic mismatches. A new benchmark for detailed caption evaluation is presented, and experiments report state-of-the-art results together with efficiency suitable for large-scale use, quality-aware decoding, and reward modeling.

Significance. If the performance claims hold under scrutiny, the work supplies a practical middle ground between expensive LLM judges and limited CLIP-style encoders. The VQA initialization combined with targeted adversarial training offers a concrete route to improved compositional sensitivity at modest compute cost. Credit is given for the reproducible training recipe and efficiency measurements that directly address downstream usability.

major comments (2)

[§4] §4 (Experimental Results): The central SOTA claim rests on comparisons to baselines; an ablation isolating the contribution of the adversarial LLM augmentations versus the base data mixture is required to confirm that the reported gains are attributable to the proposed training scheme rather than initialization or data scale alone.
[§5] §5 (New Benchmark): The benchmark construction details (scenario selection, annotation protocol, and diversity controls) are load-bearing for the claim that the metric generalizes across detailed captioning scenarios; without them, it is difficult to rule out benchmark-specific biases favoring the proposed model.

minor comments (3)

[Table 2] Table 2: Report standard deviations or confidence intervals alongside mean scores to allow readers to assess whether the observed margins over baselines are statistically meaningful.
[Figure 2] Figure 2: The efficiency plot would benefit from explicit hardware specifications and direct side-by-side timing against the strongest LLM baseline under identical conditions.
[§3.1] §3.1: Clarify the exact token-length handling and any truncation strategy used by the cross-encoder to address the token-limit limitation mentioned in the introduction.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive review and recommendation of minor revision. We address each major comment below and will revise the manuscript accordingly to incorporate the requested analyses and details.

read point-by-point responses

Referee: [§4] §4 (Experimental Results): The central SOTA claim rests on comparisons to baselines; an ablation isolating the contribution of the adversarial LLM augmentations versus the base data mixture is required to confirm that the reported gains are attributable to the proposed training scheme rather than initialization or data scale alone.

Authors: We agree that an explicit ablation is needed to isolate the contribution of the adversarial LLM augmentations. The original manuscript describes the full training mixture but does not report a controlled comparison against the base data mixture alone. In the revised version we will add this ablation, training an otherwise identical model on the base mixture without the adversarial augmentations and reporting the resulting drop in performance on the detailed caption benchmarks. This will directly attribute the observed gains to the proposed augmentation strategy rather than initialization or data scale. revision: yes
Referee: [§5] §5 (New Benchmark): The benchmark construction details (scenario selection, annotation protocol, and diversity controls) are load-bearing for the claim that the metric generalizes across detailed captioning scenarios; without them, it is difficult to rule out benchmark-specific biases favoring the proposed model.

Authors: We acknowledge that additional transparency on benchmark construction is required to support claims of generalizability. While §5 outlines the new benchmark, it does not fully specify the scenario selection process, annotation protocol, or diversity controls. In the revision we will expand this section with these details, including the criteria used to select scenarios, the protocol for obtaining reference judgments, and the steps taken to ensure diversity and reduce potential biases. This will strengthen the evidence that performance gains are not benchmark-specific. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper presents an empirical ML contribution: a lightweight cross-encoder metric trained via supervised learning on an assembled data mixture that includes adversarial LLM augmentations, initialized from a VQA checkpoint, and evaluated on a newly introduced benchmark. No derivation chain, equations, or first-principles results are present that reduce by construction to fitted parameters or self-citations. The reported SOTA performance and efficiency claims rest on external benchmark comparisons and reproducible training details rather than any self-referential definition or prediction that is statistically forced by the inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Only the abstract is available, so the ledger is limited to the explicit modeling choices stated there; the central performance claim rests on the effectiveness of the chosen initialization and data mixture.

axioms (1)

domain assumption Initialization from a visual question-answering model checkpoint provides a strong yet efficient starting point for the caption evaluation task.
The abstract states the model is initialized from a VQA checkpoint to balance strong weight initialization with computational efficiency.

pith-pipeline@v0.9.0 · 5729 in / 1400 out tokens · 45124 ms · 2026-05-22T08:58:08.791454+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

lightweight cross-encoder ... initialized from a visual question-answering model checkpoint ... adversarial LLM-based data augmentations ... binary cross-entropy (BCE) loss ... L1 loss

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

35 extracted references · 35 canonical work pages · 1 internal anchor

[1]

From Images to Sentences through Scene Description Graphs using Commonsense Reasoning and Knowledge

Aditya, S., Yang, Y., Baral, C., Fermuller, C., Aloimonos, Y.: From Images to Sentences Through Scene Description Graphs Using Commonsense Reasoning and Knowledge. arXiv preprint arXiv:1511.03292 (2015)

work page internal anchor Pith review Pith/arXiv arXiv 2015
[2]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision (2019)

Agrawal, H., Desai, K., Wang, Y., Chen, X., Jain, R., Johnson, M., Batra, D., Parikh, D., Lee, S., Anderson, P.: Nocaps: Novel Object Captioning at Scale. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (2019)

work page 2019
[3]

In: Proceedings of the Conference on Empirical Methods in Natural Language Processing (2023)

Chan, D., Petryk, S., Gonzalez, J., Darrell, T., Canny, J.: CLAIR: Evaluating Image Captions with Large Language Models. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing (2023)

work page 2023
[4]

In: Proceedings of the Conference on Empirical Methods in Natural Language Processing (2025)

Chen, X., Salazar, I., Kementchedjhieva, Y.: SPECS: Specificity-Enhanced CLIP- Score for Long Image Caption Evaluation. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing (2025)

work page 2025
[5]

In: Pro- ceedings of the Computer Vision and Pattern Recognition Conference (2025)

Feng, Y., Wen, C., Peng, Z., Zhu, S., et al.: Retaining Knowledge and Enhancing Long-Text Representations in CLIP through Dual-Teacher Distillation. In: Pro- ceedings of the Computer Vision and Pattern Recognition Conference (2025)

work page 2025
[6]

Garg, R., Burns, A., Karagol-Ayan, B., Bitton, Y., Montgomery, C., Onoe, Y., Bunner, A., Krishna, R., Baldridge, J.M., Soricut, R.: Imageinwords: Unlocking Hyper-DetailedImageDescriptions.In:ProceedingsoftheConferenceonEmpirical Methods in Natural Language Processing (2024)

work page 2024
[7]

Gomes, G.E.C., Zerva, C., Martins, B.: Evaluation of Multilingual Image Caption- ing: How Far Can We Get with CLIP Models? In: Findings of the Association for Computational Linguistics (2025)

work page 2025
[8]

arXiv preprint arXiv:2412.18150 (2024)

Han, S., Fan, H., Fu, J., Li, L., Li, T., Cui, J., Wang, Y., Tai, Y., Sun, J., Guo, C., et al.: Evalmuse-40k: A Reliable and Fine-Grained Benchmark with Comprehen- sive Human Annotations for Text-to-Image Generation Model Evaluation. arXiv preprint arXiv:2412.18150 (2024)

work page arXiv 2024
[9]

In: Proceedings of the Conference on Empirical Methods in Natural Language Processing (2021)

Hessel, J., Holtzman, A., Forbes, M., Le Bras, R., Choi, Y.: CLIPscore: A Reference-Free Evaluation Metric for Image Captioning. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing (2021)

work page 2021
[10]

Journal of Artificial Intelligence Research (2013)

Hodosh, M., Young, P., Hockenmaier, J.: Framing Image Description as a Rank- ing Task: Data, Models and Evaluation Metrics. Journal of Artificial Intelligence Research (2013)

work page 2013
[11]

Advances in neural information processing systems (2023)

Hsieh, C.Y., Zhang, J., Ma, Z., Kembhavi, A., Krishna, R.: Sugarcrepe: Fixing Hackable Benchmarks for Vision-Language Compositionality. Advances in neural information processing systems (2023)

work page 2023
[12]

In: Proceedings of the Annual Meeting of the Association for Computational Linguistics (2023)

Hu, A., Chen, S., Zhang, L., Jin, Q.: InfoMetIC: An Informative Metric for Reference-free Image Caption Evaluation. In: Proceedings of the Annual Meeting of the Association for Computational Linguistics (2023)

work page 2023
[13]

In: Findings of the Annual Meeting of the Association for Computational Linguistics (2025)

Kim, H., Kim, S., Jeong, J., Cho, Y., Cho, S.: EXPERT: An Explainable Image Captioning Evaluation Metric with Structured Explanations. In: Findings of the Annual Meeting of the Association for Computational Linguistics (2025)

work page 2025
[14]

In: Proceedings of the Annual Meeting of the Association for Computational Linguistics (2024)

Lee, Y., Park, I., Kang, M.: FLEUR: An Explainable Reference-Free Evaluation Metric for Image Captioning Using a Large Multimodal Model. In: Proceedings of the Annual Meeting of the Association for Computational Linguistics (2024)

work page 2024
[15]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2024) 16 G

Li, B., Lin, Z., Pathak, D., Li, J., Fei, Y., Wu, K., Xia, X., Zhang, P., Neubig, G., Ramanan, D.: Evaluating and Improving Compositional Text-to-Visual Gen- eration. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2024) 16 G. Gomes et al

work page 2024
[16]

In: Proceedings of the European Conference on Computer Vision (2014)

Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., Zitnick, C.L.: Microsoft COCO: Common Objects in Context. In: Proceedings of the European Conference on Computer Vision (2014)

work page 2014
[17]

Springer (2024)

Lin, Z., Pathak, D., Li, B., Li, J., Xia, X., Neubig, G., Zhang, P., Ramanan, D.: EvaluatingText-to-VisualGenerationwithImage-to-TextGeneration.In:Proceed- ings of the European Conference on Computer Vision. Springer (2024)

work page 2024
[18]

arXiv preprint arXiv:2402.11572 (2024)

Ma, Z., Wang, C., Ouyang, Y., Zhao, F., Zhang, J., Huang, S., Chen, J.: Cobra Ef- fect in Reference-Free Image Captioning Metrics. arXiv preprint arXiv:2402.11572 (2024)

work page arXiv 2024
[19]

In: Proceedings of the Asian Conference on Computer Vision (2024)

Matsuda, K., Wada, Y., Sugiura, K.: DENEB: A Hallucination-Robust Automatic Evaluation Metric for Image Captioning. In: Proceedings of the Asian Conference on Computer Vision (2024)

work page 2024
[20]

Advances in Neural Information Processing Systems (2023)

Narins, L.D., Scott, A., Gautam, A., Kulkarni, A., Castanon, M., Kao, B., Ihorn, S., Siu, Y.T., Mason, J.M., Blum, A., et al.: Validated Image Caption Rating Dataset. Advances in Neural Information Processing Systems (2023)

work page 2023
[21]

In: Proceedings to the European Conference on Computer Vision (2024)

Onoe, Y., Rane, S., Berger, Z., Bitton, Y., Cho, J., Garg, R., Ku, A., Parekh, Z., Pont-Tuset, J., Tanzer, G., et al.: DOCCI: Descriptions of Connected and Con- trasting Images. In: Proceedings to the European Conference on Computer Vision (2024)

work page 2024
[22]

In: Proceedings of the Annual Meeting of the Association for Computational Linguistics (2022)

Parcalabescu, L., Cafagna, M., Muradjan, L., Frank, A., Calixto, I., Gatt, A.: VALSE: A Task-Independent Benchmark for Vision and Language Models Cen- tered on Linguistic Phenomena. In: Proceedings of the Annual Meeting of the Association for Computational Linguistics (2022)

work page 2022
[23]

In: Pro- ceedings of the Conference of the North American Chapter of the Association for Computational Linguistics (2024)

Petryk, S., Chan, D., Kachinthaya, A., Zou, H., Canny, J., Gonzalez, J., Darrell, T.: ALOHa: A Nnew Measure for Hallucination in Captioning Models. In: Pro- ceedings of the Conference of the North American Chapter of the Association for Computational Linguistics (2024)

work page 2024
[24]

In: Proceedings of the European Conference on Computer Vision (2020)

Pont-Tuset, J., Uijlings, J., Changpinyo, S., Soricut, R., Ferrari, V.: Connecting Vision and Language with Localized Narratives. In: Proceedings of the European Conference on Computer Vision (2020)

work page 2020
[25]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2023)

Sarto, S., Barraco, M., Cornia, M., Baraldi, L., Cucchiara, R.: Positive-Augmented Contrastive Learning for Image and Video Captioning Evaluation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2023)

work page 2023
[26]

Interna- tional Journal of Computer Vision (2025)

Sarto,S.,Moratelli,N.,Cornia,M.,Baraldi,L.,Cucchiara,R.:Positive-Augmented Contrastive Learning for Vision-and-Language Evaluation and Training. Interna- tional Journal of Computer Vision (2025)

work page 2025
[27]

In: Proceedings of the Annual Meeting of the Association for Computational Lin- guistics (2017)

Shekhar, R., Pezzelle, S., Klimovich, Y., Herbelot, A., Nabi, M., Sangineto, E., Bernardi, R.: Foil It! Find One Mismatch Between Image and Language Caption. In: Proceedings of the Annual Meeting of the Association for Computational Lin- guistics (2017)

work page 2017
[28]

In: Proceedings of the European Conference on Computer Vision

Sidorov, O., Hu, R., Rohrbach, M., Singh, A.: Textcaps: A Dataset for Image Cap- tioning with Reading Comprehension. In: Proceedings of the European Conference on Computer Vision. Springer (2020)

work page 2020
[29]

From pixels to prose: A large dataset of dense image cap- tions

Singla, V., Yue, K., Paul, S., Shirkavand, R., Jayawardhana, M., Ganjdanesh, A., Huang, H., Bhatele, A., Somepalli, G., Goldstein, T.: From Pixels to Prose: A Large Dataset of Dense Image Captions. arXiv preprint arXiv:2406.10328 (2024)

work page arXiv 2024
[30]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2022) Title Suppressed Due to Excessive Length 17

Thrush, T., Jiang, R., Bartolo, M., Singh, A., Williams, A., Kiela, D., Ross, C.: Winoground: Probing vision and Language Models for Visio-Linguistic Composi- tionality. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2022) Title Suppressed Due to Excessive Length 17

work page 2022
[31]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2024)

Urbanek, J., Bordes, F., Astolfi, P., Williamson, M., Sharma, V., Romero-Soriano, A.: A Picture is Worth More than 77 Text Tokens: Evaluating CLIP-Style Models on Dense Captions. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2024)

work page 2024
[32]

Wada, Y., Kaneda, K., Saito, D., Sugiura, K.: POLOS: Multimodal Metric Learn- ingfromHumanFeedbackforImageCaptioning.In:ProceedingsoftheIEEE/CVF Conference on Computer Vision and Pattern Recognition (2024)

work page 2024
[33]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2023)

Wang, W., Bao, H., Dong, L., Bjorck, J., Peng, Z., Liu, Q., Aggarwal, K., Mo- hammed,O.K.,Singhal,S.,Som,S.,etal.:ImageasaForeignLanguage:BEiTPre- training for Vision and Vision-Language Tasks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2023)

work page 2023
[34]

Yuksekgonul, M., Bianchi, F., Kalluri, P., Jurafsky, D., Zou, J., et al.: When and Why Vision-Language Models Behave Like Bags-of-Words, and What to Do About It? In: Proceedings of the International Conference on Learning Representations (2023)

work page 2023
[35]

In: Proceedings of the European Conference on Computer Vision (2024) 18 G

Zhang, B., Zhang, P., Dong, X., Zang, Y., Wang, J.: Long-CLIP: Unlocking the Long-Text Capability of CLIP. In: Proceedings of the European Conference on Computer Vision (2024) 18 G. Gomes et al. Supplementary Materials We prepared a set of supplementary materials that provide additional details supporting the methodology and results presented in the main ...

work page 2024

[1] [1]

From Images to Sentences through Scene Description Graphs using Commonsense Reasoning and Knowledge

Aditya, S., Yang, Y., Baral, C., Fermuller, C., Aloimonos, Y.: From Images to Sentences Through Scene Description Graphs Using Commonsense Reasoning and Knowledge. arXiv preprint arXiv:1511.03292 (2015)

work page internal anchor Pith review Pith/arXiv arXiv 2015

[2] [2]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision (2019)

Agrawal, H., Desai, K., Wang, Y., Chen, X., Jain, R., Johnson, M., Batra, D., Parikh, D., Lee, S., Anderson, P.: Nocaps: Novel Object Captioning at Scale. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (2019)

work page 2019

[3] [3]

In: Proceedings of the Conference on Empirical Methods in Natural Language Processing (2023)

Chan, D., Petryk, S., Gonzalez, J., Darrell, T., Canny, J.: CLAIR: Evaluating Image Captions with Large Language Models. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing (2023)

work page 2023

[4] [4]

In: Proceedings of the Conference on Empirical Methods in Natural Language Processing (2025)

Chen, X., Salazar, I., Kementchedjhieva, Y.: SPECS: Specificity-Enhanced CLIP- Score for Long Image Caption Evaluation. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing (2025)

work page 2025

[5] [5]

In: Pro- ceedings of the Computer Vision and Pattern Recognition Conference (2025)

Feng, Y., Wen, C., Peng, Z., Zhu, S., et al.: Retaining Knowledge and Enhancing Long-Text Representations in CLIP through Dual-Teacher Distillation. In: Pro- ceedings of the Computer Vision and Pattern Recognition Conference (2025)

work page 2025

[6] [6]

Garg, R., Burns, A., Karagol-Ayan, B., Bitton, Y., Montgomery, C., Onoe, Y., Bunner, A., Krishna, R., Baldridge, J.M., Soricut, R.: Imageinwords: Unlocking Hyper-DetailedImageDescriptions.In:ProceedingsoftheConferenceonEmpirical Methods in Natural Language Processing (2024)

work page 2024

[7] [7]

Gomes, G.E.C., Zerva, C., Martins, B.: Evaluation of Multilingual Image Caption- ing: How Far Can We Get with CLIP Models? In: Findings of the Association for Computational Linguistics (2025)

work page 2025

[8] [8]

arXiv preprint arXiv:2412.18150 (2024)

Han, S., Fan, H., Fu, J., Li, L., Li, T., Cui, J., Wang, Y., Tai, Y., Sun, J., Guo, C., et al.: Evalmuse-40k: A Reliable and Fine-Grained Benchmark with Comprehen- sive Human Annotations for Text-to-Image Generation Model Evaluation. arXiv preprint arXiv:2412.18150 (2024)

work page arXiv 2024

[9] [9]

In: Proceedings of the Conference on Empirical Methods in Natural Language Processing (2021)

Hessel, J., Holtzman, A., Forbes, M., Le Bras, R., Choi, Y.: CLIPscore: A Reference-Free Evaluation Metric for Image Captioning. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing (2021)

work page 2021

[10] [10]

Journal of Artificial Intelligence Research (2013)

Hodosh, M., Young, P., Hockenmaier, J.: Framing Image Description as a Rank- ing Task: Data, Models and Evaluation Metrics. Journal of Artificial Intelligence Research (2013)

work page 2013

[11] [11]

Advances in neural information processing systems (2023)

Hsieh, C.Y., Zhang, J., Ma, Z., Kembhavi, A., Krishna, R.: Sugarcrepe: Fixing Hackable Benchmarks for Vision-Language Compositionality. Advances in neural information processing systems (2023)

work page 2023

[12] [12]

In: Proceedings of the Annual Meeting of the Association for Computational Linguistics (2023)

Hu, A., Chen, S., Zhang, L., Jin, Q.: InfoMetIC: An Informative Metric for Reference-free Image Caption Evaluation. In: Proceedings of the Annual Meeting of the Association for Computational Linguistics (2023)

work page 2023

[13] [13]

In: Findings of the Annual Meeting of the Association for Computational Linguistics (2025)

Kim, H., Kim, S., Jeong, J., Cho, Y., Cho, S.: EXPERT: An Explainable Image Captioning Evaluation Metric with Structured Explanations. In: Findings of the Annual Meeting of the Association for Computational Linguistics (2025)

work page 2025

[14] [14]

In: Proceedings of the Annual Meeting of the Association for Computational Linguistics (2024)

Lee, Y., Park, I., Kang, M.: FLEUR: An Explainable Reference-Free Evaluation Metric for Image Captioning Using a Large Multimodal Model. In: Proceedings of the Annual Meeting of the Association for Computational Linguistics (2024)

work page 2024

[15] [15]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2024) 16 G

Li, B., Lin, Z., Pathak, D., Li, J., Fei, Y., Wu, K., Xia, X., Zhang, P., Neubig, G., Ramanan, D.: Evaluating and Improving Compositional Text-to-Visual Gen- eration. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2024) 16 G. Gomes et al

work page 2024

[16] [16]

In: Proceedings of the European Conference on Computer Vision (2014)

Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., Zitnick, C.L.: Microsoft COCO: Common Objects in Context. In: Proceedings of the European Conference on Computer Vision (2014)

work page 2014

[17] [17]

Springer (2024)

Lin, Z., Pathak, D., Li, B., Li, J., Xia, X., Neubig, G., Zhang, P., Ramanan, D.: EvaluatingText-to-VisualGenerationwithImage-to-TextGeneration.In:Proceed- ings of the European Conference on Computer Vision. Springer (2024)

work page 2024

[18] [18]

arXiv preprint arXiv:2402.11572 (2024)

Ma, Z., Wang, C., Ouyang, Y., Zhao, F., Zhang, J., Huang, S., Chen, J.: Cobra Ef- fect in Reference-Free Image Captioning Metrics. arXiv preprint arXiv:2402.11572 (2024)

work page arXiv 2024

[19] [19]

In: Proceedings of the Asian Conference on Computer Vision (2024)

Matsuda, K., Wada, Y., Sugiura, K.: DENEB: A Hallucination-Robust Automatic Evaluation Metric for Image Captioning. In: Proceedings of the Asian Conference on Computer Vision (2024)

work page 2024

[20] [20]

Advances in Neural Information Processing Systems (2023)

Narins, L.D., Scott, A., Gautam, A., Kulkarni, A., Castanon, M., Kao, B., Ihorn, S., Siu, Y.T., Mason, J.M., Blum, A., et al.: Validated Image Caption Rating Dataset. Advances in Neural Information Processing Systems (2023)

work page 2023

[21] [21]

In: Proceedings to the European Conference on Computer Vision (2024)

Onoe, Y., Rane, S., Berger, Z., Bitton, Y., Cho, J., Garg, R., Ku, A., Parekh, Z., Pont-Tuset, J., Tanzer, G., et al.: DOCCI: Descriptions of Connected and Con- trasting Images. In: Proceedings to the European Conference on Computer Vision (2024)

work page 2024

[22] [22]

In: Proceedings of the Annual Meeting of the Association for Computational Linguistics (2022)

Parcalabescu, L., Cafagna, M., Muradjan, L., Frank, A., Calixto, I., Gatt, A.: VALSE: A Task-Independent Benchmark for Vision and Language Models Cen- tered on Linguistic Phenomena. In: Proceedings of the Annual Meeting of the Association for Computational Linguistics (2022)

work page 2022

[23] [23]

In: Pro- ceedings of the Conference of the North American Chapter of the Association for Computational Linguistics (2024)

Petryk, S., Chan, D., Kachinthaya, A., Zou, H., Canny, J., Gonzalez, J., Darrell, T.: ALOHa: A Nnew Measure for Hallucination in Captioning Models. In: Pro- ceedings of the Conference of the North American Chapter of the Association for Computational Linguistics (2024)

work page 2024

[24] [24]

In: Proceedings of the European Conference on Computer Vision (2020)

Pont-Tuset, J., Uijlings, J., Changpinyo, S., Soricut, R., Ferrari, V.: Connecting Vision and Language with Localized Narratives. In: Proceedings of the European Conference on Computer Vision (2020)

work page 2020

[25] [25]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2023)

Sarto, S., Barraco, M., Cornia, M., Baraldi, L., Cucchiara, R.: Positive-Augmented Contrastive Learning for Image and Video Captioning Evaluation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2023)

work page 2023

[26] [26]

Interna- tional Journal of Computer Vision (2025)

Sarto,S.,Moratelli,N.,Cornia,M.,Baraldi,L.,Cucchiara,R.:Positive-Augmented Contrastive Learning for Vision-and-Language Evaluation and Training. Interna- tional Journal of Computer Vision (2025)

work page 2025

[27] [27]

In: Proceedings of the Annual Meeting of the Association for Computational Lin- guistics (2017)

Shekhar, R., Pezzelle, S., Klimovich, Y., Herbelot, A., Nabi, M., Sangineto, E., Bernardi, R.: Foil It! Find One Mismatch Between Image and Language Caption. In: Proceedings of the Annual Meeting of the Association for Computational Lin- guistics (2017)

work page 2017

[28] [28]

In: Proceedings of the European Conference on Computer Vision

Sidorov, O., Hu, R., Rohrbach, M., Singh, A.: Textcaps: A Dataset for Image Cap- tioning with Reading Comprehension. In: Proceedings of the European Conference on Computer Vision. Springer (2020)

work page 2020

[29] [29]

From pixels to prose: A large dataset of dense image cap- tions

Singla, V., Yue, K., Paul, S., Shirkavand, R., Jayawardhana, M., Ganjdanesh, A., Huang, H., Bhatele, A., Somepalli, G., Goldstein, T.: From Pixels to Prose: A Large Dataset of Dense Image Captions. arXiv preprint arXiv:2406.10328 (2024)

work page arXiv 2024

[30] [30]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2022) Title Suppressed Due to Excessive Length 17

Thrush, T., Jiang, R., Bartolo, M., Singh, A., Williams, A., Kiela, D., Ross, C.: Winoground: Probing vision and Language Models for Visio-Linguistic Composi- tionality. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2022) Title Suppressed Due to Excessive Length 17

work page 2022

[31] [31]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2024)

Urbanek, J., Bordes, F., Astolfi, P., Williamson, M., Sharma, V., Romero-Soriano, A.: A Picture is Worth More than 77 Text Tokens: Evaluating CLIP-Style Models on Dense Captions. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2024)

work page 2024

[32] [32]

Wada, Y., Kaneda, K., Saito, D., Sugiura, K.: POLOS: Multimodal Metric Learn- ingfromHumanFeedbackforImageCaptioning.In:ProceedingsoftheIEEE/CVF Conference on Computer Vision and Pattern Recognition (2024)

work page 2024

[33] [33]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2023)

Wang, W., Bao, H., Dong, L., Bjorck, J., Peng, Z., Liu, Q., Aggarwal, K., Mo- hammed,O.K.,Singhal,S.,Som,S.,etal.:ImageasaForeignLanguage:BEiTPre- training for Vision and Vision-Language Tasks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2023)

work page 2023

[34] [34]

Yuksekgonul, M., Bianchi, F., Kalluri, P., Jurafsky, D., Zou, J., et al.: When and Why Vision-Language Models Behave Like Bags-of-Words, and What to Do About It? In: Proceedings of the International Conference on Learning Representations (2023)

work page 2023

[35] [35]

In: Proceedings of the European Conference on Computer Vision (2024) 18 G

Zhang, B., Zhang, P., Dong, X., Zang, Y., Wang, J.: Long-CLIP: Unlocking the Long-Text Capability of CLIP. In: Proceedings of the European Conference on Computer Vision (2024) 18 G. Gomes et al. Supplementary Materials We prepared a set of supplementary materials that provide additional details supporting the methodology and results presented in the main ...

work page 2024