pith. sign in

arxiv: 2605.21728 · v1 · pith:WKZXOXQUnew · submitted 2026-05-20 · 💻 cs.CV · cs.CL· cs.LG

BEiTScore: Reference-free Image Captioning Evaluation with an Efficient Cross-Encoder Model

Pith reviewed 2026-05-22 08:58 UTC · model grok-4.3

classification 💻 cs.CV cs.CLcs.LG
keywords reference-free evaluationimage captioningcross-encoder modelvision-language modelsadversarial data augmentationefficient benchmarkingcaption quality assessmentVQA initialization
0
0 comments X

The pith

A lightweight cross-encoder model trained on adversarial data achieves state-of-the-art reference-free evaluation of image captions at low computational cost.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces BEiTScore, a learned metric that scores how well an image caption matches its image without needing reference captions. It initializes a compact cross-encoder from a visual question-answering checkpoint and trains it on a mix of data that includes adversarial examples generated by large language models to catch subtle visual-linguistic mistakes. This design aims to overcome the token limits and coarse judgments of CLIP encoders while avoiding the high expense of using LLMs directly as judges. A new benchmark tests performance on detailed, long-form captions across varied scenarios. If the approach works, it supplies a practical tool for large-scale model comparison, guided decoding, and reinforcement learning signals in vision-language systems.

Core claim

The central claim is that a cross-encoder model, started from a visual question-answering checkpoint and trained on a carefully mixed dataset containing adversarial LLM augmentations, produces reference-free caption scores that reach state-of-the-art accuracy while remaining efficient enough for repeated use in benchmarking and training loops.

What carries the argument

The lightweight cross-encoder model, initialized from a visual question-answering checkpoint, that directly processes image-text pairs to output a quality score.

If this is right

  • Large-scale benchmarking of captioning models becomes feasible without prohibitive compute.
  • Quality-aware decoding during generation can use the metric as a direct signal.
  • Reinforcement learning or reward modeling for vision-language models gains a practical, efficient reward function.
  • Evaluation of long-form and context-rich captions improves over bag-of-words style encoders.
  • The introduced benchmark provides a standardized testbed for detailed caption assessment.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar lightweight cross-encoder setups could be tested for evaluating other vision-language outputs such as visual question answers or generated images.
  • The adversarial augmentation strategy might transfer to training evaluators in adjacent domains like video captioning or multimodal dialogue.
  • If the efficiency holds, the metric could support on-device or repeated inference in production captioning pipelines.
  • Extending the initialization idea to other pretrained multimodal checkpoints could yield further gains without increasing model size.

Load-bearing premise

The mixture of training data that includes adversarial LLM-based augmentations is what gives the model its sensitivity to fine-grained visual-linguistic mismatches.

What would settle it

If the model assigns high scores to captions that contain clear fine-grained errors such as wrong object attributes, incorrect spatial relations, or missing context on the new benchmark, while lower-scoring correct captions exist, the performance advantage would be refuted.

Figures

Figures reproduced from arXiv: 2605.21728 by Bruno Martins, Chrysoula Zerva, Gon\c{c}alo Gomes.

Figure 1
Figure 1. Figure 1: BEiTScore versus state-of-the-art metrics on an instance from Nebula [19]. there have been significant advances in terms of learned evaluation metrics, the current encoder-based metrics still struggle with these requirements. Recent encoder-based approaches either rely on reference captions to assist evaluation within the textual domain [4, 19, 32] or use CLIP [9]-based encoders. Despite achieving strong c… view at source ↗
Figure 2
Figure 2. Figure 2: BEiTScore architecture, and adversarial data augmentation strategy. 3.1 Adversarial Data Augmentation Strategy Building on the aforementioned idea, we augmented existing datasets by explic￾itly focusing on pairing images with detailed textual descriptions [24, 27–29], and repurposing them to build a binary pairwise training dataset. Specifically, we prompt a LLM4 to generate factually incorrect but fluent … view at source ↗
Figure 3
Figure 3. Figure 3: Two qualitative examples from Winoground benchmark. ity and counting), correctly classifying named spatial relations between objects (relations), and distinguishing actions while identifying their participants (ac￾tions). These tasks were previously very challenging for encoder-based methods, as highlighted in the original benchmark and confirmed by our results. Results on the Winoground Benchmark: Winogro… view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative example from LongCapVLCP, with an Image and original caption from DOCCI. The results in [PITH_FULL_IMAGE:figures/full_fig_p013_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Caption length distributions (i.e., the number of words per caption) for the three datasets. Each histogram shows the number of instances containing captions with a number of words that falls within each bin interval values. The maximum caption lengths are as follows: 176 words for the training dataset, 270 words for the validation dataset, and 525 words for the LongCapVLCP benchmark [PITH_FULL_IMAGE:figu… view at source ↗
read the original abstract

Image captioning evaluation remains a significant challenge, as vision-language models evolve toward more challenging capabilities such as generating long-form and context-rich descriptions. State-of-the-art evaluation metrics involve extensive computational costs associated with the use of Large Language Models (LLMs) as judges, or instead suffer from the limitations of standard CLIP-based encoders, such as strict token limits, lack of fine-grained sensitivity, or lack of compositional generalization by treating captions as ``bags-of-words.'' We propose a new learned metric that tackles the aforementioned challenges, based on a lightweight cross-encoder that is initialized from a visual question-answering model checkpoint, balancing a strong weight initialization with computational efficiency. Our training scheme uses a carefully assembled data mixture for supervised learning, featuring adversarial LLM-based data augmentations to enhance model sensitivity to fine-grained visual-linguistic errors. We also introduce a new benchmark designed to assess detailed captioning evaluation across diverse scenarios. Experimental results demonstrate that the proposed metric achieves state-of-the-art performance while maintaining the efficiency required for large-scale benchmarking, quality-aware decoding, or reward guidance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 3 minor

Summary. The manuscript introduces BEiTScore, a reference-free image captioning evaluation metric based on a lightweight cross-encoder initialized from a visual question-answering checkpoint. Training uses a data mixture that incorporates adversarial LLM-based augmentations to increase sensitivity to fine-grained visual-linguistic mismatches. A new benchmark for detailed caption evaluation is presented, and experiments report state-of-the-art results together with efficiency suitable for large-scale use, quality-aware decoding, and reward modeling.

Significance. If the performance claims hold under scrutiny, the work supplies a practical middle ground between expensive LLM judges and limited CLIP-style encoders. The VQA initialization combined with targeted adversarial training offers a concrete route to improved compositional sensitivity at modest compute cost. Credit is given for the reproducible training recipe and efficiency measurements that directly address downstream usability.

major comments (2)
  1. [§4] §4 (Experimental Results): The central SOTA claim rests on comparisons to baselines; an ablation isolating the contribution of the adversarial LLM augmentations versus the base data mixture is required to confirm that the reported gains are attributable to the proposed training scheme rather than initialization or data scale alone.
  2. [§5] §5 (New Benchmark): The benchmark construction details (scenario selection, annotation protocol, and diversity controls) are load-bearing for the claim that the metric generalizes across detailed captioning scenarios; without them, it is difficult to rule out benchmark-specific biases favoring the proposed model.
minor comments (3)
  1. [Table 2] Table 2: Report standard deviations or confidence intervals alongside mean scores to allow readers to assess whether the observed margins over baselines are statistically meaningful.
  2. [Figure 2] Figure 2: The efficiency plot would benefit from explicit hardware specifications and direct side-by-side timing against the strongest LLM baseline under identical conditions.
  3. [§3.1] §3.1: Clarify the exact token-length handling and any truncation strategy used by the cross-encoder to address the token-limit limitation mentioned in the introduction.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive review and recommendation of minor revision. We address each major comment below and will revise the manuscript accordingly to incorporate the requested analyses and details.

read point-by-point responses
  1. Referee: [§4] §4 (Experimental Results): The central SOTA claim rests on comparisons to baselines; an ablation isolating the contribution of the adversarial LLM augmentations versus the base data mixture is required to confirm that the reported gains are attributable to the proposed training scheme rather than initialization or data scale alone.

    Authors: We agree that an explicit ablation is needed to isolate the contribution of the adversarial LLM augmentations. The original manuscript describes the full training mixture but does not report a controlled comparison against the base data mixture alone. In the revised version we will add this ablation, training an otherwise identical model on the base mixture without the adversarial augmentations and reporting the resulting drop in performance on the detailed caption benchmarks. This will directly attribute the observed gains to the proposed augmentation strategy rather than initialization or data scale. revision: yes

  2. Referee: [§5] §5 (New Benchmark): The benchmark construction details (scenario selection, annotation protocol, and diversity controls) are load-bearing for the claim that the metric generalizes across detailed captioning scenarios; without them, it is difficult to rule out benchmark-specific biases favoring the proposed model.

    Authors: We acknowledge that additional transparency on benchmark construction is required to support claims of generalizability. While §5 outlines the new benchmark, it does not fully specify the scenario selection process, annotation protocol, or diversity controls. In the revision we will expand this section with these details, including the criteria used to select scenarios, the protocol for obtaining reference judgments, and the steps taken to ensure diversity and reduce potential biases. This will strengthen the evidence that performance gains are not benchmark-specific. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper presents an empirical ML contribution: a lightweight cross-encoder metric trained via supervised learning on an assembled data mixture that includes adversarial LLM augmentations, initialized from a VQA checkpoint, and evaluated on a newly introduced benchmark. No derivation chain, equations, or first-principles results are present that reduce by construction to fitted parameters or self-citations. The reported SOTA performance and efficiency claims rest on external benchmark comparisons and reproducible training details rather than any self-referential definition or prediction that is statistically forced by the inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Only the abstract is available, so the ledger is limited to the explicit modeling choices stated there; the central performance claim rests on the effectiveness of the chosen initialization and data mixture.

axioms (1)
  • domain assumption Initialization from a visual question-answering model checkpoint provides a strong yet efficient starting point for the caption evaluation task.
    The abstract states the model is initialized from a VQA checkpoint to balance strong weight initialization with computational efficiency.

pith-pipeline@v0.9.0 · 5729 in / 1400 out tokens · 45124 ms · 2026-05-22T08:58:08.791454+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

35 extracted references · 35 canonical work pages · 1 internal anchor

  1. [1]

    From Images to Sentences through Scene Description Graphs using Commonsense Reasoning and Knowledge

    Aditya, S., Yang, Y., Baral, C., Fermuller, C., Aloimonos, Y.: From Images to Sentences Through Scene Description Graphs Using Commonsense Reasoning and Knowledge. arXiv preprint arXiv:1511.03292 (2015)

  2. [2]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision (2019)

    Agrawal, H., Desai, K., Wang, Y., Chen, X., Jain, R., Johnson, M., Batra, D., Parikh, D., Lee, S., Anderson, P.: Nocaps: Novel Object Captioning at Scale. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (2019)

  3. [3]

    In: Proceedings of the Conference on Empirical Methods in Natural Language Processing (2023)

    Chan, D., Petryk, S., Gonzalez, J., Darrell, T., Canny, J.: CLAIR: Evaluating Image Captions with Large Language Models. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing (2023)

  4. [4]

    In: Proceedings of the Conference on Empirical Methods in Natural Language Processing (2025)

    Chen, X., Salazar, I., Kementchedjhieva, Y.: SPECS: Specificity-Enhanced CLIP- Score for Long Image Caption Evaluation. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing (2025)

  5. [5]

    In: Pro- ceedings of the Computer Vision and Pattern Recognition Conference (2025)

    Feng, Y., Wen, C., Peng, Z., Zhu, S., et al.: Retaining Knowledge and Enhancing Long-Text Representations in CLIP through Dual-Teacher Distillation. In: Pro- ceedings of the Computer Vision and Pattern Recognition Conference (2025)

  6. [6]

    Garg, R., Burns, A., Karagol-Ayan, B., Bitton, Y., Montgomery, C., Onoe, Y., Bunner, A., Krishna, R., Baldridge, J.M., Soricut, R.: Imageinwords: Unlocking Hyper-DetailedImageDescriptions.In:ProceedingsoftheConferenceonEmpirical Methods in Natural Language Processing (2024)

  7. [7]

    Gomes, G.E.C., Zerva, C., Martins, B.: Evaluation of Multilingual Image Caption- ing: How Far Can We Get with CLIP Models? In: Findings of the Association for Computational Linguistics (2025)

  8. [8]

    arXiv preprint arXiv:2412.18150 (2024)

    Han, S., Fan, H., Fu, J., Li, L., Li, T., Cui, J., Wang, Y., Tai, Y., Sun, J., Guo, C., et al.: Evalmuse-40k: A Reliable and Fine-Grained Benchmark with Comprehen- sive Human Annotations for Text-to-Image Generation Model Evaluation. arXiv preprint arXiv:2412.18150 (2024)

  9. [9]

    In: Proceedings of the Conference on Empirical Methods in Natural Language Processing (2021)

    Hessel, J., Holtzman, A., Forbes, M., Le Bras, R., Choi, Y.: CLIPscore: A Reference-Free Evaluation Metric for Image Captioning. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing (2021)

  10. [10]

    Journal of Artificial Intelligence Research (2013)

    Hodosh, M., Young, P., Hockenmaier, J.: Framing Image Description as a Rank- ing Task: Data, Models and Evaluation Metrics. Journal of Artificial Intelligence Research (2013)

  11. [11]

    Advances in neural information processing systems (2023)

    Hsieh, C.Y., Zhang, J., Ma, Z., Kembhavi, A., Krishna, R.: Sugarcrepe: Fixing Hackable Benchmarks for Vision-Language Compositionality. Advances in neural information processing systems (2023)

  12. [12]

    In: Proceedings of the Annual Meeting of the Association for Computational Linguistics (2023)

    Hu, A., Chen, S., Zhang, L., Jin, Q.: InfoMetIC: An Informative Metric for Reference-free Image Caption Evaluation. In: Proceedings of the Annual Meeting of the Association for Computational Linguistics (2023)

  13. [13]

    In: Findings of the Annual Meeting of the Association for Computational Linguistics (2025)

    Kim, H., Kim, S., Jeong, J., Cho, Y., Cho, S.: EXPERT: An Explainable Image Captioning Evaluation Metric with Structured Explanations. In: Findings of the Annual Meeting of the Association for Computational Linguistics (2025)

  14. [14]

    In: Proceedings of the Annual Meeting of the Association for Computational Linguistics (2024)

    Lee, Y., Park, I., Kang, M.: FLEUR: An Explainable Reference-Free Evaluation Metric for Image Captioning Using a Large Multimodal Model. In: Proceedings of the Annual Meeting of the Association for Computational Linguistics (2024)

  15. [15]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2024) 16 G

    Li, B., Lin, Z., Pathak, D., Li, J., Fei, Y., Wu, K., Xia, X., Zhang, P., Neubig, G., Ramanan, D.: Evaluating and Improving Compositional Text-to-Visual Gen- eration. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2024) 16 G. Gomes et al

  16. [16]

    In: Proceedings of the European Conference on Computer Vision (2014)

    Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., Zitnick, C.L.: Microsoft COCO: Common Objects in Context. In: Proceedings of the European Conference on Computer Vision (2014)

  17. [17]

    Springer (2024)

    Lin, Z., Pathak, D., Li, B., Li, J., Xia, X., Neubig, G., Zhang, P., Ramanan, D.: EvaluatingText-to-VisualGenerationwithImage-to-TextGeneration.In:Proceed- ings of the European Conference on Computer Vision. Springer (2024)

  18. [18]

    arXiv preprint arXiv:2402.11572 (2024)

    Ma, Z., Wang, C., Ouyang, Y., Zhao, F., Zhang, J., Huang, S., Chen, J.: Cobra Ef- fect in Reference-Free Image Captioning Metrics. arXiv preprint arXiv:2402.11572 (2024)

  19. [19]

    In: Proceedings of the Asian Conference on Computer Vision (2024)

    Matsuda, K., Wada, Y., Sugiura, K.: DENEB: A Hallucination-Robust Automatic Evaluation Metric for Image Captioning. In: Proceedings of the Asian Conference on Computer Vision (2024)

  20. [20]

    Advances in Neural Information Processing Systems (2023)

    Narins, L.D., Scott, A., Gautam, A., Kulkarni, A., Castanon, M., Kao, B., Ihorn, S., Siu, Y.T., Mason, J.M., Blum, A., et al.: Validated Image Caption Rating Dataset. Advances in Neural Information Processing Systems (2023)

  21. [21]

    In: Proceedings to the European Conference on Computer Vision (2024)

    Onoe, Y., Rane, S., Berger, Z., Bitton, Y., Cho, J., Garg, R., Ku, A., Parekh, Z., Pont-Tuset, J., Tanzer, G., et al.: DOCCI: Descriptions of Connected and Con- trasting Images. In: Proceedings to the European Conference on Computer Vision (2024)

  22. [22]

    In: Proceedings of the Annual Meeting of the Association for Computational Linguistics (2022)

    Parcalabescu, L., Cafagna, M., Muradjan, L., Frank, A., Calixto, I., Gatt, A.: VALSE: A Task-Independent Benchmark for Vision and Language Models Cen- tered on Linguistic Phenomena. In: Proceedings of the Annual Meeting of the Association for Computational Linguistics (2022)

  23. [23]

    In: Pro- ceedings of the Conference of the North American Chapter of the Association for Computational Linguistics (2024)

    Petryk, S., Chan, D., Kachinthaya, A., Zou, H., Canny, J., Gonzalez, J., Darrell, T.: ALOHa: A Nnew Measure for Hallucination in Captioning Models. In: Pro- ceedings of the Conference of the North American Chapter of the Association for Computational Linguistics (2024)

  24. [24]

    In: Proceedings of the European Conference on Computer Vision (2020)

    Pont-Tuset, J., Uijlings, J., Changpinyo, S., Soricut, R., Ferrari, V.: Connecting Vision and Language with Localized Narratives. In: Proceedings of the European Conference on Computer Vision (2020)

  25. [25]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2023)

    Sarto, S., Barraco, M., Cornia, M., Baraldi, L., Cucchiara, R.: Positive-Augmented Contrastive Learning for Image and Video Captioning Evaluation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2023)

  26. [26]

    Interna- tional Journal of Computer Vision (2025)

    Sarto,S.,Moratelli,N.,Cornia,M.,Baraldi,L.,Cucchiara,R.:Positive-Augmented Contrastive Learning for Vision-and-Language Evaluation and Training. Interna- tional Journal of Computer Vision (2025)

  27. [27]

    In: Proceedings of the Annual Meeting of the Association for Computational Lin- guistics (2017)

    Shekhar, R., Pezzelle, S., Klimovich, Y., Herbelot, A., Nabi, M., Sangineto, E., Bernardi, R.: Foil It! Find One Mismatch Between Image and Language Caption. In: Proceedings of the Annual Meeting of the Association for Computational Lin- guistics (2017)

  28. [28]

    In: Proceedings of the European Conference on Computer Vision

    Sidorov, O., Hu, R., Rohrbach, M., Singh, A.: Textcaps: A Dataset for Image Cap- tioning with Reading Comprehension. In: Proceedings of the European Conference on Computer Vision. Springer (2020)

  29. [29]

    From pixels to prose: A large dataset of dense image cap- tions

    Singla, V., Yue, K., Paul, S., Shirkavand, R., Jayawardhana, M., Ganjdanesh, A., Huang, H., Bhatele, A., Somepalli, G., Goldstein, T.: From Pixels to Prose: A Large Dataset of Dense Image Captions. arXiv preprint arXiv:2406.10328 (2024)

  30. [30]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2022) Title Suppressed Due to Excessive Length 17

    Thrush, T., Jiang, R., Bartolo, M., Singh, A., Williams, A., Kiela, D., Ross, C.: Winoground: Probing vision and Language Models for Visio-Linguistic Composi- tionality. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2022) Title Suppressed Due to Excessive Length 17

  31. [31]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2024)

    Urbanek, J., Bordes, F., Astolfi, P., Williamson, M., Sharma, V., Romero-Soriano, A.: A Picture is Worth More than 77 Text Tokens: Evaluating CLIP-Style Models on Dense Captions. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2024)

  32. [32]

    Wada, Y., Kaneda, K., Saito, D., Sugiura, K.: POLOS: Multimodal Metric Learn- ingfromHumanFeedbackforImageCaptioning.In:ProceedingsoftheIEEE/CVF Conference on Computer Vision and Pattern Recognition (2024)

  33. [33]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2023)

    Wang, W., Bao, H., Dong, L., Bjorck, J., Peng, Z., Liu, Q., Aggarwal, K., Mo- hammed,O.K.,Singhal,S.,Som,S.,etal.:ImageasaForeignLanguage:BEiTPre- training for Vision and Vision-Language Tasks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2023)

  34. [34]

    Yuksekgonul, M., Bianchi, F., Kalluri, P., Jurafsky, D., Zou, J., et al.: When and Why Vision-Language Models Behave Like Bags-of-Words, and What to Do About It? In: Proceedings of the International Conference on Learning Representations (2023)

  35. [35]

    In: Proceedings of the European Conference on Computer Vision (2024) 18 G

    Zhang, B., Zhang, P., Dong, X., Zang, Y., Wang, J.: Long-CLIP: Unlocking the Long-Text Capability of CLIP. In: Proceedings of the European Conference on Computer Vision (2024) 18 G. Gomes et al. Supplementary Materials We prepared a set of supplementary materials that provide additional details supporting the methodology and results presented in the main ...