BEiTScore: Reference-free Image Captioning Evaluation with an Efficient Cross-Encoder Model
Pith reviewed 2026-05-22 08:58 UTC · model grok-4.3
The pith
A lightweight cross-encoder model trained on adversarial data achieves state-of-the-art reference-free evaluation of image captions at low computational cost.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that a cross-encoder model, started from a visual question-answering checkpoint and trained on a carefully mixed dataset containing adversarial LLM augmentations, produces reference-free caption scores that reach state-of-the-art accuracy while remaining efficient enough for repeated use in benchmarking and training loops.
What carries the argument
The lightweight cross-encoder model, initialized from a visual question-answering checkpoint, that directly processes image-text pairs to output a quality score.
If this is right
- Large-scale benchmarking of captioning models becomes feasible without prohibitive compute.
- Quality-aware decoding during generation can use the metric as a direct signal.
- Reinforcement learning or reward modeling for vision-language models gains a practical, efficient reward function.
- Evaluation of long-form and context-rich captions improves over bag-of-words style encoders.
- The introduced benchmark provides a standardized testbed for detailed caption assessment.
Where Pith is reading between the lines
- Similar lightweight cross-encoder setups could be tested for evaluating other vision-language outputs such as visual question answers or generated images.
- The adversarial augmentation strategy might transfer to training evaluators in adjacent domains like video captioning or multimodal dialogue.
- If the efficiency holds, the metric could support on-device or repeated inference in production captioning pipelines.
- Extending the initialization idea to other pretrained multimodal checkpoints could yield further gains without increasing model size.
Load-bearing premise
The mixture of training data that includes adversarial LLM-based augmentations is what gives the model its sensitivity to fine-grained visual-linguistic mismatches.
What would settle it
If the model assigns high scores to captions that contain clear fine-grained errors such as wrong object attributes, incorrect spatial relations, or missing context on the new benchmark, while lower-scoring correct captions exist, the performance advantage would be refuted.
Figures
read the original abstract
Image captioning evaluation remains a significant challenge, as vision-language models evolve toward more challenging capabilities such as generating long-form and context-rich descriptions. State-of-the-art evaluation metrics involve extensive computational costs associated with the use of Large Language Models (LLMs) as judges, or instead suffer from the limitations of standard CLIP-based encoders, such as strict token limits, lack of fine-grained sensitivity, or lack of compositional generalization by treating captions as ``bags-of-words.'' We propose a new learned metric that tackles the aforementioned challenges, based on a lightweight cross-encoder that is initialized from a visual question-answering model checkpoint, balancing a strong weight initialization with computational efficiency. Our training scheme uses a carefully assembled data mixture for supervised learning, featuring adversarial LLM-based data augmentations to enhance model sensitivity to fine-grained visual-linguistic errors. We also introduce a new benchmark designed to assess detailed captioning evaluation across diverse scenarios. Experimental results demonstrate that the proposed metric achieves state-of-the-art performance while maintaining the efficiency required for large-scale benchmarking, quality-aware decoding, or reward guidance.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces BEiTScore, a reference-free image captioning evaluation metric based on a lightweight cross-encoder initialized from a visual question-answering checkpoint. Training uses a data mixture that incorporates adversarial LLM-based augmentations to increase sensitivity to fine-grained visual-linguistic mismatches. A new benchmark for detailed caption evaluation is presented, and experiments report state-of-the-art results together with efficiency suitable for large-scale use, quality-aware decoding, and reward modeling.
Significance. If the performance claims hold under scrutiny, the work supplies a practical middle ground between expensive LLM judges and limited CLIP-style encoders. The VQA initialization combined with targeted adversarial training offers a concrete route to improved compositional sensitivity at modest compute cost. Credit is given for the reproducible training recipe and efficiency measurements that directly address downstream usability.
major comments (2)
- [§4] §4 (Experimental Results): The central SOTA claim rests on comparisons to baselines; an ablation isolating the contribution of the adversarial LLM augmentations versus the base data mixture is required to confirm that the reported gains are attributable to the proposed training scheme rather than initialization or data scale alone.
- [§5] §5 (New Benchmark): The benchmark construction details (scenario selection, annotation protocol, and diversity controls) are load-bearing for the claim that the metric generalizes across detailed captioning scenarios; without them, it is difficult to rule out benchmark-specific biases favoring the proposed model.
minor comments (3)
- [Table 2] Table 2: Report standard deviations or confidence intervals alongside mean scores to allow readers to assess whether the observed margins over baselines are statistically meaningful.
- [Figure 2] Figure 2: The efficiency plot would benefit from explicit hardware specifications and direct side-by-side timing against the strongest LLM baseline under identical conditions.
- [§3.1] §3.1: Clarify the exact token-length handling and any truncation strategy used by the cross-encoder to address the token-limit limitation mentioned in the introduction.
Simulated Author's Rebuttal
We thank the referee for the constructive review and recommendation of minor revision. We address each major comment below and will revise the manuscript accordingly to incorporate the requested analyses and details.
read point-by-point responses
-
Referee: [§4] §4 (Experimental Results): The central SOTA claim rests on comparisons to baselines; an ablation isolating the contribution of the adversarial LLM augmentations versus the base data mixture is required to confirm that the reported gains are attributable to the proposed training scheme rather than initialization or data scale alone.
Authors: We agree that an explicit ablation is needed to isolate the contribution of the adversarial LLM augmentations. The original manuscript describes the full training mixture but does not report a controlled comparison against the base data mixture alone. In the revised version we will add this ablation, training an otherwise identical model on the base mixture without the adversarial augmentations and reporting the resulting drop in performance on the detailed caption benchmarks. This will directly attribute the observed gains to the proposed augmentation strategy rather than initialization or data scale. revision: yes
-
Referee: [§5] §5 (New Benchmark): The benchmark construction details (scenario selection, annotation protocol, and diversity controls) are load-bearing for the claim that the metric generalizes across detailed captioning scenarios; without them, it is difficult to rule out benchmark-specific biases favoring the proposed model.
Authors: We acknowledge that additional transparency on benchmark construction is required to support claims of generalizability. While §5 outlines the new benchmark, it does not fully specify the scenario selection process, annotation protocol, or diversity controls. In the revision we will expand this section with these details, including the criteria used to select scenarios, the protocol for obtaining reference judgments, and the steps taken to ensure diversity and reduce potential biases. This will strengthen the evidence that performance gains are not benchmark-specific. revision: yes
Circularity Check
No significant circularity detected
full rationale
The paper presents an empirical ML contribution: a lightweight cross-encoder metric trained via supervised learning on an assembled data mixture that includes adversarial LLM augmentations, initialized from a VQA checkpoint, and evaluated on a newly introduced benchmark. No derivation chain, equations, or first-principles results are present that reduce by construction to fitted parameters or self-citations. The reported SOTA performance and efficiency claims rest on external benchmark comparisons and reproducible training details rather than any self-referential definition or prediction that is statistically forced by the inputs.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Initialization from a visual question-answering model checkpoint provides a strong yet efficient starting point for the caption evaluation task.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
lightweight cross-encoder ... initialized from a visual question-answering model checkpoint ... adversarial LLM-based data augmentations ... binary cross-entropy (BCE) loss ... L1 loss
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
From Images to Sentences through Scene Description Graphs using Commonsense Reasoning and Knowledge
Aditya, S., Yang, Y., Baral, C., Fermuller, C., Aloimonos, Y.: From Images to Sentences Through Scene Description Graphs Using Commonsense Reasoning and Knowledge. arXiv preprint arXiv:1511.03292 (2015)
work page internal anchor Pith review Pith/arXiv arXiv 2015
-
[2]
In: Proceedings of the IEEE/CVF International Conference on Computer Vision (2019)
Agrawal, H., Desai, K., Wang, Y., Chen, X., Jain, R., Johnson, M., Batra, D., Parikh, D., Lee, S., Anderson, P.: Nocaps: Novel Object Captioning at Scale. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (2019)
work page 2019
-
[3]
In: Proceedings of the Conference on Empirical Methods in Natural Language Processing (2023)
Chan, D., Petryk, S., Gonzalez, J., Darrell, T., Canny, J.: CLAIR: Evaluating Image Captions with Large Language Models. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing (2023)
work page 2023
-
[4]
In: Proceedings of the Conference on Empirical Methods in Natural Language Processing (2025)
Chen, X., Salazar, I., Kementchedjhieva, Y.: SPECS: Specificity-Enhanced CLIP- Score for Long Image Caption Evaluation. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing (2025)
work page 2025
-
[5]
In: Pro- ceedings of the Computer Vision and Pattern Recognition Conference (2025)
Feng, Y., Wen, C., Peng, Z., Zhu, S., et al.: Retaining Knowledge and Enhancing Long-Text Representations in CLIP through Dual-Teacher Distillation. In: Pro- ceedings of the Computer Vision and Pattern Recognition Conference (2025)
work page 2025
-
[6]
Garg, R., Burns, A., Karagol-Ayan, B., Bitton, Y., Montgomery, C., Onoe, Y., Bunner, A., Krishna, R., Baldridge, J.M., Soricut, R.: Imageinwords: Unlocking Hyper-DetailedImageDescriptions.In:ProceedingsoftheConferenceonEmpirical Methods in Natural Language Processing (2024)
work page 2024
-
[7]
Gomes, G.E.C., Zerva, C., Martins, B.: Evaluation of Multilingual Image Caption- ing: How Far Can We Get with CLIP Models? In: Findings of the Association for Computational Linguistics (2025)
work page 2025
-
[8]
arXiv preprint arXiv:2412.18150 (2024)
Han, S., Fan, H., Fu, J., Li, L., Li, T., Cui, J., Wang, Y., Tai, Y., Sun, J., Guo, C., et al.: Evalmuse-40k: A Reliable and Fine-Grained Benchmark with Comprehen- sive Human Annotations for Text-to-Image Generation Model Evaluation. arXiv preprint arXiv:2412.18150 (2024)
-
[9]
In: Proceedings of the Conference on Empirical Methods in Natural Language Processing (2021)
Hessel, J., Holtzman, A., Forbes, M., Le Bras, R., Choi, Y.: CLIPscore: A Reference-Free Evaluation Metric for Image Captioning. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing (2021)
work page 2021
-
[10]
Journal of Artificial Intelligence Research (2013)
Hodosh, M., Young, P., Hockenmaier, J.: Framing Image Description as a Rank- ing Task: Data, Models and Evaluation Metrics. Journal of Artificial Intelligence Research (2013)
work page 2013
-
[11]
Advances in neural information processing systems (2023)
Hsieh, C.Y., Zhang, J., Ma, Z., Kembhavi, A., Krishna, R.: Sugarcrepe: Fixing Hackable Benchmarks for Vision-Language Compositionality. Advances in neural information processing systems (2023)
work page 2023
-
[12]
In: Proceedings of the Annual Meeting of the Association for Computational Linguistics (2023)
Hu, A., Chen, S., Zhang, L., Jin, Q.: InfoMetIC: An Informative Metric for Reference-free Image Caption Evaluation. In: Proceedings of the Annual Meeting of the Association for Computational Linguistics (2023)
work page 2023
-
[13]
In: Findings of the Annual Meeting of the Association for Computational Linguistics (2025)
Kim, H., Kim, S., Jeong, J., Cho, Y., Cho, S.: EXPERT: An Explainable Image Captioning Evaluation Metric with Structured Explanations. In: Findings of the Annual Meeting of the Association for Computational Linguistics (2025)
work page 2025
-
[14]
In: Proceedings of the Annual Meeting of the Association for Computational Linguistics (2024)
Lee, Y., Park, I., Kang, M.: FLEUR: An Explainable Reference-Free Evaluation Metric for Image Captioning Using a Large Multimodal Model. In: Proceedings of the Annual Meeting of the Association for Computational Linguistics (2024)
work page 2024
-
[15]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2024) 16 G
Li, B., Lin, Z., Pathak, D., Li, J., Fei, Y., Wu, K., Xia, X., Zhang, P., Neubig, G., Ramanan, D.: Evaluating and Improving Compositional Text-to-Visual Gen- eration. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2024) 16 G. Gomes et al
work page 2024
-
[16]
In: Proceedings of the European Conference on Computer Vision (2014)
Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., Zitnick, C.L.: Microsoft COCO: Common Objects in Context. In: Proceedings of the European Conference on Computer Vision (2014)
work page 2014
-
[17]
Lin, Z., Pathak, D., Li, B., Li, J., Xia, X., Neubig, G., Zhang, P., Ramanan, D.: EvaluatingText-to-VisualGenerationwithImage-to-TextGeneration.In:Proceed- ings of the European Conference on Computer Vision. Springer (2024)
work page 2024
-
[18]
arXiv preprint arXiv:2402.11572 (2024)
Ma, Z., Wang, C., Ouyang, Y., Zhao, F., Zhang, J., Huang, S., Chen, J.: Cobra Ef- fect in Reference-Free Image Captioning Metrics. arXiv preprint arXiv:2402.11572 (2024)
-
[19]
In: Proceedings of the Asian Conference on Computer Vision (2024)
Matsuda, K., Wada, Y., Sugiura, K.: DENEB: A Hallucination-Robust Automatic Evaluation Metric for Image Captioning. In: Proceedings of the Asian Conference on Computer Vision (2024)
work page 2024
-
[20]
Advances in Neural Information Processing Systems (2023)
Narins, L.D., Scott, A., Gautam, A., Kulkarni, A., Castanon, M., Kao, B., Ihorn, S., Siu, Y.T., Mason, J.M., Blum, A., et al.: Validated Image Caption Rating Dataset. Advances in Neural Information Processing Systems (2023)
work page 2023
-
[21]
In: Proceedings to the European Conference on Computer Vision (2024)
Onoe, Y., Rane, S., Berger, Z., Bitton, Y., Cho, J., Garg, R., Ku, A., Parekh, Z., Pont-Tuset, J., Tanzer, G., et al.: DOCCI: Descriptions of Connected and Con- trasting Images. In: Proceedings to the European Conference on Computer Vision (2024)
work page 2024
-
[22]
In: Proceedings of the Annual Meeting of the Association for Computational Linguistics (2022)
Parcalabescu, L., Cafagna, M., Muradjan, L., Frank, A., Calixto, I., Gatt, A.: VALSE: A Task-Independent Benchmark for Vision and Language Models Cen- tered on Linguistic Phenomena. In: Proceedings of the Annual Meeting of the Association for Computational Linguistics (2022)
work page 2022
-
[23]
Petryk, S., Chan, D., Kachinthaya, A., Zou, H., Canny, J., Gonzalez, J., Darrell, T.: ALOHa: A Nnew Measure for Hallucination in Captioning Models. In: Pro- ceedings of the Conference of the North American Chapter of the Association for Computational Linguistics (2024)
work page 2024
-
[24]
In: Proceedings of the European Conference on Computer Vision (2020)
Pont-Tuset, J., Uijlings, J., Changpinyo, S., Soricut, R., Ferrari, V.: Connecting Vision and Language with Localized Narratives. In: Proceedings of the European Conference on Computer Vision (2020)
work page 2020
-
[25]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2023)
Sarto, S., Barraco, M., Cornia, M., Baraldi, L., Cucchiara, R.: Positive-Augmented Contrastive Learning for Image and Video Captioning Evaluation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2023)
work page 2023
-
[26]
Interna- tional Journal of Computer Vision (2025)
Sarto,S.,Moratelli,N.,Cornia,M.,Baraldi,L.,Cucchiara,R.:Positive-Augmented Contrastive Learning for Vision-and-Language Evaluation and Training. Interna- tional Journal of Computer Vision (2025)
work page 2025
-
[27]
In: Proceedings of the Annual Meeting of the Association for Computational Lin- guistics (2017)
Shekhar, R., Pezzelle, S., Klimovich, Y., Herbelot, A., Nabi, M., Sangineto, E., Bernardi, R.: Foil It! Find One Mismatch Between Image and Language Caption. In: Proceedings of the Annual Meeting of the Association for Computational Lin- guistics (2017)
work page 2017
-
[28]
In: Proceedings of the European Conference on Computer Vision
Sidorov, O., Hu, R., Rohrbach, M., Singh, A.: Textcaps: A Dataset for Image Cap- tioning with Reading Comprehension. In: Proceedings of the European Conference on Computer Vision. Springer (2020)
work page 2020
-
[29]
From pixels to prose: A large dataset of dense image cap- tions
Singla, V., Yue, K., Paul, S., Shirkavand, R., Jayawardhana, M., Ganjdanesh, A., Huang, H., Bhatele, A., Somepalli, G., Goldstein, T.: From Pixels to Prose: A Large Dataset of Dense Image Captions. arXiv preprint arXiv:2406.10328 (2024)
-
[30]
Thrush, T., Jiang, R., Bartolo, M., Singh, A., Williams, A., Kiela, D., Ross, C.: Winoground: Probing vision and Language Models for Visio-Linguistic Composi- tionality. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2022) Title Suppressed Due to Excessive Length 17
work page 2022
-
[31]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2024)
Urbanek, J., Bordes, F., Astolfi, P., Williamson, M., Sharma, V., Romero-Soriano, A.: A Picture is Worth More than 77 Text Tokens: Evaluating CLIP-Style Models on Dense Captions. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2024)
work page 2024
-
[32]
Wada, Y., Kaneda, K., Saito, D., Sugiura, K.: POLOS: Multimodal Metric Learn- ingfromHumanFeedbackforImageCaptioning.In:ProceedingsoftheIEEE/CVF Conference on Computer Vision and Pattern Recognition (2024)
work page 2024
-
[33]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2023)
Wang, W., Bao, H., Dong, L., Bjorck, J., Peng, Z., Liu, Q., Aggarwal, K., Mo- hammed,O.K.,Singhal,S.,Som,S.,etal.:ImageasaForeignLanguage:BEiTPre- training for Vision and Vision-Language Tasks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2023)
work page 2023
-
[34]
Yuksekgonul, M., Bianchi, F., Kalluri, P., Jurafsky, D., Zou, J., et al.: When and Why Vision-Language Models Behave Like Bags-of-Words, and What to Do About It? In: Proceedings of the International Conference on Learning Representations (2023)
work page 2023
-
[35]
In: Proceedings of the European Conference on Computer Vision (2024) 18 G
Zhang, B., Zhang, P., Dong, X., Zang, Y., Wang, J.: Long-CLIP: Unlocking the Long-Text Capability of CLIP. In: Proceedings of the European Conference on Computer Vision (2024) 18 G. Gomes et al. Supplementary Materials We prepared a set of supplementary materials that provide additional details supporting the methodology and results presented in the main ...
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.