pith. machine review for the scientific record. sign in

arxiv: 2605.06080 · v1 · submitted 2026-05-07 · 💻 cs.CV

Recognition: unknown

MSD-Score: Multi-Scale Distributional Scoring for Reference-Free Image Caption Evaluation

Authors on Pith no claims yet

Pith reviewed 2026-05-08 14:06 UTC · model grok-4.3

classification 💻 cs.CV
keywords reference-free image caption evaluationmulti-scale distributional scoringvon Mises-Fisher mixturesbi-directional KL divergencevision-language metricshallucination detection
0
0 comments X

The pith

MSD-Score evaluates image captions without references by treating patch and token embeddings as von Mises-Fisher mixtures and scoring them distributionally at multiple scales.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents MSD-Score as a way to judge how well a caption matches an image when no reference caption is available. Instead of comparing single embedding points, it models image patches and text tokens as mixtures of directional distributions on the sphere. It then measures mismatches with a weighted bidirectional KL divergence computed across several scales and adds this to a global similarity term. The goal is to pick up local problems such as invented objects, omitted attributes, or incorrect relations that simpler metrics miss. Experiments indicate the resulting scores line up more closely with human ratings than earlier reference-free approaches and can also break down which parts of the alignment went wrong.

Core claim

MSD-Score formulates reference-free image-text matching as a multi-scale distributional scoring task in which both image patch embeddings and text token embeddings are represented as von Mises-Fisher mixtures; semantic discrepancies are quantified by weighted bi-directional KL divergence and fused with global similarity to produce a score that correlates more strongly with human judgments while exposing local grounding errors.

What carries the argument

Multi-scale distributional scoring that represents embeddings as von Mises-Fisher mixtures and quantifies discrepancies with weighted bi-directional KL divergence.

If this is right

  • Captions can be ranked or selected without any reference text while still tracking human preferences closely.
  • The same framework supports evaluation of one caption or a set of candidate captions.
  • Error signals become decomposable so specific mismatches such as missing relations can be localized.
  • The probabilistic scores supply a deterministic complement to purely holistic similarity measures.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The mixture representation may prove useful for other tasks that require checking alignment at both coarse and fine levels, such as visual question answering.
  • Pairing the distributional scores with large-model judges could yield hybrid evaluators that combine transparency with broad coverage.
  • The multi-scale structure suggests a natural way to analyze consistency across different levels of semantic abstraction.

Load-bearing premise

Modeling embeddings as von Mises-Fisher mixtures and measuring their multi-scale KL discrepancies accurately reflects the fine-grained semantic mismatches that humans notice in captions.

What would settle it

A collection of image-caption pairs containing subtle errors such as wrong attributes or hallucinated objects where human raters assign low quality but global similarity metrics still score high; MSD-Score should assign correspondingly low scores if the claim holds.

Figures

Figures reproduced from arXiv: 2605.06080 by Haojie Zhang, Jiazhi Xia, Lianlei Shan, Linna Zhang, Shichao Kan, Xuyang Zhang, Yigang Cen, Yixiong Liang, Zhe Qu, Zhe Zhu.

Figure 1
Figure 1. Figure 1: From pointwise similarity to distributional alignment. (a) Global similarity methods (e.g. CLIPScore [1]) encode an image and a caption into single vectors via mean pooling and compute their similarity. Hallucinated content (e.g., “a cat”) is diluted by dominant semantics, yielding similar scores to correct captions. (b) This is because mean pooling causes information collapse, discarding patch-token corre… view at source ↗
Figure 2
Figure 2. Figure 2: Interpretability and benchmark evaluation of MSD-Score across short and long captions. Panels (a) and (b) show KL-decomposition￾based heatmaps for short- and long-form captions. MSD-Score localizes hallucinated tokens (red text, e.g., “a cup of milk”, “bright neon pink in color and have a fluffy, fur-covered texture”) and unsupported details; panels (c) and (d) summarize benchmark behavior on SugarCrepe [6… view at source ↗
Figure 3
Figure 3. Figure 3: Overview of the proposed MSD-Score framework for uncertainty-aware vision–language discrepancy modeling. We introduce a multi-scale scoring paradigm that unifies global alignment with local distributional verification. (A) A frozen CLIP encoder and LLaMA model provide patch- and token-level embeddings, while a learnable alignment module maps visual patches into a shared semantic space and is trained with c… view at source ↗
Figure 4
Figure 4. Figure 4: Effect of the reconstruction objective on local semantic alignment. We compare a contrastive-only aligner with our full contrastive–generative view at source ↗
Figure 5
Figure 5. Figure 5: Seed stability of local token clustering (caption-level ARI). Each view at source ↗
Figure 6
Figure 6. Figure 6: SugarCrepe evaluation on fine-grained compositional errors in view at source ↗
Figure 7
Figure 7. Figure 7: COCO-CF performance by caption source. Pairwise accuracy view at source ↗
Figure 8
Figure 8. Figure 8: Collapse of EM responsibility entropy under adaptive view at source ↗
Figure 9
Figure 9. Figure 9: KL-decomposition-based interpretability of MSD showing se view at source ↗
Figure 10
Figure 10. Figure 10: Qualitative examples. Failure modes including unsupported details and missing visual evidence, highlighted by token/region attribution. view at source ↗
read the original abstract

Evaluating image captions without references remains challenging because global embedding similarity often misses fine-grained mismatches such as hallucinated objects, missing attributes, or incorrect relations. We propose MSD-Score, a reference-free metric that models image patch and text token embeddings as von Mises-Fisher mixtures on the unit hypersphere. Instead of treating each modality as a single point, MSD-Score formulates image-text matching as a multi-scale distributional scoring problem. Semantic discrepancies are quantified via a weighted bi-directional KL divergence and combined with global similarity in a multi-scale framework for both single- and multi-candidate evaluations. Extensive experiments show that MSD-Score achieves state-of-the-art correlation with human judgments among reference-free metrics. Beyond accuracy, its probabilistic formulation yields transparent and decomposable diagnostics of local grounding errors, providing a deterministic complementary signal to holistic similarity metrics and judge-based evaluators.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper proposes MSD-Score, a reference-free image caption evaluation metric. It models image patch and text token embeddings as von Mises-Fisher mixtures on the unit hypersphere, formulates matching as multi-scale distributional scoring, quantifies discrepancies via weighted bi-directional KL divergence, and combines this with global similarity for both single- and multi-candidate settings. The central claim is that extensive experiments demonstrate state-of-the-art correlation with human judgments among reference-free metrics, while the probabilistic formulation additionally yields transparent, decomposable diagnostics of local grounding errors such as hallucinations and incorrect relations.

Significance. If the empirical claims hold after addressing validation gaps, MSD-Score would represent a meaningful advance in reference-free caption evaluation by moving beyond single-point global similarity to explicit distributional discrepancy modeling. The vMF mixture plus weighted bi-KL construction could supply interpretable local signals that complement existing holistic metrics and LLM judges, particularly for fine-grained semantic mismatches.

major comments (3)
  1. [Abstract and §3] Abstract and §3 (method): the SOTA correlation claim is presented without any equations, hyperparameter schedules, or experimental protocol details, so it is impossible to determine whether the reported gains are supported by the data or driven by post-hoc choices on the evaluation sets.
  2. [§3.2 and §4] §3.2 and §4 (experiments): the multi-scale weights, mixture component count, and divergence parameters are free parameters; the manuscript does not state whether they were tuned on the same human judgment data used to compute the final correlations, leaving the central performance claim vulnerable to circularity.
  3. [§4.3] §4.3 (ablation and qualitative analysis): no controlled experiment isolates the contribution of the vMF mixture + weighted bi-KL terms versus simpler baselines (e.g., mean-pooled cosine or single-scale global similarity), nor shows that local KL terms selectively increase on hallucinated objects, missing attributes, or incorrect relations; without this, the modeling choice is not demonstrated to be load-bearing for the reported result.
minor comments (2)
  1. [§3.1] Notation in §3.1: define the precise form of the weighted bi-directional KL (including temperature and weighting scheme) and the multi-scale aggregation operator before the experimental section.
  2. [Table 1 and Figure 2] Table 1 and Figure 2: report per-dataset standard deviations or confidence intervals alongside mean correlations to allow assessment of statistical significance of the SOTA margins.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below with clarifications on our experimental design and indicate revisions to improve transparency.

read point-by-point responses
  1. Referee: [Abstract and §3] Abstract and §3 (method): the SOTA correlation claim is presented without any equations, hyperparameter schedules, or experimental protocol details, so it is impossible to determine whether the reported gains are supported by the data or driven by post-hoc choices on the evaluation sets.

    Authors: Section 3 presents the full vMF mixture formulation, multi-scale scoring equations, and weighted bi-directional KL divergence with all mathematical details. Section 4 specifies the evaluation protocol, datasets, and correlation computation. To enhance accessibility, we will revise the abstract to reference these components explicitly and include a brief protocol summary. Hyperparameters were fixed prior to final testing using a separate validation split. revision: partial

  2. Referee: [§3.2 and §4] §3.2 and §4 (experiments): the multi-scale weights, mixture component count, and divergence parameters are free parameters; the manuscript does not state whether they were tuned on the same human judgment data used to compute the final correlations, leaving the central performance claim vulnerable to circularity.

    Authors: We agree this requires explicit clarification. In the revision we will add text in §4 stating that multi-scale weights, mixture count (K=5), and KL parameters were selected via grid search on a held-out validation portion of the human judgment data, disjoint from the test sets used for the reported SOTA correlations. revision: yes

  3. Referee: [§4.3] §4.3 (ablation and qualitative analysis): no controlled experiment isolates the contribution of the vMF mixture + weighted bi-KL terms versus simpler baselines (e.g., mean-pooled cosine or single-scale global similarity), nor shows that local KL terms selectively increase on hallucinated objects, missing attributes, or incorrect relations; without this, the modeling choice is not demonstrated to be load-bearing for the reported result.

    Authors: We will expand §4.3 with a controlled ablation comparing MSD-Score against mean-pooled cosine and single-scale global similarity baselines. We will also add quantitative analysis on annotated error subsets demonstrating elevated local KL values for hallucinations, missing attributes, and incorrect relations, confirming the distributional terms contribute to the observed gains. revision: yes

Circularity Check

0 steps flagged

No circularity: metric definition is independent of evaluation data

full rationale

The paper defines MSD-Score via explicit modeling choices (vMF mixtures on the hypersphere, weighted bidirectional KL at multiple scales, plus global similarity) and then reports empirical correlations on human judgment benchmarks. No equations or text in the provided abstract or description show parameters being fitted to the correlation test sets, no self-citation chain that imports uniqueness, and no renaming of prior results as new derivations. The central claim remains a modeling proposal evaluated externally rather than a tautological restatement of inputs.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

The central claim rests on several modeling assumptions and tunable elements whose values are not derived from first principles or external benchmarks.

free parameters (2)
  • multi-scale weights
    Weights combining different scales and the global similarity term are introduced without derivation and are likely chosen or fitted.
  • mixture component count
    Number of vMF components per patch or token distribution is a modeling choice that must be set.
axioms (2)
  • domain assumption Image patch and text token embeddings can be faithfully represented as von Mises-Fisher mixtures on the unit hypersphere
    Core modeling premise stated in the abstract.
  • domain assumption Weighted bi-directional KL divergence between these mixtures quantifies semantic discrepancies relevant to human judgment
    The justification for using this particular divergence as the mismatch measure.

pith-pipeline@v0.9.0 · 5473 in / 1432 out tokens · 56330 ms · 2026-05-08T14:06:28.787703+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

46 extracted references · 16 canonical work pages · 8 internal anchors

  1. [1]

    Clip- score: A reference-free evaluation metric for image captioning,

    J. Hessel, A. Holtzman, M. Forbes, R. Le Bras, and Y. Choi, “Clip- score: A reference-free evaluation metric for image captioning,” inProceedings of the 2021 conference on empirical methods in natural language processing, 2021, pp. 7514–7528

  2. [2]

    Throne: An object-based hallucination benchmark for the free-form generations of large vision-language models,

    P . Kaul, Z. Li, H. Yang, Y. Dukler, A. Swaminathan, C. Taylor, and S. Soatto, “Throne: An object-based hallucination benchmark for the free-form generations of large vision-language models,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 27 228–27 238

  3. [3]

    Microsoft COCO Captions: Data Collection and Evaluation Server

    X. Chen, H. Fang, T.-Y. Lin, R. Vedantam, S. Gupta, P . Doll ´ar, and C. L. Zitnick, “Microsoft coco captions: Data collection and evaluation server,”arXiv preprint arXiv:1504.00325, 2015

  4. [4]

    From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions,

    P . Young, A. Lai, M. Hodosh, and J. Hockenmaier, “From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions,”Transactions of the association for computational linguistics, vol. 2, pp. 67–78, 2014

  5. [5]

    Visual genome: Connecting language and vision using crowdsourced dense image annotations,

    R. Krishna, Y. Zhu, O. Groth, J. Johnson, K. Hata, J. Kravitz, S. Chen, Y. Kalantidis, L.-J. Li, D. A. Shammaet al., “Visual genome: Connecting language and vision using crowdsourced dense image annotations,”International journal of computer vision, vol. 123, no. 1, pp. 32–73, 2017

  6. [6]

    Sugarcrepe: Fixing hackable benchmarks for vision-language compositionality,

    C.-Y. Hsieh, J. Zhang, Z. Ma, A. Kembhavi, and R. Krishna, “Sugarcrepe: Fixing hackable benchmarks for vision-language compositionality,”Advances in neural information processing systems, vol. 36, pp. 31 096–31 116, 2023. 17

  7. [7]

    Caparena: Benchmarking and analyzing detailed image captioning in the llm era,

    K. Cheng, W. Song, J. Fan, Z. Ma, Q. Sun, F. Xu, C. Yan, N. Chen, J. Zhang, and J. Chen, “Caparena: Benchmarking and analyzing detailed image captioning in the llm era,”arXiv preprint arXiv:2503.12329, 2025

  8. [8]

    Long-clip: Unlocking the long-text capability of clip,

    B. Zhang, P . Zhang, X. Dong, Y. Zang, and J. Wang, “Long-clip: Unlocking the long-text capability of clip,” inEuropean conference on computer vision, 2024, pp. 310–325

  9. [9]

    Fleur: An explainable reference-free evaluation metric for image captioning using a large multimodal model,

    Y. Lee, I. Park, and M. Kang, “Fleur: An explainable reference-free evaluation metric for image captioning using a large multimodal model,”arXiv preprint arXiv:2406.06004, 2024

  10. [10]

    GPT-4o System Card

    A. Hurst, A. Lerer, A. P . Goucher, A. Perelman, A. Ramesh, A. Clark, A. Ostrow, A. Welihinda, A. Hayes, A. Radfordet al., “Gpt-4o system card,”arXiv preprint arXiv:2410.21276, 2024

  11. [11]

    Bleu: a method for automatic evaluation of machine translation,

    K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu, “Bleu: a method for automatic evaluation of machine translation,” inProceedings of the 40th annual meeting of the Association for Computational Linguis- tics, 2002, pp. 311–318

  12. [12]

    Meteor: An automatic metric for mt evaluation with improved correlation with human judgments,

    S. Banerjee and A. Lavie, “Meteor: An automatic metric for mt evaluation with improved correlation with human judgments,” in Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, 2005, pp. 65– 72

  13. [13]

    Rouge: A package for automatic evaluation of sum- maries,

    C.-Y. Lin, “Rouge: A package for automatic evaluation of sum- maries,” inText summarization branches out, 2004, pp. 74–81

  14. [14]

    Cider: Consensus-based image description evaluation,

    R. Vedantam, C. Lawrence Zitnick, and D. Parikh, “Cider: Consensus-based image description evaluation,” inProceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 4566–4575

  15. [15]

    Spice: Semantic propositional image caption evaluation,

    P . Anderson, B. Fernando, M. Johnson, and S. Gould, “Spice: Semantic propositional image caption evaluation,” inEuropean conference on computer vision, 2016, pp. 382–398

  16. [16]

    Benchmarking and improving detail image caption

    H. Dong, J. Li, B. Wu, J. Wang, Y. Zhang, and H. Guo, “Bench- marking and improving detail image caption,”arXiv preprint arXiv:2405.19092, 2024

  17. [17]

    Auroracap: Efficient, performant video detailed captioning and a new benchmark

    W. Chai, E. Song, Y. Du, C. Meng, V . Madhavan, O. Bar-Tal, J.- N. Hwang, S. Xie, and C. D. Manning, “Auroracap: Efficient, performant video detailed captioning and a new benchmark,” arXiv preprint arXiv:2410.03051, 2024

  18. [18]

    Umic: An unreferenced metric for image captioning via contrastive learn- ing,

    H. Lee, S. Yoon, F. Dernoncourt, T. Bui, and K. Jung, “Umic: An unreferenced metric for image captioning via contrastive learn- ing,” inProceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), 2021, pp. 220–226

  19. [19]

    Positive-augmented contrastive learning for image and video captioning evaluation,

    S. Sarto, M. Barraco, M. Cornia, L. Baraldi, and R. Cucchiara, “Positive-augmented contrastive learning for image and video captioning evaluation,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 6914–6924

  20. [20]

    Infometic: An informative metric for reference-free image caption evaluation,

    A. Hu, S. Chen, L. Zhang, and Q. Jin, “Infometic: An informative metric for reference-free image caption evaluation,” inProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2023, pp. 3171–3185

  21. [21]

    Bridge: Bridging gaps in image captioning evaluation with stronger visual cues,

    S. Sarto, M. Cornia, L. Baraldi, and R. Cucchiara, “Bridge: Bridging gaps in image captioning evaluation with stronger visual cues,” in European Conference on Computer Vision, 2024

  22. [22]

    Hicescore: A hi- erarchical metric for image captioning evaluation,

    Z. Zeng, J. Sun, H. Zhang, T. Wen, B. Chenet al., “Hicescore: A hi- erarchical metric for image captioning evaluation,” inProceedings of the 32nd ACM International Conference on Multimedia, 2024

  23. [23]

    Expert: An explain- able image captioning evaluation metric with structured explana- tions,

    H. Kim, S. Kim, J. Jeong, Y. Cho, and S. Cho, “Expert: An explain- able image captioning evaluation metric with structured explana- tions,” inFindings of the Association for Computational Linguistics: ACL 2025, 2025, pp. 26 642–26 657

  24. [24]

    From word embeddings to document distances,

    M. Kusner, Y. Sun, N. Kolkin, and K. Weinberger, “From word embeddings to document distances,” inInternational conference on machine learning, 2015, pp. 957–966

  25. [25]

    Sinkhorn distances: Lightspeed computation of opti- mal transport,

    M. Cuturi, “Sinkhorn distances: Lightspeed computation of opti- mal transport,”Advances in neural information processing systems, vol. 26, 2013

  26. [26]

    Padim: a patch distribution modeling framework for anomaly detection and localization,

    T. Defard, A. Setkov, A. Loesch, and R. Audigier, “Padim: a patch distribution modeling framework for anomaly detection and localization,” inInternational conference on pattern recognition, 2021, pp. 475–489

  27. [27]

    Gmmseg: Gaussian mix- ture based generative semantic segmentation models,

    C. Liang, W. Wang, J. Miao, and Y. Yang, “Gmmseg: Gaussian mix- ture based generative semantic segmentation models,”Advances in Neural Information Processing Systems, vol. 35, pp. 31 360–31 375, 2022

  28. [28]

    Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

    P . Wang, S. Bai, S. Tan, S. Wang, Z. Fan, J. Bai, K. Chen, X. Liu, J. Wang, W. Geet al., “Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution,”arXiv preprint arXiv:2409.12191, 2024

  29. [29]

    LLaVA-OneVision: Easy Visual Task Transfer

    B. Li, Y. Zhang, D. Guo, R. Zhang, F. Li, H. Zhang, K. Zhang, P . Zhang, Y. Li, Z. Liuet al., “Llava-onevision: Easy visual task transfer,”arXiv preprint arXiv:2408.03326, 2024

  30. [30]

    Llava-critic: Learning to evaluate multimodal models,

    T. Xiong, X. Wang, D. Guo, Q. Ye, H. Fan, Q. Gu, H. Huang, and C. Li, “Llava-critic: Learning to evaluate multimodal models,” in Proceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 13 618–13 628

  31. [31]

    G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment

    Y. Liu, D. Iter, Y. Xu, S. Wang, R. Xu, and C. Zhu, “G-eval: Nlg evaluation using gpt-4 with better human alignment,”arXiv preprint arXiv:2303.16634, 2023

  32. [32]

    Learning transferable visual models from natural language supervision,

    A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agar- wal, G. Sastry, A. Askell, P . Mishkin, J. Clarket al., “Learning transferable visual models from natural language supervision,” inInternational conference on machine learning. PMLR, 2021, pp. 8748–8763

  33. [33]

    SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

    M. Tschannen, A. Gritsenko, X. Wang, M. F. Naeem, I. Alabdul- mohsin, N. Parthasarathy, T. Evans, L. Beyer, Y. Xia, B. Mustafa et al., “Siglip 2: Multilingual vision-language encoders with im- proved semantic understanding, localization, and dense features,” arXiv preprint arXiv:2502.14786, 2025

  34. [34]

    Posh: Using scene graphs to guide llms-as-a-judge for detailed image descriptions,

    A. Ananthram, E. Stengel-Eskin, L. A. Bradford, J. Demarest, A. Purvis, K. Krut, R. Stein, R. E. Pantalony, M. Bansal, and K. McKeown, “Posh: Using scene graphs to guide llms-as-a-judge for detailed image descriptions,”arXiv preprint arXiv:2510.19060, 2025

  35. [35]

    Clustering on the unit hypersphere using von mises-fisher dis- tributions

    A. Banerjee, I. S. Dhillon, J. Ghosh, S. Sra, and G. Ridgeway, “Clustering on the unit hypersphere using von mises-fisher dis- tributions.”Journal of Machine Learning Research, vol. 6, no. 9, 2005

  36. [36]

    Framing image de- scription as a ranking task: Data, models and evaluation metrics,

    M. Hodosh, P . Young, and J. Hockenmaier, “Framing image de- scription as a ranking task: Data, models and evaluation metrics,” Journal of Artificial Intelligence Research, vol. 47, pp. 853–899, 2013

  37. [37]

    From images to sentences through scene description graphs using commonsense reasoning and knowledge,

    S. Aditya, Y. Yang, C. Baral, C. Fermuller, and Y. Aloimonos, “From images to sentences through scene description graphs using commonsense reasoning and knowledge,”arXiv preprint arXiv:1511.03292, 2015

  38. [38]

    Microsoft coco: Common objects in context,

    T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P . Perona, D. Ramanan, P . Doll´ar, and C. L. Zitnick, “Microsoft coco: Common objects in context,” inEuropean conference on computer vision, 2014, pp. 740– 755

  39. [39]

    Polos: Multimodal metric learning from human feedback for image captioning,

    Y. Wada, K. Kaneda, D. Saito, and K. Sugiura, “Polos: Multimodal metric learning from human feedback for image captioning,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 13 559–13 568

  40. [40]

    Visual instruction tuning,

    H. Liu, C. Li, Q. Wu, and Y. J. Lee, “Visual instruction tuning,” in Advances in Neural Information Processing Systems, 2023

  41. [41]

    Qwen2.5-VL Technical Report

    S. Bai, K. Chen, X. Liu, J. Wang, W. Ge, S. Song, K. Dang, P . Wang, S. Wang, J. Tanget al., “Qwen2.5-vl technical report,”arXiv preprint arXiv:2502.13923, 2025

  42. [42]

    Rocov2: Radiology objects in context version 2, an updated multimodal image dataset,

    J. R ¨uckert, L. Bloch, R. Br ¨ungel, A. Idrissi-Yaghir, H. Sch¨afer, C. S. Schmidt, S. Koitka, O. Pelka, A. Ben Abacha, A. G. Seco de Her- rera, H. M¨uller, P . A. Horn, F. Nensa, and C. M. Friedrich, “Rocov2: Radiology objects in context version 2, an updated multimodal image dataset,”arXiv preprint arXiv:2405.10004, 2024

  43. [43]

    Exploring models and data for remote sensing image caption generation,

    X. Lu, B. Wang, X. Zheng, and X. Li, “Exploring models and data for remote sensing image caption generation,”IEEE Transactions on Geoscience and Remote Sensing, vol. 56, no. 4, pp. 2183–2195, 2018

  44. [44]

    BiomedCLIP: a multimodal biomedical foundation model pretrained from fifteen million scientific image-text pairs

    S. Zhang, Y. Xu, N. Usuyama, H. Xu, J. Bagga, R. Tinn, S. Preston, R. Rao, M. Wei, N. Valluri, C. Wong, A. Tupini, Y. Wang, M. Maz- zola, S. Shukla, L. Liden, J. Gao, A. Crabtree, B. Piening, C. Bifulco, M. P . Lungren, T. Naumann, S. Wang, and H. Poon, “Biomedclip: a multimodal biomedical foundation model pretrained from fifteen million scientific image-...

  45. [45]

    Remoteclip: A vision language foundation model for remote sensing, 2024

    F. Liu, D. Chen, Z. Guan, X. Zhou, J. Zhu, Q. Ye, L. Fu, and J. Zhou, “Remoteclip: A vision language foundation model for remote sensing,”arXiv preprint arXiv:2306.11029, 2023

  46. [46]

    Internvl: Scaling up vision foun- dation models and aligning for generic visual-linguistic tasks,

    Z. Chen, J. Wu, W. Wang, W. Su, G. Chen, S. Xing, M. Zhong, Q. Zhang, X. Zhu, L. Luet al., “Internvl: Scaling up vision foun- dation models and aligning for generic visual-linguistic tasks,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2024, pp. 24 185–24 198