Recognition: unknown
MSD-Score: Multi-Scale Distributional Scoring for Reference-Free Image Caption Evaluation
Pith reviewed 2026-05-08 14:06 UTC · model grok-4.3
The pith
MSD-Score evaluates image captions without references by treating patch and token embeddings as von Mises-Fisher mixtures and scoring them distributionally at multiple scales.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
MSD-Score formulates reference-free image-text matching as a multi-scale distributional scoring task in which both image patch embeddings and text token embeddings are represented as von Mises-Fisher mixtures; semantic discrepancies are quantified by weighted bi-directional KL divergence and fused with global similarity to produce a score that correlates more strongly with human judgments while exposing local grounding errors.
What carries the argument
Multi-scale distributional scoring that represents embeddings as von Mises-Fisher mixtures and quantifies discrepancies with weighted bi-directional KL divergence.
If this is right
- Captions can be ranked or selected without any reference text while still tracking human preferences closely.
- The same framework supports evaluation of one caption or a set of candidate captions.
- Error signals become decomposable so specific mismatches such as missing relations can be localized.
- The probabilistic scores supply a deterministic complement to purely holistic similarity measures.
Where Pith is reading between the lines
- The mixture representation may prove useful for other tasks that require checking alignment at both coarse and fine levels, such as visual question answering.
- Pairing the distributional scores with large-model judges could yield hybrid evaluators that combine transparency with broad coverage.
- The multi-scale structure suggests a natural way to analyze consistency across different levels of semantic abstraction.
Load-bearing premise
Modeling embeddings as von Mises-Fisher mixtures and measuring their multi-scale KL discrepancies accurately reflects the fine-grained semantic mismatches that humans notice in captions.
What would settle it
A collection of image-caption pairs containing subtle errors such as wrong attributes or hallucinated objects where human raters assign low quality but global similarity metrics still score high; MSD-Score should assign correspondingly low scores if the claim holds.
Figures
read the original abstract
Evaluating image captions without references remains challenging because global embedding similarity often misses fine-grained mismatches such as hallucinated objects, missing attributes, or incorrect relations. We propose MSD-Score, a reference-free metric that models image patch and text token embeddings as von Mises-Fisher mixtures on the unit hypersphere. Instead of treating each modality as a single point, MSD-Score formulates image-text matching as a multi-scale distributional scoring problem. Semantic discrepancies are quantified via a weighted bi-directional KL divergence and combined with global similarity in a multi-scale framework for both single- and multi-candidate evaluations. Extensive experiments show that MSD-Score achieves state-of-the-art correlation with human judgments among reference-free metrics. Beyond accuracy, its probabilistic formulation yields transparent and decomposable diagnostics of local grounding errors, providing a deterministic complementary signal to holistic similarity metrics and judge-based evaluators.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes MSD-Score, a reference-free image caption evaluation metric. It models image patch and text token embeddings as von Mises-Fisher mixtures on the unit hypersphere, formulates matching as multi-scale distributional scoring, quantifies discrepancies via weighted bi-directional KL divergence, and combines this with global similarity for both single- and multi-candidate settings. The central claim is that extensive experiments demonstrate state-of-the-art correlation with human judgments among reference-free metrics, while the probabilistic formulation additionally yields transparent, decomposable diagnostics of local grounding errors such as hallucinations and incorrect relations.
Significance. If the empirical claims hold after addressing validation gaps, MSD-Score would represent a meaningful advance in reference-free caption evaluation by moving beyond single-point global similarity to explicit distributional discrepancy modeling. The vMF mixture plus weighted bi-KL construction could supply interpretable local signals that complement existing holistic metrics and LLM judges, particularly for fine-grained semantic mismatches.
major comments (3)
- [Abstract and §3] Abstract and §3 (method): the SOTA correlation claim is presented without any equations, hyperparameter schedules, or experimental protocol details, so it is impossible to determine whether the reported gains are supported by the data or driven by post-hoc choices on the evaluation sets.
- [§3.2 and §4] §3.2 and §4 (experiments): the multi-scale weights, mixture component count, and divergence parameters are free parameters; the manuscript does not state whether they were tuned on the same human judgment data used to compute the final correlations, leaving the central performance claim vulnerable to circularity.
- [§4.3] §4.3 (ablation and qualitative analysis): no controlled experiment isolates the contribution of the vMF mixture + weighted bi-KL terms versus simpler baselines (e.g., mean-pooled cosine or single-scale global similarity), nor shows that local KL terms selectively increase on hallucinated objects, missing attributes, or incorrect relations; without this, the modeling choice is not demonstrated to be load-bearing for the reported result.
minor comments (2)
- [§3.1] Notation in §3.1: define the precise form of the weighted bi-directional KL (including temperature and weighting scheme) and the multi-scale aggregation operator before the experimental section.
- [Table 1 and Figure 2] Table 1 and Figure 2: report per-dataset standard deviations or confidence intervals alongside mean correlations to allow assessment of statistical significance of the SOTA margins.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address each major comment below with clarifications on our experimental design and indicate revisions to improve transparency.
read point-by-point responses
-
Referee: [Abstract and §3] Abstract and §3 (method): the SOTA correlation claim is presented without any equations, hyperparameter schedules, or experimental protocol details, so it is impossible to determine whether the reported gains are supported by the data or driven by post-hoc choices on the evaluation sets.
Authors: Section 3 presents the full vMF mixture formulation, multi-scale scoring equations, and weighted bi-directional KL divergence with all mathematical details. Section 4 specifies the evaluation protocol, datasets, and correlation computation. To enhance accessibility, we will revise the abstract to reference these components explicitly and include a brief protocol summary. Hyperparameters were fixed prior to final testing using a separate validation split. revision: partial
-
Referee: [§3.2 and §4] §3.2 and §4 (experiments): the multi-scale weights, mixture component count, and divergence parameters are free parameters; the manuscript does not state whether they were tuned on the same human judgment data used to compute the final correlations, leaving the central performance claim vulnerable to circularity.
Authors: We agree this requires explicit clarification. In the revision we will add text in §4 stating that multi-scale weights, mixture count (K=5), and KL parameters were selected via grid search on a held-out validation portion of the human judgment data, disjoint from the test sets used for the reported SOTA correlations. revision: yes
-
Referee: [§4.3] §4.3 (ablation and qualitative analysis): no controlled experiment isolates the contribution of the vMF mixture + weighted bi-KL terms versus simpler baselines (e.g., mean-pooled cosine or single-scale global similarity), nor shows that local KL terms selectively increase on hallucinated objects, missing attributes, or incorrect relations; without this, the modeling choice is not demonstrated to be load-bearing for the reported result.
Authors: We will expand §4.3 with a controlled ablation comparing MSD-Score against mean-pooled cosine and single-scale global similarity baselines. We will also add quantitative analysis on annotated error subsets demonstrating elevated local KL values for hallucinations, missing attributes, and incorrect relations, confirming the distributional terms contribute to the observed gains. revision: yes
Circularity Check
No circularity: metric definition is independent of evaluation data
full rationale
The paper defines MSD-Score via explicit modeling choices (vMF mixtures on the hypersphere, weighted bidirectional KL at multiple scales, plus global similarity) and then reports empirical correlations on human judgment benchmarks. No equations or text in the provided abstract or description show parameters being fitted to the correlation test sets, no self-citation chain that imports uniqueness, and no renaming of prior results as new derivations. The central claim remains a modeling proposal evaluated externally rather than a tautological restatement of inputs.
Axiom & Free-Parameter Ledger
free parameters (2)
- multi-scale weights
- mixture component count
axioms (2)
- domain assumption Image patch and text token embeddings can be faithfully represented as von Mises-Fisher mixtures on the unit hypersphere
- domain assumption Weighted bi-directional KL divergence between these mixtures quantifies semantic discrepancies relevant to human judgment
Reference graph
Works this paper leans on
-
[1]
Clip- score: A reference-free evaluation metric for image captioning,
J. Hessel, A. Holtzman, M. Forbes, R. Le Bras, and Y. Choi, “Clip- score: A reference-free evaluation metric for image captioning,” inProceedings of the 2021 conference on empirical methods in natural language processing, 2021, pp. 7514–7528
2021
-
[2]
Throne: An object-based hallucination benchmark for the free-form generations of large vision-language models,
P . Kaul, Z. Li, H. Yang, Y. Dukler, A. Swaminathan, C. Taylor, and S. Soatto, “Throne: An object-based hallucination benchmark for the free-form generations of large vision-language models,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 27 228–27 238
2024
-
[3]
Microsoft COCO Captions: Data Collection and Evaluation Server
X. Chen, H. Fang, T.-Y. Lin, R. Vedantam, S. Gupta, P . Doll ´ar, and C. L. Zitnick, “Microsoft coco captions: Data collection and evaluation server,”arXiv preprint arXiv:1504.00325, 2015
work page internal anchor Pith review arXiv 2015
-
[4]
From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions,
P . Young, A. Lai, M. Hodosh, and J. Hockenmaier, “From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions,”Transactions of the association for computational linguistics, vol. 2, pp. 67–78, 2014
2014
-
[5]
Visual genome: Connecting language and vision using crowdsourced dense image annotations,
R. Krishna, Y. Zhu, O. Groth, J. Johnson, K. Hata, J. Kravitz, S. Chen, Y. Kalantidis, L.-J. Li, D. A. Shammaet al., “Visual genome: Connecting language and vision using crowdsourced dense image annotations,”International journal of computer vision, vol. 123, no. 1, pp. 32–73, 2017
2017
-
[6]
Sugarcrepe: Fixing hackable benchmarks for vision-language compositionality,
C.-Y. Hsieh, J. Zhang, Z. Ma, A. Kembhavi, and R. Krishna, “Sugarcrepe: Fixing hackable benchmarks for vision-language compositionality,”Advances in neural information processing systems, vol. 36, pp. 31 096–31 116, 2023. 17
2023
-
[7]
Caparena: Benchmarking and analyzing detailed image captioning in the llm era,
K. Cheng, W. Song, J. Fan, Z. Ma, Q. Sun, F. Xu, C. Yan, N. Chen, J. Zhang, and J. Chen, “Caparena: Benchmarking and analyzing detailed image captioning in the llm era,”arXiv preprint arXiv:2503.12329, 2025
-
[8]
Long-clip: Unlocking the long-text capability of clip,
B. Zhang, P . Zhang, X. Dong, Y. Zang, and J. Wang, “Long-clip: Unlocking the long-text capability of clip,” inEuropean conference on computer vision, 2024, pp. 310–325
2024
-
[9]
Y. Lee, I. Park, and M. Kang, “Fleur: An explainable reference-free evaluation metric for image captioning using a large multimodal model,”arXiv preprint arXiv:2406.06004, 2024
-
[10]
A. Hurst, A. Lerer, A. P . Goucher, A. Perelman, A. Ramesh, A. Clark, A. Ostrow, A. Welihinda, A. Hayes, A. Radfordet al., “Gpt-4o system card,”arXiv preprint arXiv:2410.21276, 2024
work page internal anchor Pith review arXiv 2024
-
[11]
Bleu: a method for automatic evaluation of machine translation,
K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu, “Bleu: a method for automatic evaluation of machine translation,” inProceedings of the 40th annual meeting of the Association for Computational Linguis- tics, 2002, pp. 311–318
2002
-
[12]
Meteor: An automatic metric for mt evaluation with improved correlation with human judgments,
S. Banerjee and A. Lavie, “Meteor: An automatic metric for mt evaluation with improved correlation with human judgments,” in Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, 2005, pp. 65– 72
2005
-
[13]
Rouge: A package for automatic evaluation of sum- maries,
C.-Y. Lin, “Rouge: A package for automatic evaluation of sum- maries,” inText summarization branches out, 2004, pp. 74–81
2004
-
[14]
Cider: Consensus-based image description evaluation,
R. Vedantam, C. Lawrence Zitnick, and D. Parikh, “Cider: Consensus-based image description evaluation,” inProceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 4566–4575
2015
-
[15]
Spice: Semantic propositional image caption evaluation,
P . Anderson, B. Fernando, M. Johnson, and S. Gould, “Spice: Semantic propositional image caption evaluation,” inEuropean conference on computer vision, 2016, pp. 382–398
2016
-
[16]
Benchmarking and improving detail image caption
H. Dong, J. Li, B. Wu, J. Wang, Y. Zhang, and H. Guo, “Bench- marking and improving detail image caption,”arXiv preprint arXiv:2405.19092, 2024
-
[17]
Auroracap: Efficient, performant video detailed captioning and a new benchmark
W. Chai, E. Song, Y. Du, C. Meng, V . Madhavan, O. Bar-Tal, J.- N. Hwang, S. Xie, and C. D. Manning, “Auroracap: Efficient, performant video detailed captioning and a new benchmark,” arXiv preprint arXiv:2410.03051, 2024
-
[18]
Umic: An unreferenced metric for image captioning via contrastive learn- ing,
H. Lee, S. Yoon, F. Dernoncourt, T. Bui, and K. Jung, “Umic: An unreferenced metric for image captioning via contrastive learn- ing,” inProceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), 2021, pp. 220–226
2021
-
[19]
Positive-augmented contrastive learning for image and video captioning evaluation,
S. Sarto, M. Barraco, M. Cornia, L. Baraldi, and R. Cucchiara, “Positive-augmented contrastive learning for image and video captioning evaluation,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 6914–6924
2023
-
[20]
Infometic: An informative metric for reference-free image caption evaluation,
A. Hu, S. Chen, L. Zhang, and Q. Jin, “Infometic: An informative metric for reference-free image caption evaluation,” inProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2023, pp. 3171–3185
2023
-
[21]
Bridge: Bridging gaps in image captioning evaluation with stronger visual cues,
S. Sarto, M. Cornia, L. Baraldi, and R. Cucchiara, “Bridge: Bridging gaps in image captioning evaluation with stronger visual cues,” in European Conference on Computer Vision, 2024
2024
-
[22]
Hicescore: A hi- erarchical metric for image captioning evaluation,
Z. Zeng, J. Sun, H. Zhang, T. Wen, B. Chenet al., “Hicescore: A hi- erarchical metric for image captioning evaluation,” inProceedings of the 32nd ACM International Conference on Multimedia, 2024
2024
-
[23]
Expert: An explain- able image captioning evaluation metric with structured explana- tions,
H. Kim, S. Kim, J. Jeong, Y. Cho, and S. Cho, “Expert: An explain- able image captioning evaluation metric with structured explana- tions,” inFindings of the Association for Computational Linguistics: ACL 2025, 2025, pp. 26 642–26 657
2025
-
[24]
From word embeddings to document distances,
M. Kusner, Y. Sun, N. Kolkin, and K. Weinberger, “From word embeddings to document distances,” inInternational conference on machine learning, 2015, pp. 957–966
2015
-
[25]
Sinkhorn distances: Lightspeed computation of opti- mal transport,
M. Cuturi, “Sinkhorn distances: Lightspeed computation of opti- mal transport,”Advances in neural information processing systems, vol. 26, 2013
2013
-
[26]
Padim: a patch distribution modeling framework for anomaly detection and localization,
T. Defard, A. Setkov, A. Loesch, and R. Audigier, “Padim: a patch distribution modeling framework for anomaly detection and localization,” inInternational conference on pattern recognition, 2021, pp. 475–489
2021
-
[27]
Gmmseg: Gaussian mix- ture based generative semantic segmentation models,
C. Liang, W. Wang, J. Miao, and Y. Yang, “Gmmseg: Gaussian mix- ture based generative semantic segmentation models,”Advances in Neural Information Processing Systems, vol. 35, pp. 31 360–31 375, 2022
2022
-
[28]
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution
P . Wang, S. Bai, S. Tan, S. Wang, Z. Fan, J. Bai, K. Chen, X. Liu, J. Wang, W. Geet al., “Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution,”arXiv preprint arXiv:2409.12191, 2024
work page internal anchor Pith review arXiv 2024
-
[29]
LLaVA-OneVision: Easy Visual Task Transfer
B. Li, Y. Zhang, D. Guo, R. Zhang, F. Li, H. Zhang, K. Zhang, P . Zhang, Y. Li, Z. Liuet al., “Llava-onevision: Easy visual task transfer,”arXiv preprint arXiv:2408.03326, 2024
work page internal anchor Pith review arXiv 2024
-
[30]
Llava-critic: Learning to evaluate multimodal models,
T. Xiong, X. Wang, D. Guo, Q. Ye, H. Fan, Q. Gu, H. Huang, and C. Li, “Llava-critic: Learning to evaluate multimodal models,” in Proceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 13 618–13 628
2025
-
[31]
G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment
Y. Liu, D. Iter, Y. Xu, S. Wang, R. Xu, and C. Zhu, “G-eval: Nlg evaluation using gpt-4 with better human alignment,”arXiv preprint arXiv:2303.16634, 2023
work page internal anchor Pith review arXiv 2023
-
[32]
Learning transferable visual models from natural language supervision,
A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agar- wal, G. Sastry, A. Askell, P . Mishkin, J. Clarket al., “Learning transferable visual models from natural language supervision,” inInternational conference on machine learning. PMLR, 2021, pp. 8748–8763
2021
-
[33]
M. Tschannen, A. Gritsenko, X. Wang, M. F. Naeem, I. Alabdul- mohsin, N. Parthasarathy, T. Evans, L. Beyer, Y. Xia, B. Mustafa et al., “Siglip 2: Multilingual vision-language encoders with im- proved semantic understanding, localization, and dense features,” arXiv preprint arXiv:2502.14786, 2025
work page internal anchor Pith review arXiv 2025
-
[34]
Posh: Using scene graphs to guide llms-as-a-judge for detailed image descriptions,
A. Ananthram, E. Stengel-Eskin, L. A. Bradford, J. Demarest, A. Purvis, K. Krut, R. Stein, R. E. Pantalony, M. Bansal, and K. McKeown, “Posh: Using scene graphs to guide llms-as-a-judge for detailed image descriptions,”arXiv preprint arXiv:2510.19060, 2025
-
[35]
Clustering on the unit hypersphere using von mises-fisher dis- tributions
A. Banerjee, I. S. Dhillon, J. Ghosh, S. Sra, and G. Ridgeway, “Clustering on the unit hypersphere using von mises-fisher dis- tributions.”Journal of Machine Learning Research, vol. 6, no. 9, 2005
2005
-
[36]
Framing image de- scription as a ranking task: Data, models and evaluation metrics,
M. Hodosh, P . Young, and J. Hockenmaier, “Framing image de- scription as a ranking task: Data, models and evaluation metrics,” Journal of Artificial Intelligence Research, vol. 47, pp. 853–899, 2013
2013
-
[37]
From images to sentences through scene description graphs using commonsense reasoning and knowledge,
S. Aditya, Y. Yang, C. Baral, C. Fermuller, and Y. Aloimonos, “From images to sentences through scene description graphs using commonsense reasoning and knowledge,”arXiv preprint arXiv:1511.03292, 2015
-
[38]
Microsoft coco: Common objects in context,
T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P . Perona, D. Ramanan, P . Doll´ar, and C. L. Zitnick, “Microsoft coco: Common objects in context,” inEuropean conference on computer vision, 2014, pp. 740– 755
2014
-
[39]
Polos: Multimodal metric learning from human feedback for image captioning,
Y. Wada, K. Kaneda, D. Saito, and K. Sugiura, “Polos: Multimodal metric learning from human feedback for image captioning,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 13 559–13 568
2024
-
[40]
Visual instruction tuning,
H. Liu, C. Li, Q. Wu, and Y. J. Lee, “Visual instruction tuning,” in Advances in Neural Information Processing Systems, 2023
2023
-
[41]
S. Bai, K. Chen, X. Liu, J. Wang, W. Ge, S. Song, K. Dang, P . Wang, S. Wang, J. Tanget al., “Qwen2.5-vl technical report,”arXiv preprint arXiv:2502.13923, 2025
work page internal anchor Pith review arXiv 2025
-
[42]
Rocov2: Radiology objects in context version 2, an updated multimodal image dataset,
J. R ¨uckert, L. Bloch, R. Br ¨ungel, A. Idrissi-Yaghir, H. Sch¨afer, C. S. Schmidt, S. Koitka, O. Pelka, A. Ben Abacha, A. G. Seco de Her- rera, H. M¨uller, P . A. Horn, F. Nensa, and C. M. Friedrich, “Rocov2: Radiology objects in context version 2, an updated multimodal image dataset,”arXiv preprint arXiv:2405.10004, 2024
-
[43]
Exploring models and data for remote sensing image caption generation,
X. Lu, B. Wang, X. Zheng, and X. Li, “Exploring models and data for remote sensing image caption generation,”IEEE Transactions on Geoscience and Remote Sensing, vol. 56, no. 4, pp. 2183–2195, 2018
2018
-
[44]
S. Zhang, Y. Xu, N. Usuyama, H. Xu, J. Bagga, R. Tinn, S. Preston, R. Rao, M. Wei, N. Valluri, C. Wong, A. Tupini, Y. Wang, M. Maz- zola, S. Shukla, L. Liden, J. Gao, A. Crabtree, B. Piening, C. Bifulco, M. P . Lungren, T. Naumann, S. Wang, and H. Poon, “Biomedclip: a multimodal biomedical foundation model pretrained from fifteen million scientific image-...
work page internal anchor Pith review arXiv 2023
-
[45]
Remoteclip: A vision language foundation model for remote sensing, 2024
F. Liu, D. Chen, Z. Guan, X. Zhou, J. Zhu, Q. Ye, L. Fu, and J. Zhou, “Remoteclip: A vision language foundation model for remote sensing,”arXiv preprint arXiv:2306.11029, 2023
-
[46]
Internvl: Scaling up vision foun- dation models and aligning for generic visual-linguistic tasks,
Z. Chen, J. Wu, W. Wang, W. Su, G. Chen, S. Xing, M. Zhong, Q. Zhang, X. Zhu, L. Luet al., “Internvl: Scaling up vision foun- dation models and aligning for generic visual-linguistic tasks,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2024, pp. 24 185–24 198
2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.