Re-evaluating Automatic Metrics for Image Captioning

· 2016 · cs.CL · arXiv 1612.07600

2 Pith papers cite this work. Polarity classification is still indexing.

2 Pith papers citing it

open full Pith review browse 2 citing papers arXiv PDF

abstract

The task of generating natural language descriptions from images has received a lot of attention in recent years. Consequently, it is becoming increasingly important to evaluate such image captioning approaches in an automatic manner. In this paper, we provide an in-depth evaluation of the existing image captioning metrics through a series of carefully designed experiments. Moreover, we explore the utilization of the recently proposed Word Mover's Distance (WMD) document metric for the purpose of image captioning. Our findings outline the differences and/or similarities between metrics and their relative robustness by means of extensive correlation, accuracy and distraction based evaluations. Our results also demonstrate that WMD provides strong advantages over other metrics.

representative citing papers

CaptionQA: Is Your Caption as Useful as the Image Itself?

cs.CV · 2025-11-26 · conditional · novelty 7.0

CaptionQA is a new benchmark with 33,027 questions across natural, document, e-commerce, and embodied AI domains that measures how much utility model-generated captions retain compared to original images when used by LLMs for downstream tasks.

Search-based Testing of Vision Language Models for In-Car Scene Understanding

cs.CV · 2026-07-02 · unverdicted · novelty 6.0

ISU-Test combines rendering-based scene generation with search-based testing to produce up to 10x higher failure rates and 3.6x higher failure coverage in VLMs for in-car scene understanding compared to random generation.

citing papers explorer

Showing 1 of 1 citing paper after filters.

CaptionQA: Is Your Caption as Useful as the Image Itself? cs.CV · 2025-11-26 · conditional · none · ref 18 · internal anchor
CaptionQA is a new benchmark with 33,027 questions across natural, document, e-commerce, and embodied AI domains that measures how much utility model-generated captions retain compared to original images when used by LLMs for downstream tasks.

Re-evaluating Automatic Metrics for Image Captioning

fields

years

verdicts

representative citing papers

citing papers explorer