Re-evaluating Automatic Metrics for Image Captioning

Aykut Erdem; Erkut Erdem; Mert Kilickaya; Nazli Ikizler-Cinbis

Re-evaluating Automatic Metrics for Image Captioning

Not yet reviewed by Pith; the record is open.

Re-run · record.json Download PDF Read on arXiv ↗

This paper has not been read by Pith yet. Machine review is queued; the pith claim, tier, and objections will appear here once it completes.

SPECIMEN: schema-true, not a live event

T0 review · schema-true

One-sentence machine reading of the paper's core claim.

pith:XXXXXXXX · record.json · timestamp

arxiv 1612.07600 v1 pith:2PKIBQGA submitted 2016-12-22 cs.CL cs.CV

Re-evaluating Automatic Metrics for Image Captioning

Mert Kilickaya , Aykut Erdem , Nazli Ikizler-Cinbis , Erkut Erdem This is my paper

classification cs.CL cs.CV

keywords captioningimagemetricsautomaticaccuracyadvantagesapproachesattention

verification ladder T0 review T1 audit T2 compute T3 formal T4 reserved

0 comments

read the original abstract

The task of generating natural language descriptions from images has received a lot of attention in recent years. Consequently, it is becoming increasingly important to evaluate such image captioning approaches in an automatic manner. In this paper, we provide an in-depth evaluation of the existing image captioning metrics through a series of carefully designed experiments. Moreover, we explore the utilization of the recently proposed Word Mover's Distance (WMD) document metric for the purpose of image captioning. Our findings outline the differences and/or similarities between metrics and their relative robustness by means of extensive correlation, accuracy and distraction based evaluations. Our results also demonstrate that WMD provides strong advantages over other metrics.

discussion (0)

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

CaptionQA: Is Your Caption as Useful as the Image Itself?
cs.CV 2025-11 conditional novelty 7.0

CaptionQA is a new benchmark with 33,027 questions across natural, document, e-commerce, and embodied AI domains that measures how much utility model-generated captions retain compared to original images when used by ...
Search-based Testing of Vision Language Models for In-Car Scene Understanding
cs.CV 2026-07 unverdicted novelty 6.0

ISU-Test combines rendering-based scene generation with search-based testing to produce up to 10x higher failure rates and 3.6x higher failure coverage in VLMs for in-car scene understanding compared to random generation.
Search-based Testing of Vision Language Models for In-Car Scene Understanding
cs.CV 2026-07 conditional novelty 6.0

Search-based optimization over rendered in-cabin scenes finds up to 10× more VLM failures and up to 3.6× higher failure-cluster coverage than random generation for question answering and captioning.