CaptionQA: Is Your Caption as Useful as the Image Itself?

· 2025 · cs.CV · arXiv 2511.21025

2 Pith papers cite this work. Polarity classification is still indexing.

2 Pith papers citing it

open full Pith review browse 2 citing papers arXiv PDF

abstract

Image captions serve as efficient surrogates for visual content in multimodal systems such as retrieval, recommendation, and multi-step agentic inference pipelines. Yet current evaluation practices miss a fundamental question: Can captions stand-in for images in real downstream tasks? We propose a utility-based benchmark, CaptionQA, to evaluate model-generated captions, where caption quality is measured by how well it supports downstream tasks. CaptionQA is an extensible domain-dependent benchmark covering 4 domains--Natural, Document, E-commerce, and Embodied AI--each with fine-grained taxonomies (25 top-level and 69 subcategories) that identify useful information for domain-specific tasks. CaptionQA builds 33,027 densely annotated multiple-choice questions (50.3 per image on average) that explicitly require visual information to answer, providing a comprehensive probe of caption utility. In our evaluation protocol, an LLM answers these questions using captions alone, directly measuring whether captions preserve image-level utility and are utilizable by a downstream LLM. Evaluating state-of-the-art MLLMs reveals substantial gaps between the image and its caption utility. Notably, models nearly identical on traditional image-QA benchmarks lower by up to 32% in caption utility. We release CaptionQA along with an open-source pipeline for extension to new domains. The code is available at https://github.com/bronyayang/CaptionQA.

citation-role summary

background 1

citation-polarity summary

background 1

representative citing papers

BalCapRL: A Balanced Framework for RL-Based MLLM Image Captioning

cs.CV · 2026-05-08 · unverdicted · novelty 6.0

BalCapRL applies balanced multi-objective RL with GDPO-style normalization and length-conditional masking to improve MLLM image captioning, reporting gains of up to +13.6 DCScore, +9.0 CaptionQA, and +29.0 CapArena on LLaVA and Qwen models.

ClaimDiff-RL: Fine-Grained Caption Reinforcement Learning through Visual Claim Comparison

cs.LG · 2026-05-19

citing papers explorer

Showing 2 of 2 citing papers.

BalCapRL: A Balanced Framework for RL-Based MLLM Image Captioning cs.CV · 2026-05-08 · unverdicted · none · ref 2 · internal anchor
BalCapRL applies balanced multi-objective RL with GDPO-style normalization and length-conditional masking to improve MLLM image captioning, reporting gains of up to +13.6 DCScore, +9.0 CaptionQA, and +29.0 CapArena on LLaVA and Qwen models.
ClaimDiff-RL: Fine-Grained Caption Reinforcement Learning through Visual Claim Comparison cs.LG · 2026-05-19 · unreviewed · ref 33 · internal anchor

CaptionQA: Is Your Caption as Useful as the Image Itself?

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer