When More Words Say Less: Decoupling Length and Specificity in Image Description Evaluation

Elisa Kreiss; Rhea Kapur; Robert Hawkins

arxiv: 2601.04609 · v2 · submitted 2026-01-08 · 💻 cs.CL

When More Words Say Less: Decoupling Length and Specificity in Image Description Evaluation

Rhea Kapur , Robert Hawkins , Elisa Kreiss This is my paper

Pith reviewed 2026-05-16 17:04 UTC · model grok-4.3

classification 💻 cs.CL

keywords image description evaluationspecificitylength biasvision-language modelscontrast setshuman preferencescaptioning metricsinformation density

0 comments

The pith

Specificity in image descriptions cannot be reduced to length, as how detail is allocated within a fixed word budget determines how well it identifies the target image.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper argues that vision-language models often equate longer text with better descriptions, yet the two are separable. Specificity is defined as how effectively a description distinguishes one image from a set of alternatives. The authors built a controlled dataset holding length constant while changing the density of distinguishing information. Human raters consistently favored the versions that used their length budget more informatively. The results indicate that evaluation methods should target specificity directly rather than relying on length as a proxy.

Core claim

The central claim is that controlling for length alone cannot account for differences in specificity: how the length budget is allocated makes a difference. A description is more specific to the degree that it picks out the target image better than other possible images in a contrast set. When length is fixed, people reliably prefer descriptions that allocate words to unique visual features over those that remain vague or generic.

What carries the argument

A contrast set of images used to define specificity as the degree to which a description uniquely identifies the target image among alternatives.

If this is right

Evaluation benchmarks for image captioning should incorporate contrastive measures that test distinguishing power rather than word count.
Systems trained only to maximize length or fluency may still produce descriptions that fail to convey unique visual content.
Dataset construction that decouples length from information content can expose where current models allocate their description budget inefficiently.
Direct optimization for specificity offers a clearer path to accessible visual descriptions than length-based incentives.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Automatic metrics could adopt contrast-set evaluation to reward descriptions for their ability to rule out similar images.
The same length-versus-content distinction may apply to other text-generation tasks where verbosity can obscure lack of substance.
Training objectives that reward unique identification in image sets could reduce the production of generic yet lengthy captions.

Load-bearing premise

Human preference judgments in the contrast-set task reliably track true specificity and are not swayed by other factors such as fluency or style.

What would settle it

A study in which people show no consistent preference for higher-information descriptions when length is controlled, or in which length-based scores predict human choices better than contrast-set specificity scores.

read the original abstract

Vision-language models (VLMs) are increasingly used to make visual content accessible via text-based descriptions. In current systems, however, description specificity is often conflated with their length. We argue that these two concepts must be disentangled: descriptions can be concise yet dense with information, or lengthy yet vacuous. We define specificity relative to a contrast set, where a description is more specific to the extent that it picks out the target image better than other possible images. We construct a dataset that controls for length while varying information content, and validate that people reliably prefer more specific descriptions regardless of length. We find that controlling for length alone cannot account for differences in specificity: how the length budget is allocated makes a difference. These results support evaluation approaches that directly prioritize specificity over verbosity.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Length and specificity in VLM image descriptions are distinct, and the paper shows that how the length budget is allocated affects specificity beyond raw count.

read the letter

The main thing to know is that this paper separates length from specificity in image descriptions for vision-language models. They define specificity via contrast sets, where a description scores higher if it better distinguishes the target image from alternatives. They then build a dataset that holds length fixed while varying information content and run human preference tests showing people favor the more specific versions even at matched lengths. The result is that simply controlling for length does not explain the differences; allocation matters.

Referee Report

2 major / 2 minor

Summary. The paper argues that specificity in VLM-generated image descriptions must be decoupled from length. Specificity is operationalized via contrast sets, where a description is more specific if it better distinguishes the target image from alternatives. The authors construct a length-controlled dataset that varies information density, then use human preference judgments to show that length alone does not explain specificity differences and that allocation of the length budget matters. The results support evaluation metrics that directly reward specificity rather than verbosity.

Significance. If the central claims hold after methodological clarification, the work offers a concrete way to improve VLM evaluation for accessibility and captioning tasks by separating informative density from mere length. The contrast-set approach and human validation provide a falsifiable basis for prioritizing specificity, which could influence both automatic metrics and training objectives in vision-language research.

major comments (2)

[§3] §3 (Dataset Construction): the description of contrast-set sampling, exact length-control procedure, and dataset statistics (number of images, descriptions per set, how vacuous vs. dense variants were generated) is insufficient. These details are load-bearing for the claim that length control alone cannot account for specificity differences, as the independence of the two factors rests on the construction.
[§4] §4 (Human Validation): no statistical tests, participant counts, effect sizes, or controls for length bias in the preference task are reported. This directly affects the reliability of the finding that humans prefer more specific descriptions regardless of length and that allocation within the budget matters.

minor comments (2)

[Abstract] Abstract: add one sentence with quantitative scale (e.g., number of contrast sets or participants) to give readers an immediate sense of study size.
[§2] Notation: the definition of specificity via contrast sets could be formalized with a short equation or pseudocode for clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments, which highlight areas where additional methodological detail will strengthen the paper. We address each major point below and will incorporate the requested clarifications in the revised manuscript.

read point-by-point responses

Referee: [§3] §3 (Dataset Construction): the description of contrast-set sampling, exact length-control procedure, and dataset statistics (number of images, descriptions per set, how vacuous vs. dense variants were generated) is insufficient. These details are load-bearing for the claim that length control alone cannot account for specificity differences, as the independence of the two factors rests on the construction.

Authors: We agree that §3 requires expansion for full reproducibility and to support the decoupling claim. In the revision we will add a detailed account of contrast-set sampling (including image selection criteria and how alternatives were chosen), the exact length-control procedure (word-count matching with explicit variance bounds and information-density manipulation), and complete dataset statistics (total images, descriptions per set, and the generation process distinguishing vacuous from dense variants). These additions will make the independence of length and specificity explicit. revision: yes
Referee: [§4] §4 (Human Validation): no statistical tests, participant counts, effect sizes, or controls for length bias in the preference task are reported. This directly affects the reliability of the finding that humans prefer more specific descriptions regardless of length and that allocation within the budget matters.

Authors: We acknowledge the absence of these statistical details in §4. The revised version will report participant numbers, the specific statistical tests performed (e.g., paired comparisons with p-values), effect sizes, and length-bias controls (identical word counts across compared descriptions, randomized presentation order, and explicit instructions to judges). These additions will provide quantitative support for the human-preference results. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper operationalizes specificity independently as discrimination within a contrast set, constructs a length-controlled dataset that varies information content, and validates via human preference judgments showing allocation effects. No equations, self-citations, or fitted parameters reduce the central claim to its inputs by construction; the result follows directly from the empirical contrast between length-matched descriptions differing in specificity.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Based on abstract only; the central definition of specificity rests on one domain assumption with no free parameters or invented entities identified.

axioms (1)

domain assumption Specificity can be defined relative to a contrast set where a description picks out the target image better than other possible images.
Core definition stated in the abstract that underpins the dataset and human validation.

pith-pipeline@v0.9.0 · 5429 in / 1222 out tokens · 192468 ms · 2026-05-16T17:04:30.167228+00:00 · methodology

When More Words Say Less: Decoupling Length and Specificity in Image Description Evaluation

Core claim

What carries the argument

If this is right

Where Pith is reading between the lines

Load-bearing premise

What would settle it

discussion (0)