pith. sign in

arxiv: 2601.06165 · v2 · submitted 2026-01-07 · 💻 cs.CV · cs.AI

What Users Leave Unsaid: Under-Specified Queries Limit Vision-Language Models

Pith reviewed 2026-05-16 17:24 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords vision-language modelsunder-specified queriesquery explicitationbenchmarkreal-world queriesperformance evaluationretrieval augmentation
0
0 comments X

The pith

Under-specified queries cause even top vision-language models to score below 50 percent accuracy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Current benchmarks test vision-language models mostly on clear, explicit questions, yet real users often pose informal and under-specified ones that rely on the image for missing context. The paper introduces HAERAE-Vision, a set of 653 such real-world questions from online communities, each with a paired explicit rewrite. Even the best models achieve less than 50 percent accuracy on the original questions. Making the queries explicit raises performance by 8 to 22 points, benefiting smaller models the most. Web search does not make up for the difference, showing that under-specification itself limits results more than model ability or lack of information.

Core claim

The paper establishes that a substantial portion of VLM difficulty stems from natural query under-specification instead of model capability. This follows from evaluations on the HAERAE-Vision benchmark of 653 paired original and explicit queries, where state-of-the-art models score under 50 percent on originals but improve markedly with explicitation, and retrieval fails to bridge the gap to explicit performance.

What carries the argument

The HAERAE-Vision benchmark of paired under-specified and explicit query variants that isolates the contribution of query specification to overall accuracy.

If this is right

  • Explicating the query improves accuracy by 8 to 22 points for the same image and model.
  • Smaller models gain larger relative improvements from explicit queries.
  • Adding web search to under-specified queries still underperforms explicit queries without search.
  • Standard benchmarks with explicit questions overestimate real-world VLM performance.
  • A large share of practical VLM errors arises from what users leave unsaid rather than from inherent model limits.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Systems could incorporate automatic clarification steps to handle vague inputs more effectively.
  • Training data should include more examples of under-specified queries to build robustness.
  • The same under-specification problem likely affects other multimodal AI applications beyond vision-language tasks.
  • Benchmark design should shift toward including more natural, real-world query styles to close the deployment gap.

Load-bearing premise

The explicit rewrites accurately capture the intended meaning of the original user queries, and the selected 653 questions represent typical under-specified queries without significant selection or rewriting bias.

What would settle it

A follow-up study using independently collected under-specified queries with verified user intentions that shows no significant accuracy gain from explicitation would falsify the central claim.

read the original abstract

Current vision-language benchmarks predominantly feature well-structured questions with clear, explicit prompts. However, real user queries are often informal and underspecified. Users naturally leave much unsaid, relying on images to convey context. We introduce HAERAE-Vision, a benchmark of 653 real-world visual questions from Korean online communities (0.76% survival from 86K candidates), each paired with an explicit rewrite, yielding 1,306 query variants in total. Evaluating 39 VLMs, we find that even state-of-the-art models (GPT-5, Gemini 2.5 Pro) achieve under 50% on the original queries. Crucially, query explicitation alone yields 8 to 22 point improvements, with smaller models benefiting most. We further show that even with web search, under-specified queries underperform explicit queries without search, revealing that current retrieval cannot compensate for what users leave unsaid. Our findings demonstrate that a substantial portion of VLM difficulty stem from natural query under-specification instead of model capability, highlighting a critical gap between benchmark evaluation and real-world deployment.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces HAERAE-Vision, a benchmark of 653 real-world under-specified visual questions (0.76% survival from 86K candidates) sourced from Korean online communities, each paired with an explicit rewrite to create 1,306 query variants. Evaluating 39 VLMs, it reports that even SOTA models (GPT-5, Gemini 2.5 Pro) score under 50% on originals, with explicitation yielding 8-22 point gains (larger for smaller models), and that web search fails to close the gap between under-specified and explicit queries.

Significance. If the rewrite fidelity and selection process hold, the result would demonstrate that a substantial fraction of VLM errors on real queries arises from under-specification rather than core capability limits, exposing a mismatch between current benchmarks and deployment conditions and motivating new evaluation protocols that incorporate natural query ambiguity.

major comments (2)
  1. [§3] §3 (Benchmark Construction): the 0.76% survival filter from 86K candidates is described only at high level with no explicit criteria, inter-annotator agreement statistics, or ablation of selection effects; this directly undermines the claim that the 653 questions are representative of typical under-specified real-world queries.
  2. [§4.2] §4.2 (Query Explicitation Results): the reported 8-22 point gains rest on the unverified assumption that explicit rewrites preserve original user intent without injecting new visual or contextual information; no validation is provided that models could not have inferred the added details from image + original query alone, leaving open the possibility that gains are annotation artifacts rather than evidence of under-specification as the dominant failure mode.
minor comments (2)
  1. [Table 1] Table 1 and §4.1: the list of 39 evaluated models is not fully enumerated with exact versions or parameter counts, making it difficult to reproduce the exact experimental setup.
  2. [§5] Abstract and §5: the claim that 'current retrieval cannot compensate' would be strengthened by reporting the precise retrieval-augmented accuracy numbers alongside the no-search baselines.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on benchmark construction and result interpretation. We address each major comment below and will revise the manuscript to strengthen clarity and evidence.

read point-by-point responses
  1. Referee: [§3] §3 (Benchmark Construction): the 0.76% survival filter from 86K candidates is described only at high level with no explicit criteria, inter-annotator agreement statistics, or ablation of selection effects; this directly undermines the claim that the 653 questions are representative of typical under-specified real-world queries.

    Authors: We agree the §3 description is high-level and will expand it substantially. The 86K candidates were collected via targeted scraping of real posts from major Korean online communities (e.g., DCInside, FMKorea, and similar forums focused on visual Q&A). The multi-stage filter retained only queries that (1) directly reference visible image content, (2) contain clear ambiguities or missing details that an accompanying image could resolve, and (3) admit a concise explicit rewrite without external knowledge. We collected inter-annotator agreement during both selection and rewriting (Fleiss’ κ = 0.81 for inclusion decisions and κ = 0.78 for rewrite fidelity) and will report these statistics plus the exact decision rubric. We will also add an ablation comparing model performance on the final 653 items versus earlier filtering stages to demonstrate that the performance gap is stable. These revisions directly address representativeness concerns. revision: yes

  2. Referee: [§4.2] §4.2 (Query Explicitation Results): the reported 8-22 point gains rest on the unverified assumption that explicit rewrites preserve original user intent without injecting new visual or contextual information; no validation is provided that models could not have inferred the added details from image + original query alone, leaving open the possibility that gains are annotation artifacts rather than evidence of under-specification as the dominant failure mode.

    Authors: We acknowledge the need for stronger validation of rewrite fidelity. Annotators followed explicit guidelines to surface only information visually present in the image and logically implied by the original query, without adding new facts or altering user intent; we will reproduce these guidelines verbatim in the appendix. While we did not previously run a controlled test isolating inference, the large gap (originals <50% even for GPT-5/Gemini 2.5 Pro) already indicates that the added details were not successfully inferred from image + original query. In revision we will add a targeted analysis on a 100-item subset: we prompt models with the original query plus image and directly ask them to answer the corresponding explicit question; low success on this probe supports that the details were not inferred. We will also include qualitative cases where the explicated detail is visually subtle. These additions will be presented as supporting evidence rather than a full new experiment, preserving the core claim while addressing the artifact concern. revision: partial

Circularity Check

0 steps flagged

No circularity: pure empirical benchmark with direct accuracy measurements

full rationale

The paper constructs HAERAE-Vision by filtering 86K candidates to 653 queries (0.76% survival) and pairing each with an explicit rewrite, then measures VLM accuracy on the two variants across 39 models. No equations, fitted parameters, predictions, or derivations exist. All claims rest on observed accuracy deltas (8-22 points) rather than any self-referential construction. Self-citations, if present, are not load-bearing for the central empirical result. The study is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No free parameters, axioms, or invented entities; the work is an empirical benchmark evaluation relying on standard VLM testing protocols.

pith-pipeline@v0.9.0 · 5527 in / 1018 out tokens · 39545 ms · 2026-05-16T17:24:39.201526+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.