Reference Games as a Testbed for the Alignment of Model Uncertainty and Clarification Requests

Hendrik Buschmeier; Judith Sieker; Manar Ali; Sina Zarrie{\ss}

arxiv: 2601.07820 · v2 · pith:V66ZROCGnew · submitted 2026-01-12 · 💻 cs.CL

Reference Games as a Testbed for the Alignment of Model Uncertainty and Clarification Requests

Manar Ali , Judith Sieker , Sina Zarrie{\ss} , Hendrik Buschmeier This is my paper

Pith reviewed 2026-05-21 15:42 UTC · model grok-4.3

classification 💻 cs.CL

keywords reference gamesclarification requestsmodel uncertaintyvision-language modelsalignmenthuman-AI interaction

0 comments

The pith

Vision-language models struggle to recognize their uncertainty and request clarification in reference games.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that reference games offer a controlled testbed for checking if models can recognize their own uncertainty and express it by requesting clarification, much like humans do in conversation. This matters for developing models that can maintain mutual understanding in interactions rather than just responding passively. The authors compare a baseline reference resolution task with one where models are explicitly instructed to clarify when uncertain, using three vision-language models. Results indicate persistent struggles in aligning uncertainty with clarification actions, highlighting the testbed's utility for probing interaction qualities.

Core claim

The central claim is that reference games are a suitable testbed for approaching whether language models can assume a listener role by recognizing and expressing uncertainty through clarification requests. Evaluating models in baseline versus instructed clarification conditions shows they often struggle to recognize internal uncertainty and translate it into adequate clarification behavior.

What carries the argument

Reference games, defined as controlled and self-contained tasks that make clarification needs explicit and measurable, serving as the mechanism to test uncertainty recognition in models.

If this is right

Reference games allow for explicit measurement of clarification needs in model interactions.
Even simple tasks reveal models' difficulties in uncertainty recognition.
Models do not consistently assume an active listener role in maintaining understanding.
This setup demonstrates the value for testing interaction qualities of vision-language models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The testbed could extend to evaluating uncertainty in more open-ended dialogues.
Training might benefit from explicit uncertainty modeling to improve clarification.
This aligns with broader goals of making AI more robust in collaborative tasks.

Load-bearing premise

That the instruction to request clarification when uncertain measures genuine internal uncertainty recognition rather than just instruction following.

What would settle it

A direct comparison showing no correlation between model error rates on referent identification and the rate or appropriateness of clarification requests would falsify the alignment claim.

read the original abstract

In human conversation, both interlocutors play an active role in maintaining mutual understanding. When listeners are uncertain about what speakers mean, for example, they can request clarification. It is an open question for language models whether they can assume a similar listener role, recognizing and expressing their own uncertainty through clarification. We argue that reference games are a suitable testbed to approach this question as they are controlled, self-contained, and make clarification needs explicit and measurable. To test this, we evaluate three vision-language models comparing a baseline reference resolution task to an experiment where the models are instructed to request clarification when uncertain. The results suggest that even in such simple tasks, models often struggle to recognize internal uncertainty and translate it into adequate clarification behavior. This demonstrates the value of reference games as testbeds for interaction qualities of (vision and) language models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Reference games are pitched as a testbed for model uncertainty and clarification but the comparison leaves open whether low clarification rates show detection failure or just weak instruction following.

read the letter

Reference games make sense as a testbed for how models handle uncertainty through clarification requests, but this paper's results don't clearly show whether models fail to recognize uncertainty or simply don't follow the clarification instruction. The authors evaluate three vision-language models in a reference game setup. They compare a standard reference resolution condition to one where models are instructed to ask for clarification when uncertain. The finding is that models don't clarify much, pointing to struggles with uncertainty. This is new because it specifically uses reference games to test alignment between internal uncertainty and clarification behavior, building on separate lines of work in reference resolution and dialogue. The controlled environment is a plus since it makes clarification needs explicit and measurable. A soft spot is the missing independent measure of uncertainty. Low clarification in the instructed condition could reflect poor instruction following rather than undetected uncertainty. The paper would be stronger with something like probability entropy or multiple sampling to confirm uncertainty levels. Details on exact metrics and sample sizes are also thin based on what's reported. This paper is for researchers in conversational AI and model alignment who want practical testbeds for dialogue qualities. Readers focused on improving reliability in vision-language models could get ideas from the setup. It deserves peer review because the testbed concept is useful and the basic comparison is doable, even if the current evidence needs bolstering on the uncertainty measurement.

Referee Report

2 major / 2 minor

Summary. The paper proposes reference games as a controlled testbed for assessing whether vision-language models can recognize their own uncertainty and express it via clarification requests. It evaluates three VLMs by comparing a baseline reference-resolution task against a condition in which models are instructed to request clarification when uncertain, concluding that models often fail to align internal uncertainty with clarification behavior.

Significance. If the central empirical pattern holds under improved measurement, the work supplies a simple, falsifiable protocol for probing interactive uncertainty handling in VLMs and demonstrates the utility of reference games for studying alignment between model confidence and communicative acts.

major comments (2)

[Abstract] Abstract: the reported comparative results on three models supply no metrics, sample sizes, statistical tests, or controls, rendering the evidential basis for the claim that models 'struggle to recognize internal uncertainty' preliminary and difficult to evaluate.
[Results] Results section: the design contrasts a baseline condition with an instructed clarification condition but provides no independent, prompt-independent measure of uncertainty (e.g., entropy over referent distributions, cross-sample consistency, or token-level confidence). Consequently, low clarification rates cannot be unambiguously attributed to failure to recognize uncertainty rather than to incomplete instruction following.

minor comments (2)

[Methods] Methods: provide the exact wording of the clarification instruction prompt and the full set of reference-game stimuli so that the instructed condition can be replicated.
[Results] Table 1 or equivalent results table: report raw clarification rates alongside any uncertainty proxies and include confidence intervals or significance tests for the baseline-versus-instructed contrast.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback. We address each major comment below and indicate the revisions we will implement to improve the empirical rigor of the manuscript.

read point-by-point responses

Referee: [Abstract] Abstract: the reported comparative results on three models supply no metrics, sample sizes, statistical tests, or controls, rendering the evidential basis for the claim that models 'struggle to recognize internal uncertainty' preliminary and difficult to evaluate.

Authors: We agree that the abstract would be strengthened by including key quantitative details. In the revised version, we will update the abstract to report clarification request rates for each of the three models in both conditions, along with sample sizes and a brief reference to the statistical comparisons performed. The full metrics, tests, and controls are already detailed in the Results section; the abstract revision will make this evidential basis more immediately apparent without exceeding length constraints. revision: yes
Referee: [Results] Results section: the design contrasts a baseline condition with an instructed clarification condition but provides no independent, prompt-independent measure of uncertainty (e.g., entropy over referent distributions, cross-sample consistency, or token-level confidence). Consequently, low clarification rates cannot be unambiguously attributed to failure to recognize uncertainty rather than to incomplete instruction following.

Authors: This is a fair and important observation. Our current analysis relies on the performance contrast between conditions to infer uncertainty alignment, but we recognize that an explicit, prompt-independent uncertainty metric would strengthen causal attribution. We will revise the Results section to include such measures—for instance, entropy over the model's referent probability distribution and cross-sample consistency scores computed on the baseline task. These additions will help distinguish failures of uncertainty recognition from potential instruction-following limitations. revision: yes

Circularity Check

0 steps flagged

Empirical comparison of instructed vs baseline conditions with no definitional or fitted reduction

full rationale

The paper conducts a direct empirical evaluation of three vision-language models on reference games, contrasting a baseline reference-resolution task against an instructed condition that explicitly tells models to request clarification when uncertain. No equations, parameter fitting, self-citations, or ansatzes are invoked as load-bearing steps in the provided text; the results are reported as observed differences in clarification behavior between the two experimental arms. This setup is self-contained against the external benchmark of the reference game itself and does not reduce any claimed prediction or result to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper rests on the domain assumption that reference games make clarification needs explicit and measurable, plus standard practices for evaluating vision-language models on reference resolution.

axioms (1)

domain assumption Reference games are controlled, self-contained, and make clarification needs explicit and measurable.
Invoked to establish the testbed's suitability for studying uncertainty alignment.

pith-pipeline@v0.9.0 · 5678 in / 1116 out tokens · 62720 ms · 2026-05-21T15:42:47.111657+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

How Hypocritical Is Your LLM judge? Listener-Speaker Asymmetries in the Pragmatic Competence of Large Language Models
cs.CL 2026-04 unverdicted novelty 6.0

LLMs perform substantially better as pragmatic listeners judging language than as speakers generating it, revealing weak alignment between the two roles.