LVLMs and Humans Ground Differently in Referential Communication

Amie J. Paige; Dimitris Samaras; Gregory Zelinsky; Owen Rambow; Panagiotis Kaliosis; Peter Zeng; Susan E. Brennan; Weiling Li; Zhengxiang Wang

arxiv: 2601.19792 · v5 · pith:ZYNQQBUHnew · submitted 2026-01-27 · 💻 cs.CL · cs.AI· cs.HC

LVLMs and Humans Ground Differently in Referential Communication

Peter Zeng , Weiling Li , Amie J. Paige , Zhengxiang Wang , Panagiotis Kaliosis , Dimitris Samaras , Gregory Zelinsky , Susan E. Brennan

show 1 more author

Owen Rambow

This is my paper

Pith reviewed 2026-05-16 10:31 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.HC

keywords referential communicationcommon groundlarge vision-language modelsreferring expressionshuman-AI interactiondialogue coordinationgrounding

0 comments

The pith

Large vision-language models cannot interactively generate and resolve referring expressions to build common ground with humans or each other.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper runs a controlled referential communication task where pairs of agents take turns describing and matching pictures of objects that lack standard names. Human-human pairs improve accuracy and efficiency across repeated rounds by negotiating shared descriptions, while any pair that includes an LVLM shows persistent gaps in coordination and lexical alignment. The results indicate that current LVLMs lack the pragmatic ability to track and update common ground during multi-turn dialogue, a skill the authors treat as foundational to effective language use. The study releases the full set of 356 dialogues together with analysis tools for accuracy, efficiency, and word overlap.

Core claim

In a factorial design with human-human, human-AI, AI-human, and AI-AI director-matcher pairs interacting over four rounds on non-lexicalized objects, LVLMs cannot generate and resolve referring expressions interactively in a manner that supports smooth, improving communication.

What carries the argument

The repeated-round director-matcher referential communication game using pictures of objects without obvious lexical labels, which forces participants to negotiate descriptions from scratch.

If this is right

Human-AI and AI-AI pairs will continue to exhibit lower accuracy and slower convergence than human-human pairs on tasks requiring on-the-fly reference negotiation.
LVLMs will produce referring expressions with lower lexical overlap and less adaptive reuse of prior descriptions across dialogue rounds.
Effective human-AI collaboration will remain limited until models can maintain and update representations of shared knowledge during extended interaction.
Released dialogue corpus and analysis pipeline will enable direct measurement of grounding deficits in future models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Training regimes that emphasize next-token prediction on static text may be insufficient to instill pragmatic grounding skills that emerge in live interaction.
Adding explicit mechanisms for tracking partner knowledge state could close the observed gap without requiring larger model scale.
The same task design could be used to test whether multimodal models improve when given feedback on whether their descriptions successfully identified the target object.

Load-bearing premise

That shortfalls in this particular matching task with novel objects reveal a general deficit in LVLMs' capacity to model common ground across other communication settings.

What would settle it

An experiment in which LVLM agents reach human levels of accuracy and coordination speed after the same number of rounds on the identical non-lexicalized object set.

Figures

Figures reproduced from arXiv: 2601.19792 by Amie J. Paige, Dimitris Samaras, Gregory Zelinsky, Owen Rambow, Panagiotis Kaliosis, Peter Zeng, Susan E. Brennan, Weiling Li, Zhengxiang Wang.

**Figure 1.** Figure 1: Repeated referring to two baskets (nonlexicalized objects) by a human-human pair in Rounds 1-4 of our experiment, with lexical overlap highlighted in blue. Entrainment on more concise language (a conceptual pact) occurs by Round 3, after they consider multiple proposals in Rounds 1-2. their common ground. Recently, the field has begun to address the question of whether large language models (LLMs) and l… view at source ↗

**Figure 2.** Figure 2: Trends over four rounds for (from left to right) [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Complete stimulus set used in the task. (a) The 12 target baskets viewed by both the director and the [PITH_FULL_IMAGE:figures/full_fig_p012_3.png] view at source ↗

**Figure 4.** Figure 4: Interface for the two-player collaborative game. The Director (left) sees the target order, while the Matcher [PITH_FULL_IMAGE:figures/full_fig_p012_4.png] view at source ↗

**Figure 5.** Figure 5: Human-human partners explicitly acknowledging common ground, as they try to distinguish two similar [PITH_FULL_IMAGE:figures/full_fig_p013_5.png] view at source ↗

**Figure 6.** Figure 6: Dialogue from an AI–AI pair. Unlike human pairs, both AI partners fail to exhibit lexical entrainment. [PITH_FULL_IMAGE:figures/full_fig_p014_6.png] view at source ↗

**Figure 7.** Figure 7: Dialogue from a human-AI pair in which the human director appears to try valiantly to entrain (flexibly [PITH_FULL_IMAGE:figures/full_fig_p015_7.png] view at source ↗

**Figure 8.** Figure 8: In this dialogue from an AI-human pair, the human matcher struggles mightily to distinguish the same [PITH_FULL_IMAGE:figures/full_fig_p016_8.png] view at source ↗

**Figure 9.** Figure 9: Prompt for extracting referring expressions for each target basket from a round transcript in our corpus. [PITH_FULL_IMAGE:figures/full_fig_p017_9.png] view at source ↗

**Figure 10.** Figure 10: Shared task instructions, prepended at the beginning of both the director and matcher’s system messages. and both players see the score. [PITH_FULL_IMAGE:figures/full_fig_p019_10.png] view at source ↗

**Figure 11.** Figure 11: The base system message for the director, which contains core responsibilities as well as describing the [PITH_FULL_IMAGE:figures/full_fig_p020_11.png] view at source ↗

**Figure 12.** Figure 12: Director: The pragmatically informed system message, in addition to the base prompt. This includes communication rules motivated by cognitive science theory, as well as scaffolding and structured output to support state updates during the task [PITH_FULL_IMAGE:figures/full_fig_p021_12.png] view at source ↗

**Figure 13.** Figure 13: Matcher: The base system message for the matcher, which contains core responsibilities, as well as describing the visual context that’s provided to the LVLM. AUTHORITATIVE CURRENT MATCHER SEQUENCE STATE (for this turn): - There are 12 positions total. - `sequence_candidate_indices` is a length-12 array aligned to positions 1..12. - A value of null means that position is EMPTY/unfilled right now. - Default… view at source ↗

**Figure 14.** Figure 14: Matcher: Sequence state system message to track the current selected sequence per turn [PITH_FULL_IMAGE:figures/full_fig_p022_14.png] view at source ↗

**Figure 15.** Figure 15: Matcher: The pragmatically informed system message, in addition to the base prompt [PITH_FULL_IMAGE:figures/full_fig_p023_15.png] view at source ↗

**Figure 16.** Figure 16: Matcher: Scaffolding and structured output in order to handle state updates in the task, as well as using zero-shot chain-of-thought prompting (Kojima et al., 2022) [PITH_FULL_IMAGE:figures/full_fig_p024_16.png] view at source ↗

read the original abstract

For generative AI agents to partner effectively with human users, the ability to accurately predict human intent is critical. But this ability to collaborate remains limited by a critical deficit: an inability to model common ground. We present a referential communication experiment with a factorial design involving director-matcher pairs (human-human, human-AI, AI-human, and AI-AI) that interact with multiple turns in repeated rounds to match pictures of objects not associated with any obvious lexicalized labels. We show that LVLMs cannot interactively generate and resolve referring expressions in a way that enables smooth communication, a crucial skill that underlies human language use. We release our corpus of 356 dialogues (89 pairs over 4 rounds each) along with the online pipeline for data collection and the tools for analyzing accuracy, efficiency, and lexical overlap.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

LVLMs lag in interactive referential grounding on novel objects, but the released corpus is the real deliverable while the causal link to common ground stays loose.

read the letter

The main point is that this experiment finds LVLMs in human-AI and AI-AI pairs do worse than human-human pairs at matching non-lexicalized objects over repeated rounds, which the authors tie to weaker common-ground modeling. The factorial design across all four pair types and the repeated-round structure are straightforward ways to surface those differences. Releasing the 356-dialogue corpus plus the collection pipeline and analysis tools is the clearest contribution; anyone studying multimodal reference can actually use the data. The focus on accuracy, efficiency, and lexical overlap gives multiple measurable angles on the same interactions. The evidence remains thin on the numbers. The abstract states the conclusion without reporting accuracy rates, efficiency metrics, statistical tests, or error bars, so the size and reliability of the gaps are hard to judge. The stress-test concern holds up: because the stimuli are only novel shapes, the performance drop could stem from weaker visual feature handling or multi-turn prompt sensitivity rather than a specific failure to track common ground. No lexicalized control condition or vision-only baseline is described that would separate those factors. This work is for researchers building collaborative multimodal agents or studying human-AI dialogue. A reader who needs a new referential corpus will get immediate value even if the interpretation needs tightening. It deserves peer review because the setup is clean and the data release is useful, though the draft will need quantitative results and tighter controls before the central claim lands solidly.

Referee Report

3 major / 1 minor

Summary. The paper presents a referential communication experiment with a factorial design comparing director-matcher pairs across human-human, human-AI, AI-human, and AI-AI conditions. Participants interact over multiple turns in repeated rounds to match pictures of non-lexicalized objects. The central claim is that LVLMs cannot interactively generate and resolve referring expressions to enable smooth communication in the way humans do, and the authors release a corpus of 356 dialogues along with analysis tools for accuracy, efficiency, and lexical overlap.

Significance. If the performance gaps are robustly demonstrated and attributable to common-ground modeling deficits, the work identifies a practically important limitation for deploying LVLMs in collaborative human-AI settings. The public release of the dialogue corpus and pipeline is a clear strength that supports reproducibility and follow-up studies on grounded interaction.

major comments (3)

[Abstract] Abstract: the central claim that LVLMs 'cannot interactively generate and resolve referring expressions' is presented without any quantitative results, accuracy/efficiency metrics, statistical tests, or error bars, leaving the empirical support for the conclusion difficult to evaluate from the provided summary.
[Methods] Methods/Experimental Design: the attribution of AI-AI and human-AI gaps specifically to failures in modeling common ground is under-supported because the task uses only non-lexicalized objects; without lexicalized control conditions or visual-only baselines, gaps could instead arise from weaker visual feature extraction for novel shapes or prompt sensitivity in multi-turn dialogue.
[Results] Results: no details are given on how accuracy and efficiency were operationalized or on the magnitude of differences across the four pair types, making it impossible to assess whether the observed patterns are large enough to support the strong claim about LVLMs' inability to model common ground.

minor comments (1)

[Abstract] The abstract mentions 'lexical overlap' as an analysis dimension but does not define the metric or how it relates to common-ground use.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive feedback on our manuscript. We address each of the major comments below and outline the revisions we will make to improve the clarity and robustness of our claims.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim that LVLMs 'cannot interactively generate and resolve referring expressions' is presented without any quantitative results, accuracy/efficiency metrics, statistical tests, or error bars, leaving the empirical support for the conclusion difficult to evaluate from the provided summary.

Authors: We agree that the abstract would benefit from including key quantitative findings to support the central claim. In the revised version, we will update the abstract to include specific accuracy and efficiency metrics from our experiments, along with notes on statistical significance, while maintaining the abstract's brevity. revision: yes
Referee: [Methods] Methods/Experimental Design: the attribution of AI-AI and human-AI gaps specifically to failures in modeling common ground is under-supported because the task uses only non-lexicalized objects; without lexicalized control conditions or visual-only baselines, gaps could instead arise from weaker visual feature extraction for novel shapes or prompt sensitivity in multi-turn dialogue.

Authors: The choice of non-lexicalized objects is central to the experimental design, as it requires participants to interactively establish referring expressions without relying on conventional labels, directly testing common ground formation. This follows established paradigms in referential communication research. That said, we recognize the potential for confounds related to visual processing or prompt handling. We will revise the methods and discussion sections to explicitly address these possibilities and include additional analysis or caveats regarding visual feature extraction. revision: partial
Referee: [Results] Results: no details are given on how accuracy and efficiency were operationalized or on the magnitude of differences across the four pair types, making it impossible to assess whether the observed patterns are large enough to support the strong claim about LVLMs' inability to model common ground.

Authors: We apologize if the operationalization was not sufficiently clear in the results section. Accuracy is defined as the proportion of successful matches per round, and efficiency as the average number of turns required and measures of lexical overlap between director and matcher utterances. The manuscript reports substantial differences, with human-human pairs showing higher accuracy and efficiency compared to conditions involving LVLMs. We will revise the results section to provide more explicit definitions of these metrics at the outset and to highlight the magnitude of the differences with appropriate statistical tests and visualizations. revision: yes

Circularity Check

0 steps flagged

No significant circularity: purely empirical observational study

full rationale

The paper reports results from a factorial referential communication experiment comparing human-human, human-AI, AI-human, and AI-AI dialogue pairs on non-lexicalized object matching across multiple turns. No equations, parameters, derivations, or predictive models are present; performance metrics (accuracy, efficiency, lexical overlap) are computed directly from collected dialogues without any fitting step that could reduce a claimed prediction to its own inputs. No self-citations are invoked as load-bearing uniqueness theorems or ansatzes. The central claim rests on observed performance gaps rather than any chain that collapses by construction to prior fitted quantities or self-referential definitions. This is a standard empirical comparison whose validity can be assessed against the released corpus and pipeline without internal circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that the chosen referential task isolates common-ground building ability; no free parameters or invented entities are introduced.

axioms (1)

domain assumption The referential communication task with non-lexicalized objects measures the ability to build common ground interactively.
This premise is required to interpret performance differences as evidence of grounding deficits in LVLMs.

pith-pipeline@v0.9.0 · 5461 in / 1171 out tokens · 25071 ms · 2026-05-16T10:31:52.892510+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We present a referential communication experiment with a factorial design involving director-matcher pairs... to match pictures of objects not associated with any obvious lexicalized labels.
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Lexical entrainment... RLO(b)_i = |Inters(Tok(RE(b)_{i-1}),Tok(RE(b)_i))| / |Tok(RE(b)_i)|

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Seeing Is Not Sharing: Some Vision-Language Models Overestimate Common Ground in Asymmetric Dialogue
cs.CL 2026-06 unverdicted novelty 6.0

Vision-language models overestimate common ground in asymmetric dialogues by treating map content as evidence of mutual understanding rather than tracking how grounding unfolds through interaction.