Vision-language models overestimate common ground in asymmetric dialogues by treating map content as evidence of mutual understanding rather than tracking how grounding unfolds through interaction.
LVLMs and Humans Ground Differently in Referential Communication
1 Pith paper cite this work. Polarity classification is still indexing.
abstract
For generative AI agents to partner effectively with human users, the ability to accurately predict human intent is critical. But this ability to collaborate remains limited by a critical deficit: an inability to model common ground. We present a referential communication experiment with a factorial design involving director-matcher pairs (human-human, human-AI, AI-human, and AI-AI) that interact with multiple turns in repeated rounds to match pictures of objects not associated with any obvious lexicalized labels. We show that LVLMs cannot interactively generate and resolve referring expressions in a way that enables smooth communication, a crucial skill that underlies human language use. We release our corpus of 356 dialogues (89 pairs over 4 rounds each) along with the online pipeline for data collection and the tools for analyzing accuracy, efficiency, and lexical overlap.
fields
cs.CL 1years
2026 1verdicts
UNVERDICTED 1representative citing papers
citing papers explorer
-
Seeing Is Not Sharing: Some Vision-Language Models Overestimate Common Ground in Asymmetric Dialogue
Vision-language models overestimate common ground in asymmetric dialogues by treating map content as evidence of mutual understanding rather than tracking how grounding unfolds through interaction.