MeetUp! A Corpus of Joint Activity Dialogues in a Visual Environment
Pith reviewed 2026-05-24 23:25 UTC · model grok-4.3
The pith
MeetUp! is a two-player game that generates dialogues requiring both visual and conversational grounding to achieve mutual understanding.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
MeetUp! is a two-player coordination game where players move in a visual environment with the objective of finding each other; to succeed they must talk about what they see and achieve mutual understanding, producing dialogues that exhibit the targeted joint activity phenomena while challenging the language and vision aspect.
What carries the argument
The MeetUp! task, a two-player coordination game in a visual environment that requires players to describe observations and reach mutual understanding to locate each other.
If this is right
- Models trained on the corpus must handle discourse-level representations that track shared visual knowledge across turns.
- The task highlights limitations in current vision-language systems when coordination requires ongoing reference resolution.
- The corpus provides data for studying how speakers establish common ground in dynamic visual settings.
Where Pith is reading between the lines
- Similar game designs could be adapted to test whether models can maintain consistent spatial models across longer interactions.
- The approach connects to questions about how dialogue systems might scale to multi-agent embodied scenarios.
- If the dialogues prove rich, they could serve as a benchmark for measuring progress toward conversational agents that operate in shared physical spaces.
Load-bearing premise
The game mechanics will reliably force players into joint activity and grounding behaviors rather than allowing simpler or less interesting strategies.
What would settle it
If the collected dialogues turn out to consist mostly of short commands or location reports without evidence of mutual understanding or detailed visual descriptions, the claim that the task elicits the targeted phenomena would not hold.
read the original abstract
Building computer systems that can converse about their visual environment is one of the oldest concerns of research in Artificial Intelligence and Computational Linguistics (see, for example, Winograd's 1972 SHRDLU system). Only recently, however, have methods from computer vision and natural language processing become powerful enough to make this vision seem more attainable. Pushed especially by developments in computer vision, many data sets and collection environments have recently been published that bring together verbal interaction and visual processing. Here, we argue that these datasets tend to oversimplify the dialogue part, and we propose a task---MeetUp!---that requires both visual and conversational grounding, and that makes stronger demands on representations of the discourse. MeetUp! is a two-player coordination game where players move in a visual environment, with the objective of finding each other. To do so, they must talk about what they see, and achieve mutual understanding. We describe a data collection and show that the resulting dialogues indeed exhibit the dialogue phenomena of interest, while also challenging the language & vision aspect.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces MeetUp!, a two-player coordination game in a visual environment where participants must locate each other through dialogue about observed scenes, thereby requiring visual and conversational grounding. It describes the data collection setup and asserts that the collected dialogues exhibit joint activity, mutual understanding, and visual grounding phenomena while posing challenges for language-and-vision models, positioning the corpus as an advance over prior simplified datasets.
Significance. If the collected dialogues reliably demonstrate the claimed grounding and coordination behaviors, the corpus would offer a valuable resource for developing and evaluating models that handle mutual understanding in visually grounded settings, addressing limitations in existing datasets that oversimplify dialogue.
major comments (1)
- [Abstract / data collection] Abstract and data collection description: the central claim that 'the resulting dialogues indeed exhibit the dialogue phenomena of interest' is asserted without any quantitative metrics (e.g., rates of clarification requests, successful grounding acts, or comparison to baseline non-interactive strategies), qualitative examples, or analysis of how game rules block simpler one-sided descriptions; this leaves the weakest assumption untested and undermines evaluation of whether the task elicits the targeted behaviors.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on the abstract and data collection description. We address the major comment below.
read point-by-point responses
-
Referee: [Abstract / data collection] Abstract and data collection description: the central claim that 'the resulting dialogues indeed exhibit the dialogue phenomena of interest' is asserted without any quantitative metrics (e.g., rates of clarification requests, successful grounding acts, or comparison to baseline non-interactive strategies), qualitative examples, or analysis of how game rules block simpler one-sided descriptions; this leaves the weakest assumption untested and undermines evaluation of whether the task elicits the targeted behaviors.
Authors: We agree that the abstract states the claim without accompanying quantitative support or explicit analysis of the game rules' role in eliciting joint activity. The body of the manuscript provides a description of the collection setup and some illustrative dialogue excerpts demonstrating the phenomena, but these are not summarized quantitatively in the abstract or data collection section, nor is there a direct comparison to non-interactive baselines. We will revise the abstract and data collection section to include quantitative metrics on phenomena such as clarification requests and successful grounding acts, along with an analysis of how the game rules require mutual understanding. This will strengthen the substantiation of the claim. revision: yes
Circularity Check
No circularity: empirical corpus claim with no derivations or self-referential reductions
full rationale
This is a data-collection paper describing a two-player game and the resulting dialogues. The central claim is an empirical assertion that the collected data exhibits joint activity, mutual understanding, and visual grounding. No equations, fitted parameters, predictions, or derivation chains exist that could reduce to inputs by construction. No self-citations are load-bearing for any mathematical result. The analysis is self-contained as an observational report on human data.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.