MeetUp! A Corpus of Joint Activity Dialogues in a Visual Environment

David Schlangen; Nikolai Ilinykh; Sina Zarrie{\ss}

arxiv: 1907.05084 · v1 · pith:SWVFZ6ZKnew · submitted 2019-07-11 · 💻 cs.CL · cs.CV

MeetUp! A Corpus of Joint Activity Dialogues in a Visual Environment

Nikolai Ilinykh , Sina Zarrie{\ss} , David Schlangen This is my paper

Pith reviewed 2026-05-24 23:25 UTC · model grok-4.3

classification 💻 cs.CL cs.CV

keywords visual dialoguejoint activitygroundingcoordination gamecorpus collectionmutual understandinglanguage and vision

0 comments

The pith

MeetUp! is a two-player game that generates dialogues requiring both visual and conversational grounding to achieve mutual understanding.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Existing visual dialogue datasets tend to oversimplify the dialogue component. The paper introduces MeetUp!, a coordination game in which two players move through a shared visual environment with the goal of finding each other. Players must describe what they see and reach mutual understanding through conversation. The collected dialogues are shown to exhibit joint activity phenomena while posing challenges for language and vision processing. This setup makes stronger demands on discourse representations than prior tasks.

Core claim

MeetUp! is a two-player coordination game where players move in a visual environment with the objective of finding each other; to succeed they must talk about what they see and achieve mutual understanding, producing dialogues that exhibit the targeted joint activity phenomena while challenging the language and vision aspect.

What carries the argument

The MeetUp! task, a two-player coordination game in a visual environment that requires players to describe observations and reach mutual understanding to locate each other.

If this is right

Models trained on the corpus must handle discourse-level representations that track shared visual knowledge across turns.
The task highlights limitations in current vision-language systems when coordination requires ongoing reference resolution.
The corpus provides data for studying how speakers establish common ground in dynamic visual settings.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar game designs could be adapted to test whether models can maintain consistent spatial models across longer interactions.
The approach connects to questions about how dialogue systems might scale to multi-agent embodied scenarios.
If the dialogues prove rich, they could serve as a benchmark for measuring progress toward conversational agents that operate in shared physical spaces.

Load-bearing premise

The game mechanics will reliably force players into joint activity and grounding behaviors rather than allowing simpler or less interesting strategies.

What would settle it

If the collected dialogues turn out to consist mostly of short commands or location reports without evidence of mutual understanding or detailed visual descriptions, the claim that the task elicits the targeted phenomena would not hold.

read the original abstract

Building computer systems that can converse about their visual environment is one of the oldest concerns of research in Artificial Intelligence and Computational Linguistics (see, for example, Winograd's 1972 SHRDLU system). Only recently, however, have methods from computer vision and natural language processing become powerful enough to make this vision seem more attainable. Pushed especially by developments in computer vision, many data sets and collection environments have recently been published that bring together verbal interaction and visual processing. Here, we argue that these datasets tend to oversimplify the dialogue part, and we propose a task---MeetUp!---that requires both visual and conversational grounding, and that makes stronger demands on representations of the discourse. MeetUp! is a two-player coordination game where players move in a visual environment, with the objective of finding each other. To do so, they must talk about what they see, and achieve mutual understanding. We describe a data collection and show that the resulting dialogues indeed exhibit the dialogue phenomena of interest, while also challenging the language & vision aspect.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper introduces a coordination game corpus meant to elicit mutual visual grounding in dialogue, but the abstract provides no evidence or examples to back the central claim that the setup actually forces those behaviors.

read the letter

The main thing to know is that this paper defines MeetUp!, a two-player game in which participants move through a visual environment and must locate each other by talking about what they see. The task is positioned as requiring both visual grounding and discourse coordination, which goes beyond standard visual QA or captioning setups in the cited prior work. That task definition is the clear new element here, and the collection protocol is described at a high level that seems workable for producing dialogue data in a shared visual setting. Credit is due for identifying that many existing datasets simplify the dialogue side and for trying to build something that demands more ongoing mutual understanding. The soft spot is exactly the one flagged in the stress-test note. The abstract asserts that the collected dialogues show joint activity, mutual understanding, and visual grounding, yet it supplies no metrics on clarification requests, grounding success rates, or comparisons to simpler strategies, and no examples. Without those details it is impossible to tell whether the game rules actually block one-sided descriptions or non-interactive movement. The concern that the mechanics may permit less interesting play therefore stands on the basis of the abstract alone. If the full paper contains quantitative breakdowns or qualitative analysis demonstrating the targeted phenomena, that would address the gap. This work is aimed at researchers building multimodal dialogue systems that handle coordination rather than single-turn queries. A reader focused on visual grounding in extended conversation could extract the task setup and any released data for their own experiments. The paper deserves a serious referee to evaluate the data analysis and check whether the claims about dialogue phenomena are supported once the full details are available.

Referee Report

1 major / 0 minor

Summary. The paper introduces MeetUp!, a two-player coordination game in a visual environment where participants must locate each other through dialogue about observed scenes, thereby requiring visual and conversational grounding. It describes the data collection setup and asserts that the collected dialogues exhibit joint activity, mutual understanding, and visual grounding phenomena while posing challenges for language-and-vision models, positioning the corpus as an advance over prior simplified datasets.

Significance. If the collected dialogues reliably demonstrate the claimed grounding and coordination behaviors, the corpus would offer a valuable resource for developing and evaluating models that handle mutual understanding in visually grounded settings, addressing limitations in existing datasets that oversimplify dialogue.

major comments (1)

[Abstract / data collection] Abstract and data collection description: the central claim that 'the resulting dialogues indeed exhibit the dialogue phenomena of interest' is asserted without any quantitative metrics (e.g., rates of clarification requests, successful grounding acts, or comparison to baseline non-interactive strategies), qualitative examples, or analysis of how game rules block simpler one-sided descriptions; this leaves the weakest assumption untested and undermines evaluation of whether the task elicits the targeted behaviors.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on the abstract and data collection description. We address the major comment below.

read point-by-point responses

Referee: [Abstract / data collection] Abstract and data collection description: the central claim that 'the resulting dialogues indeed exhibit the dialogue phenomena of interest' is asserted without any quantitative metrics (e.g., rates of clarification requests, successful grounding acts, or comparison to baseline non-interactive strategies), qualitative examples, or analysis of how game rules block simpler one-sided descriptions; this leaves the weakest assumption untested and undermines evaluation of whether the task elicits the targeted behaviors.

Authors: We agree that the abstract states the claim without accompanying quantitative support or explicit analysis of the game rules' role in eliciting joint activity. The body of the manuscript provides a description of the collection setup and some illustrative dialogue excerpts demonstrating the phenomena, but these are not summarized quantitatively in the abstract or data collection section, nor is there a direct comparison to non-interactive baselines. We will revise the abstract and data collection section to include quantitative metrics on phenomena such as clarification requests and successful grounding acts, along with an analysis of how the game rules require mutual understanding. This will strengthen the substantiation of the claim. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical corpus claim with no derivations or self-referential reductions

full rationale

This is a data-collection paper describing a two-player game and the resulting dialogues. The central claim is an empirical assertion that the collected data exhibits joint activity, mutual understanding, and visual grounding. No equations, fitted parameters, predictions, or derivation chains exist that could reduce to inputs by construction. No self-citations are load-bearing for any mathematical result. The analysis is self-contained as an observational report on human data.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is a corpus introduction paper with no mathematical model or derivation. No free parameters, axioms, or invented entities are involved beyond the high-level task definition itself.

pith-pipeline@v0.9.0 · 5717 in / 955 out tokens · 17909 ms · 2026-05-24T23:25:40.069643+00:00 · methodology

MeetUp! A Corpus of Joint Activity Dialogues in a Visual Environment

Core claim

What carries the argument

If this is right

Where Pith is reading between the lines

Load-bearing premise

What would settle it

discussion (0)