pith. sign in

arxiv: 1907.05084 · v1 · pith:SWVFZ6ZKnew · submitted 2019-07-11 · 💻 cs.CL · cs.CV

MeetUp! A Corpus of Joint Activity Dialogues in a Visual Environment

Pith reviewed 2026-05-24 23:25 UTC · model grok-4.3

classification 💻 cs.CL cs.CV
keywords visual dialoguejoint activitygroundingcoordination gamecorpus collectionmutual understandinglanguage and vision
0
0 comments X

The pith

MeetUp! is a two-player game that generates dialogues requiring both visual and conversational grounding to achieve mutual understanding.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Existing visual dialogue datasets tend to oversimplify the dialogue component. The paper introduces MeetUp!, a coordination game in which two players move through a shared visual environment with the goal of finding each other. Players must describe what they see and reach mutual understanding through conversation. The collected dialogues are shown to exhibit joint activity phenomena while posing challenges for language and vision processing. This setup makes stronger demands on discourse representations than prior tasks.

Core claim

MeetUp! is a two-player coordination game where players move in a visual environment with the objective of finding each other; to succeed they must talk about what they see and achieve mutual understanding, producing dialogues that exhibit the targeted joint activity phenomena while challenging the language and vision aspect.

What carries the argument

The MeetUp! task, a two-player coordination game in a visual environment that requires players to describe observations and reach mutual understanding to locate each other.

If this is right

  • Models trained on the corpus must handle discourse-level representations that track shared visual knowledge across turns.
  • The task highlights limitations in current vision-language systems when coordination requires ongoing reference resolution.
  • The corpus provides data for studying how speakers establish common ground in dynamic visual settings.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar game designs could be adapted to test whether models can maintain consistent spatial models across longer interactions.
  • The approach connects to questions about how dialogue systems might scale to multi-agent embodied scenarios.
  • If the dialogues prove rich, they could serve as a benchmark for measuring progress toward conversational agents that operate in shared physical spaces.

Load-bearing premise

The game mechanics will reliably force players into joint activity and grounding behaviors rather than allowing simpler or less interesting strategies.

What would settle it

If the collected dialogues turn out to consist mostly of short commands or location reports without evidence of mutual understanding or detailed visual descriptions, the claim that the task elicits the targeted phenomena would not hold.

read the original abstract

Building computer systems that can converse about their visual environment is one of the oldest concerns of research in Artificial Intelligence and Computational Linguistics (see, for example, Winograd's 1972 SHRDLU system). Only recently, however, have methods from computer vision and natural language processing become powerful enough to make this vision seem more attainable. Pushed especially by developments in computer vision, many data sets and collection environments have recently been published that bring together verbal interaction and visual processing. Here, we argue that these datasets tend to oversimplify the dialogue part, and we propose a task---MeetUp!---that requires both visual and conversational grounding, and that makes stronger demands on representations of the discourse. MeetUp! is a two-player coordination game where players move in a visual environment, with the objective of finding each other. To do so, they must talk about what they see, and achieve mutual understanding. We describe a data collection and show that the resulting dialogues indeed exhibit the dialogue phenomena of interest, while also challenging the language & vision aspect.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The paper introduces MeetUp!, a two-player coordination game in a visual environment where participants must locate each other through dialogue about observed scenes, thereby requiring visual and conversational grounding. It describes the data collection setup and asserts that the collected dialogues exhibit joint activity, mutual understanding, and visual grounding phenomena while posing challenges for language-and-vision models, positioning the corpus as an advance over prior simplified datasets.

Significance. If the collected dialogues reliably demonstrate the claimed grounding and coordination behaviors, the corpus would offer a valuable resource for developing and evaluating models that handle mutual understanding in visually grounded settings, addressing limitations in existing datasets that oversimplify dialogue.

major comments (1)
  1. [Abstract / data collection] Abstract and data collection description: the central claim that 'the resulting dialogues indeed exhibit the dialogue phenomena of interest' is asserted without any quantitative metrics (e.g., rates of clarification requests, successful grounding acts, or comparison to baseline non-interactive strategies), qualitative examples, or analysis of how game rules block simpler one-sided descriptions; this leaves the weakest assumption untested and undermines evaluation of whether the task elicits the targeted behaviors.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on the abstract and data collection description. We address the major comment below.

read point-by-point responses
  1. Referee: [Abstract / data collection] Abstract and data collection description: the central claim that 'the resulting dialogues indeed exhibit the dialogue phenomena of interest' is asserted without any quantitative metrics (e.g., rates of clarification requests, successful grounding acts, or comparison to baseline non-interactive strategies), qualitative examples, or analysis of how game rules block simpler one-sided descriptions; this leaves the weakest assumption untested and undermines evaluation of whether the task elicits the targeted behaviors.

    Authors: We agree that the abstract states the claim without accompanying quantitative support or explicit analysis of the game rules' role in eliciting joint activity. The body of the manuscript provides a description of the collection setup and some illustrative dialogue excerpts demonstrating the phenomena, but these are not summarized quantitatively in the abstract or data collection section, nor is there a direct comparison to non-interactive baselines. We will revise the abstract and data collection section to include quantitative metrics on phenomena such as clarification requests and successful grounding acts, along with an analysis of how the game rules require mutual understanding. This will strengthen the substantiation of the claim. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical corpus claim with no derivations or self-referential reductions

full rationale

This is a data-collection paper describing a two-player game and the resulting dialogues. The central claim is an empirical assertion that the collected data exhibits joint activity, mutual understanding, and visual grounding. No equations, fitted parameters, predictions, or derivation chains exist that could reduce to inputs by construction. No self-citations are load-bearing for any mathematical result. The analysis is self-contained as an observational report on human data.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is a corpus introduction paper with no mathematical model or derivation. No free parameters, axioms, or invented entities are involved beyond the high-level task definition itself.

pith-pipeline@v0.9.0 · 5717 in / 955 out tokens · 17909 ms · 2026-05-24T23:25:40.069643+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.