Making Dialogue Grounding Data Rich: A Three-Tier Data Synthesis Framework for Generalized Referring Expression Comprehension

Chris Madge; Juexi Shao; Massimo Poesio; Siyou Li; Vanja Karan; Yujian Gan

arxiv: 2512.02791 · v2 · submitted 2025-12-02 · 💻 cs.CL

Making Dialogue Grounding Data Rich: A Three-Tier Data Synthesis Framework for Generalized Referring Expression Comprehension

Juexi Shao , Siyou Li , Yujian Gan , Chris Madge , Vanja Karan , Massimo Poesio This is my paper

Pith reviewed 2026-05-17 02:27 UTC · model grok-4.3

classification 💻 cs.CL

keywords dialogue groundingreferring expression comprehensiondata synthesiscoreference resolutiondistribution shiftmultimodal learninggeneralized referring expressions

0 comments

The pith

A three-tier data synthesis framework generates scalable dialogue grounding data that improves model performance on generalized referring expression comprehension under distribution shifts.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper addresses the scarcity of annotated dialogue data for Generalized Referring Expression Comprehension, where models must ground expressions to multiple possible targets in images while tracking coreferences over conversation history. Existing approaches suffer when test dialogues differ from training ones. To fix this, the authors develop a three-tier synthesis process that mixes controllable generation steps with realism-preserving elements to create large volumes of training examples. Fine-tuning models on the resulting data produces steady gains on standard benchmarks for grounding accuracy and coreference handling.

Core claim

Dialogue-Based Generalized Referring Expression Comprehension requires models to ground expressions and unlimited targets in complex visual scenes while resolving coreference across long dialogue contexts. Existing systems struggle under distribution shift between training and evaluation domains because of scarce annotated dialogue grounding data. A three-tier data-synthesis method balances realism and controllability to produce scalable supervision for dialogue-conditioned grounding. Fine-tuning on the synthesized data yields consistent, substantial improvements over prior approaches across standard evaluation metrics.

What carries the argument

Three-tier data-synthesis framework that produces large-scale annotated examples for dialogue-conditioned visual grounding by balancing realism and controllability.

If this is right

Models gain robustness to domain shifts without requiring new human annotations for each target domain.
Performance rises on tasks that demand tracking referents across multiple dialogue turns.
Training becomes feasible for scenes containing many possible targets rather than single unique objects.
The same synthesis pipeline can support larger-scale experiments on longer or more complex dialogues.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same tiered synthesis approach could be tested on related tasks such as visual question answering over dialogue history.
If the generated dialogues contain detectable artifacts, downstream applications may need additional filtering steps before deployment.
Extending the framework to generate data for entirely new visual domains would test whether the improvements generalize beyond current benchmarks.
Combining the synthesized data with small amounts of real human dialogue might yield further gains while keeping annotation costs low.

Load-bearing premise

The three-tier synthesis process produces data whose distribution is close enough to real human dialogues that improvements on held-out test sets will transfer to genuinely unseen dialogue domains.

What would settle it

A controlled experiment showing no gains or degraded performance when models fine-tuned on the synthesized data are evaluated on dialogues drawn from a new domain with different conversational styles or visual setups would falsify the central claim.

Figures

Figures reproduced from arXiv: 2512.02791 by Chris Madge, Juexi Shao, Massimo Poesio, Siyou Li, Vanja Karan, Yujian Gan.

**Figure 1.** Figure 1: Multimodal Data Synthesis As the data requirements of large-scale pre-training have grown, synthetic data has emerged as a prominent paradigm. A broad class of methods [9, 10, 19] relies on simulators to produce data with controllable distributions and programmatically generated annotations [11, 20, 21], whereas recent approaches leverage generative models to synthesize diverse textual [12, 13] and multimo… view at source ↗

read the original abstract

Dialogue-Based Generalized Referring Expression Comprehension (GREC) requires models to ground the expression and unlimited targets in complex visual scenes while resolving coreference across a long dialogue context. However, existing systems struggle under distribution shift between training and evaluation domains, a gap exacerbated by the scarcity of annotated dialogue grounding data. We address this challenge with a three-tier data-synthesis method that balances realism and controllability to produce scalable supervision for dialogue-conditioned grounding. Fine-tuning on the synthesized data yields consistent, substantial improvements over prior approaches across standard evaluation metrics.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives a three-tier synthesis pipeline for scaling up dialogue GREC training data and claims consistent gains from fine-tuning, but the evidence for those gains and for distributional match to real dialogues is still thin.

read the letter

This paper introduces a three-tier data synthesis framework for generating training examples in dialogue-based generalized referring expression comprehension. The headline result is that fine-tuning on the synthesized data produces consistent improvements over prior approaches on standard metrics. The three-tier structure is presented as a way to balance realism with controllability so the generated dialogues can cover long contexts and coreference chains while remaining scalable. That combination is a reasonable practical response to the acknowledged scarcity of annotated multimodal dialogue data, and the focus on distribution shift between training and evaluation domains is a clear statement of the real bottleneck. The work does a decent job laying out why existing systems struggle and how synthesis can supply the missing supervision without requiring new human annotation at scale. The tiers themselves sound like a structured attempt to inject different kinds of control at different stages, which could be useful for other data-augmentation efforts in vision-language dialogue. The soft spots are more noticeable. The abstract states the improvements without numbers, ablations, or implementation details, so it is impossible to tell how much each tier contributes or whether the gains are robust. The central assumption—that the synthetic distribution is close enough to real human dialogues for the improvements to survive on genuinely unseen data—is not directly tested with divergence measures or human-likeness ratings. If the controllability steps introduce systematic biases in coreference patterns or scene statistics, the metric gains could be artifacts of the synthetic regime rather than genuine robustness. That concern from the stress-test note still looks live on the basis of what is described. This is the kind of paper that would interest researchers working on multimodal dialogue systems or practical data generation techniques. A reader looking for concrete ideas on structuring a synthesis pipeline might pick up some useful scaffolding, but anyone wanting to rely on the claimed gains will need the full experimental section and code to judge. I would send it for peer review so the methods and results can be examined in detail.

Referee Report

2 major / 2 minor

Summary. The paper introduces a three-tier data synthesis framework for generating scalable dialogue grounding data to support Dialogue-Based Generalized Referring Expression Comprehension (GREC). The approach aims to balance realism and controllability in synthetic data creation to mitigate scarcity of annotated data and distribution shift issues between training and evaluation domains. The central claim is that fine-tuning models on this synthesized data produces consistent, substantial improvements over prior methods on standard evaluation metrics for grounding expressions and resolving coreference in complex visual scenes with long dialogue contexts.

Significance. If the empirical gains hold under proper validation, the work would offer a practical, scalable method for augmenting limited dialogue grounding datasets, potentially improving model robustness in multi-turn visual grounding tasks. This could address a key bottleneck in dialogue-conditioned comprehension systems, though its impact depends on demonstrating that synthetic data generalizes beyond the generation process itself.

major comments (2)

[§3] §3: The three-tier synthesis process is described as balancing realism and controllability, but the manuscript provides no direct quantitative assessment (e.g., KL divergence, human-likeness ratings, or coreference pattern statistics) comparing the synthetic distribution to real human dialogues from disjoint corpora. This measurement is load-bearing for the claim that observed metric gains reflect genuine robustness rather than synthesis artifacts.
[Abstract and §4] Abstract and §4: The assertion of 'consistent, substantial improvements' is stated without accompanying quantitative results, ablation studies on individual tiers, or details on evaluation metrics and baselines in the provided summary sections. This absence prevents verification that the central empirical claim is supported by the data.

minor comments (2)

[§2] Notation for coreference chains and target grounding could be clarified with an example dialogue in §2 to aid readability.
[Figures] Figure captions for any synthesis pipeline diagrams should explicitly label the three tiers and their outputs.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback on our work. We address each major comment point by point below, indicating planned revisions where they strengthen the manuscript without misrepresenting our existing results.

read point-by-point responses

Referee: [§3] The three-tier synthesis process is described as balancing realism and controllability, but the manuscript provides no direct quantitative assessment (e.g., KL divergence, human-likeness ratings, or coreference pattern statistics) comparing the synthetic distribution to real human dialogues from disjoint corpora. This measurement is load-bearing for the claim that observed metric gains reflect genuine robustness rather than synthesis artifacts.

Authors: We agree that a direct quantitative comparison to real dialogues from disjoint corpora would provide stronger evidence against synthesis artifacts. In the revised manuscript we will add this analysis to §3, reporting KL divergence on coreference and dialogue features, coreference pattern statistics, and human-likeness ratings collected on a held-out sample. These additions will directly address the load-bearing concern while remaining within the scope of the existing synthesis framework. revision: yes
Referee: [Abstract and §4] The assertion of 'consistent, substantial improvements' is stated without accompanying quantitative results, ablation studies on individual tiers, or details on evaluation metrics and baselines in the provided summary sections. This absence prevents verification that the central empirical claim is supported by the data.

Authors: The full manuscript already contains the requested quantitative results, tier-wise ablations, metric definitions, and baseline comparisons in §4 and the associated tables. To improve accessibility from the summary sections, we will revise the abstract to include the most salient numerical gains and ensure §4 more explicitly cross-references the supporting evidence. This is a partial revision because the core empirical support exists in the body; the change mainly enhances visibility in the summary portions. revision: partial

Circularity Check

0 steps flagged

No significant circularity: empirical synthesis and evaluation remain self-contained

full rationale

The paper describes a three-tier data synthesis framework for generating dialogue grounding supervision and reports empirical gains from fine-tuning on the resulting data. No equations, fitted parameters, or first-principles derivations are present that could reduce to their own inputs by construction. The central claim rests on observed metric improvements across standard evaluations rather than any self-definitional mapping, renamed known result, or load-bearing self-citation chain. The distributional closeness assumption is an empirical premise open to external falsification and does not create circularity under the specified criteria.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available; no explicit free parameters, axioms, or invented entities are stated. The central claim implicitly assumes that synthetic data can stand in for real annotated dialogues without introducing unmeasured biases.

pith-pipeline@v0.9.0 · 5397 in / 1121 out tokens · 40098 ms · 2026-05-17T02:27:13.937848+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We address this challenge with a three-tier data-synthesis method that balances realism and controllability to produce scalable supervision for dialogue-conditioned grounding.
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

a three-tier data augmentation framework for solving the data sparsity of MDC-R

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

36 extracted references · 36 canonical work pages · 2 internal anchors

[1]

[1, 2, 3] - is a key vision-language research task

INTRODUCTION Referring Expression Comprehension (REC) - the task of locating a target referred to by a natural language descrip- tion. [1, 2, 3] - is a key vision-language research task. Recent advances have pushed the state of the art beyond simple surface matching toward richer use of semantic information —most notably, constructing compositional referr...

work page
[2]

A three-tier data augmentation framework for solving the data sparsity of MDC-R [7], spanning short expressions to multi-turn dialogues

work page
[3]

Experimental demonstration that the model trained on these synthetic data achieves notable improvements, with precision increasing by≈20%

work page
[4]

The finding that biases across data types influence model learning and generalization, motivating the importance of distribution-aware training

work page
[5]

Making Dialogue Grounding Data Rich: A Three-Tier Data Synthesis Framework for Generalized Referring Expression Comprehension

RELA TED WORK REC and GRECREC has advanced rapidly in recent years. The early two-stage paradigm [14], couples off- arXiv:2512.02791v1 [cs.CL] 2 Dec 2025 the-shelf object detectors with linguistic cues to compute region–expression matching scores. The field has progressed from specialist to generalist grounding frameworks [15] that are pre-trained at scal...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[6]

Both lines of work aim to expand coverage and reduce manual labelling effort

training signals. Both lines of work aim to expand coverage and reduce manual labelling effort. Based on pre- vious works, we propose three distinct tiers of data synthesis methods

work page
[7]

the second green block from the top

METHODOLOGY To balance data realism and controllability, we introduce a three-tier synthesis framework comprising: (i) template- based short expression synthesis, (ii) prompted short ex- pression synthesis, and (iii) full dialogue with coreference information synthesis. We detail the construction procedures and explain how the components integrate to prod...

work page
[8]

indicates that off-the-shelf LLMs exhibit limited coref- erence tracking. We therefore fine-tune a Qwen2-VL [25] on external coreference-aware dialogue corpora [16], en- abling coherent generation of dialogues with explicit coref- erence chains in Minecraft scenes. The outputs contain (i) coreference-consistent dialogues and (ii) structured expres- sions,...

work page
[9]

Dataset We adopted the MDC–R benchmark [7] for evaluation

EXPERIMENT 4.1. Dataset We adopted the MDC–R benchmark [7] for evaluation. The MDC–R test split comprises 423 instances; each instance includes a scene image, a multi–turn dialogue, and an as- sociated target mention. As shown in Figure 3, given an input of dialogue and image, the model predicts multiple bounding boxes that should refer to the ground trut...

work page
[10]

to learn to generate dialogues containing coreference information. 4.2. Bounding Boxes Reading MDC-R [7] has assigned a unique identifier to each block, composed of letters and Arabic numerals, e.g., A1. This en- sures that all entity mentions within a dialogue refer to distinct combinations of blocks. Minecraft allows obtaining the pixel locations of the...

work page
[11]

We achieved substantial perfor- mance gains via a three-tier data synthesis method, followed by model fine-tuning

CONCLUSION This paper addresses GREC data scarcity stemming from the high cost of annotation. We achieved substantial perfor- mance gains via a three-tier data synthesis method, followed by model fine-tuning. The method is generalizable to other vision-language tasks facing limited supervision. Future work could adopt distribution-aware training to mitiga...

work page
[12]

Refer- itgame: Referring to objects in photographs of natural scenes,

S. Kazemzadeh, V . Ordonez, M. Matten, and T. Berg, “Refer- itgame: Referring to objects in photographs of natural scenes,” inEMNLP, 2014, pp. 787–798

work page 2014
[13]

Model- ing context in referring expressions,

L. Yu, P. Poirson, S. Yang, A. C. Berg, and T. L. Berg, “Model- ing context in referring expressions,” inEuropean conference on computer vision. Springer, 2016, pp. 69–85

work page 2016
[14]

Flickr30k entities: Col- lecting region-to-phrase correspondences for richer image-to- sentence models,

B. A. Plummer, L. Wang, C. M. Cervantes, J. C. Caicedo, J. Hockenmaier, and S. Lazebnik, “Flickr30k entities: Col- lecting region-to-phrase correspondences for richer image-to- sentence models,” inProceedings of the IEEE international conference on computer vision, 2015, pp. 2641–2649

work page 2015
[15]

Cops- ref: A new dataset and task on compositional referring expres- sion comprehension,

Z. Chen, P. Wang, L. Ma, K.-Y . K. Wong, and Q. Wu, “Cops- ref: A new dataset and task on compositional referring expres- sion comprehension,” inCVPR, 2020, pp. 10 086–10 095

work page 2020
[16]

Give me something to eat: Referring expression comprehension with commonsense knowledge,

P. Wang, D. Liu, H. Li, and Q. Wu, “Give me something to eat: Referring expression comprehension with commonsense knowledge,” inProceedings of the 28th ACM International Conference on Multimedia, 2020, pp. 28–36

work page 2020
[17]

Advanc- ing visual grounding with scene knowledge: Benchmark and method,

Z. Chen, R. Zhang, Y . Song, X. Wan, and G. Li, “Advanc- ing visual grounding with scene knowledge: Benchmark and method,” inCVPR, 2023, pp. 15 039–15 049

work page 2023
[18]

Mdc-r: The minecraft dialogue corpus with reference,

C. Madge, M. Camilleri, P. C. Garcia, V . Karan, J. Shao, P. Jayannavar, J. Hough, B. Roth, and M. Poesio, “Mdc-r: The minecraft dialogue corpus with reference,”arXiv preprint arXiv:2506.22062, 2025

work page arXiv 2025
[19]

GREC: Generalized Referring Expression Comprehension, 2023

S. He, H. Ding, C. Liu, and X. Jiang, “Grec: Gener- alized referring expression comprehension,”arXiv preprint arXiv:2308.16182, 2023

work page arXiv 2023
[20]

Collab- orative dialogue in minecraft,

A. Narayan-Chen, P. Jayannavar, and J. Hockenmaier, “Collab- orative dialogue in minecraft,” inACL, 2019, pp. 5405–5415

work page 2019
[21]

Interactive grounded language understanding in a collabora- tive environment: Iglu 2021,

J. Kiseleva, Z. Li, M. Aliannejadi, S. Mohanty, M. ter Hoeve, M. Burtsev, A. Skrynnik, A. Zholus, A. Panov, K. Srinetet al., “Interactive grounded language understanding in a collabora- tive environment: Iglu 2021,” inNeurIPS 2021 Competitions and Demonstrations Track. PMLR, 2022, pp. 146–161

work page 2021
[22]

Caesar: An embodied simulator for generating mul- timodal referring expression datasets,

M. M. Islam, R. Mirzaiee, A. Gladstone, H. Green, and T. Iqbal, “Caesar: An embodied simulator for generating mul- timodal referring expression datasets,”Advances in Neural Information Processing Systems, vol. 35, pp. 21 001–21 015, 2022

work page 2022
[23]

Metamath: Bootstrap your own mathematical questions for large language models,

L. Yu, W. Jiang, H. Shi, J. YU, Z. Liu, Y . Zhang, J. Kwok, Z. Li, A. Weller, and W. Liu, “Metamath: Bootstrap your own mathematical questions for large language models,” in The Twelfth International Conference on Learning Represen- tations

work page
[24]

MathGenie: Generating synthetic data with ques- tion back-translation for enhancing mathematical reasoning of LLMs,

Z. Lu, A. Zhou, H. Ren, K. Wang, W. Shi, J. Pan, M. Zhan, and H. Li, “MathGenie: Generating synthetic data with ques- tion back-translation for enhancing mathematical reasoning of LLMs,” inACL, 2024, pp. 2732–2747

work page 2024
[25]

Mattnet: Modular attention network for referring ex- pression comprehension,

L. Yu, Z. Lin, X. Shen, J. Yang, X. Lu, M. Bansal, and T. L. Berg, “Mattnet: Modular attention network for referring ex- pression comprehension,” inCVPR, 2018, pp. 1307–1315

work page 2018
[26]

Mdetr-modulated detection for end-to-end multi- modal understanding,

A. Kamath, M. Singh, Y . LeCun, G. Synnaeve, I. Misra, and N. Carion, “Mdetr-modulated detection for end-to-end multi- modal understanding,” inProceedings of the IEEE/CVF inter- national conference on computer vision, 2021, pp. 1780–1790

work page 2021
[27]

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

P. Wang, S. Bai, S. Tan, S. Wang, Z. Fan, J. Bai, K. Chen, X. Liu, J. Wang, W. Geet al., “Qwen2-vl: Enhancing vision- language model’s perception of the world at any resolution,” arXiv preprint arXiv:2409.12191, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[28]

Recantformer: Referring expression comprehen- sion with varying numbers of targets,

B. Hemanthage, H. Bilen, P. Bartie, C. Dondrup, and O. Lemon, “Recantformer: Referring expression comprehen- sion with varying numbers of targets,” inEMNLP, 2024, pp. 21 784–21 798

work page 2024
[29]

Jack of all tasks master of many: Designing general-purpose coarse-to-fine vision-language model,

S. Pramanick, G. Han, R. Hou, S. Nag, S.-N. Lim, N. Ballas, Q. Wang, R. Chellappa, and A. Almahairi, “Jack of all tasks master of many: Designing general-purpose coarse-to-fine vision-language model,” inCVPR, 2024, pp. 14 076–14 088

work page 2024
[30]

Alfred: A bench- mark for interpreting grounded instructions for everyday tasks,

M. Shridhar, J. Thomason, D. Gordon, Y . Bisk, W. Han, R. Mottaghi, L. Zettlemoyer, and D. Fox, “Alfred: A bench- mark for interpreting grounded instructions for everyday tasks,” inCVPR, 2020, pp. 10 740–10 749

work page 2020
[31]

Generating easy-to-understand referring ex- pressions for target identifications,

M. Tanaka, T. Itamochi, K. Narioka, I. Sato, Y . Ushiku, and T. Harada, “Generating easy-to-understand referring ex- pressions for target identifications,” in2019 IEEE/CVF Inter- national Conference on Computer Vision (ICCV), 2019, pp. 5793–5802

work page 2019
[32]

Clevr-ref+: Diag- nosing visual reasoning with referring expressions,

R. Liu, C. Liu, Y . Bai, and A. L. Yuille, “Clevr-ref+: Diag- nosing visual reasoning with referring expressions,” inCVPR, 2019, pp. 4185–4194

work page 2019
[33]

Harlequin: Color-driven generation of synthetic data for referring expression compre- hension,

L. Parolari, E. Izzo, and L. Ballan, “Harlequin: Color-driven generation of synthetic data for referring expression compre- hension,” inInternational Conference on Pattern Recognition. Springer, 2025, pp. 292–307

work page 2025
[34]

Introducing GPT-4.1 in the API,

OpenAI, “Introducing GPT-4.1 in the API,” https://openai. com/index/gpt-4-1/, Apr. 2025, accessed: Sep. 17, 2025

work page 2025
[35]

Assessing the capabilities of large language models in coreference: An evaluation,

Y . Gan, J. Yu, and M. Poesio, “Assessing the capabilities of large language models in coreference: An evaluation,” in LREC-COLING 2024, 2024, pp. 1645–1665

work page 2024
[36]

What you see is what you get: Visual pronoun coreference resolution in dialogues,

X. Yu, H. Zhang, Y . Song, Y . Song, and C. Zhang, “What you see is what you get: Visual pronoun coreference resolution in dialogues,”arXiv preprint arXiv:1909.00421, 2019

work page arXiv 1909

[1] [1]

[1, 2, 3] - is a key vision-language research task

INTRODUCTION Referring Expression Comprehension (REC) - the task of locating a target referred to by a natural language descrip- tion. [1, 2, 3] - is a key vision-language research task. Recent advances have pushed the state of the art beyond simple surface matching toward richer use of semantic information —most notably, constructing compositional referr...

work page

[2] [2]

A three-tier data augmentation framework for solving the data sparsity of MDC-R [7], spanning short expressions to multi-turn dialogues

work page

[3] [3]

Experimental demonstration that the model trained on these synthetic data achieves notable improvements, with precision increasing by≈20%

work page

[4] [4]

The finding that biases across data types influence model learning and generalization, motivating the importance of distribution-aware training

work page

[5] [5]

Making Dialogue Grounding Data Rich: A Three-Tier Data Synthesis Framework for Generalized Referring Expression Comprehension

RELA TED WORK REC and GRECREC has advanced rapidly in recent years. The early two-stage paradigm [14], couples off- arXiv:2512.02791v1 [cs.CL] 2 Dec 2025 the-shelf object detectors with linguistic cues to compute region–expression matching scores. The field has progressed from specialist to generalist grounding frameworks [15] that are pre-trained at scal...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[6] [6]

Both lines of work aim to expand coverage and reduce manual labelling effort

training signals. Both lines of work aim to expand coverage and reduce manual labelling effort. Based on pre- vious works, we propose three distinct tiers of data synthesis methods

work page

[7] [7]

the second green block from the top

METHODOLOGY To balance data realism and controllability, we introduce a three-tier synthesis framework comprising: (i) template- based short expression synthesis, (ii) prompted short ex- pression synthesis, and (iii) full dialogue with coreference information synthesis. We detail the construction procedures and explain how the components integrate to prod...

work page

[8] [8]

indicates that off-the-shelf LLMs exhibit limited coref- erence tracking. We therefore fine-tune a Qwen2-VL [25] on external coreference-aware dialogue corpora [16], en- abling coherent generation of dialogues with explicit coref- erence chains in Minecraft scenes. The outputs contain (i) coreference-consistent dialogues and (ii) structured expres- sions,...

work page

[9] [9]

Dataset We adopted the MDC–R benchmark [7] for evaluation

EXPERIMENT 4.1. Dataset We adopted the MDC–R benchmark [7] for evaluation. The MDC–R test split comprises 423 instances; each instance includes a scene image, a multi–turn dialogue, and an as- sociated target mention. As shown in Figure 3, given an input of dialogue and image, the model predicts multiple bounding boxes that should refer to the ground trut...

work page

[10] [10]

to learn to generate dialogues containing coreference information. 4.2. Bounding Boxes Reading MDC-R [7] has assigned a unique identifier to each block, composed of letters and Arabic numerals, e.g., A1. This en- sures that all entity mentions within a dialogue refer to distinct combinations of blocks. Minecraft allows obtaining the pixel locations of the...

work page

[11] [11]

We achieved substantial perfor- mance gains via a three-tier data synthesis method, followed by model fine-tuning

CONCLUSION This paper addresses GREC data scarcity stemming from the high cost of annotation. We achieved substantial perfor- mance gains via a three-tier data synthesis method, followed by model fine-tuning. The method is generalizable to other vision-language tasks facing limited supervision. Future work could adopt distribution-aware training to mitiga...

work page

[12] [12]

Refer- itgame: Referring to objects in photographs of natural scenes,

S. Kazemzadeh, V . Ordonez, M. Matten, and T. Berg, “Refer- itgame: Referring to objects in photographs of natural scenes,” inEMNLP, 2014, pp. 787–798

work page 2014

[13] [13]

Model- ing context in referring expressions,

L. Yu, P. Poirson, S. Yang, A. C. Berg, and T. L. Berg, “Model- ing context in referring expressions,” inEuropean conference on computer vision. Springer, 2016, pp. 69–85

work page 2016

[14] [14]

Flickr30k entities: Col- lecting region-to-phrase correspondences for richer image-to- sentence models,

B. A. Plummer, L. Wang, C. M. Cervantes, J. C. Caicedo, J. Hockenmaier, and S. Lazebnik, “Flickr30k entities: Col- lecting region-to-phrase correspondences for richer image-to- sentence models,” inProceedings of the IEEE international conference on computer vision, 2015, pp. 2641–2649

work page 2015

[15] [15]

Cops- ref: A new dataset and task on compositional referring expres- sion comprehension,

Z. Chen, P. Wang, L. Ma, K.-Y . K. Wong, and Q. Wu, “Cops- ref: A new dataset and task on compositional referring expres- sion comprehension,” inCVPR, 2020, pp. 10 086–10 095

work page 2020

[16] [16]

Give me something to eat: Referring expression comprehension with commonsense knowledge,

P. Wang, D. Liu, H. Li, and Q. Wu, “Give me something to eat: Referring expression comprehension with commonsense knowledge,” inProceedings of the 28th ACM International Conference on Multimedia, 2020, pp. 28–36

work page 2020

[17] [17]

Advanc- ing visual grounding with scene knowledge: Benchmark and method,

Z. Chen, R. Zhang, Y . Song, X. Wan, and G. Li, “Advanc- ing visual grounding with scene knowledge: Benchmark and method,” inCVPR, 2023, pp. 15 039–15 049

work page 2023

[18] [18]

Mdc-r: The minecraft dialogue corpus with reference,

C. Madge, M. Camilleri, P. C. Garcia, V . Karan, J. Shao, P. Jayannavar, J. Hough, B. Roth, and M. Poesio, “Mdc-r: The minecraft dialogue corpus with reference,”arXiv preprint arXiv:2506.22062, 2025

work page arXiv 2025

[19] [19]

GREC: Generalized Referring Expression Comprehension, 2023

S. He, H. Ding, C. Liu, and X. Jiang, “Grec: Gener- alized referring expression comprehension,”arXiv preprint arXiv:2308.16182, 2023

work page arXiv 2023

[20] [20]

Collab- orative dialogue in minecraft,

A. Narayan-Chen, P. Jayannavar, and J. Hockenmaier, “Collab- orative dialogue in minecraft,” inACL, 2019, pp. 5405–5415

work page 2019

[21] [21]

Interactive grounded language understanding in a collabora- tive environment: Iglu 2021,

J. Kiseleva, Z. Li, M. Aliannejadi, S. Mohanty, M. ter Hoeve, M. Burtsev, A. Skrynnik, A. Zholus, A. Panov, K. Srinetet al., “Interactive grounded language understanding in a collabora- tive environment: Iglu 2021,” inNeurIPS 2021 Competitions and Demonstrations Track. PMLR, 2022, pp. 146–161

work page 2021

[22] [22]

Caesar: An embodied simulator for generating mul- timodal referring expression datasets,

M. M. Islam, R. Mirzaiee, A. Gladstone, H. Green, and T. Iqbal, “Caesar: An embodied simulator for generating mul- timodal referring expression datasets,”Advances in Neural Information Processing Systems, vol. 35, pp. 21 001–21 015, 2022

work page 2022

[23] [23]

Metamath: Bootstrap your own mathematical questions for large language models,

L. Yu, W. Jiang, H. Shi, J. YU, Z. Liu, Y . Zhang, J. Kwok, Z. Li, A. Weller, and W. Liu, “Metamath: Bootstrap your own mathematical questions for large language models,” in The Twelfth International Conference on Learning Represen- tations

work page

[24] [24]

MathGenie: Generating synthetic data with ques- tion back-translation for enhancing mathematical reasoning of LLMs,

Z. Lu, A. Zhou, H. Ren, K. Wang, W. Shi, J. Pan, M. Zhan, and H. Li, “MathGenie: Generating synthetic data with ques- tion back-translation for enhancing mathematical reasoning of LLMs,” inACL, 2024, pp. 2732–2747

work page 2024

[25] [25]

Mattnet: Modular attention network for referring ex- pression comprehension,

L. Yu, Z. Lin, X. Shen, J. Yang, X. Lu, M. Bansal, and T. L. Berg, “Mattnet: Modular attention network for referring ex- pression comprehension,” inCVPR, 2018, pp. 1307–1315

work page 2018

[26] [26]

Mdetr-modulated detection for end-to-end multi- modal understanding,

A. Kamath, M. Singh, Y . LeCun, G. Synnaeve, I. Misra, and N. Carion, “Mdetr-modulated detection for end-to-end multi- modal understanding,” inProceedings of the IEEE/CVF inter- national conference on computer vision, 2021, pp. 1780–1790

work page 2021

[27] [27]

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

P. Wang, S. Bai, S. Tan, S. Wang, Z. Fan, J. Bai, K. Chen, X. Liu, J. Wang, W. Geet al., “Qwen2-vl: Enhancing vision- language model’s perception of the world at any resolution,” arXiv preprint arXiv:2409.12191, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[28] [28]

Recantformer: Referring expression comprehen- sion with varying numbers of targets,

B. Hemanthage, H. Bilen, P. Bartie, C. Dondrup, and O. Lemon, “Recantformer: Referring expression comprehen- sion with varying numbers of targets,” inEMNLP, 2024, pp. 21 784–21 798

work page 2024

[29] [29]

Jack of all tasks master of many: Designing general-purpose coarse-to-fine vision-language model,

S. Pramanick, G. Han, R. Hou, S. Nag, S.-N. Lim, N. Ballas, Q. Wang, R. Chellappa, and A. Almahairi, “Jack of all tasks master of many: Designing general-purpose coarse-to-fine vision-language model,” inCVPR, 2024, pp. 14 076–14 088

work page 2024

[30] [30]

Alfred: A bench- mark for interpreting grounded instructions for everyday tasks,

M. Shridhar, J. Thomason, D. Gordon, Y . Bisk, W. Han, R. Mottaghi, L. Zettlemoyer, and D. Fox, “Alfred: A bench- mark for interpreting grounded instructions for everyday tasks,” inCVPR, 2020, pp. 10 740–10 749

work page 2020

[31] [31]

Generating easy-to-understand referring ex- pressions for target identifications,

M. Tanaka, T. Itamochi, K. Narioka, I. Sato, Y . Ushiku, and T. Harada, “Generating easy-to-understand referring ex- pressions for target identifications,” in2019 IEEE/CVF Inter- national Conference on Computer Vision (ICCV), 2019, pp. 5793–5802

work page 2019

[32] [32]

Clevr-ref+: Diag- nosing visual reasoning with referring expressions,

R. Liu, C. Liu, Y . Bai, and A. L. Yuille, “Clevr-ref+: Diag- nosing visual reasoning with referring expressions,” inCVPR, 2019, pp. 4185–4194

work page 2019

[33] [33]

Harlequin: Color-driven generation of synthetic data for referring expression compre- hension,

L. Parolari, E. Izzo, and L. Ballan, “Harlequin: Color-driven generation of synthetic data for referring expression compre- hension,” inInternational Conference on Pattern Recognition. Springer, 2025, pp. 292–307

work page 2025

[34] [34]

Introducing GPT-4.1 in the API,

OpenAI, “Introducing GPT-4.1 in the API,” https://openai. com/index/gpt-4-1/, Apr. 2025, accessed: Sep. 17, 2025

work page 2025

[35] [35]

Assessing the capabilities of large language models in coreference: An evaluation,

Y . Gan, J. Yu, and M. Poesio, “Assessing the capabilities of large language models in coreference: An evaluation,” in LREC-COLING 2024, 2024, pp. 1645–1665

work page 2024

[36] [36]

What you see is what you get: Visual pronoun coreference resolution in dialogues,

X. Yu, H. Zhang, Y . Song, Y . Song, and C. Zhang, “What you see is what you get: Visual pronoun coreference resolution in dialogues,”arXiv preprint arXiv:1909.00421, 2019

work page arXiv 1909