ChatSearch: a Dataset and a Generative Retrieval Model for General Conversational Image Retrieval
Pith reviewed 2026-05-23 18:26 UTC · model grok-4.3
The pith
A generative model retrieves open-domain images by reasoning over multi-turn multimodal conversations.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
ChatSearcher is trained end-to-end to accept and produce interleaved image-text inputs and outputs; once trained, the model reasons over the accumulated multimodal conversational context and draws on world knowledge to return the single correct target image from the database.
What carries the argument
The generative retrieval model trained to map interleaved multimodal conversation histories directly to target images.
If this is right
- The model achieves superior retrieval accuracy on the ChatSearch dataset.
- The same model obtains competitive results on other image retrieval benchmarks.
- The same model obtains competitive results on visual conversation tasks.
- The model can leverage both multimodal context and world knowledge to produce visual retrieval results.
Where Pith is reading between the lines
- Interactive retrieval interfaces that accept ongoing dialogue could replace single-shot search bars in consumer applications.
- The interleaved input-output format may be reusable for other context-dependent retrieval problems such as video or document search.
- Scaling the number of conversation turns or adding further modalities could be tested directly on the released dataset.
Load-bearing premise
The curated ChatSearch dataset accurately represents the distribution and difficulty of real-world conversational queries for open-domain images.
What would settle it
A controlled test set of new multi-round conversations that require external knowledge in which ChatSearcher retrieves no more correct images than a non-conversational baseline would falsify the claim of effective multimodal reasoning.
Figures
read the original abstract
In this paper, we investigate the task of general conversational image retrieval on open-domain images. The objective is to search for images based on interactive conversations between humans and computers. To advance this task, we curate a dataset called ChatSearch. This dataset includes a multi-round multimodal conversational context query for each target image, thereby requiring the retrieval system to find the accurate image from database. Simultaneously, we propose a generative retrieval model named ChatSearcher, which is trained end-to-end to accept/produce interleaved image-text inputs/outputs. ChatSearcher exhibits strong capability in reasoning with multimodal context and can leverage world knowledge to yield visual retrieval results. It demonstrates superior performance on the ChatSearch dataset and also achieves competitive results on other image retrieval tasks and visual conversation tasks. We anticipate that this work will inspire further research on interactive multimodal retrieval systems. Our dataset will be available at https://github.com/joez17/ChatSearch.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces the ChatSearch dataset for general conversational image retrieval on open-domain images. Each entry consists of a multi-round multimodal conversational context query paired with a target image, requiring the system to retrieve the correct image from a database. It also proposes ChatSearcher, an end-to-end generative retrieval model that accepts and produces interleaved image-text inputs and outputs. The manuscript claims that ChatSearcher exhibits strong multimodal reasoning, leverages world knowledge for retrieval, achieves superior performance on ChatSearch, and obtains competitive results on other image retrieval and visual conversation tasks.
Significance. If the empirical claims are substantiated, the work would provide a useful benchmark dataset and model architecture for interactive multimodal retrieval, highlighting the potential of generative models that integrate world knowledge. The dataset construction and end-to-end training approach address a plausible gap in conversational image search. However, the absence of any reported metrics, baselines, ablations, or dataset statistics prevents a concrete assessment of whether these contributions advance the state of the art.
major comments (1)
- [Abstract] Abstract: the central claim that ChatSearcher 'demonstrates superior performance on the ChatSearch dataset' and 'achieves competitive results on other image retrieval tasks' is unsupported by any quantitative results, baselines, ablation studies, error analysis, or details on dataset construction and model training. Without these elements the empirical contribution cannot be evaluated.
Simulated Author's Rebuttal
We thank the referee for the detailed review and for highlighting the need for empirical support. We agree that the current manuscript version does not include the quantitative results, baselines, ablations, or dataset statistics necessary to substantiate the claims, and we will revise accordingly to enable proper evaluation.
read point-by-point responses
-
Referee: [Abstract] Abstract: the central claim that ChatSearcher 'demonstrates superior performance on the ChatSearch dataset' and 'achieves competitive results on other image retrieval tasks' is unsupported by any quantitative results, baselines, ablation studies, error analysis, or details on dataset construction and model training. Without these elements the empirical contribution cannot be evaluated.
Authors: We agree that the abstract claims require supporting quantitative evidence. The revised manuscript will add: (1) performance metrics and comparisons against baselines on the ChatSearch dataset, (2) results on other image retrieval and visual conversation benchmarks, (3) ablation studies on model components, (4) error analysis, and (5) full details on dataset construction (including statistics) and end-to-end training procedure. These additions will directly address the lack of substantiation noted. revision: yes
Circularity Check
No significant circularity; derivation chain is self-contained
full rationale
The paper introduces a new dataset (ChatSearch) and an end-to-end trained generative retrieval model (ChatSearcher) without any equations, derivations, fitted parameters presented as predictions, or load-bearing self-citations. All performance claims rest on standard empirical training and evaluation on held-out or external tasks, with no reduction of outputs to inputs by construction. This matches the default expectation for a dataset-plus-model contribution paper.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
S. Tong, E. Chang, Support vector machine active learning for image re- trieval, in: Proceedings of the ninth ACM international conference on Mul- timedia, 2001, pp. 107–118
work page 2001
-
[2]
X. Y. Felix, R. Ji, M.-H. Tsai, G. Ye, S.-F. Chang, Weak attributes for large-scale image retrieval, in: 2012 IEEE Conference on Computer Vision and Pattern Recognition, IEEE, 2012, pp. 2949–2956
work page 2012
-
[3]
W. Li, L. Duan, D. Xu, I. W.-H. Tsang, Text-based image retrieval using progressive multi-instance learning, in: 2011 international conference on computer vision, IEEE, 2011, pp. 2049–2055
work page 2011
-
[4]
A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, et al., Learning transferable visual models from natural language supervision, in: International Confer- ence on Machine Learning, PMLR, 2021, pp. 8748–8763
work page 2021
-
[5]
Y. Rui, T. S. Huang, M. Ortega, S. Mehrotra, Relevance feedback: a power tool for interactive content-based image retrieval, IEEE Transactions on circuits and systems for video technology 8 (5) (1998) 644–655
work page 1998
-
[6]
Z. Liu, C. Rodriguez-Opazo, D. Teney, S. Gould, Image retrieval on real- life images with pre-trained vision-and-language models, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 2125–2134
work page 2021
-
[7]
X. Guo, H. Wu, Y. Cheng, S. Rennie, G. Tesauro, R. Feris, Dialog-based interactive image retrieval, Advances in neural information processing sys- tems 31 (2018)
work page 2018
-
[8]
Y. Yuan, W. Lam, Conversational fashion image retrieval via multiturn natural language feedback, in: Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, 2021, pp. 839–848
work page 2021
- [9]
- [10]
-
[11]
F. Mahmoudi, J. Shanbehzadeh, A.-M. Eftekhari-Moghadam, H. Soltanian- Zadeh, Image retrieval based on shape similarity by edge orientation auto- correlogram, Pattern recognition 36 (8) (2003) 1725–1736
work page 2003
- [12]
-
[13]
L. Wang, Y. Li, S. Lazebnik, Learning deep structure-preserving image-text embeddings, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 5005–5013
work page 2016
-
[14]
P. Sangkloy, N. Burnell, C. Ham, J. Hays, The sketchy database: learning to retrieve badly drawn bunnies, ACM Transactions on Graphics (TOG) 35 (4) (2016) 1–12
work page 2016
-
[15]
K. E. Ak, A. A. Kassim, J. H. Lim, J. Y. Tham, Learning attribute repre- sentations with localization for flexible fashion search, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 7708–7717
work page 2018
-
[16]
J. Li, D. Li, S. Savarese, S. Hoi, Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models, arXiv preprint arXiv:2301.12597 (2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[17]
H. Liu, C. Li, Q. Wu, Y. J. Lee, Visual instruction tuning, arXiv preprint arXiv:2304.08485 (2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[18]
H. Liu, C. Li, Y. Li, Y. J. Lee, Improved baselines with visual instruction tuning, arXiv preprint arXiv:2310.03744 (2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[19]
W. Dai, J. Li, D. Li, A. M. H. Tiong, J. Zhao, W. Wang, B. Li, P. Fung, S. Hoi, Instructblip: Towards general-purpose vision-language models with instruction tuning, arXiv preprint arXiv:2305.06500 (2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[20]
H. Lauren¸ con, D. van Strien, S. Bekman, L. Tronchon, L. Saulnier, T. Wang, S. Karamcheti, A. Singh, G. Pistilli, Y. Jernite, et al., Intro- ducing idefics: An open reproduction of state-of-the-art visual language model, 2023, URL https://huggingface. co/blog/idefics. Accessed (2023) 09–18
work page 2023
-
[21]
J. Y. Koh, R. Salakhutdinov, D. Fried, Grounding language models to images for multimodal inputs and outputs (2023)
work page 2023
-
[22]
T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Doll´ ar, C. L. Zitnick, Microsoft coco: Common objects in context, in: European conference on computer vision, Springer, 2014, pp. 740–755
work page 2014
-
[23]
OpenAI, Gpt-4 technical report, Tech. rep., https://arxiv.org/abs/ 2303.08774 (2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[24]
C. Schuhmann, R. Beaumont, R. Vencu, C. Gordon, R. Wightman, M. Cherti, T. Coombes, A. Katta, C. Mullis, M. Wortsman, et al., Laion-5b: An open large-scale dataset for training next generation image-text mod- els, Advances in Neural Information Processing Systems 35 (2022) 25278– 25294. 18
work page 2022
-
[25]
A. Karpathy, L. Fei-Fei, Deep visual-semantic alignments for generating image descriptions, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 3128–3137
work page 2015
- [26]
-
[27]
Z. Wu, Y. Xiong, S. X. Yu, D. Lin, Unsupervised feature learning via non- parametric instance discrimination, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 3733–3742
work page 2018
-
[28]
P. Sharma, N. Ding, S. Goodman, R. Soricut, Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image caption- ing, in: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2018, pp. 2556–2565
work page 2018
- [29]
-
[30]
I. Loshchilov, F. Hutter, Decoupled weight decay regularization, in: Inter- national Conference on Learning Representations, 2018
work page 2018
- [31]
- [32]
-
[33]
A. Baldrati, M. Bertini, T. Uricchio, A. Del Bimbo, Conditioned and com- posed image retrieval combining and partially fine-tuning clip-based fea- tures, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 4959–4968
work page 2022
- [34]
-
[35]
B. A. Plummer, L. Wang, C. M. Cervantes, J. C. Caicedo, J. Hocken- maier, S. Lazebnik, Flickr30k entities: Collecting region-to-phrase corre- spondences for richer image-to-sentence models, in: Proceedings of the IEEE international conference on computer vision, 2015, pp. 2641–2649. 19
work page 2015
- [36]
-
[37]
D. A. Hudson, C. D. Manning, Gqa: A new dataset for real-world visual reasoning and compositional question answering, in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 6700–6709
work page 2019
-
[38]
Y. Liu, H. Duan, Y. Zhang, B. Li, S. Zhang, W. Zhao, Y. Yuan, J. Wang, C. He, Z. Liu, et al., Mmbench: Is your multi-modal model an all-around player?, arXiv preprint arXiv:2307.06281 (2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[39]
B. Li, R. Wang, G. Wang, Y. Ge, Y. Ge, Y. Shan, Seed-bench: Bench- marking multimodal llms with generative comprehension, arXiv preprint arXiv:2307.16125 (2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[40]
J. Bai, S. Bai, S. Yang, S. Wang, S. Tan, P. Wang, J. Lin, C. Zhou, J. Zhou, Qwen-vl: A frontier large vision-language model with versatile abilities, arXiv preprint arXiv:2308.12966 (2023). 20
work page internal anchor Pith review Pith/arXiv arXiv 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.