pith. sign in

arxiv: 2410.18715 · v1 · submitted 2024-10-24 · 💻 cs.CV

ChatSearch: a Dataset and a Generative Retrieval Model for General Conversational Image Retrieval

Pith reviewed 2026-05-23 18:26 UTC · model grok-4.3

classification 💻 cs.CV
keywords conversational image retrievalgenerative retrievalmultimodal conversationChatSearch datasetopen-domain imagesinterleaved image-textimage search model
0
0 comments X

The pith

A generative model retrieves open-domain images by reasoning over multi-turn multimodal conversations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper defines the task of general conversational image retrieval, in which a system must identify a specific target image from a large database after receiving a sequence of user messages that may include both text and images. To make progress on this task the authors release the ChatSearch dataset, each entry of which pairs one target image with a multi-round conversational context that mixes text and pictures. They introduce ChatSearcher, a single end-to-end model that ingests and emits interleaved image-text sequences and is trained to output the correct image when given the preceding dialogue. The model is shown to exploit both the supplied multimodal history and external world knowledge, yielding higher retrieval accuracy on ChatSearch than prior methods and remaining competitive on standard image-retrieval and visual-dialogue benchmarks. Readers should care because the work replaces isolated keyword queries with natural, ongoing dialogue as the interface for finding pictures.

Core claim

ChatSearcher is trained end-to-end to accept and produce interleaved image-text inputs and outputs; once trained, the model reasons over the accumulated multimodal conversational context and draws on world knowledge to return the single correct target image from the database.

What carries the argument

The generative retrieval model trained to map interleaved multimodal conversation histories directly to target images.

If this is right

  • The model achieves superior retrieval accuracy on the ChatSearch dataset.
  • The same model obtains competitive results on other image retrieval benchmarks.
  • The same model obtains competitive results on visual conversation tasks.
  • The model can leverage both multimodal context and world knowledge to produce visual retrieval results.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Interactive retrieval interfaces that accept ongoing dialogue could replace single-shot search bars in consumer applications.
  • The interleaved input-output format may be reusable for other context-dependent retrieval problems such as video or document search.
  • Scaling the number of conversation turns or adding further modalities could be tested directly on the released dataset.

Load-bearing premise

The curated ChatSearch dataset accurately represents the distribution and difficulty of real-world conversational queries for open-domain images.

What would settle it

A controlled test set of new multi-round conversations that require external knowledge in which ChatSearcher retrieves no more correct images than a non-conversational baseline would falsify the claim of effective multimodal reasoning.

Figures

Figures reproduced from arXiv: 2410.18715 by Erdong Hu, Hua Huang, Jing Liu, Longteng Guo, Shuai Shao, Tongtian Yue, Zehuan Yuan, Zijia Zhao.

Figure 1
Figure 1. Figure 1: Our generative retrieval model ChatSearcher can accept multimodal inputs and [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Illustration of automatic data construction pipeline for general conversational image [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Multimodal dialogue construction. The whole pipeline is combined with a text [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Architecture of our generative retrieval model ChatSearcher. Interleaved documents [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Ablation study on feature queue size. We show the average recall on three retrieval [PITH_FULL_IMAGE:figures/full_fig_p013_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Ablation study on instruction data scale. [PITH_FULL_IMAGE:figures/full_fig_p014_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Qualitative results of ChatSearcher. We show ChatSearcher’s conversational image [PITH_FULL_IMAGE:figures/full_fig_p015_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Qualitative results of combining grounding and retrieval: using retrieval result to [PITH_FULL_IMAGE:figures/full_fig_p015_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Interaction on result choosing. We show that different choices on image results in previous round can influence the results in following round. In these samples, user choose different image returned by ChatSearcher and input same instruction to interact with model. ChatSearcher return different results based on user’s choice and instruction [PITH_FULL_IMAGE:figures/full_fig_p016_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Interaction on instruction choosing. We show that different text instructions on same image can influence the results. In these samples, user input different instructions with a same image. ChatSearcher return different results based on user’s instruction and given image. 6. Discussion and Conclusion Our dataset and model not only expand the frontiers of interactive image retrieval, but also enhance the e… view at source ↗
read the original abstract

In this paper, we investigate the task of general conversational image retrieval on open-domain images. The objective is to search for images based on interactive conversations between humans and computers. To advance this task, we curate a dataset called ChatSearch. This dataset includes a multi-round multimodal conversational context query for each target image, thereby requiring the retrieval system to find the accurate image from database. Simultaneously, we propose a generative retrieval model named ChatSearcher, which is trained end-to-end to accept/produce interleaved image-text inputs/outputs. ChatSearcher exhibits strong capability in reasoning with multimodal context and can leverage world knowledge to yield visual retrieval results. It demonstrates superior performance on the ChatSearch dataset and also achieves competitive results on other image retrieval tasks and visual conversation tasks. We anticipate that this work will inspire further research on interactive multimodal retrieval systems. Our dataset will be available at https://github.com/joez17/ChatSearch.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The paper introduces the ChatSearch dataset for general conversational image retrieval on open-domain images. Each entry consists of a multi-round multimodal conversational context query paired with a target image, requiring the system to retrieve the correct image from a database. It also proposes ChatSearcher, an end-to-end generative retrieval model that accepts and produces interleaved image-text inputs and outputs. The manuscript claims that ChatSearcher exhibits strong multimodal reasoning, leverages world knowledge for retrieval, achieves superior performance on ChatSearch, and obtains competitive results on other image retrieval and visual conversation tasks.

Significance. If the empirical claims are substantiated, the work would provide a useful benchmark dataset and model architecture for interactive multimodal retrieval, highlighting the potential of generative models that integrate world knowledge. The dataset construction and end-to-end training approach address a plausible gap in conversational image search. However, the absence of any reported metrics, baselines, ablations, or dataset statistics prevents a concrete assessment of whether these contributions advance the state of the art.

major comments (1)
  1. [Abstract] Abstract: the central claim that ChatSearcher 'demonstrates superior performance on the ChatSearch dataset' and 'achieves competitive results on other image retrieval tasks' is unsupported by any quantitative results, baselines, ablation studies, error analysis, or details on dataset construction and model training. Without these elements the empirical contribution cannot be evaluated.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the detailed review and for highlighting the need for empirical support. We agree that the current manuscript version does not include the quantitative results, baselines, ablations, or dataset statistics necessary to substantiate the claims, and we will revise accordingly to enable proper evaluation.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central claim that ChatSearcher 'demonstrates superior performance on the ChatSearch dataset' and 'achieves competitive results on other image retrieval tasks' is unsupported by any quantitative results, baselines, ablation studies, error analysis, or details on dataset construction and model training. Without these elements the empirical contribution cannot be evaluated.

    Authors: We agree that the abstract claims require supporting quantitative evidence. The revised manuscript will add: (1) performance metrics and comparisons against baselines on the ChatSearch dataset, (2) results on other image retrieval and visual conversation benchmarks, (3) ablation studies on model components, (4) error analysis, and (5) full details on dataset construction (including statistics) and end-to-end training procedure. These additions will directly address the lack of substantiation noted. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation chain is self-contained

full rationale

The paper introduces a new dataset (ChatSearch) and an end-to-end trained generative retrieval model (ChatSearcher) without any equations, derivations, fitted parameters presented as predictions, or load-bearing self-citations. All performance claims rest on standard empirical training and evaluation on held-out or external tasks, with no reduction of outputs to inputs by construction. This matches the default expectation for a dataset-plus-model contribution paper.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No explicit free parameters, axioms, or invented entities are described beyond standard assumptions of supervised training on a curated dataset and the existence of world knowledge in the underlying language model backbone.

pith-pipeline@v0.9.0 · 5710 in / 1156 out tokens · 46574 ms · 2026-05-23T18:26:49.034393+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

40 extracted references · 40 canonical work pages · 8 internal anchors

  1. [1]

    S. Tong, E. Chang, Support vector machine active learning for image re- trieval, in: Proceedings of the ninth ACM international conference on Mul- timedia, 2001, pp. 107–118

  2. [2]

    X. Y. Felix, R. Ji, M.-H. Tsai, G. Ye, S.-F. Chang, Weak attributes for large-scale image retrieval, in: 2012 IEEE Conference on Computer Vision and Pattern Recognition, IEEE, 2012, pp. 2949–2956

  3. [3]

    W. Li, L. Duan, D. Xu, I. W.-H. Tsang, Text-based image retrieval using progressive multi-instance learning, in: 2011 international conference on computer vision, IEEE, 2011, pp. 2049–2055

  4. [4]

    Radford, J

    A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, et al., Learning transferable visual models from natural language supervision, in: International Confer- ence on Machine Learning, PMLR, 2021, pp. 8748–8763

  5. [5]

    Y. Rui, T. S. Huang, M. Ortega, S. Mehrotra, Relevance feedback: a power tool for interactive content-based image retrieval, IEEE Transactions on circuits and systems for video technology 8 (5) (1998) 644–655

  6. [6]

    Z. Liu, C. Rodriguez-Opazo, D. Teney, S. Gould, Image retrieval on real- life images with pre-trained vision-and-language models, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 2125–2134

  7. [7]

    X. Guo, H. Wu, Y. Cheng, S. Rennie, G. Tesauro, R. Feris, Dialog-based interactive image retrieval, Advances in neural information processing sys- tems 31 (2018)

  8. [8]

    Y. Yuan, W. Lam, Conversational fashion image retrieval via multiturn natural language feedback, in: Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, 2021, pp. 839–848

  9. [9]

    Liu, Z.-Y

    G.-H. Liu, Z.-Y. Li, L. Zhang, Y. Xu, Image retrieval based on micro- structure descriptor, Pattern Recognition 44 (9) (2011) 2123–2133

  10. [10]

    Huang, S

    P.-W. Huang, S. Dai, Image retrieval by texture similarity, Pattern recog- nition 36 (3) (2003) 665–679

  11. [11]

    Mahmoudi, J

    F. Mahmoudi, J. Shanbehzadeh, A.-M. Eftekhari-Moghadam, H. Soltanian- Zadeh, Image retrieval based on shape similarity by edge orientation auto- correlogram, Pattern recognition 36 (8) (2003) 1725–1736

  12. [12]

    Liu, J.-Y

    G.-H. Liu, J.-Y. Yang, Content-based image retrieval using color difference histogram, Pattern recognition 46 (1) (2013) 188–198. 17

  13. [13]

    L. Wang, Y. Li, S. Lazebnik, Learning deep structure-preserving image-text embeddings, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 5005–5013

  14. [14]

    Sangkloy, N

    P. Sangkloy, N. Burnell, C. Ham, J. Hays, The sketchy database: learning to retrieve badly drawn bunnies, ACM Transactions on Graphics (TOG) 35 (4) (2016) 1–12

  15. [15]

    K. E. Ak, A. A. Kassim, J. H. Lim, J. Y. Tham, Learning attribute repre- sentations with localization for flexible fashion search, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 7708–7717

  16. [16]

    J. Li, D. Li, S. Savarese, S. Hoi, Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models, arXiv preprint arXiv:2301.12597 (2023)

  17. [17]

    H. Liu, C. Li, Q. Wu, Y. J. Lee, Visual instruction tuning, arXiv preprint arXiv:2304.08485 (2023)

  18. [18]

    H. Liu, C. Li, Y. Li, Y. J. Lee, Improved baselines with visual instruction tuning, arXiv preprint arXiv:2310.03744 (2023)

  19. [19]

    W. Dai, J. Li, D. Li, A. M. H. Tiong, J. Zhao, W. Wang, B. Li, P. Fung, S. Hoi, Instructblip: Towards general-purpose vision-language models with instruction tuning, arXiv preprint arXiv:2305.06500 (2023)

  20. [20]

    Lauren¸ con, D

    H. Lauren¸ con, D. van Strien, S. Bekman, L. Tronchon, L. Saulnier, T. Wang, S. Karamcheti, A. Singh, G. Pistilli, Y. Jernite, et al., Intro- ducing idefics: An open reproduction of state-of-the-art visual language model, 2023, URL https://huggingface. co/blog/idefics. Accessed (2023) 09–18

  21. [21]

    J. Y. Koh, R. Salakhutdinov, D. Fried, Grounding language models to images for multimodal inputs and outputs (2023)

  22. [22]

    T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Doll´ ar, C. L. Zitnick, Microsoft coco: Common objects in context, in: European conference on computer vision, Springer, 2014, pp. 740–755

  23. [23]

    GPT-4 Technical Report

    OpenAI, Gpt-4 technical report, Tech. rep., https://arxiv.org/abs/ 2303.08774 (2023)

  24. [24]

    Schuhmann, R

    C. Schuhmann, R. Beaumont, R. Vencu, C. Gordon, R. Wightman, M. Cherti, T. Coombes, A. Katta, C. Mullis, M. Wortsman, et al., Laion-5b: An open large-scale dataset for training next generation image-text mod- els, Advances in Neural Information Processing Systems 35 (2022) 25278– 25294. 18

  25. [25]

    Karpathy, L

    A. Karpathy, L. Fei-Fei, Deep visual-semantic alignments for generating image descriptions, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 3128–3137

  26. [26]

    Chiang, Z

    W.-L. Chiang, Z. Li, Z. Lin, Y. Sheng, Z. Wu, H. Zhang, L. Zheng, S. Zhuang, Y. Zhuang, J. E. Gonzalez, I. Stoica, E. P. Xing, Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality (March 2023). URL https://lmsys.org/blog/2023-03-30-vicuna/

  27. [27]

    Z. Wu, Y. Xiong, S. X. Yu, D. Lin, Unsupervised feature learning via non- parametric instance discrimination, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 3733–3742

  28. [28]

    Sharma, N

    P. Sharma, N. Ding, S. Goodman, R. Soricut, Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image caption- ing, in: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2018, pp. 2556–2565

  29. [29]

    W. Zhu, J. Hessel, A. Awadalla, S. Y. Gadre, J. Dodge, A. Fang, Y. Yu, L. Schmidt, W. Y. Wang, Y. Choi, Multimodal c4: An open, billion-scale corpus of images interleaved with text, arXiv preprint arXiv:2304.06939 (2023)

  30. [30]

    Loshchilov, F

    I. Loshchilov, F. Hutter, Decoupled weight decay regularization, in: Inter- national Conference on Learning Representations, 2018

  31. [31]

    Brooks, A

    T. Brooks, A. Holynski, A. A. Efros, Instructpix2pix: Learning to follow image editing instructions, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 18392–18402

  32. [32]

    G. Gu, S. Chun, W. Kim, H. Jun, Y. Kang, S. Yun, Compodiff: Ver- satile composed image retrieval with latent diffusion, arXiv preprint arXiv:2303.11916 (2023)

  33. [33]

    Baldrati, M

    A. Baldrati, M. Bertini, T. Uricchio, A. Del Bimbo, Conditioned and com- posed image retrieval combining and partially fine-tuning clip-based fea- tures, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 4959–4968

  34. [34]

    Saito, K

    K. Saito, K. Sohn, X. Zhang, C.-L. Li, C.-Y. Lee, K. Saenko, T. Pfister, Pic2word: Mapping pictures to words for zero-shot composed image re- trieval, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 19305–19314

  35. [35]

    B. A. Plummer, L. Wang, C. M. Cervantes, J. C. Caicedo, J. Hocken- maier, S. Lazebnik, Flickr30k entities: Collecting region-to-phrase corre- spondences for richer image-to-sentence models, in: Proceedings of the IEEE international conference on computer vision, 2015, pp. 2641–2649. 19

  36. [36]

    Goyal, T

    Y. Goyal, T. Khot, D. Summers-Stay, D. Batra, D. Parikh, Making the v in vqa matter: Elevating the role of image understanding in visual question answering, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 6904–6913

  37. [37]

    D. A. Hudson, C. D. Manning, Gqa: A new dataset for real-world visual reasoning and compositional question answering, in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 6700–6709

  38. [38]

    Y. Liu, H. Duan, Y. Zhang, B. Li, S. Zhang, W. Zhao, Y. Yuan, J. Wang, C. He, Z. Liu, et al., Mmbench: Is your multi-modal model an all-around player?, arXiv preprint arXiv:2307.06281 (2023)

  39. [39]

    B. Li, R. Wang, G. Wang, Y. Ge, Y. Ge, Y. Shan, Seed-bench: Bench- marking multimodal llms with generative comprehension, arXiv preprint arXiv:2307.16125 (2023)

  40. [40]

    J. Bai, S. Bai, S. Yang, S. Wang, S. Tan, P. Wang, J. Lin, C. Zhou, J. Zhou, Qwen-vl: A frontier large vision-language model with versatile abilities, arXiv preprint arXiv:2308.12966 (2023). 20