ChatSearch: a Dataset and a Generative Retrieval Model for General Conversational Image Retrieval

Erdong Hu; Hua Huang; Jing Liu; Longteng Guo; Shuai Shao; Tongtian Yue; Zehuan Yuan; Zijia Zhao

arxiv: 2410.18715 · v1 · submitted 2024-10-24 · 💻 cs.CV

ChatSearch: a Dataset and a Generative Retrieval Model for General Conversational Image Retrieval

Zijia Zhao , Longteng Guo , Tongtian Yue , Erdong Hu , Shuai Shao , Zehuan Yuan , Hua Huang , Jing Liu This is my paper

Pith reviewed 2026-05-23 18:26 UTC · model grok-4.3

classification 💻 cs.CV

keywords conversational image retrievalgenerative retrievalmultimodal conversationChatSearch datasetopen-domain imagesinterleaved image-textimage search model

0 comments

The pith

A generative model retrieves open-domain images by reasoning over multi-turn multimodal conversations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper defines the task of general conversational image retrieval, in which a system must identify a specific target image from a large database after receiving a sequence of user messages that may include both text and images. To make progress on this task the authors release the ChatSearch dataset, each entry of which pairs one target image with a multi-round conversational context that mixes text and pictures. They introduce ChatSearcher, a single end-to-end model that ingests and emits interleaved image-text sequences and is trained to output the correct image when given the preceding dialogue. The model is shown to exploit both the supplied multimodal history and external world knowledge, yielding higher retrieval accuracy on ChatSearch than prior methods and remaining competitive on standard image-retrieval and visual-dialogue benchmarks. Readers should care because the work replaces isolated keyword queries with natural, ongoing dialogue as the interface for finding pictures.

Core claim

ChatSearcher is trained end-to-end to accept and produce interleaved image-text inputs and outputs; once trained, the model reasons over the accumulated multimodal conversational context and draws on world knowledge to return the single correct target image from the database.

What carries the argument

The generative retrieval model trained to map interleaved multimodal conversation histories directly to target images.

If this is right

The model achieves superior retrieval accuracy on the ChatSearch dataset.
The same model obtains competitive results on other image retrieval benchmarks.
The same model obtains competitive results on visual conversation tasks.
The model can leverage both multimodal context and world knowledge to produce visual retrieval results.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Interactive retrieval interfaces that accept ongoing dialogue could replace single-shot search bars in consumer applications.
The interleaved input-output format may be reusable for other context-dependent retrieval problems such as video or document search.
Scaling the number of conversation turns or adding further modalities could be tested directly on the released dataset.

Load-bearing premise

The curated ChatSearch dataset accurately represents the distribution and difficulty of real-world conversational queries for open-domain images.

What would settle it

A controlled test set of new multi-round conversations that require external knowledge in which ChatSearcher retrieves no more correct images than a non-conversational baseline would falsify the claim of effective multimodal reasoning.

Figures

Figures reproduced from arXiv: 2410.18715 by Erdong Hu, Hua Huang, Jing Liu, Longteng Guo, Shuai Shao, Tongtian Yue, Zehuan Yuan, Zijia Zhao.

**Figure 2.** Figure 2: Illustration of automatic data construction pipeline for general conversational image [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Multimodal dialogue construction. The whole pipeline is combined with a text [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Architecture of our generative retrieval model ChatSearcher. Interleaved documents [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Ablation study on feature queue size. We show the average recall on three retrieval [PITH_FULL_IMAGE:figures/full_fig_p013_5.png] view at source ↗

**Figure 6.** Figure 6: Ablation study on instruction data scale. [PITH_FULL_IMAGE:figures/full_fig_p014_6.png] view at source ↗

**Figure 7.** Figure 7: Qualitative results of ChatSearcher. We show ChatSearcher’s conversational image [PITH_FULL_IMAGE:figures/full_fig_p015_7.png] view at source ↗

**Figure 8.** Figure 8: Qualitative results of combining grounding and retrieval: using retrieval result to [PITH_FULL_IMAGE:figures/full_fig_p015_8.png] view at source ↗

**Figure 9.** Figure 9: Interaction on result choosing. We show that different choices on image results in previous round can influence the results in following round. In these samples, user choose different image returned by ChatSearcher and input same instruction to interact with model. ChatSearcher return different results based on user’s choice and instruction [PITH_FULL_IMAGE:figures/full_fig_p016_9.png] view at source ↗

**Figure 10.** Figure 10: Interaction on instruction choosing. We show that different text instructions on same image can influence the results. In these samples, user input different instructions with a same image. ChatSearcher return different results based on user’s instruction and given image. 6. Discussion and Conclusion Our dataset and model not only expand the frontiers of interactive image retrieval, but also enhance the e… view at source ↗

read the original abstract

In this paper, we investigate the task of general conversational image retrieval on open-domain images. The objective is to search for images based on interactive conversations between humans and computers. To advance this task, we curate a dataset called ChatSearch. This dataset includes a multi-round multimodal conversational context query for each target image, thereby requiring the retrieval system to find the accurate image from database. Simultaneously, we propose a generative retrieval model named ChatSearcher, which is trained end-to-end to accept/produce interleaved image-text inputs/outputs. ChatSearcher exhibits strong capability in reasoning with multimodal context and can leverage world knowledge to yield visual retrieval results. It demonstrates superior performance on the ChatSearch dataset and also achieves competitive results on other image retrieval tasks and visual conversation tasks. We anticipate that this work will inspire further research on interactive multimodal retrieval systems. Our dataset will be available at https://github.com/joez17/ChatSearch.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper introduces a new dataset and generative model for multi-turn conversational image retrieval on open-domain images, but the abstract provides no metrics or details so the performance claims stay untested.

read the letter

The main point is a dataset called ChatSearch that pairs multi-round multimodal conversations with target images for retrieval, plus an end-to-end model called ChatSearcher that takes and produces interleaved image-text sequences. This setup targets interactive search where the system must reason over conversation history to pick the right open-domain image. The combination is not directly covered by prior image retrieval or visual dialogue papers, so the dataset fills a specific gap and the generative formulation is a reasonable way to handle the context. The model is also said to pull in world knowledge for retrieval, which fits the task. That part is useful for anyone working on multimodal dialogue systems or retrieval benchmarks. The dataset could become a standard testbed if the construction process holds up. The soft spots are clear from the abstract alone. No numbers, baselines, ablations, or error analysis appear, and there is no description of how the conversations were collected or how the model was trained. This leaves the claims of superior performance and strong multimodal reasoning unsupported for now. The usual risks with new datasets apply here too: it is not obvious whether gains would come from the architecture or from artifacts in how the data was built. This work is for researchers focused on conversational multimodal retrieval or visual dialogue. A reader who needs a new benchmark in that niche would get value from the dataset and the model description. It deserves a serious referee because the task definition is clean and the contribution is concrete, even though the evaluation section will need close checking. I would send it to peer review for feedback on the data and results rather than desk reject.

Referee Report

1 major / 0 minor

Summary. The paper introduces the ChatSearch dataset for general conversational image retrieval on open-domain images. Each entry consists of a multi-round multimodal conversational context query paired with a target image, requiring the system to retrieve the correct image from a database. It also proposes ChatSearcher, an end-to-end generative retrieval model that accepts and produces interleaved image-text inputs and outputs. The manuscript claims that ChatSearcher exhibits strong multimodal reasoning, leverages world knowledge for retrieval, achieves superior performance on ChatSearch, and obtains competitive results on other image retrieval and visual conversation tasks.

Significance. If the empirical claims are substantiated, the work would provide a useful benchmark dataset and model architecture for interactive multimodal retrieval, highlighting the potential of generative models that integrate world knowledge. The dataset construction and end-to-end training approach address a plausible gap in conversational image search. However, the absence of any reported metrics, baselines, ablations, or dataset statistics prevents a concrete assessment of whether these contributions advance the state of the art.

major comments (1)

[Abstract] Abstract: the central claim that ChatSearcher 'demonstrates superior performance on the ChatSearch dataset' and 'achieves competitive results on other image retrieval tasks' is unsupported by any quantitative results, baselines, ablation studies, error analysis, or details on dataset construction and model training. Without these elements the empirical contribution cannot be evaluated.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the detailed review and for highlighting the need for empirical support. We agree that the current manuscript version does not include the quantitative results, baselines, ablations, or dataset statistics necessary to substantiate the claims, and we will revise accordingly to enable proper evaluation.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim that ChatSearcher 'demonstrates superior performance on the ChatSearch dataset' and 'achieves competitive results on other image retrieval tasks' is unsupported by any quantitative results, baselines, ablation studies, error analysis, or details on dataset construction and model training. Without these elements the empirical contribution cannot be evaluated.

Authors: We agree that the abstract claims require supporting quantitative evidence. The revised manuscript will add: (1) performance metrics and comparisons against baselines on the ChatSearch dataset, (2) results on other image retrieval and visual conversation benchmarks, (3) ablation studies on model components, (4) error analysis, and (5) full details on dataset construction (including statistics) and end-to-end training procedure. These additions will directly address the lack of substantiation noted. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation chain is self-contained

full rationale

The paper introduces a new dataset (ChatSearch) and an end-to-end trained generative retrieval model (ChatSearcher) without any equations, derivations, fitted parameters presented as predictions, or load-bearing self-citations. All performance claims rest on standard empirical training and evaluation on held-out or external tasks, with no reduction of outputs to inputs by construction. This matches the default expectation for a dataset-plus-model contribution paper.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No explicit free parameters, axioms, or invented entities are described beyond standard assumptions of supervised training on a curated dataset and the existence of world knowledge in the underlying language model backbone.

pith-pipeline@v0.9.0 · 5710 in / 1156 out tokens · 46574 ms · 2026-05-23T18:26:49.034393+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

40 extracted references · 40 canonical work pages · 8 internal anchors

[1]

S. Tong, E. Chang, Support vector machine active learning for image re- trieval, in: Proceedings of the ninth ACM international conference on Mul- timedia, 2001, pp. 107–118

work page 2001
[2]

X. Y. Felix, R. Ji, M.-H. Tsai, G. Ye, S.-F. Chang, Weak attributes for large-scale image retrieval, in: 2012 IEEE Conference on Computer Vision and Pattern Recognition, IEEE, 2012, pp. 2949–2956

work page 2012
[3]

W. Li, L. Duan, D. Xu, I. W.-H. Tsang, Text-based image retrieval using progressive multi-instance learning, in: 2011 international conference on computer vision, IEEE, 2011, pp. 2049–2055

work page 2011
[4]

Radford, J

A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, et al., Learning transferable visual models from natural language supervision, in: International Confer- ence on Machine Learning, PMLR, 2021, pp. 8748–8763

work page 2021
[5]

Y. Rui, T. S. Huang, M. Ortega, S. Mehrotra, Relevance feedback: a power tool for interactive content-based image retrieval, IEEE Transactions on circuits and systems for video technology 8 (5) (1998) 644–655

work page 1998
[6]

Z. Liu, C. Rodriguez-Opazo, D. Teney, S. Gould, Image retrieval on real- life images with pre-trained vision-and-language models, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 2125–2134

work page 2021
[7]

X. Guo, H. Wu, Y. Cheng, S. Rennie, G. Tesauro, R. Feris, Dialog-based interactive image retrieval, Advances in neural information processing sys- tems 31 (2018)

work page 2018
[8]

Y. Yuan, W. Lam, Conversational fashion image retrieval via multiturn natural language feedback, in: Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, 2021, pp. 839–848

work page 2021
[9]

Liu, Z.-Y

G.-H. Liu, Z.-Y. Li, L. Zhang, Y. Xu, Image retrieval based on micro- structure descriptor, Pattern Recognition 44 (9) (2011) 2123–2133

work page 2011
[10]

Huang, S

P.-W. Huang, S. Dai, Image retrieval by texture similarity, Pattern recog- nition 36 (3) (2003) 665–679

work page 2003
[11]

Mahmoudi, J

F. Mahmoudi, J. Shanbehzadeh, A.-M. Eftekhari-Moghadam, H. Soltanian- Zadeh, Image retrieval based on shape similarity by edge orientation auto- correlogram, Pattern recognition 36 (8) (2003) 1725–1736

work page 2003
[12]

Liu, J.-Y

G.-H. Liu, J.-Y. Yang, Content-based image retrieval using color difference histogram, Pattern recognition 46 (1) (2013) 188–198. 17

work page 2013
[13]

L. Wang, Y. Li, S. Lazebnik, Learning deep structure-preserving image-text embeddings, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 5005–5013

work page 2016
[14]

Sangkloy, N

P. Sangkloy, N. Burnell, C. Ham, J. Hays, The sketchy database: learning to retrieve badly drawn bunnies, ACM Transactions on Graphics (TOG) 35 (4) (2016) 1–12

work page 2016
[15]

K. E. Ak, A. A. Kassim, J. H. Lim, J. Y. Tham, Learning attribute repre- sentations with localization for flexible fashion search, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 7708–7717

work page 2018
[16]

J. Li, D. Li, S. Savarese, S. Hoi, Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models, arXiv preprint arXiv:2301.12597 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[17]

H. Liu, C. Li, Q. Wu, Y. J. Lee, Visual instruction tuning, arXiv preprint arXiv:2304.08485 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[18]

H. Liu, C. Li, Y. Li, Y. J. Lee, Improved baselines with visual instruction tuning, arXiv preprint arXiv:2310.03744 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[19]

W. Dai, J. Li, D. Li, A. M. H. Tiong, J. Zhao, W. Wang, B. Li, P. Fung, S. Hoi, Instructblip: Towards general-purpose vision-language models with instruction tuning, arXiv preprint arXiv:2305.06500 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[20]

Lauren¸ con, D

H. Lauren¸ con, D. van Strien, S. Bekman, L. Tronchon, L. Saulnier, T. Wang, S. Karamcheti, A. Singh, G. Pistilli, Y. Jernite, et al., Intro- ducing idefics: An open reproduction of state-of-the-art visual language model, 2023, URL https://huggingface. co/blog/idefics. Accessed (2023) 09–18

work page 2023
[21]

J. Y. Koh, R. Salakhutdinov, D. Fried, Grounding language models to images for multimodal inputs and outputs (2023)

work page 2023
[22]

T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Doll´ ar, C. L. Zitnick, Microsoft coco: Common objects in context, in: European conference on computer vision, Springer, 2014, pp. 740–755

work page 2014
[23]

GPT-4 Technical Report

OpenAI, Gpt-4 technical report, Tech. rep., https://arxiv.org/abs/ 2303.08774 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[24]

Schuhmann, R

C. Schuhmann, R. Beaumont, R. Vencu, C. Gordon, R. Wightman, M. Cherti, T. Coombes, A. Katta, C. Mullis, M. Wortsman, et al., Laion-5b: An open large-scale dataset for training next generation image-text mod- els, Advances in Neural Information Processing Systems 35 (2022) 25278– 25294. 18

work page 2022
[25]

Karpathy, L

A. Karpathy, L. Fei-Fei, Deep visual-semantic alignments for generating image descriptions, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 3128–3137

work page 2015
[26]

Chiang, Z

W.-L. Chiang, Z. Li, Z. Lin, Y. Sheng, Z. Wu, H. Zhang, L. Zheng, S. Zhuang, Y. Zhuang, J. E. Gonzalez, I. Stoica, E. P. Xing, Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality (March 2023). URL https://lmsys.org/blog/2023-03-30-vicuna/

work page 2023
[27]

Z. Wu, Y. Xiong, S. X. Yu, D. Lin, Unsupervised feature learning via non- parametric instance discrimination, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 3733–3742

work page 2018
[28]

Sharma, N

P. Sharma, N. Ding, S. Goodman, R. Soricut, Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image caption- ing, in: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2018, pp. 2556–2565

work page 2018
[29]

W. Zhu, J. Hessel, A. Awadalla, S. Y. Gadre, J. Dodge, A. Fang, Y. Yu, L. Schmidt, W. Y. Wang, Y. Choi, Multimodal c4: An open, billion-scale corpus of images interleaved with text, arXiv preprint arXiv:2304.06939 (2023)

work page arXiv 2023
[30]

Loshchilov, F

I. Loshchilov, F. Hutter, Decoupled weight decay regularization, in: Inter- national Conference on Learning Representations, 2018

work page 2018
[31]

Brooks, A

T. Brooks, A. Holynski, A. A. Efros, Instructpix2pix: Learning to follow image editing instructions, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 18392–18402

work page 2023
[32]

G. Gu, S. Chun, W. Kim, H. Jun, Y. Kang, S. Yun, Compodiff: Ver- satile composed image retrieval with latent diffusion, arXiv preprint arXiv:2303.11916 (2023)

work page arXiv 2023
[33]

Baldrati, M

A. Baldrati, M. Bertini, T. Uricchio, A. Del Bimbo, Conditioned and com- posed image retrieval combining and partially fine-tuning clip-based fea- tures, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 4959–4968

work page 2022
[34]

Saito, K

K. Saito, K. Sohn, X. Zhang, C.-L. Li, C.-Y. Lee, K. Saenko, T. Pfister, Pic2word: Mapping pictures to words for zero-shot composed image re- trieval, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 19305–19314

work page 2023
[35]

B. A. Plummer, L. Wang, C. M. Cervantes, J. C. Caicedo, J. Hocken- maier, S. Lazebnik, Flickr30k entities: Collecting region-to-phrase corre- spondences for richer image-to-sentence models, in: Proceedings of the IEEE international conference on computer vision, 2015, pp. 2641–2649. 19

work page 2015
[36]

Goyal, T

Y. Goyal, T. Khot, D. Summers-Stay, D. Batra, D. Parikh, Making the v in vqa matter: Elevating the role of image understanding in visual question answering, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 6904–6913

work page 2017
[37]

D. A. Hudson, C. D. Manning, Gqa: A new dataset for real-world visual reasoning and compositional question answering, in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 6700–6709

work page 2019
[38]

Y. Liu, H. Duan, Y. Zhang, B. Li, S. Zhang, W. Zhao, Y. Yuan, J. Wang, C. He, Z. Liu, et al., Mmbench: Is your multi-modal model an all-around player?, arXiv preprint arXiv:2307.06281 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[39]

B. Li, R. Wang, G. Wang, Y. Ge, Y. Ge, Y. Shan, Seed-bench: Bench- marking multimodal llms with generative comprehension, arXiv preprint arXiv:2307.16125 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[40]

J. Bai, S. Bai, S. Yang, S. Wang, S. Tan, P. Wang, J. Lin, C. Zhou, J. Zhou, Qwen-vl: A frontier large vision-language model with versatile abilities, arXiv preprint arXiv:2308.12966 (2023). 20

work page internal anchor Pith review Pith/arXiv arXiv 2023

[1] [1]

S. Tong, E. Chang, Support vector machine active learning for image re- trieval, in: Proceedings of the ninth ACM international conference on Mul- timedia, 2001, pp. 107–118

work page 2001

[2] [2]

X. Y. Felix, R. Ji, M.-H. Tsai, G. Ye, S.-F. Chang, Weak attributes for large-scale image retrieval, in: 2012 IEEE Conference on Computer Vision and Pattern Recognition, IEEE, 2012, pp. 2949–2956

work page 2012

[3] [3]

W. Li, L. Duan, D. Xu, I. W.-H. Tsang, Text-based image retrieval using progressive multi-instance learning, in: 2011 international conference on computer vision, IEEE, 2011, pp. 2049–2055

work page 2011

[4] [4]

Radford, J

A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, et al., Learning transferable visual models from natural language supervision, in: International Confer- ence on Machine Learning, PMLR, 2021, pp. 8748–8763

work page 2021

[5] [5]

Y. Rui, T. S. Huang, M. Ortega, S. Mehrotra, Relevance feedback: a power tool for interactive content-based image retrieval, IEEE Transactions on circuits and systems for video technology 8 (5) (1998) 644–655

work page 1998

[6] [6]

Z. Liu, C. Rodriguez-Opazo, D. Teney, S. Gould, Image retrieval on real- life images with pre-trained vision-and-language models, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 2125–2134

work page 2021

[7] [7]

X. Guo, H. Wu, Y. Cheng, S. Rennie, G. Tesauro, R. Feris, Dialog-based interactive image retrieval, Advances in neural information processing sys- tems 31 (2018)

work page 2018

[8] [8]

Y. Yuan, W. Lam, Conversational fashion image retrieval via multiturn natural language feedback, in: Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, 2021, pp. 839–848

work page 2021

[9] [9]

Liu, Z.-Y

G.-H. Liu, Z.-Y. Li, L. Zhang, Y. Xu, Image retrieval based on micro- structure descriptor, Pattern Recognition 44 (9) (2011) 2123–2133

work page 2011

[10] [10]

Huang, S

P.-W. Huang, S. Dai, Image retrieval by texture similarity, Pattern recog- nition 36 (3) (2003) 665–679

work page 2003

[11] [11]

Mahmoudi, J

F. Mahmoudi, J. Shanbehzadeh, A.-M. Eftekhari-Moghadam, H. Soltanian- Zadeh, Image retrieval based on shape similarity by edge orientation auto- correlogram, Pattern recognition 36 (8) (2003) 1725–1736

work page 2003

[12] [12]

Liu, J.-Y

G.-H. Liu, J.-Y. Yang, Content-based image retrieval using color difference histogram, Pattern recognition 46 (1) (2013) 188–198. 17

work page 2013

[13] [13]

L. Wang, Y. Li, S. Lazebnik, Learning deep structure-preserving image-text embeddings, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 5005–5013

work page 2016

[14] [14]

Sangkloy, N

P. Sangkloy, N. Burnell, C. Ham, J. Hays, The sketchy database: learning to retrieve badly drawn bunnies, ACM Transactions on Graphics (TOG) 35 (4) (2016) 1–12

work page 2016

[15] [15]

K. E. Ak, A. A. Kassim, J. H. Lim, J. Y. Tham, Learning attribute repre- sentations with localization for flexible fashion search, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 7708–7717

work page 2018

[16] [16]

J. Li, D. Li, S. Savarese, S. Hoi, Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models, arXiv preprint arXiv:2301.12597 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023

[17] [17]

H. Liu, C. Li, Q. Wu, Y. J. Lee, Visual instruction tuning, arXiv preprint arXiv:2304.08485 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023

[18] [18]

H. Liu, C. Li, Y. Li, Y. J. Lee, Improved baselines with visual instruction tuning, arXiv preprint arXiv:2310.03744 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023

[19] [19]

W. Dai, J. Li, D. Li, A. M. H. Tiong, J. Zhao, W. Wang, B. Li, P. Fung, S. Hoi, Instructblip: Towards general-purpose vision-language models with instruction tuning, arXiv preprint arXiv:2305.06500 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023

[20] [20]

Lauren¸ con, D

H. Lauren¸ con, D. van Strien, S. Bekman, L. Tronchon, L. Saulnier, T. Wang, S. Karamcheti, A. Singh, G. Pistilli, Y. Jernite, et al., Intro- ducing idefics: An open reproduction of state-of-the-art visual language model, 2023, URL https://huggingface. co/blog/idefics. Accessed (2023) 09–18

work page 2023

[21] [21]

J. Y. Koh, R. Salakhutdinov, D. Fried, Grounding language models to images for multimodal inputs and outputs (2023)

work page 2023

[22] [22]

T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Doll´ ar, C. L. Zitnick, Microsoft coco: Common objects in context, in: European conference on computer vision, Springer, 2014, pp. 740–755

work page 2014

[23] [23]

GPT-4 Technical Report

OpenAI, Gpt-4 technical report, Tech. rep., https://arxiv.org/abs/ 2303.08774 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023

[24] [24]

Schuhmann, R

C. Schuhmann, R. Beaumont, R. Vencu, C. Gordon, R. Wightman, M. Cherti, T. Coombes, A. Katta, C. Mullis, M. Wortsman, et al., Laion-5b: An open large-scale dataset for training next generation image-text mod- els, Advances in Neural Information Processing Systems 35 (2022) 25278– 25294. 18

work page 2022

[25] [25]

Karpathy, L

A. Karpathy, L. Fei-Fei, Deep visual-semantic alignments for generating image descriptions, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 3128–3137

work page 2015

[26] [26]

Chiang, Z

W.-L. Chiang, Z. Li, Z. Lin, Y. Sheng, Z. Wu, H. Zhang, L. Zheng, S. Zhuang, Y. Zhuang, J. E. Gonzalez, I. Stoica, E. P. Xing, Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality (March 2023). URL https://lmsys.org/blog/2023-03-30-vicuna/

work page 2023

[27] [27]

Z. Wu, Y. Xiong, S. X. Yu, D. Lin, Unsupervised feature learning via non- parametric instance discrimination, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 3733–3742

work page 2018

[28] [28]

Sharma, N

P. Sharma, N. Ding, S. Goodman, R. Soricut, Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image caption- ing, in: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2018, pp. 2556–2565

work page 2018

[29] [29]

W. Zhu, J. Hessel, A. Awadalla, S. Y. Gadre, J. Dodge, A. Fang, Y. Yu, L. Schmidt, W. Y. Wang, Y. Choi, Multimodal c4: An open, billion-scale corpus of images interleaved with text, arXiv preprint arXiv:2304.06939 (2023)

work page arXiv 2023

[30] [30]

Loshchilov, F

I. Loshchilov, F. Hutter, Decoupled weight decay regularization, in: Inter- national Conference on Learning Representations, 2018

work page 2018

[31] [31]

Brooks, A

T. Brooks, A. Holynski, A. A. Efros, Instructpix2pix: Learning to follow image editing instructions, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 18392–18402

work page 2023

[32] [32]

G. Gu, S. Chun, W. Kim, H. Jun, Y. Kang, S. Yun, Compodiff: Ver- satile composed image retrieval with latent diffusion, arXiv preprint arXiv:2303.11916 (2023)

work page arXiv 2023

[33] [33]

Baldrati, M

A. Baldrati, M. Bertini, T. Uricchio, A. Del Bimbo, Conditioned and com- posed image retrieval combining and partially fine-tuning clip-based fea- tures, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 4959–4968

work page 2022

[34] [34]

Saito, K

K. Saito, K. Sohn, X. Zhang, C.-L. Li, C.-Y. Lee, K. Saenko, T. Pfister, Pic2word: Mapping pictures to words for zero-shot composed image re- trieval, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 19305–19314

work page 2023

[35] [35]

B. A. Plummer, L. Wang, C. M. Cervantes, J. C. Caicedo, J. Hocken- maier, S. Lazebnik, Flickr30k entities: Collecting region-to-phrase corre- spondences for richer image-to-sentence models, in: Proceedings of the IEEE international conference on computer vision, 2015, pp. 2641–2649. 19

work page 2015

[36] [36]

Goyal, T

Y. Goyal, T. Khot, D. Summers-Stay, D. Batra, D. Parikh, Making the v in vqa matter: Elevating the role of image understanding in visual question answering, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 6904–6913

work page 2017

[37] [37]

D. A. Hudson, C. D. Manning, Gqa: A new dataset for real-world visual reasoning and compositional question answering, in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 6700–6709

work page 2019

[38] [38]

Y. Liu, H. Duan, Y. Zhang, B. Li, S. Zhang, W. Zhao, Y. Yuan, J. Wang, C. He, Z. Liu, et al., Mmbench: Is your multi-modal model an all-around player?, arXiv preprint arXiv:2307.06281 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023

[39] [39]

B. Li, R. Wang, G. Wang, Y. Ge, Y. Ge, Y. Shan, Seed-bench: Bench- marking multimodal llms with generative comprehension, arXiv preprint arXiv:2307.16125 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023

[40] [40]

J. Bai, S. Bai, S. Yang, S. Wang, S. Tan, P. Wang, J. Lin, C. Zhou, J. Zhou, Qwen-vl: A frontier large vision-language model with versatile abilities, arXiv preprint arXiv:2308.12966 (2023). 20

work page internal anchor Pith review Pith/arXiv arXiv 2023